DataOps Data Glossary

We live in a world with a rapidly evolving data landscape. In essence, data has changed, and is changing, our society and business world. Data is propelling incredible advancements across multiple areas with new terms being adopted and the definitions of existing terms changing as the data landscape and data technologies evolve.

Therefore, to mitigate any misunderstanding, our Dataops.live experts have created a glossary explaining all the terms relevant to the data ecosystem or “everything data.”

The 1950s definition described Artificial Intelligence as “any task performed by a machine that would previously been considered to require human intelligence.”

Modern definitions are more specific. François Chollet provides a later definition of AI:

“The effort to automate intellectual tasks normally performed by humans. As such, AI is a general field that encompasses machine learning and deep learning, but also includes many more approaches that don’t involve any learning.”

Big Data is a collection of structured, unstructured, and semi-structured raw data that is particularly voluminous in nature. It comprises both external and internal data collected by organizations. It is used for machine learning, predictive analytics, and data analysis purposes. While the Big Data construct does not specify a minimum data volume to be considered Big Data, organizations can generate terabytes, petabytes, and even exabytes of data over time.

A business analyst’s role is to examine and document an organization’s business model, processes, systems, and technology integrations. The analyst aims to advise the business on how to improve its day-to-day operations. The business analyst also acts as a bridge between the business and IT using data analytics to determine new requirements, access processes, and provide data-driven reports and recommendations to executives and stakeholders.

Business Intelligence (BI) is a technology-driven process that analyzes data to deliver actionable information to executives, managers, and stakeholders throughout the organization. BI combines business analytics, data analytics, data visualization, and BI tools and infrastructure to help organizations make data-driven decisions.

Traditional Business Intelligence was first noted in the 1960s. And it was merely a system of sharing information between different organizational role players. In the 1908s, it evolved alongside computer modeling into a methodology for decision-making and turning data into meaningful insights. The modern BI paradigm prioritizes flexible, self-service analytics, trusted and governed data, increasing the speed to insight and powering business users and stakeholders.

Collaboration among team members, business stakeholders, and end-users is key to success. It leads to creative solutions and improved results. End-users and stakeholders have a keen understanding of what they want to see in a data product. Therefore, it is imperative to consult with them frequently as they will be the ones to determine whether the data product is a success or not. When stakeholders and end-users feel part of the process, they remain engaged until the data product is completed.

A data architecture is a blueprint of the organization’s data ecosystem. It includes the rules, policies, procedures, and standards used to govern data is collected, stored, and processed. A data architecture also describes how data is extracted from multiple data sources, ingested into data pipelines, stored in the data platform, and transformed into valuable data. Lastly, it defines the processes used to extract, load, and transform the data into meaningful insights.

As described by data.world, a data catalog is a “metadata management tool that companies use to inventory and organize their data.” It provides data teams, business users, and stakeholders with the functionality to find, analyze, and manage the voluminous data that falls within the Big Data paradigm. Advantages of using a data catalog include organization-wide, role-based access to the data, data discovery, and data governance and security.

Data Governance is responsible for data security and data quality. It defines the roles and responsibilities that allow access to and ensure accountability for and ownership of data products and assets contained within the data platform. In summary, it covers the people, processes, and technologies needed to manage and protect data assets and products.

Governance is embedded in the data pipelines. Thus, the rules, processes, and procedures that ensure data security, data quality, and data deliverability are encapsulated inside the data pipelines.

A data lake is a centralized repository that stores raw structured, unstructured, and semi-structured data at any scale. The data is stored as-is, in its raw format, without needing to be structured to fit a relational database or data warehouse model. Data teams can also run different types of analytics such as visualizations, dashboards, real-time analytics and train machine learning models directly on top of the data lake, directly accessing the raw data.

The data lifecycle is the sequence of processes (or stages) that successfully manage and preserve the data for use and reuse. The data lifecycle includes the following steps: Plan, collect, data quality and governance, describe, preserve (store), discover, integrate, and analyze. Not all of these stages have to form part of every data lifecycle. For instance, some data activities, such as metadata gathering, might only focus on discovering, integrating, and analyzing steps. Additionally, this lifecycle might not always follow a linear path, and multiple revolutions of the cycle might be necessary to process the data.

Data management is a term describing the methodology used to maintain and oversee the processes used to plan, create, acquire, maintain, use, archive, retrieve, control, and delete data. It is also described as enterprise information management (EIM) defined by Gartner as “an integrative discipline for structuring, describing, and governing information assets across organizational and technical boundaries to improve efficiency, promote transparency, and enable business insight.”

Data mining is the process of analyzing large data sets to discover patterns, anomalies, trends, correlations, and relationships that data analysts and scientists might otherwise miss. Data mining uses methods that intersect machine learning, advanced statistical analysis, and database or data warehousing systems with the overall goal of extracting information from the data set. This information assists organizations solve problems, mitigate risks, and determine new business opportunities.

DataOps is a set of principles describing how data engineers build, test, and deploy data products and platforms precisely the same way software applications are developed. DataOps (Data + Operations) is rooted in the principles of DevOps, Agile, and Lean. It is a way to approach and deliver enterprise data operations, incorporating agility and governance. DataOps relies on continual and Agile development. And it includes creativity and balances governance and agility, resulting in improved collaboration between data teams, business stakeholders, and end-users.

A DataOps engineer is a technical professional that focuses exclusively on the data development and deployment lifecycle instead of the data product itself. DataOps engineers do not work with the data itself. Instead, they engineer the environment and processes that data engineers use to build the data products. DataOps engineers also improve the data engineers and analysts’ development processes by implementing the DataOps principles as outlined in the DataOps Manifesto and the #TrueDataOps philosophy.

A data product is not the same as a data project in that a project has a defined scope, start, and end. On the other hand, a data product has a continuous cycle of requirements gathering, development, testing, and delivery to the business. A product has regular releases throughout its lifecycle. It has high agility and an enormous capacity for change. Its workload backlog is continually being reprioritized. And a product has automated regression testing built into the release process. A data product can be as simple as a set of dashboards or a data mart, or more involved such as a contracted data share with a 3rd party. Each have a clear definition and changing requirements over time. And every change to the product (whether a technical change or a simple refresh of the data) should be treated like a new version of product.

Data vault modeling is a “database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems.” It also provides a method of considering historical data that deals with “issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from.” Data vault modeling does not differentiate between good and bad data. It supplies a single version of the data facts. It means that every row in the data vault must be accompanied by the source of the record and the load data metrics.

Deep learning is a subset of machine learning. It is designed to train a machine or computer to perform human-like tasks, such as recognizing speech, making predictions, and identifying images. It is capable of unsupervised learning from unstructured or unlabeled data. Deep learning’s key focus, the ability to continuously approve and adapt to changes in the underlying data patterns, presents an opportunity to introduce more dynamic behavior into data analytics.

The Internet of Things (IoT) is a vastly different set of technologies and use cases that have multiple, disparate definitions. IoT is the use of network devices embedded in the physical world and connect to the network uploading read-only data that they gather from their environment. The data collected by these devices is known as telemetry. Much of this telemetry can be considered time series data, but it is not exclusively time series data.

Once a Git feature branch has been checked, it can be changed and tested. Then it is ready to be merged back into the master branch. A merge request is created, which also acts as a review tool or a second pair of eyes to review the changes. Once these changes have been approved, the feature branch is merged back into the master branch. If there are any conflicts between multiple feature branches developed concurrently, they are manually reviewed and sorted out during the merge.

Structured data is data that observes or conforms to a pre-defined data structure or model. Therefore, it is straightforward to analyze. It adheres to a tabular format with relationships between different rows and columns. Structured data depends on the existence of a data model of how the data can be stored, processed, and accessed. Structured data is considered the most traditional form of data storage. The earliest versions of DBMSs were only able to house, process, and access structured data.

One of the most significant challenges organizations face is the need for multiple environments (Prod, Dev, QA) as well as the creation and maintenance of these environments. This challenge is solved by establishing a Single Source of Truth or a trusted data source that provides a complete picture of the overarching data environment. In a DataOps environment, the code and configurations are all moved into the Single Source of Truth, usually a Git Repository. The data’s Source of Truth are still its original source systems.

The DataOps philosophy supplies the WHAT: What DataOps aims to achieve and what it should deliver. #TrueDataOps provides the HOW. It is a philosophy that defines the seven core principles or pillars of how data is managed and delivered with agility and governance. It builds on the truest principles of DevOps, Agile, Lean, test-driven development, and Total Quality Management (TQM). It applies these principles to data, data platform management, and data analytics.

DataOps Data Glossary

INTRODUCTION TO THE DATAOPS DATA GLOSSARY

TABLE OF CONTENTS

A - F

G - L

M - R

S - Z

Download Our Free Ebook DataOps for Dummies

DataOps.live for Snowflake Platform

Spendview for Snowflake

Request a demo