DataOps Data Glossary
INTRODUCTION TO THE DATAOPS DATA GLOSSARY
We live in a world with a rapidly evolving data landscape. In essence, data has changed, and is changing, our society and business world. Data is propelling incredible advancements across multiple areas with new terms being adopted and the definitions of existing terms changing as the data landscape and data technologies evolve.
Therefore, to mitigate any misunderstanding, our Dataops.live experts have created a glossary explaining all the terms relevant to the data ecosystem or “everything data.”
TABLE OF CONTENTS
A - F
Advanced analytics is data analysis that is more than simple operations like sums, averages, filtering, and sorting. In summary, advanced analytics uses mathematical and statistical algorithms to recognize patterns, predict outcomes and probabilities of outcomes, and generate new information.
Agile is an iterative methodology to project management and software development that helps teams deliver value to the customers faster and without all the headaches that go hand-in-hand with traditional software development and project management. The data world has adopted the Agile principles to deliver data products in atomic increments, saving time, balancing governance versus agility, and increasing product quality and availability.
An application Programming Interface (or API) is an external software interface that allows two unrelated entities to interact and pass data between the objects via a service layer.
The term architectural coupling describes how different pieces or components of the data architecture fit together to form a cohesive, functional unit.
An architectural quantum is a highly independent, deployable component with high functional cohesion. It includes all the structural elements needed to function independently and correctly.
The 1950s definition described Artificial Intelligence as “any task performed by a machine that would previously been considered to require human intelligence.”
Modern definitions are more specific. François Chollet provides a later definition of AI:
“The effort to automate intellectual tasks normally performed by humans. As such, AI is a general field that encompasses machine learning and deep learning, but also includes many more approaches that don’t involve any learning.”
An atomic code unit is the basic unit of executable code. It is reusable, essentially indivisible, and easy to maintain. It is based on a single, specific function, enhancing its reusability. A new feature should be constructed from as many existing atomic code units as possible so that the code base has as little repetition as possible.
Documentation is an integral part of data governance. And because no one likes documentation, Agile development promotes the creation and updating of documentation produced automatically from the code as part of the pipeline and data product build process. Documentation is usually formatted using Markdown language (a markup language) to format the plain text documentation stored in the repository with the code and data.
Automated regression testing is regression testing or the retesting of the entire data pipeline or system via a set of automated pre-writing scripts after every code change with little or no human intervention.
In contrast to manual testing, automated testing is a testing technique that uses automated testing tools and techniques to execute pre-written test sets.
Big Data is a collection of structured, unstructured, and semi-structured raw data that is particularly voluminous in nature. It comprises both external and internal data collected by organizations. It is used for machine learning, predictive analytics, and data analysis purposes. While the Big Data construct does not specify a minimum data volume to be considered Big Data, organizations can generate terabytes, petabytes, and even exabytes of data over time.
A business analyst’s role is to examine and document an organization’s business model, processes, systems, and technology integrations. The analyst aims to advise the business on how to improve its day-to-day operations. The business analyst also acts as a bridge between the business and IT using data analytics to determine new requirements, access processes, and provide data-driven reports and recommendations to executives and stakeholders.
Business Intelligence (BI) is a technology-driven process that analyzes data to deliver actionable information to executives, managers, and stakeholders throughout the organization. BI combines business analytics, data analytics, data visualization, and BI tools and infrastructure to help organizations make data-driven decisions.
Traditional Business Intelligence was first noted in the 1960s. And it was merely a system of sharing information between different organizational role players. In the 1908s, it evolved alongside computer modeling into a methodology for decision-making and turning data into meaningful insights. The modern BI paradigm prioritizes flexible, self-service analytics, trusted and governed data, increasing the speed to insight and powering business users and stakeholders.
CI/CD or continuous integration/continuous delivery is a set of operating principles and practices that provide data teams with the functionality to deliver frequent, incremental code and data updates to the business. CI/CD is an Agile best practice. It enables data teams to focus on meeting business stakeholder requirements and deploy high-quality and highly available data products because the CI/CD pipeline steps are automated.
Collaboration among team members, business stakeholders, and end-users is key to success. It leads to creative solutions and improved results. End-users and stakeholders have a keen understanding of what they want to see in a data product. Therefore, it is imperative to consult with them frequently as they will be the ones to determine whether the data product is a success or not. When stakeholders and end-users feel part of the process, they remain engaged until the data product is completed.
A cron or cron job is a software utility is a time-based scheduler available in Unix-based operating systems (including Linux). A cron is used to schedule time-based jobs to run at pre-determined times automatically.
A cron script is a list of one or more commands issued to a computer server operating system to be executed at a specific time. Each command is executed when its triggering time arrives.
Cross-sectional data is data that observes entities such as customers, companies, and stock items at a single point in time. The opposite of cross-sectional data is time series data.
Dark data is a subset of Big Data that is collected and stored by an organization. But it is not used for data analysis and has no importance to the business. It can hold immense value and can be of enormous relevance to the business.
A data dashboard is a data visualization tool that provides users with a visual representation or insight into their organization or department’s data analytics. Dashboards are particularly useful to non-technical users, allowing them to participate and understand the results of data analytics operations.
Data analytics is the methodical study of examining raw data to derive meaningful insights and reach specific conclusions. Data analytics looks for hidden patterns, unknown correlations, customer preferences, and market trends used to improve strategic decision-making, prevent fraud, and drive innovative product development.
A data artifact is a by-product produced during the data engineering, data analytics, or data science operations. Examples of data artifacts include data pipelines, data products, dashboards, visualizations, and analytics or machine learning reports.
A data architecture is a blueprint of the organization’s data ecosystem. It includes the rules, policies, procedures, and standards used to govern data is collected, stored, and processed. A data architecture also describes how data is extracted from multiple data sources, ingested into data pipelines, stored in the data platform, and transformed into valuable data. Lastly, it defines the processes used to extract, load, and transform the data into meaningful insights.
A data asset is high-value data. It is a data set or data product that has been derived from the raw data. And that the organization uses to derive meaningful insights to generate revenue. Data assets are now some of the most valuable assets that an organization owns.
Data assurance is a promise or guarantee from the data team to the business that the data sets and products delivered to the business are correct and extremely high quality, instilling high levels of confidence in the data by the business.
Data atrophy or data degradation is the gradual corruption or loss of value of the data. If not stored, transformed, and analyzed into meaningful insights, data starts losing value as soon as it has been created.
A data attribute is a data field representing the characteristics or features of a data entity or object. For instance, customer name is an attribute of the customer object.
As described by data.world, a data catalog is a “metadata management tool that companies use to inventory and organize their data.” It provides data teams, business users, and stakeholders with the functionality to find, analyze, and manage the voluminous data that falls within the Big Data paradigm. Advantages of using a data catalog include organization-wide, role-based access to the data, data discovery, and data governance and security.
Data cleansing, data cleaning, or data scrubbing is the process of removing or fixing incorrect, corrupt, duplicate, incorrectly formatted, or incomplete data within a data set. When combining different data types from multiple data sources, there are many opportunities for the resultant data set to be less than perfect. Consequently, it is crucial to cleanse the data before serving it to the business.
A data cloud is a unified data management architecture or platform that breaks down data silos, is highly available, with unlimited scale, and can run multiple workloads in parallel.
Data debt comes from the term “technology debt” and describes the cost of deferring funding for a new data platform feature that will increase the value of the organization’s data. The debt could typically be avoided by funding and implementing data governance and data management functions.
Data discovery is the process through which data teams collect metadata from multiple data sources with the help of advanced analytics tools and visual navigation of the data. It aims to increase the visibility of the entire data environment by consolidating all of the information about the data environment.
A data domain forms part of the data governance model. It is strongly aligned with a business model. For instance, individual data domains can include customer data, product data, and vendor data. Each data domain includes a domain owner responsible for every aspect of the domain, including data governance.
Data drift is the unexplained and undocumented changes to the data structure, semantics, and infrastructure. It is typical of the modern, fast-paced, explosive growth of data and data sources and rapid innovation in technologies such as data science, AI, advanced analytics, and machine learning. Not only does data drift break pipelines and data products and corrupt data, but it also exposes new data usage and analysis opportunities.
A data-driven organization is also known as an insights-driven business. And it is a company that uses data-derived insights to improve its business operations and to drive sustainable growth and innovation over time.
In the traditional data environment, data is siloed and owned by the IT department. Business users do not have access to the data even though they use the data to make business decisions. Data democratization translates into the fact that everyone in the organization has equal access to the data without gatekeepers creating bottlenecks at the gateway to the data.
Data downtime is a data quality issue. And it occurs when your data is inaccurate, partial, or incorrect. It is incredibly costly for the modern data-driven organization. It reduces trust in the data because decisions are made based on this inaccurate data, impacting ROI and day-to-day operations.
A data ecosystem is a data environment designed to grow, evolve, and scale in line with the organization’s unique information needs. Its sole purpose is to collect, store, and analyze the data generated by the organization. A data ecosystem comprises three elements: Infrastructure, including data stores, analytics pipelines, and BI applications.
Data engineering is a sub-discipline of software engineering that focuses on data transportation, data storage, and data transformation. A data engineer aims to provide a consistent flow of organized data for data analysts, data scientists, and self-service business users.
A data entity is an object that replicates a real-world object and forms part of a data model. Examples of data entities include a customer, a supplier, and a stock item record. Data entities are part of the Master Data Management (MDM) model.
A data environment is similar to a data ecosystem in that it provides a platform (or place) for an organization to extract, load, store, and transform data into meaningful information.
Data federation is the creation of a virtual database, or data warehouse that aggregates data from disparate data sources without physically moving the data into the database or data warehouse. In other words, federated data is aggregated virtual data from multiple different data sources. The concept of data federation is similar to data virtualization.
Data fragmentation is a condition where an organization's data is not well-cataloged, redundant, outdated, and difficult to access.
Data Governance is responsible for data security and data quality. It defines the roles and responsibilities that allow access to and ensure accountability for and ownership of data products and assets contained within the data platform. In summary, it covers the people, processes, and technologies needed to manage and protect data assets and products.
Governance is embedded in the data pipelines. Thus, the rules, processes, and procedures that ensure data security, data quality, and data deliverability are encapsulated inside the data pipelines.
Most enterprise data is still collected and stored in standalone data stores or individual data silos. Data integration is the process of amalgamating or bringing the separate data stores together to generate improved data insights and increase the value of the data stored by the enterprise.
A data integration platform is a data platform that allows data from multiple disparate sources to be loaded, stored, and transformed so that it can be analyzed by data analysts and data scientists or served to business stakeholders and end-users in the form of self-service analytics.
A data lake is a centralized repository that stores raw structured, unstructured, and semi-structured data at any scale. The data is stored as-is, in its raw format, without needing to be structured to fit a relational database or data warehouse model. Data teams can also run different types of analytics such as visualizations, dashboards, real-time analytics and train machine learning models directly on top of the data lake, directly accessing the raw data.
A data landscape is the totality of an organization’s physical data assets. It does not include any logical data views. In other words, it is the organization’s overall data storage options, processing and analytics capabilities, and BI applications included in the company’s data environment.
Data latency is the time it takes for data to travel between two points. It is typically measured as the time-lapse between the time of the data-generating event and the time it arrives in the organization’s data platform. It is also measured as the time-lapse between the data arrives in the data platform and its availability to be retrieved for analysis and business use.
The data lifecycle is the sequence of processes (or stages) that successfully manage and preserve the data for use and reuse. The data lifecycle includes the following steps: Plan, collect, data quality and governance, describe, preserve (store), discover, integrate, and analyze. Not all of these stages have to form part of every data lifecycle. For instance, some data activities, such as metadata gathering, might only focus on discovering, integrating, and analyzing steps. Additionally, this lifecycle might not always follow a linear path, and multiple revolutions of the cycle might be necessary to process the data.
The data lifecycle management model helps organizations manage data flow throughout their lifecycle: from initial creation to destruction. A clearly defined data lifecycle management method is integral to ensuring effective data governance. And a robust data governance policy is vital to data quality and data trust.
Data lineage is a map of the data journey from its origins to how it is in its present state. Each movement and transformation step (including its analytics lifecycle) is recorded with an explanation of how and why the data has moved over time. Data lineage provides an in-depth description of where the data comes from and where it ends up.
Data management is a term describing the methodology used to maintain and oversee the processes used to plan, create, acquire, maintain, use, archive, retrieve, control, and delete data. It is also described as enterprise information management (EIM) defined by Gartner as “an integrative discipline for structuring, describing, and governing information assets across organizational and technical boundaries to improve efficiency, promote transparency, and enable business insight.”
Data masking or data obfuscation is the process of hiding the original data by covering it with modified content. This process is most often used to protect classified and commercially sensitive data. Despite being masked, the data must remain usable and valid.
A data mesh is a decentralized, domain-driven data architecture designed to address the architectural failure nodes predominant in the centralized monolithic data architecture. The decentralized domain-driven design pattern is similar to the microservices architecture utilized in software development.
A data model is an abstract representation of the data, organizing the data entities, properties, and how they relate to one another as well as the properties of real-world objects.
Data modeling is the process of providing a structure for raw data to transform and analyze it, creating meaningful information.
Data mining is the process of analyzing large data sets to discover patterns, anomalies, trends, correlations, and relationships that data analysts and scientists might otherwise miss. Data mining uses methods that intersect machine learning, advanced statistical analysis, and database or data warehousing systems with the overall goal of extracting information from the data set. This information assists organizations solve problems, mitigate risks, and determine new business opportunities.
Data observability is similar to data monitoring in that it consistently monitors the health of your data systems by tracking and troubleshooting incidents to reduce and prevent data downtime. Similar core objectives are seen in both software application observability and data observability: The minimal disruption to the service.
DataOps is a set of principles describing how data engineers build, test, and deploy data products and platforms precisely the same way software applications are developed. DataOps (Data + Operations) is rooted in the principles of DevOps, Agile, and Lean. It is a way to approach and deliver enterprise data operations, incorporating agility and governance. DataOps relies on continual and Agile development. And it includes creativity and balances governance and agility, resulting in improved collaboration between data teams, business stakeholders, and end-users.
A DataOps engineer is a technical professional that focuses exclusively on the data development and deployment lifecycle instead of the data product itself. DataOps engineers do not work with the data itself. Instead, they engineer the environment and processes that data engineers use to build the data products. DataOps engineers also improve the data engineers and analysts’ development processes by implementing the DataOps principles as outlined in the DataOps Manifesto and the #TrueDataOps philosophy.
A DataOps platform, like our Dataops.live platform on Snowflake, is a data architecture that facilitates the implementation of the principles of DataOps and #TrueDataOps and the DataOps lifecycle, balancing agility and governance, providing end-to-end orchestration, environment management, automated testing, CI/CD, and ELT.
The root definition of a data pipeline is that it moves data from source to destination. It is essentially is a series of actions or steps that ingests data from its multiple, disparate sources and loads the data into a data platform for storage and analysis purposes. Additionally, data pipelines can be used to transform data to prepare it for analysis and train machine learning data sets.
Data pipeline components are the individual parts that together make up the pipeline that moves the data from its source to its destination.
Data pipeline debt is a form of technical debt that is incurred when data pipeline code is not properly coded and tested, reducing productivity and puts data integrity at risk.
Data pipeline orchestration is the automated coordination of the individual processes or steps that make up the data pipeline. It manages the dependencies between the pipeline tasks, schedules jobs, and monitors workflows, alerting to errors in the pipeline.
A data platform is an integrated data architecture or technology solution that provides end-to-end management of the organization’s data ecosystem. The organization’s data is governed, accessed, and delivered to business stakeholders and end-users via this data platform.
A data point or observation is a discrete unit of information. In statistics, a data point is a set of one or more measurements on a single member of a statistical population.
A data product is not the same as a data project in that a project has a defined scope, start, and end. On the other hand, a data product has a continuous cycle of requirements gathering, development, testing, and delivery to the business. A product has regular releases throughout its lifecycle. It has high agility and an enormous capacity for change. Its workload backlog is continually being reprioritized. And a product has automated regression testing built into the release process. A data product can be as simple as a set of dashboards or a data mart, or more involved such as a contracted data share with a 3rd party. Each have a clear definition and changing requirements over time. And every change to the product (whether a technical change or a simple refresh of the data) should be treated like a new version of product.
Data profiling is the act of extrapolating information or statistics about the data based on its known characteristics, traits, or tendencies.
Data provenance describes the origins of the data, including inputs, entities, systems, and processes related to the data. In other words, data provenance is the data’s historical record keeper. Information derived from a data provenance record helps provide error tracking, source identification, identify data governance audit trails, and helps re-enact the flow of data from its source into the data platform.
Data Quality is the standard for the condition of the data based on factors such as reliability, whether it is up to date, errors, consistency, completeness, and accuracy. Organizations must measure the quality of their data to resolve errors, inaccuracies, incomplete data, and reliability issues; otherwise, the data will not be fit-for-purpose.
The practice aims to maintain a consistently high quality of data. It extends right across the data ecosystem, from data acquisition to data delivery. Its focus is on continually and continuously improving the quality of your data.
Data relationships are the cornerstone of relational databases. A relationship between two data entities describes the commonalities between the two entities or objects, allowing users to explore this relationship by joining data from each object.
Data reliability describes the accuracy and completeness of the data served to the business for self-service analytics purposes.
A data resource is a component of an organization’s data infrastructure and overarching IT infrastructure. It represents all of the data generated by and available to the organization.
Data science is an interdisciplinary discipline that uses scientific methods, analytics processes, and machine learning and Artificial Intelligence to derive new and innovative insights from raw unstructured, semi-structured, and structured data. It uses techniques and theories drawn from the academic fields of mathematics, computer science, statistics, information science, and domain knowledge.
A data set is an ordered collection of related data. Data sets can be massive and span multiple data sources. The data is organized within each data set based on a specific model designed to process and analyze the data, creating meaningful information.
A data silo is an insular data management strategy where stakeholders and business users across the organization do not have access to all of the company’s data. The stakeholders and users will only have access to a specific subset of the organization’s data because it is siloed or partitioned.
Data standardization is converting data to a standardized format to facilitate the processing and analysis of this data. Data usually comes from a wide variety of different sources and in various formats. Therefore, it must be transformed into a standard format before it can be accessed and analyzed.
A data swamp is a data lake that has deteriorated and unmanaged data. Either it is inaccessible to its users, or it is not providing any value to the organization.
The data value is the worth of the data to the organization it belongs to. In other words, it is the value that is derived from transforming, processing, and analyzing the data.
Data vault modeling is a “database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems.” It also provides a method of considering historical data that deals with “issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from.” Data vault modeling does not differentiate between good and bad data. It supplies a single version of the data facts. It means that every row in the data vault must be accompanied by the source of the record and the load data metrics.
Data visualization is the graphical representation of the data and the information derived from this data. It is an effective way of communicating meaningful insights to business users, especially when the data is voluminous.
A data warehouse is a centralized data management system designed to support Business Intelligence and analytics activities. It centralizes and consolidates voluminous data from multiple disparate sources. And it often contains massive volumes of historical data and is solely intended to perform large-scale queries and analytics on this data.
A database grant is a permission or privilege that grants users access to different database functions. These grants include administrator access, where the user has access to the entire database and all of its functionality to read-only access. The user can only view the data in the database without making any changes to the data and its underlying structure.
Database lifecycle management (DLM) combines a business and technical policy-based approach to maintaining databases and database objects. It is not a product. Instead, it is a methodology for managing all the elements in a database, such as a database schema, the data, and the metadata.
A database object is a defined structure in a database used to store or reference data such as a table, view, sequence, synonym, or index.
A database schema is an abstract representation of the design of a database. It describes the organization of data as well as the relationships between the data in the different database tables, columns, and rows.
The SQL Data Definition Language (DDL) consists of the SQL commands used to define a database schema. It deals with descriptions of the database schema. It is used to create, drop, alter, truncate, and rename objects in the schema, such as tables, synonyms, views, indexes, and databases themselves.
A compiler is an engine or software application that converts instructions to be read and executed by a computer. A declarative compiler takes a SQL-based declarative definition describing the state of an object. And it determines how to reach the state without being told how to by the programming code.
A declarative language is a programming paradigm that describes what the state of a data product, database, or data pipeline component should be without describing how to reach a particular state.
Deep learning is a subset of machine learning. It is designed to train a machine or computer to perform human-like tasks, such as recognizing speech, making predictions, and identifying images. It is capable of unsupervised learning from unstructured or unlabeled data. Deep learning’s key focus, the ability to continuously approve and adapt to changes in the underlying data patterns, presents an opportunity to introduce more dynamic behavior into data analytics.
A deployment pipeline deploys or delivers code updates from the Git repository or version control to other environments, making it readily available to end-users in an automated fashion.
A design pattern is a repeatable solution to a commonly occurring problem within a given context. A design pattern is an abstraction, a template that does not translate directly into executable code. In other words, it is a problem-solving template that can be used as a foundation for creating a solution to a problem. Data pipeline design patterns help provide a foundation for building and deploy robust, functional data pipelines.
DevOps (development + operations) is a philosophy and practice that combines software development and operations functions. It aims to shorten the time to value by reducing the software application development and deployment lifecycle and providing continuous delivery and high software application quality.
A directed acyclic graph or DAG is a conceptual representation of a series of activities, such as the steps involved in running a data pipeline from start to finish. The order in which the actions occur is visually represented by a set of circles that form a graph, some connected by lines showing the flow from one activity to another.
A distributed data pipeline is simply a data pipeline that is made up of individual components that run on different computer servers or hardware infrastructure. In order to run a distributed data pipeline, a notification or messaging system is used to orchestrate each step of the pipeline.
Data manipulation language is a SQL-based language that is used to insert, delete, and update or modify data in a database.
A domain-agnostic model is the opposite of a domain model. It is not subject-related. Rather it is an abstract interpretation of what a domain in the organization’s data environment could look like.
The domain-driven design pattern is an approach to software and data architecture development that focuses on the domain model as its core construct.
A domain model is a conceptual model of the domain that incorporates both data and behavior. A domain model is a formal representation of the knowledge domain in a knowledge graph, including roles, data types, concepts, and rules.
Dynamic data masking helps prevent unauthorized access to sensitive data by allowing data teams to decide how much sensitive data to reveal on the client-side without changing the raw data. Dynamic data masking is a policy-based security feature that hides sensitive data and is applied at the application layer.
Extract, Load, and Transform (ELT) is the process of extracting data from multiple, disparate data sources and loading it into a target data platform without transforming it in the process. ELT is the preferred alternative to the traditional ETL process by pushing the transformation component to the target data platform for improved data lineage, performance, data governance, and data provenance.
An ephemeral data set is a logical data set in that it does not contain physical data. Instead, it contains pointers to the physical data. The benefits of an ephemeral data set include only needing to keep one copy of the raw data, the ability to create multiple transformed data sets based on individual stakeholder requirements, and reducing heavy workloads, and improving data platform and analytics performance.
ETL or extract, transform, and load is the traditional data integration process that combines data from multiple sources into a single, targeted data platform. It was introduced in the 1970s to load or integrate data into mainframes or supercomputers for computation and analysis.
Loading the raw data into the data platform before transforming it is a far superior method to transforming the data before loading it into the data platform. However, there are cases where data must be removed, anonymized, masked or obfuscated for privacy reasons or because of regulations such as GDPR.
Fail fast is a philosophy that values the CI/CD principles of incremental development and extensive testing to determine whether an idea has value. It is often associated with the Lean startup methodology and is designed to cut losses when the testing reveals something isn’t working and quickly trying something else. This concept is known as pivoting.
Data engineers can create individual branches that are safe and isolated to develop new data pipeline features and data products simultaneously. These feature branches are then merged back into the master branch via a merge request.
A first-class workflow is similar to a first-class object. It can be dynamically created, passed between different data pipeline components, destroyed, and has the rights of any other programming language variable.
G - L
Gallium is DatOps.Live's data compression algorithm used to extract value from volume in IoT time-series data.
Git is a software application that tracks and manages changes made to files. A Git repository is a git folder within a project’s source code folders. It keeps a history of all the changes made to the configurations and source code files over time.
Granular data is detailed data, or the lowest level of detail represented in an analytics report. The greater the data granularity, the deeper the level of detail. Data granularity also allows business users to drill down into the data from a high-level overview, right down into the deepest level of detail.
The graph data model is the visualization of a data model, describing its nodes and relationships. This model is based on the property graph approach and is very close to how people draw out entities and their relationships. The graph data model is data-platform agnostic and is used to model the relationships and nodes (entities) found in data.
A graph edge is also known as an arc, branch, link, or link. Succinctly stated, it is the line between two nodes.
A graph node or vertex is the primary element from which graphs are formed. A node is also a basic unit of a data structure, such as a linked list or tree data structure. Graph nodes and data structure nodes also contain data and may connect to other nodes.
A highly available system or a system with high availability is able to operate continuously without failing or experiencing any downtime. This definition also includes the system’s ability to recover from unexpected events or downtime in the shortest time possible.
An idempotent operation is an action or request that can be applied multiple times without changing the result. Therefore, making multiple identical requests has the same effect as making a single request.
An imperative programming language is a programming paradigm that uses statements to change an object’s state. In other words, it specifies a clearly defined sequence of instructions detailing how the code should achieve its result.
Infrastructure-as-a-Service is a cloud computing service that provides users with virtual computing resources over the Internet. In this model, the cloud provider manages the IT infrastructure, including storage, networking, and server resources. These resources are delivered to subscriber organizations via virtual machines accessible via the Internet.
The Internet of Things (IoT) is a vastly different set of technologies and use cases that have multiple, disparate definitions. IoT is the use of network devices embedded in the physical world and connect to the network uploading read-only data that they gather from their environment. The data collected by these devices is known as telemetry. Much of this telemetry can be considered time series data, but it is not exclusively time series data.
Jobs are the elements contained within a pipeline to orchestrate the pipeline’s workflow. Each job mainly consists of the applicable technology like Snowflake, Talend, Python, Stitch, and HTTP APIs), configurations, criteria for the job to succeed, and the location where the pipeline should execute this job.
A knowledge graph is a knowledge base that uses graph technology or a graph-structured data model to integrate or describe data. Knowledge graphs are often used in data catalogs to store interlinked entity descriptions such as objects, events, abstract concepts, and situations.
The core idea of the Lean methodology is to maximize customer value while minimizing waste. In other words, Lean means creating more value with fewer resources. It is a way to optimize people, resources, effort, and energy towards maximizing value for the customer. It is based on two guiding tenets: Continuous improvement and respect for people.
A linear version control system is simple in its design and execution. Each version follows its predecessor and must be based on its forerunner. This versioning system is common in the Waterfall software development method and is extremely limited in a development environment where several colleagues are working on new features simultaneously. It is problematic to merge these new features to create a new linear version.
A logical data model represents or describes data definitions, characteristics, and relationships independent of the physical storage model. This process provides information about the many different elements that make up your organization’s business data, as well as how these elements relate to each other.
M - R
Machine learning is a subset of Artificial Intelligence. Succinctly stated, machine learning is where a computer system learns how to perform a task, rather than programmed how to do so. In other words, it trains a computer system or machine how to learn. Machine learning algorithms learn from data, identify patterns, and make decisions with minimal human interference.
Massively parallel processing (MPP) is the ability of a data platform to accept multiple concurrent requests and process these requests seamlessly and in parallel.
A Git master branch is the production-ready code in the code repository. A master branch is deployable. Therefore, it is not a good idea to make changes directly to the master branch. Feature branches can be created from the master branch, used to develop new features, edit or update existing features, and merged back into the master branch once the code has been tested and approved.
Master data management (MDM) is the technology, tools, and processes that create, maintain, and manage an organization’s master data, ensuring that the master data is consistent and coordinated across the organization.
Metadata is data about data. There are different types of metadata, including business metadata, technical metadata, data governance metadata, data elements metadata, reference metadata, and information architecture metadata.
Metadata management is the overarching discipline of ensuring that the organization’s metadata is created, maintained, governed, and utilized according to the company’s metadata policies and procedures.
A metadata model is similar to a data model. It is an abstract model that organizes the metadata elements and standardizes how it relates to the data and the properties of real-world entities.
Once a Git feature branch has been checked, it can be changed and tested. Then it is ready to be merged back into the master branch. A merge request is created, which also acts as a review tool or a second pair of eyes to review the changes. Once these changes have been approved, the feature branch is merged back into the master branch. If there are any conflicts between multiple feature branches developed concurrently, they are manually reviewed and sorted out during the merge.
The NoSQL or Not Only SQL database is a database model that provides flexible schemas for storing and retrieving data that does not fit in the traditional relational database model. NoSQL databases have recently increased in popularity due to the rise of Big Data, cloud-based data platforms, and high-volume web and mobile applications. The most common types of NoSQL databases are key-value, document, column, and graph databases.
An ontology is a data model that represents a set of concepts or categories within a subject area or domain, showing their properties and relationships between them.
Orchestration is the movement of data and code logic along the data pipeline and between the different tools and components of the pipeline. Automating the orchestration process is an essential part of ensuring simplicity and pipeline run success over time.
Personally identifiable information (PII) is any data that can be used to identify a specific person, including social security numbers, email, physical, and postal addresses, phone numbers, identity numbers like passport numbers, IP addresses, and user names and passwords.
The principle of least privilege (POLP) is a computer security concept that limits user access rights to only what they strictly need to do their jobs. Users are only granted read, write, execute privileges only for the resources required to do their jobs.
A property graph is a graph model that contains nodes, relationships, and additional information like name, type, and other properties. Its primary aim is to show connections between data scattered across distributed, diverse data architectures and data schemas. A property graph provides a richer view of how data can be modeled over many different databases and how diverse metadata relates to each node and relationship.
Protected Health Information (PHI) is health data created, stored, received, or transmitted by organizations in relation to the provision of healthcare, healthcare operations, and payment for healthcare services.
Raw data, also known as source data or atomic data, is data that is extracted from its source and stored in a data platform (or data store) without being transformed or processed.
Refactoring is defined as a way to refine data product and pipeline code without changing the functionality. This process can reduce complexity and increase speed. The goal is to improve maintainability and extendibility.
Regression testing is a testing practice conducted to verify that a data pipeline (or platform) code change does not negatively impact the existing pipeline functionality. Its primary function is to make sure that the pipeline works well with the new or updated code. Practically speaking, previously executed test cases or test sets are re-executed to verify the impact of the change.
A relational database management system (RDMS) is a database management system based on the relational model. Relational databases use a structure that allows the identification and access of data in relation to other data in the database. This data is organized into tables, columns, and rows.
The repeatable schema management paradigm maintains and manages the database schema’s ability to duplicate itself with the same effects.
A RESTful API is an Application Programming Interface that uses HTTP requests and responses to access and use the data in a data store that the API is linked to. HTTP commands like GET, PUT, POST, and DELETE refer to querying, updating, creating, and deleting data in a database or data warehouse.
S - Z
A Software-as-a-Service model is a software application that is hosted in the cloud or on cloud-based infrastructure. It is accessed and operated via a web browser, and businesses pay monthly or annual subscription fees for the right to use the application to run their business operations.
Scalability measures a system’s ability to increase or decrease in performance quickly and efficiently in response to changes in system and application processing demands. For instance, the scalability of a data platform measures how well it performs with increasingly heavy workloads.
A schema describes the logical organization of any data resource. It depicts the logical, overarching view of the entire data environment, all the data stored in the data platform. It defines how the data is organized and what their relationships between the entities are.
Secrets management is the method of managing digital authentication credentials, including passwords, keys, APIs, and tokens for use in applications, services, privileged accounts, and other sensitive parts of the data ecosystem. Secrets management is vital across the organization; it is especially relevant to the DataOps environments, tools, and processes.
Self-service data analytics offers advanced analytics functionality to self-service business users. It allows users to perform all of the advanced analytics functions using complex mathematical and statistical algorithms without a background in mathematics, statistics, or technology.
Self-service BI provides business stakeholders and end-users with the functionality to explore data sets even if they do not have a background in data analytics. Self-service BI tools allow users to filter, sort, analyze, and visualize data without relying on the organization’s BI or data teams.
Semi-structured data is a form of structured data that does not fit the tabular structure of data models. But it contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Good examples of semi-structured data include JSON, XML, and YAML files.
A service level agreement (SLA) is a commitment between a service provider and a client. It documents the services required and the expected level of service. Internal SLAs maintain a level of service between different infrastructure components, describing expected availability and performance.
A single point of failure is any critical part of a data architecture (or Information Technology infrastructure architecture) that will cause the entire system to fail if it breaks down. This concept is opposite to the goal of a highly available system. It is essentially a design flaw, incorrect configuration, or system component that is a high risk to system availability.
Snowflake’s time travel feature allows access to data that has been changed or deleted at any point within the pre-defined retention period. It is helpful in restoring tables, databases, and schemas that have been dropped. And it enables users to create clones of tables, databases, and schemas at or before specific points in the past; but before the retention period has expired.
The separation of data cloud storage and compute functions provides data platform functionality with increased flexibility, scalability, and availability. It also has the potential to reduce costs dramatically.
Structured data is data that observes or conforms to a pre-defined data structure or model. Therefore, it is straightforward to analyze. It adheres to a tabular format with relationships between different rows and columns. Structured data depends on the existence of a data model of how the data can be stored, processed, and accessed. Structured data is considered the most traditional form of data storage. The earliest versions of DBMSs were only able to house, process, and access structured data.
SQL is a programming language that is used to communicate with databases or data stream management systems. It is used to store, retrieve, and manipulate the data in a relational database.
Source control or version control is the practice of tracking and maintaining changes to source code and data contained within the data products. Version control keeps a history of any changes to both the source code and data, prevents concurrent work conflicts, and allows changes to be rolled back if necessary. Version control also facilitates data provenance, lineage, governance, and security.
One of the most significant challenges organizations face is the need for multiple environments (Prod, Dev, QA) as well as the creation and maintenance of these environments. This challenge is solved by establishing a Single Source of Truth or a trusted data source that provides a complete picture of the overarching data environment. In a DataOps environment, the code and configurations are all moved into the Single Source of Truth, usually a Git Repository. The data’s Source of Truth are still its original source systems.
Technical debt is a software development concept that measures the actual and implied costs of reworking or refactoring programming code by choosing a quick and easy solution instead of using a slightly longer, improved solution.
Test-driven development is a software development approach that writes the test cases before the code. This reduces the time to value and reduces the number of code rewrites resulting from changing stakeholder requirements.
Time series data is data that is collected at different points in time. This paradigm is the juxtaposition of cross-sectional data where the behavior of an entity is observed at a single point in time. Because time series data points are collected at adjacent periods in time, it is possible to run correlations between observations.
The time-to-value metric describes the length of time it takes for an organization to derive value from its data.
Total Quality Management is the methodology by which management and employees can become involved in the continuous improvement of products and services. It is a combination of quality and management tools aimed at reducing losses due to wasteful practices and increasing business opportunities. It seeks to integrate all of the organization’s functions to meet customer expectations and organizational objectives.
The DataOps philosophy supplies the WHAT: What DataOps aims to achieve and what it should deliver. #TrueDataOps provides the HOW. It is a philosophy that defines the seven core principles or pillars of how data is managed and delivered with agility and governance. It builds on the truest principles of DevOps, Agile, Lean, test-driven development, and Total Quality Management (TQM). It applies these principles to data, data platform management, and data analytics.
Trusted data is data that inspires user and stakeholder confidence because it is valid, complete, and of high enough quality to produce analytics that can be used as a reliable basis for strategic business decision-making. Trusted data includes the following six dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness.
A ubiquitous language is a common business language or universal language that helps communication and collaboration between data teams, stakeholders, domain specialists, and business users.
Unit testing is a testing type where individual units or components are tested. Its purpose is to validate that each unit of a data product or data pipeline performs as expected. In other words, unit tests isolate a section of a data pipeline or product and test it to verify its correctness.
Unstructured data is data that cannot be stored in a traditional relational database or data warehouse column-row structure. It includes data types such as image files, video files, call center transcripts or recordings, text files, PDF documents, social media posts, web pages and blog posts, and audio files.
A user privilege is the right to access a specific feature or function in a data environment. The user role or Role-based Access Control is a means to create user roles and grant user access or privileges to a particular role. For instance, should business users access a specific analytics dashboard, it is simply a case of creating a role to access the report and grant the users access to this role.
Value-led pipeline development defines the ultimate goal of every data pipeline as the need to deliver value to the business (stakeholders and end-users).
A workflow is a series of processes, steps, or actions performed in a sequence. Workflow orchestration is the automation of the series of steps included in a workflow. Lastly, workflow orchestration tools are software applications that provide data teams with the functionality to develop a workflow and automate it.
YML or YAML Ain’t Markup Language is a human-readable serialization language. It is often used as a configuration file, but its object serialization abilities also make it an acceptable replacement for other serialization languages like XML and JSON.
Zero-copy clone is a feature found in Snowflake that allows a branch of the production environment to be instantly created in a feature branch, providing data engineers with the ability to develop and test new features in isolation, safely and securely without impacting anyone else.
Download Our Free Ebook DataOps for Dummies
DataOps describes a novel way of development teams working together collaboratively around data to achieve rapid results and improve customer satisfaction. This book is intended for everyone looking to adopt the DataOps philosophy to improve the governance and agility of their data products. The principles in this book should create a shared understanding of the goals and methods of DataOps and #TrueDataOps and create a starting point for collaboration.