When you think about Data Operations, or “DataOps” - what comes to mind?
DataOps can be a confusing topic to discuss. People have various ideas about what makes up the responsibilities of DataOps and how that relates to Data management. I have worked in Data Operations and been a Data Operations manager several times in my career. In my experience, more people seem to understand most of the moving pieces involved in running an application for an enterprise.
Talk of load balancers, dynamic web servers, Kubernetes clusters running microservices, virtual machines or containerized applications coming up and down throughout the day, adequate disk storage, centralized collection of logs, blue-green deployments, application databases configured for high availability, sometimes even more than a single type of database server supporting the applications. These are all relatively common topics to be discussed about the operation of an application supporting a business more commonly known as DevOps.
Data is different. In the application world, if a micro-service misbehaves, you can destroy it and replace it with another. In the data world, destroying data could be a crime, depending on the nature of the data being kept.
In the application world, APIs process one record at a time and give their results in milliseconds. In the data world, we have processes that can manipulate millions of records at a time and could run in minutes to hours (I have even seen data processing jobs run in days).
We refer to the data ecosystem because many moving parts protect, enrich, enable, and produce data products for an enterprise.
My mission statement for a data operations group is as follows:
“To get the right data to the right people, at the right time, and in the right format.”
What is the right data?
Who are the right people?
When is the right time?
Where is the definition of the format?
This simple mission statement, backed up by a proper philosophical approach, can accomplish many things. This statement implies that data must move.
Move from where to where?
Data must move for it to be of value. Typically, there is a process that brings application data into an enrichment platform (data lake, data warehouse, data mart, data lakehouse) to be used by non-application users. The tools needed to move data are very diverse. ETL, ELT, data loading platforms, change-data-capture processes, transaction-log monitoring, and even running queries on the application database (This last should be frowned upon because it will impact application performance).
The reason the data has to move is that the type of data model whose desired state is performance of the application is seldom the database design that is useful for data scientists, business analysts, or business intelligence consumers. The data modeling techniques used for application performance are different from the type of data model needed for reporting and analysis. Not only must the data flow, the data must be transformed.
Do I have the right data?
Knowing the data collected by the application systems can take time and effort. I have gone through the process of explaining what a data dictionary is to several application developers. I don’t think I would have had the strength to describe a data catalog to that audience.
Data dictionaries, metadata, and the data-model configurations to know when a record was created or updated, not to mention when it was soft-deleted rather than hard-deleted are important. Data model design and documentation that illustrates how the data in one table is related to data in another table. These are all fundamental things that are part of the data ecosystem, even if they are not part of the application environment.
Having a well-defined data modeling process as part of the deployment of the enrichment platform is crucial to success. Finding the data necessary to answer a question should be as simple as looking it up. It should not be a multi-week effort to pin down various application developers and ask them what they were thinking when they created a new table.
The world runs on schedules. Having the data movement and transformation processes run frequently and ensuring that the data is fresh for purpose is why some data operations groups have multiple shifts. At one point in time, one of my promotions to data operations manager required me to go into the office and monitor all the data feeds starting at 0400.
It was an on-premise physical server solution a few years before the data cloud or even cloud computing was available. If there is any failure, whether soft or hard, the data must flow, so fixing it on the spot (in production) was mandatory. The need for the data to flow has only grown with the arrival of more diverse cloud-based applications.
Ensuring that the data is fit for purpose requires that the data be transformed. The physical data model for an application database will differ from the one for an enrichment platform. Data structures optimized for application performance are only sometimes useful for analytical, visualization, or reporting purposes.
Machine Learning algorithms, Data Visualization tools, Business Intelligence tools, dashboards, and even excel spreadsheets need to have data in a particular structure. This is all the transformation work that the DataOps team must provide.
Whether there is a formal or an informal group within the enterprise named DataOps, I assure you that all this work is necessary. Sometimes the budget for all these tools, much like the data ecosystem described above, is an assortment of organizational responsibilities with one set of people taking on the responsibility of ensuring that data flows. Hidden in the whitespace of budget line items reside a group of people working tirelessly to ensure that the enterprise’s most valuable asset – DATA - can do its job.
Their job could be made more efficient. They could start producing valuable data products for your organization. Unless they are too busy working with Terraform or writing integration code to orchestrate all the tools in the ecosystem to meet the mission objectives. They should be able to write individual test cases for the intermediate data structures that will ultimately produce the data products for the organization.
A Data Platform is needed to manage the operations of these moving parts. It needs to:
- Coordinate the various tools within the ecosystem.
- Manage data structures and objects within the database.
- Manage versioning and branching of the production database.
- Run all tests that have ever been written to validate the data.
- Provide visibility into the overall status of operations.
- Ensure that the data flows to the right place.
- Does not interfere with production in any negative way.
- Captures metadata and generates documentation dynamically from the metadata.
- Allows for the straightforward integration of new tools.
These are just some of the things that DataOps.live does out of the box that I have had to build from scratch during my experience as a Data Operations Manager.
As you think about how to enable your data operations team to turn things up to 11, be sure to check out our free trial through Snowflake Partner Connect.
The Data Operations team that uses this platform will be able to show an organization how to use data to grow the business.