The evolution of DataOps: DIY pitfalls and peril

Written by Doug 'The Data Guy' Needham | May 02, 2023

Offense gets the glory; defense wins the game.

Customer facing applications that are built within an enterprise, are obvious revenue generators. The ability to directly reach out to customers and provide them with a good or a service that our organization provides I think of as an offensive capability.

There are ways to compare the value generated of the application to the cost required to run the infrastructure necessary to support it.

Internal applications purchased for the specific purpose of managing customers or resource planning, supply chain, merchandising, purchasing, and production (if the enterprise makes the products that are being sold) are easy to compare cost savings and perhaps even revenue generated to the costs required to implement and run an application suite that provides certain functionality.

What about the defense side of things?

An application without Data is useless.

Data without an application just needs to be organized differently.

Whether an application is custom built or purchased off the shelf each one will have different requirements for data. For reporting and analytics, seldom is the case where a data structure storing data for an application is useful without changes for creating reports or feeding to a machine learning algorithm.

Once the data starts to be collected from the various applications within an organization into an Enrichment Platform, new products can be created.

Data Products can be created.

Data products use the data collected from as many applications as are available to provide additional value. This additional value could be insight into customer interactions, financial statements for external requirements, or even daily progress reports covering how well a business unit is performing.

To create a set of Data Products, there exists a need for a team separate from the DevOps team managing application needs. This team is generally called a Data Operations team, or DataOps.

Some of the things that the DataOps team focuses its time on are:

Creating Data Products
Updating data models
Extracting data
Transforming data
Pulling data from outside the enterprise
Managing infrastructure for data enrichment tools
Managing Data Catalogs or data observability platforms
Data Cleansing and labelling
Data acquisition
Feature engineering
Deploying data modifications to production
Ensuring security is enforced
Report creation

The main person responsible for designing how all these things work together is the Data Architect. Various Data Engineers report to the Data Architect. Every Data Architect I have ever met has been quite the thought leader. They can understand and explain how data flows through the entire organization.

The following is an optimistic scenario of creating a proper data architecture from scratch by a recently hired Data Architect. The organization that hired her has grown from its humble beginnings to being a mid-sized company that wants to begin to bring Data Science to the organization. The teams that built the organization have moved on to bigger and better things, so she is hired based on her experience to revolutionize the data department!

Assuming she is starting from nothing with limited resources, she is going to look for open-source tools to bring things together. GitLab, Terraform, Airflow, Docker and dbt all have open source options. She is hired to create a custom data architecture because the existing data architecture is a few scripts that have been running since the applications were new. All of the Data Engineers are SQL experts with limited system administration expertise.

To get everything audited, she implements GitLab. This is a learning curve for the data engineers, but they are able to get a GitLab server deployed within the organization. All the scripts are touched up and stored in a repository created on this server. Overall, this “side-project” which provides little perceived value to the business takes two people about six weeks.

Once these two have Gitlab up and running their next task is to set up, learn, and deploy Airflow. They start on that task while the other six folks on the team work on their tasks. The rest of the team continues to maintain what has been running and do break-fixes to keep the lights on for the organization.

Keeping track of time, that is six weeks with no new projects for the business (limited value)

Now to ensure the overall data structures are easily modified without complicated DDL (Data Definition Language) modifications, all the objects defined in the databases she is responsible for are stored in the repository and set up to feed into Terraform for Terraform to automatically manage the environment. Getting familiar with Terraform takes two additional engineers about two weeks to understand how to apply Terraform scripts against the database without breaking anything.

During this time window, the Airflow team is able to get Airflow set up to be able to communicate with everything in the production environment. Now that that is done, they will start working with the dbt team to learn how to tie things together.

Keeping track of time, that is eight weeks with no new projects for the business (limited value)

Now that Terraform is set up, the scripts must be broken apart and set up such that terraform only does the schemas, data structures, supporting objects in the database, and security roles covering all objects in the database. They start the migration process and over the following four weeks get everything migrated to Terraform with no outage, and no interference with production.

Keeping track of time, that is twelve weeks with no new projects for the business (limited value)

In the meantime, the other four data engineers are well on their way to learning dbt. While the Terraform team is learning that, this team is learning dbt and modifying the approach to follow dbt best practices. It takes a few weeks (let us say 4) of trial and error to learn how to convert their stored procedures to the dbt approach. Since this team must wait for the infrastructure components before they can start making production changes, the long pole in the tent becomes the Terraform migration. The next steps do not start until that is complete.

Finally, the dbt team is ready to start to deploy these new scripts to production. They disable the scripts that have been running and enable the dbt processes that are running in airflow. All six people work together to get the Airflow orchestration of dbt to work together. And they pull it off! Dbt and Airflow are now deployed in production with no outage, and no interference with production. This entire process only takes three weeks.

Keeping track of time, that is fifteen weeks with no new projects for the business (limited value)

At this point all the historical legacy code has been modified and migrated to a new data architecture made up of current state of the art tools. At this point she can start taking requests from the newly formed Data Science team and start deploying a Business Intelligence suite to produce reports easily. However, before accepting any new requests she decides that it would be best to run for one week with all the new tools in place before starting to make any modifications.

They make it through the week with no interruption of services, no nighttime calls to the data operations team. Everything runs smoothly.

We can now accept new requests for the creation of data products.

Keeping track of time, that is sixteen weeks with no new projects for the business (limited value)

9 People
16 Weeks
4 Months
1 Month + 1 quarter

This is just a story, right? With a little imagination, you can see how these numbers would apply to your organization. How would something that was built like this within your organization handle a stress test? I have left out the ongoing maintenance of this data platform, and the constant upgrades necessary to keep this platform current. If you ignore the effort to keep everything current, there will come a time where packages no longer work or are no longer supported.

I worked at an organization once who did not keep things current. When we migrated the application to the cloud, we had to migrate an entire repository of Perl packages with fixed versions (some more than a few years old) to ensure that the application worked. Updating the application to use new packages would take months since there was never time to do things the right way. There seemed always to be the time to hack things together and use band aids to make it work for now.

Do not let this be your organization.

By leveraging the DataOps.live Platform, this could just be a story. For numerous customers, we have everything set up and the team trained in how to start creating data products within the first few weeks of purchase. Our team is focused on enabling new features and keeping the plumbing working.

Want to learn more about DataOps.live?  

Download your copy of our exclusive eBook, DataOps for Dummies—the must have guide to kicking off your DataOps journey
Check out the Roche Case Study
Register for our Podcast Series hosted by Kent Graziano, The Data Warrior

Ready to jump right in? Begin your DataOps.live Free Trial

View full post