DataOps FAQs
INTRODUCTION ...?
Below you'll find answers to the questions we get asked the most about DataOps and DataOps.live. If you have a question that you can't find an answer to, please contact us or attend an Office Hours session.
FAQs
DataOps is to data engineering as DevOps is to software engineering. Software engineering is the process of developing the software or writing the lines of programming code. DevOps is the processes and technologies around how the complete software application is built, tested, and deployed. DevOps supports software engineering and makes software engineers far more agile and efficient. Engineers change the code, and DevOps have an entire version of the software, including their changes, build and tested, automatically building the correct environments, etc.
DataOps is precisely the same for data. Data engineering is still required, but DataOps takes all the heavy manual lifting around building environments, getting test data, doing all the automated testing, and assuming everything functions as expected, deal with the review, promotion, and ultimate deployment into production.
The DataOps for Dummies book describes a way of development teams working together collaboratively around “data products” to achieve rapid results and improve customer satisfaction. This book is intended for everyone looking to adopt the DataOps philosophy to improve the governance and agility of their data products.
Most new clients we work with already have a data ecosystem (or environment) in place. And they want to retrofit DataOps to their current way of working. This is typically achievable but there may need to be some small changes.
However, much like DevOps and CI/CD for software, it’s much easier, and consequently, extremely important, to “start as you mean to go on” and build DataOps in from day 0.
If you had asked us this question two years ago, we would have said the most significant obstacle would be the die-hard data veterans or people with 30 years of experience saying: “well, that’s not how we do things.” However, the reality has been quite the opposite. These data veterans are saying: “Thank goodness, we’ve been waiting for this to come along for years.”
Therefore, in a larger company, the biggest obstacle is how teams are organized and how work is done. Inevitably, teams are organized around the way they work or are forced to work. When a new, better way appears, it can be challenging for larger organizations to adopt this quickly. That said, we are working with some of the largest organizations in the world, and they are 100% committed to this sort of transition because of the benefits they get as a result.
In many ways, a perfect time to implement DataOps is if you are starting small but knowing you are going to grow as you have some time to learn and get everything the way you want it before the data volumes explode.
As a professional software engineer, the first thing they do before they write a single line of functional code in a new project, no matter how small the project is, is to set up their DevOps and CI/CD system. Starting this way is always an excellent investment for the future.
In the early days of DevOps, DevOps professionals didn’t exist. They converted from a variety of disciplines such as software engineers, Sys Admins, and even project managers.
DataOps is now at the same place. The pioneers who are championing DataOps are coming from a variety of places. However, some will have a slighter easier time transitioning into DataOps than others.
The most typical path into a DataOps engineer is from a Data Engineer, but with some coding/automation background. However, a DevOps engineer who typically has a reasonable knowledge of data will also have an easy time of it.
The data scientist is harder to predict. We have worked with a lot of data scientists with very different skillsets. However, this is a less noticeable transition than the others.
As with many questions, for the answer we can look at how highly successful software teams work. Successful collaboration requires three principal elements: The right overall philosophy, technology, and team structure.
One of the biggest things people seem to struggle with is that no amount of technology can mitigate the need for people to talk and work together. The goal of technology in this situation is to support people working collaboratively rather than to create collaboration instantly.
In our experience, if you take the technical barriers away, people are generally pretty good and work quite naturally in a collaborative way. Setting up a collaborative environment, including the way your structure your team and plan your work, plays a fundamental role in how naturally team members work together or not. For instance, running your teams in an Agile way naturally creates collaboration.
Technically, by far the best approach and the one we follow with the DataOps.live system, is to follow how the software world solved the challenge of collaborating within teams is to use git.
Git is a phenomenal tool to allow potentially massive teams with thousands of members to go off, do their own thing, work in small groups, but still bring work together in a very controlled way. Therefore, together with git and robust Agile methodologies, collaboration is straightforward.
There are, however, some challenges that unique to data. For example, historically in the data world, dev and test environments are expensive, manual and time consuming and therefore large number of developers have to share these, manage careful co-ordination and ultimately deal with all the problems of “what state did the last person leave this in”! A true DataOps approach deals with this.
Enough! There is no simple answer to this question. It is very much a “how long is a piece of string” question. The software world has measured test coverage for more than 20 years and still can’t agree on this.
The way we encourage customers to think about this is as an ongoing process rather than a one-off activity. When you start, spend some time thinking about the most business-impacting failure modes and add tests for these. As a rule of thumb, if you have 3-5 tests for an average model/table then you are probably in the right ball park.
Much more important than the number of tests is the usefulness of these tests. It’s trivial to add a set of tests to every column in a table – unique, not null, and so on. In doing so, you can easily create tens or hundreds of tests. The question you need to ask is how much does each one of these tests tell you? One really smart test can be worth one hundred tests created just for the sake of creating tests.
However you get there, you will eventually go live with a set of tests, and they will be imperfect, and issues can slip through. It’s a fact; just accept it. By catching most issues that would have made it into production, you are already well ahead of most people. Business users are relatively forgiving of things like this. What they do not forgive is repetition. If they report an issue and you fix it, no problem. However, if this problem reoccurs and they report it again, you will lose trust and credibility.
Therefore, every time you fix a data issue that you missed; you must do the following:
Add these tests to your existing set of tests to prevent this issue from getting to users again.
Spend 20 minutes thinking about whether this issue can occur again in other places and can this issue occur in a slightly different format.
In this way, you’ll improve your tests over time and catch all the outstanding issues.
We have built automated documentation processes into our DataOps.live platform. Since we have all the logic about how we build, transform, test, and so on in our Git repository and have access to the target database itself, we have all the information needed to build a really good set of automated documentation.
We host a weekly office hours session if you’d like to learn more.
These sessions are 45-minute office hours led by DataOps experts and are open to anyone who wants to understand more about DataOps and Snowflake.
Our experience is that existing DevOps teams have already been told that the data team is unique, and DevOps team members have no business in the data team. Once they see an organization move towards DataOps and embrace DevOps principles.
Warning: If you are starting to look at DataOps, always involve your existing DevOps team. If you don’t, they will see this as a shadow DevOps initiative, and the team that should be your biggest supporter may turn into a barrier.
They may well have some corporate requirements you need to adhere to, but these are usually straightforward to adopt. Ownership is generally within the DataOps team because DataOps is unique. But the adoption of company best practices and standards would come from the DevOps team.
We believe that an explosion of ideas and terms occurs during the development of new concepts such as DataOps. There is often a lot of overlap, duplication, and contraction between these terms (MLOps, AIOps). Over time these concepts and terms coalesce and become better defined and more standardized.
For instance, we consider MLOps as a subset of functionality within DataOps. As the tools become easier to use and the areas become better defined, most organizations won’t need individual specialists to handle each functional subset. One team will be able to handle everything.
Additionally, we don’t think the wider business will care about any of this detail. They just want the data team to deliver quickly, reliably, and with good governance.
One of the most significant (and complex) aspects of testing data pipelines (and DataOps) is that data models consistently shift due to schema changes and different user requirements. Therefore, it can become time-consuming to keep test sets lined up when everything continually and rapidly changes.
We solved this challenge by implementing the following:
Ensure your tests are stored in the same git repo as your configuration and code files so that as you make changes and deploy them, the functional changes and tests are deployed together.
Make sure your tests are defined in the same place as (or alongside) your functional logic. If you have data modeling defined in one place and tests in a different location, it is virtually impossible to keep them in sync.
Note: The same applies to grants and permissions. If you define them together with the functional code, it is much easier to manage and more challenging to make mistakes.
Deploy your functional changes using an automated declarative approach like the Snowflake Object Lifecycle Engine (SOLE) found in our DataOps platform. This removes the need to write endless ALTER TABLE statements.
Have you encountered this use case before? How are others solving it?
There are two ways to solve this:
- Solve it within a single job
We do this pretty regularly – one job can still saturate all the resources for a host. We use this quite frequently, and of the form
create_list_of_jobs | parallel –jobs 800% do_some_work.sh
This will create a thread pool of 8 threads per CPU core (32 threads on a 4-core machine). It will then take the first 32 jobs and call do_some_work.sh, passing in the individual params for each time the shell script is called. As soon as the first call finishes, the system will start on the 33rd call and so on until all 5000 have completed.
This is a very effective way to perform a lot of work in parallel but still respect upstream API concurrent limitations. If your upstream only allows you 16 concurrent connections, just fix the jobs limit as –jobs 16.
Note: There are many variations on this we can help you with.
- Solve it with many jobs
In DataOps.live, you can create jobs dynamically/programmatically. This involves having some sort of trivial script that essentially produces a large YAML block.
For instance:
This will create a YAML file that defines many jobs, all in the same stage and in parallel. We can also apply a concurrency limit to this, so only 16 jobs are running at once.
However, this will create 5000+ jobs, which we don’t think is a good idea. There are small overheads associated with starting and stopping a job. These overheads will add up and become significant over this sort of load.
Pipeline jobs can pass files between themselves in 4 ways:
- As part of the pipeline cache. Every job in a pipeline has access to a directory/cache. This is a shared filesystem between all the jobs in a pipeline, but only for that pipeline. It is not visible to any other pipeline. This is the usual way of passing data between jobs.
- Jobs in a pipeline also have access to a /persistent_cache; an area shared across multiple pipelines for the same branch. For instance, this is useful when you are doing incremental ingestion and want to store your high-water mark data somewhere that will be available to the following pipeline run for that branch/environment.
- Using something completely external such as an AWS S3 bucket
- As a job artifact. This is sent up to the SaaS platform at the end of a job, stored as a downloadable object for the job over a period of months. It is also available to other jobs immediately after in the pipeline. Since this is stored in our platform, we don’t recommend using this method for production data for data governance and privacy reasons.
The scheduler and pipeline management is done in the DataOps SaaS platform in the cloud. Essentially, after building the complete pipeline graph, it maintains a queue of the pending jobs (or ready to run because all requirements and other dependencies have been met) for each DataOps runner.
Each DataOps runner (essentially a long-running, and stateless container) dials home regularly (typically every second) and asks if there are any pending jobs for it to execute. If the answer is yes, the DataOps runner runs another container of a specific type, passes in the relevant job execution information, and monitors it for completion, streaming the logs back in real-time.
Today our standard deployment model is that the long-running DataOps runner and the child container it spans are run on a Linux machine, typically EC2 in AWS. Therefore, resource allocation isn’t very complex.
The DataOps.live platform includes a full-scope REST API that can do basically everything from the kicking off pipelines and monitoring results to creating and approving merge requests, user management, and even creating whole new projects.
To kick off a pipeline run, the call would look something like:
curl -s --request GET --header "PRIVATE-TOKEN: $AUTH_TOKEN" --header "Content-Type: application/json" "$URL/api/v4/projects/$PROJECT/repository/files/$PIPELINE_FILE?ref=$BRANCH" | jq
For AWS and the GCS equivalent you could write a very simple trigger:
S3 -> Trivial Lambda -> DataOps REST API
There are two ways to do this:
- You can have a job at the end of a pipeline that gathers all the log information and fires it off to Datadog.
- Our preferred method is to containerize and run the Datadog agent on the same host as the DataOps runner. This is what we do for our runner hosts since it gives us log information from the containers and valuable host information.
Download Our Free Ebook DataOps for Dummies
DataOps describes a novel way of development teams working together collaboratively around data to achieve rapid results and improve customer satisfaction. This book is intended for everyone looking to adopt the DataOps philosophy to improve the governance and agility of their data products. The principles in this book should create a shared understanding of the goals and methods of DataOps and #TrueDataOps and create a starting point for collaboration.
