Below you'll find answers to the questions we get asked the most about DataOps and DataOps.live. If you have a question that you can't find an answer to, please contact us.
DataOps is to data engineering as DevOps is to software engineering. Software engineering is the process of developing the software or writing the lines of programming code. DevOps is the processes and technologies around how the complete software application is built, tested, and deployed. DevOps supports software engineering and makes software engineers far more agile and efficient. Engineers change the code, and DevOps have an entire version of the software, including their changes, build and tested, automatically building the correct environments, etc.
DataOps is precisely the same for data. Data engineering is still required, but DataOps takes all the heavy manual lifting around building environments, getting test data, doing all the automated testing, and assuming everything functions as expected, deal with the review, promotion, and ultimate deployment into production.
The DataOps for Dummies book describes a way of development teams working together collaboratively around “data products” to achieve rapid results and improve customer satisfaction. This book is intended for everyone looking to adopt the DataOps philosophy to improve the governance and agility of their data products.
Most new clients we work with already have a data ecosystem (or environment) in place. And they want to retrofit DataOps to their current way of working. This is typically achievable but there may need to be some small changes.
However, much like DevOps and CI/CD for software, it’s much easier, and consequently, extremely important, to “start as you mean to go on” and build DataOps in from day 0.
If you had asked us this question two years ago, we would have said the most significant obstacle would be the die-hard data veterans or people with 30 years of experience saying: “well, that’s not how we do things.” However, the reality has been quite the opposite. These data veterans are saying: “Thank goodness, we’ve been waiting for this to come along for years.”
Therefore, in a larger company, the biggest obstacle is how teams are organized and how work is done. Inevitably, teams are organized around the way they work or are forced to work. When a new, better way appears, it can be challenging for larger organizations to adopt this quickly. That said, we are working with some of the largest organizations in the world, and they are 100% committed to this sort of transition because of the benefits they get as a result.
In many ways, a perfect time to implement DataOps is if you are starting small but knowing you are going to grow as you have some time to learn and get everything the way you want it before the data volumes explode.
As a professional software engineer, the first thing they do before they write a single line of functional code in a new project, no matter how small the project is, is to set up their DevOps and CI/CD system. Starting this way is always an excellent investment for the future.
In the early days of DevOps, DevOps professionals didn’t exist. They converted from a variety of disciplines such as software engineers, Sys Admins, and even project managers.
DataOps is now at the same place. The pioneers who are championing DataOps are coming from a variety of places. However, some will have a slighter easier time transitioning into DataOps than others.
The most typical path into a DataOps engineer is from a Data Engineer, but with some coding/automation background. However, a DevOps engineer who typically has a reasonable knowledge of data will also have an easy time of it.
The data scientist is harder to predict. We have worked with a lot of data scientists with very different skillsets. However, this is a less noticeable transition than the others.
As with many questions, for the answer we can look at how highly successful software teams work. Successful collaboration requires three principal elements: The right overall philosophy, technology, and team structure.
One of the biggest things people seem to struggle with is that no amount of technology can mitigate the need for people to talk and work together. The goal of technology in this situation is to support people working collaboratively rather than to create collaboration instantly.
In our experience, if you take the technical barriers away, people are generally pretty good and work quite naturally in a collaborative way. Setting up a collaborative environment, including the way your structure your team and plan your work, plays a fundamental role in how naturally team members work together or not. For instance, running your teams in an Agile way naturally creates collaboration.
Technically, by far the best approach and the one we follow with the DataOps.live system, is to follow how the software world solved the challenge of collaborating within teams is to use git.
Git is a phenomenal tool to allow potentially massive teams with thousands of members to go off, do their own thing, work in small groups, but still bring work together in a very controlled way. Therefore, together with git and robust Agile methodologies, collaboration is straightforward.
There are, however, some challenges that unique to data. For example, historically in the data world, dev and test environments are expensive, manual and time consuming and therefore large number of developers have to share these, manage careful co-ordination and ultimately deal with all the problems of “what state did the last person leave this in”! A true DataOps approach deals with this.
Enough! There is no simple answer to this question. It is very much a “how long is a piece of string” question. The software world has measured test coverage for more than 20 years and still can’t agree on this.
The way we encourage customers to think about this is as an ongoing process rather than a one-off activity. When you start, spend some time thinking about the most business-impacting failure modes and add tests for these. As a rule of thumb, if you have 3-5 tests for an average model/table then you are probably in the right ball park.
Much more important than the number of tests is the usefulness of these tests. It’s trivial to add a set of tests to every column in a table – unique, not null, and so on. In doing so, you can easily create tens or hundreds of tests. The question you need to ask is how much does each one of these tests tell you? One really smart test can be worth one hundred tests created just for the sake of creating tests.
However you get there, you will eventually go live with a set of tests, and they will be imperfect, and issues can slip through. It’s a fact; just accept it. By catching most issues that would have made it into production, you are already well ahead of most people. Business users are relatively forgiving of things like this. What they do not forgive is repetition. If they report an issue and you fix it, no problem. However, if this problem reoccurs and they report it again, you will lose trust and credibility.
Therefore, every time you fix a data issue that you missed; you must do the following:
Add these tests to your existing set of tests to prevent this issue from getting to users again.
Spend 20 minutes thinking about whether this issue can occur again in other places and can this issue occur in a slightly different format.
In this way, you’ll improve your tests over time and catch all the outstanding issues.
Our experience is that existing DevOps teams have already been told that the data team is unique, and DevOps team members have no business in the data team. Once they see an organization move towards DataOps and embrace DevOps principles.
Warning: If you are starting to look at DataOps, always involve your existing DevOps team. If you don’t, they will see this as a shadow DevOps initiative, and the team that should be your biggest supporter may turn into a barrier.
They may well have some corporate requirements you need to adhere to, but these are usually straightforward to adopt. Ownership is generally within the DataOps team because DataOps is unique. But the adoption of company best practices and standards would come from the DevOps team.
We believe that an explosion of ideas and terms occurs during the development of new concepts such as DataOps. There is often a lot of overlap, duplication, and contraction between these terms (MLOps, AIOps). Over time these concepts and terms coalesce and become better defined and more standardized.
For instance, we consider MLOps as a subset of functionality within DataOps. As the tools become easier to use and the areas become better defined, most organizations won’t need individual specialists to handle each functional subset. One team will be able to handle everything.
Additionally, we don’t think the wider business will care about any of this detail. They just want the data team to deliver quickly, reliably, and with good governance.
Download Our Free Ebook DataOps for Dummies
DataOps describes a novel way of development teams working together collaboratively around data to achieve rapid results and improve customer satisfaction. This book is intended for everyone looking to adopt the DataOps philosophy to improve the governance and agility of their data products. The principles in this book should create a shared understanding of the goals and methods of DataOps and #TrueDataOps and create a starting point for collaboration.