Thanks to everyone who attended the Technical Masterclass on CI/CD and DataOps for Snowflake 2 weeks ago. And to everyone who has since watched the recording since.
This Technical Masterclass was an amazingly well-attended event and demonstrates how significant the demand is today for bringing proven agile/Devops/lean orchestration and code management practices from the software world to our world of data and, specifically, to Snowflake. Not least due to the fact that Snowflake is one of the first data platforms available today that has the underlying features needed to enable the philosophy of #TrueDataOps.
And we received almost 80 questions, more than twice the number we could actually answer during the session. So we decided to release this blog to allow everyone to see the answers to the questions we didn’t get to and some important ones that we did answer. Some questions were duplicated from multiple people, so we have consolidated as many as we can.
Here are all the answers — enjoy!
What is Data Governance in Snowflake?
Data governance refers to many features, including RBAC, secure data sharing, dynamic data masking, etc.
Will the session be recorded?
You should all have received links to the recording now. If not, here it is https://resources.snowflake.com/webinars-thought-leadership/technical-masterclass-on-cicd-and-dataops-for-snowflake.
What are the tools needed for release management for Snowflake?
The DataOps tool natively contains everything needed for release management for Snowflake.
What is Agility in Snowflake?
Agility is not a specific set of features or functions but more the ability to respond to new changes from stakeholders very rapidly. It is based on the methodology of “Agile” from the software development world. And it focuses on many of the agile principles that form part of the core features of DevOps and how they are enabled for Data Platforms. The product shown in the webinar from DataOps.live enables this type and level of agility on Snowflake.
What is DataOps.live that you mentioned for writing test scenarios?
DataOps.live is a SaaS platform you can find out more about at www.dataops.live. This platform is an orchestration platform that enables orchestrating “code” or “jobs” from other vendors (Talend, Fivetran, Matillion, Stitch, Python, R, etc.) in pipelines. As part of that, you write test scenarios and automated testing code routines that run:
- after Every stage in a pipeline, and
- every time the pipeline is run.
So that data that fails the tests can be immediately detected and resolved before it impacts the end customer using the data.
Is DataOps a SAAS?
DataOps is a SaaS product. But it also has agents to enable you to run processes on data behind secure firewalls. DataOps is heavily focused on data security, data governance, and data lineage and has lots of enterprise features to support this.
Is that just dbt, or do you have extra features on the YAMLs and documentation portal?
We are a contributor to the dbt open-source community, but we have extended the core dbt functionality in our DataOps platform with significant functionality around enterprise security, auditability, and governance.
DataOps UI — is this a part of snowflake solution or a third party?
DataOps is a partner solution provided by DataOps.live (www.dataops.live) — it is available via Snowflake Partner Connect.
What if two engineers change the same objects/fields? Can it get corrupt data, or which change will win?
This is a fully Git-compliant solution which means that two engineers can absolutely work in parallel on the same code at the same time. And the system is designed to resolve all issues when each of the developers merges their code from feature branches to main branches.
In the webinar, we showed how two developers could make changes even in the same file without causing conflicts/issues.
Of course, there will be rare occasions when two developers will make changes to exactly the same line, and in those situations, it’s physically impossible for the system to know what to do automatically, so it simply asks which change should be kept.
Can you design those referential integrity tests to run against new/changed rows only?
Absolutely. Tests are completely configurable to be deep against all of the data all of the time (but this would be resource-demanding) or shallow against only the data changed, assuming there are clear criteria for underlying this.
If possible, can you talk a little about the configuration? Do we have configurable templates available within the platform?
Yes – the platform comes with an entire test project which contains examples of all of the key components. In addition, Demonstration Projects are available, which show examples of additional useful situations from which configuration/code can be copied.
Have you written all test scenarios in the YAML file? Is it similar to CFT on AWS?
We have either reused existing test scenarios from a library or written all the test scenarios in the YAML file. This is similar to the Cloud Formation Templates approach of using “stacks” in AWS.
Can you allow data that failed tests through to production but with flags/tags which indicate there is a DQ query outstanding?
Yes. That is a 100% valid scenario – especially where the rest of the data has passed the critical test that ensures the data meets the base threshold for usefulness. In these cases, records can be let through with either incorrect data flagged or only the invalid data stripped out and the relevant records flagged for either awareness or later action.
What does this solution provide, compared to performing data pipelines in a traditional Jenkins?
Jenkins is the original CI/CD software development framework that supported the launch of DevOps and Agile. And this remains heavily focused on the software development world. While similar, there are many ways in which the data world is different such as:
- Large amounts of actual data (state) to maintain.
- The need to run multiple pipelines at multiple frequencies (real-time, hourly, daily, weekly).
- The nuanced changes that need to be made depend on whether you are running pipelines against dev, test, or prod.
- The fact that data platform engines have never been managed like “dumb” engines in the same way the software development world managed web servers, video encoding servers, cloud infrastructure, etc.
- The need to orchestrate EVERY object within the data platform (data carrying and non-data carrying objects e.g., roles, permissions, tasks, etc.) and outside the data platform (e.g., 3rd party code and vendor tools)
Jenkins is an excellent but generic DevOps platform that is not designed with these requirements in mind and therefore has limited, if any, of these capabilities.
Templates YAML and models based on dbt? It seems that the testing and modeling of DataOps.live is built on top of dbt using YAML. If so, can it be integrated with existing dbt projects?
Yes. All our YAML templates and models are based on and fully compliant with dbt. And yes, we have many cases where customers have developed a lot of core functionality in dbt and lifted and shifted this code directly into DataOps. 100% of dbt functionality is supported in DataOps, and then this is extended with enterprise functionality like environment management, secrets management, full object lifecycle management, GRANT and permissions management, source model automation, etc. In addition, we have an extended library of tests and core macros.
Hi, is it a managed product available in AWS, or do you have your own hosting?
It’s a managed product running on AWS and available to buy on AWS Marketplace with a credit card.
Who would be the person responsible for running DevOps?
While there are more and more DataOps engineer roles all the time, these are from Data Engineering roles where the role involves the ongoing management of product pipelines.
From an ETL perspective, what are the advantages of the DataOps platform compared to Python Data Pipeline and traditional Informatica ETL?
A DataOps pipeline is about the complete end-to-end rather than just the ETL/ELT piece. Orchestrating existing ETL/ELT in a DataOps pipeline allows you to chain together many steps, including Environment Management in Snowflake, the ETL/ELT itself, the post-ETL/ELT testing, the transformations, subsequent testing, final publishing, and many other steps. Putting these into a single pipeline allows a) dependency management i.e., run D exactly and only when A, B, and C have finished, and b) success criteria i.e., only run D if A, B, and C completed successfully.
How do you release DDL changes that are not covered by dbt and is clone/swap used for a release to production?
We have support for lifecycle management for ALL objects in Snowflake, not just tables and views. We also have full support for GRANT management on all dbt created objects.
We have a feature in Private Preview called “Candidate/Deploy Model,” which is exactly as you describe – clone the current Environment DB, make all the changes, and test them all, and then only if all the tests pass, swap this back into place as a single, atomic operation. This isn’t needed in many cases, but in situations where end users must only be allowed to see new data as a single atomic operation, all at the same time and never with partial changes (e.g., in a financial reconciliation use case), then this approach is very important and powerful.
Can you provide any information on pricing?
The DataOps tool is priced at $1000 per engineer per month or $10,000 per annum. We have customers running amazingly well with just one or two users, and this is orchestrating their entire data platform.
Where do Snowflake tags apply in DataOps?
As you know, Snowflake queries can be tagged (https://docs.snowflake.com/en/sql-reference/parameters.html#query-tag ). A (string) tag is added to each and every query. This can be accessed via Query History to allow you to do things like tracking credit usage for certain departments or subject areas (HR, production, sales, reporting-sales dashboard, etc).
DataOps uses these tags also to add the pipeline ID to each and every query to enable you to track snowflake usage on a per-pipeline basis. This helps with both profiling and optimizing usage within Snowflake.
What are the language and editor for the code artifacts that get stored in the Git repo?
Every language is supported but the primary languages we see used today across Snowflake are SQL, python, and sometimes R.
The DataOps platform has its own inbuilt editor/IDE, but since the Git repository is 100% standard Git, any IDE can be used e.g. Visual Studio Code, or Eclipse.
Does the tool contain tests out of the box, or do you need to define all of them? (and can you add new ones if you have a few out of the box)
We have a library of standard tests that come out of the box, and these can then be extended or used as the basis for new tests.
All tests are based on config files and SQL.
So, if you clone your FB database, do you keep the RBAC as close to the Prod?
Updated question: If you clone to a feature branch database, do you keep the RBAC (role-based access control) similar to the production?
No – we use a different approach here where the clone function is actually a “clone, mask, and grant” function.
This means we clone the data firstly with no one having any access to the cloned data, then we mask any data that developers should not be able to interact with, like credit card data, PII data, etc., and then we apply the grants necessary for the developer to be able to develop his new features.
You can choose whether to propagate the GRANTs from the source DB to the Feature Branch DB. However, if you apply all of your GRANTs using a DataOps pipeline (highly recommended), then it doesn’t matter because the pipeline will reapply these in the feature branch pipeline anyway.
Are you removing first the grants given in the cloned database for the “feature branch database” and then applying similar RBAC roles for the cloned database with similar roles so you won’t use the same roles from Prod?
See above. In addition, the GRANT Management system is environment aware so that you create namespace sets of roles i.e., changes in dev or a FB don’t impact QA or PROD.
We will be publishing a blog series about this topic soon!
So, Matillion would not be used for orchestration and building jobs?
If you use Matillion, that’s excellent, and we would fully support all the Matillion jobs produced, but you would orchestrate this from DataOps.live. This enables you to build pipelines with both Matillion jobs and non-Matillion jobs/functions/code (e.g., either 3rd party tools or hand-generated code).
A DataOps pipeline is about the complete end-to-end rather than just the ETL/ELT piece. Orchestrating existing ETL/ELT in a DataOps pipeline allows you to chain together many steps, including Environment Management in Snowflake, the ETL/ELT itself, the post-ETL/ELT testing, the transformations, subsequent testing, final publishing, and many other steps. Putting these into a single pipeline allows:
- Dependency management i.e. run D exactly and only when A, B, and C have finished and,
- Success criteria i.e. only run D if A, B, and C are completed successfully.
How can DataOps help me test my Matillion jobs?
Regardless of the type of the Matillion job (ETL, ELT, snowflake transform, etc.), after every Matillion job, automated test routines can be run based using SQL, Python, or anything else required to ensure the job has been completed successfully without errors, and the data meets all the tests for quality, validity, usability, etc. defined.
Is there migration support?
Yes. Migration support is available for every customer either as advice and guidance, migration and architectural services, or as a fully managed migration service with any of our approved DataOps.live silver, gold, and platinum delivery partners (e.g., Datalytyx www.datalytyx.com).
Is there migration support - if we are using airflow Git and Matillion, and Snowflake?
Yes. We can help to migrate all these elements to operate in and be orchestrated by DataOps.live.
And over time, you will find that many, if not all, of the Airflow DAGs (directed acyclic graphs), can be run directly with DataOps.live, removing the need for the additional Airflow layer
Can your tool trigger data loads for non-SAAS tools out of the box? Like Informatica BDM?
Absolutely. We have native support for cloud-based and on-premises solutions, and Informatica is amongst these.
On-premises non-SaaS workloads like that of Informatica BDM are invoked using our on-premises runners which are based on docker containers.
How would you compare this product with Airflow and dbt?
dbt is excellent, and we have been contributors to the community for some time.
Airflow is a great tool for running DAGs and deciding what should run and when. That includes scripts and SQL and python and can include dbt.
But neither of these is an enterprise, production-grade, automation, and orchestration platform, with all the code management, conde control, security management, secreted management, auditing, and governance capability required in an enterprise data pipeline system. And this is the level of governance and control that is required but nearly every enterprise, and now many start-ups and small to medium businesses, where data is at the core of their services delivery.
What is the cost of the DataOps tool?
Do you need to use the same snowflake user in prod, QA, and dev in order to have the zero-copy clone writeable by the different pipelines?
No, as long as the role inheritance is set up correctly (we have a standard script to do this all for you), you can use different users/roles for each environment.
Do you manage deltas to do the ETL? Do you have stream processing for semi-real-time loads? Or do you rely on Kafka Connect?
Delta management is handled differently by different ETL/ELT tools. Most of them manage their own delta state, but if not, they can be managed in DataOps.
The DataOps platform has its own ability to ingest from Kafka but works very well with Kafka Connect.
Behind CI/CD, there’s the idea of having a “continuous” toolchain. How’s your tool able to integrate with other CI/CD tools for the rest of the dev cycle?
This is an excellent question and one that shows a clear level of understanding and maturity in CICD and DataOps.
DataOps.live contains full CI/CD “continuous toolchain,” as you describe it, to orchestrate everything and allow the seamless promotion of “code” from feature branches to dev and then test and finally prod, depending on your custom approach to environmental management and governance.
Additionally, DataOps.live has been designed to enable you to “run” the pipelines in other CI/CD tools or BE RUN BY and higher-level enterprise CI/CD tools like Jenkins or Azure DevOps. This enables DataOps to operate within a wider enterprise software development lifecycle by remaining responsible for CICD around all things data and data related.
What happens with other concurrent queries or DML sentences from other sessions when you roll back a table? Is there any locking mechanism in place?
This is more of a Snowflake question, and you can read about it here: https://docs.snowflake.com/en/user-guide/data-time-travel.html.
All DML for a specific Database (or at the very least a specific Schema within a Database) should be handled through DataOps pipelines (no automated systems making wholesale changes like conflicting with hand-made changes), so there should never be a DML clash – you would make sure your DataOps pipelines were paused before you rolled back the table.
Not a question: You must be really, really proud of what you’ve done here. Thanks a lot!
Even though this is not a question, we left this in as it is such a lovely thing to say during a deeply technical webinar.
Thank you to whoever sent this. And yes, we are extremely proud of what we have created and delivered.
And this comes from the fact that the technology platform we built and is delivering value to so many Snowflake customers, is founded on one primary belief and one well-thought-through overall philosophy. The belief is simply, “why can’t we develop, test and deploy data platforms the same way we do software”. And the philosophy is the technology-agnostic #TRUEDATAOPS philosophy that other data industry pioneers and we have developed and continue to evolve as the DataOps movement evolves.
So, the main orchestration is done in the DataOps platform?
Yes! that’s correct.
Can we revert the data in an automated way in this case?
While this is technically possible, in practice, you would only use this function when something has gone fairly significantly wrong, and virtually all System and Database Administrators tell us that when dealing with this sort of situation, they don’t want an automated system — they want to review and check each step before the execute it.
Can we have a lab video for DevOps testing?
These are currently being produced, and we’ll send the links out shortly.
How does DataOps work for Trunk-Based Development?
You can actually use any branch strategy and development model supported by Git — we just demo for a specific approach!
What is Git revert?
The Git revert command can be considered an ‘undo’ type command; however, it is not a traditional undo operation. You can read more here: https://www.atlassian.com/git/tutorials/undoing-changes/git-revert.
What is a DataOps engineer in Snowflake?
DataOps is a new disciple, and the role and nature of a DataOps engineer are still emerging. However, it’s fair to consider it this way.
In the software world, a DevOps Engineer builds and manages the DevOps system, the configurations, pipelines, etc., and a Software Engineer writes the stuff that actually goes on inside the pipeline.
In the data world, we can consider a DataOps Engineer who builds and manages the DataOps system, the configurations, pipelines, etc., and a Data Engineer (or another type of user who understands basic SQL and configuration files) writes the stuff that actually goes on inside the pipeline.
How was the graph generated at the start of the session?
The graph was generated automatically from a pipeline configuration file (or files) stored in the repository.
Is Selenium supported by DevOps?
Selenium can be fully supported by DataOps.live for automated testing in the same way any 3rd party tool can be.
Can we also run snowflake TASKS DAG from DataOps pipelines?
Yes — this is fully supported.
Can we have hybrid operations using DataOps? I mean, in an existing database, some objects are managed manually, and the rest are managed by DataOps. Is it possible?
Yes, that is definitely possible, especially when the manual vs DataOps managed pieces are, say, in separate schemas. We have some good design patterns to help you here.
But the true principle of DataOps is that you move the “source of the truth”. When you run things manually or manage some objects manually, the source of truth will always be the database or the data platform. And that is not the way the software world works.
No one in the software world ever configures a server, a service, a web service, or a video encoding server manually. It breaks every element of governance, manageability, recoverability, auditability, etc., in the software world.
It is possible to configure things manually in the software world, but no one ever does it. If we stay true to the belief, “why can’t we build, test and deploy data platforms the same way we do software”
Most things people have historically thought need to be done manually can now be done through a good DataOps.live platform.
Does merge operations only update/create the new/changed objects, or it refreshes all objects, even those which are not changed in a feature branch?
Merge will only update the new or changed objects in the pipeline as the rest remains the same.
It looks like DataOps would require a lot of rights to work — how to secure it?
The level of rights needed by DataOps.live is generally fairly high- if you want a platform that will create schemas, tables, shares, stages, file formats, masking policies, etc., then, by definition, it needs the rights to create all of these. However, it doesn’t need any rights on any objects it hasn’t created, so if you have things in your Snowflake tenant not managed by DataOps, the DataOps service accounts need no access.
Is SSO with Azure AD supported?
The standard DataOps enterprise cloud instance will not support this. But this is fully supported by our DataOps.live Enterprise private instance.
How do you do schema management? Is the process declarative like a visual studio database project or imperative like the flyway database migration tool?
This is another excellent question, and a great one to finish on, as this is a good segue into the 9-blog series that is about to commence where we discuss:
- The challenges of repeatable and idempotent schema management
- Imperative versus Declarative for Data
- What the current “latest and greatest” is, and where are we going in the near future
As you probably know based on your question, declarative is the holy grail and where we want to be, but it’s EXTREMELY hard to do with data, so imperative approaches like Flyway predominate the solutions available today.
We are not satisfied with that answer! Just because it’s hard does not make the alternative the correct solution.
We are 80% of the way to a full declarative engine for data. We call the 80% point a “check-based imperative” approach.
In 2021 we expect to release our fully declarative engine for data.