Back to blog

Q&A from the Masterclass on CICD and DataOps for Snowflake

image2

Thanks to everyone who attended the Technical Masterclass on CI/CD and DataOps for Snowflake 2 weeks ago.  And to everyone who has since watched the recording since.   

This was an amazingly well attended event and demonstrates how significant the demand is today for bringing proven agile/devops/lean orchestration and code management practices from the software world to our world of data, and specifically to Snowflake.  Not least due to the fact that Snowflake is probably the only data platform available today that has the underlying features needed to enable the philosophy of #TrueDataOps. 

And we received almost 80 questions, more than twice the number we could actually answer during the session.  So we decided to release this blog to allow everyone to see the answers to the questions we didn’t get to and some important ones that we did answer.  A number were duplicated from multiple peoples so we have consolidated as much as we can. 

Anyway, here are all the answer below – enjoy!   

And if you want to know more – don’t forget you can download our DataOps.live Deep Dive from the Eckerson Group and DataOps for Dummies.

Q&A

What is Data Governance in Snowflake 

This refers to many features including RBAC, secure data sharing, dynamic data masking, etc. 

Will the session be recorded 

You should all have received links to the recording now. If not, here it is https://resources.snowflake.com/webinars-thought-leadership/technical-masterclass-on-cicd-and-dataops-for-snowflake  

What are the tools needed for release management for Snowflake 

The DataOps tool natively contains everything needed for release management for Snowflake. 

What is Agility in Snowflake? 

Agility is not a specific set of features of functions but more the ability to respond new changes from stakeholders very rapidly.  It is based on the methodology of “Agile” from the software development world.  And it focuses on many of the agile principles that form part of core features of DevOps and how they are enabled for Data Platforms.  The product shown in the webinar from DataOps.live enables this type and level of agility on Snowflake. 

What is DataOps.live that you mentioned for writing test scenarios 

DataOps.live is a SaaS platform which you can find out more about at www.dataops.live.  This platform is an orchestration platform that enables to orchestrate “code” or “jobs” from other vendors (Talend, Fivetran, Matillion, Stitch, Python, R etc.) in pipelines.  As part of that you write test scenarios and automated testing code routines that run  

  • after Every stage in a pipeline, and 
  • every time the pipeline is run 

so that data that fails the tests can be immediately detected and resolved before it impacts the end customer using the data. 

DataOps is a SAAS? 

DataOps is a SaaS product. But it also has agents to enable you to run processes on data behind secure firewalls. DataOps is heavily focused on data security, data governance, data lineage, and has lots of enterprise features to support this. 

 

 

Is that just plain dbt, or do you have extra features on the YAMLs and documentation portal? 

We are a contributor to the dbt opensource community, but we have extended the core dbt functionality in our DataOps platform with significant functionality around enterprise security, auditability and governance. 

DataOps UI - is this a part of snowflake solution or third party? 

DataOps is a partner solution provided by DataOps.live (www.dataops.live) - it is available via Snowflake Partner Connect 

What if two engineers change the same objects/fields? Can it get corrupt data or which change will win? 

This is a fully Git compliant solution which means that two engineers can absolutely work in parallel on the same code at the same time.  And the system is designed to resolve all issues when each of the developers merge their code from feature branches to main branches. 

In the webinar we showed how two developers could make changes even in the same file without causing conflicts/issues. 

Of course, there will be rare occasions when two developers will make changes to exactly the same line, and in those situations it’s physically impossible for the system to know what to do automatically, so it simply asks which change should be kept. 

Can you design those referential integrity tests to run against new/changed rows only? 

Absolutely.  Tests are completely configurable to be deep against all of the data all of the time (but this would be resource demanding), or shallow against only the data changed assuming there are clear criteria for underlying this. 

 

If possible, can you talk a little about configuration, do we have configurable templates available within the platform? 

Yes – the platform comes with an entire test project which contains examples of all of the key components. In addition, Demonstration Projects are available which show examples of additional useful situations from which configuration/code can be copied. 

Have you written all test scenarios in the YAML file? Is it similar to CFT on AWS? 

We have either reused existing test scenarios from a library or written all the test scenarios in the YAML file.  This is very similar to the Cloud Formation Templates approach of using “stacks” in AWS. 

Can you allow data that failed tests, through to production but with flags/tags which indicate there is a DQ query outstanding? 

Yes.  That is a 100% valid scenario – especially where the rest of the data has passed critical test that ensures the data meets the base threshold for usefulness.  In these cases, records can be let through with either incorrect data flagged, or with only the invalid data stripped out and the relevant records flagged for either awareness or later action. 

What does this solution provide, compared to performing data pipelines in a traditional Jenkins? 

Jenkins is the original CI/CD software development framework that supported the launch of DevOps and Agile.  And this remains heavily focused on the software development world.  While similar, there are many ways in which the data world is different such as  

  • large amounts of actual data (state) to maintain. 
  • the need to run multiple pipelines at multiple frequencies (real-time, hourly, daily, weekly) 
  • the nuanced changes that need to be made depending on whether you are running pipelines against dev, test or prod,  
  • the fact that data platform engines have never been managed like “dumb” engines in the same way software development world managed web servers, video encoding servers, cloud infrastructure, etc. 
  • the need to orchestrate EVERY object within the data platform (data carrying and non-data carrying objects e.g., roles, permissions, tasks, etc.) and outside the data platform (e.g., 3rd party code and vendor tools) 
  • etc. 

Jenkins is an excellent but generic DevOps platform that is not designed with these requirements in mind, and therefore has limited if any of these capabilities. 

 

templates YAML and models are based on dbt?  It seems that the testing and modelling of DataOps.live is built on top of dbt using YAML. If so, can it be integrated with existing dbt projects? 

Yes.  All our YAML templates and models based on and fully compliant with dbt And yes, we have many cases where customers have developed a lot of core functionality in dbt and the lift and shift this code directly into DataOps.  100% of dbt functionality is supported in DataOps, and then this is extended with enterprise functionality like environment management, secrets management, full object lifecycle management, GRANT and permissions management, source model automation etc. In addition, we have an extended library of tests and core macros. 

Hi, is it a managed product available in AWS or you have your own hosting? 

It’s a managed product running on AWS and available to buy on AWS Marketplace with a credit card. 

Who would be the responsible person to run DevOps 

While there are more and more DataOps engineer roles all the time, these are being from Data Engineering roles where the role involves the ongoing management of product pipelines. 

From ETL perspective what are the advantages of DataOps platform compared to Python Data Pipeline and traditional Informatica ETL? 

A DataOps pipeline is about the complete end to end, rather than just the ETL/ELT piece. Orchestrating existing ETL/ELT in a DataOps pipeline allows you to chain together many steps including Environment Management in Snowflake, the ETL/ELT itself, the post ETL/ELT testing, the transformations, subsequent testing, final publishing, and many other steps. Putting these into a single pipeline allows a) dependency management i.e., run D exactly and only when A, B and C have finished and b) success criteria i.e., only run D if A, B and C completed successfully. 

How do you release DDL changes that are not covered by dbt and is clone/swap used for a release to production? 

Yes, we have support for lifecycle management for ALL objects in Snowflake, not just tables and views. We also have fully support for GRANT management on all dbt created objects. 

We have a feature in Private Preview called “Candidate/Deploy Model” which is exactly as you describe – clone the current Environment DB, make all the of the changes and test them all and then only if all the tests pass, swap this back into place as a single, atomic operation. This isn’t needed in many cases, but in situations where end users must only be allowed to see new data as a single atomic operation, all at the same time and never with partial changes (e.g., in a financial reconciliation use case) then this approach is very important and powerful. 

Can you provide any information on pricing? 

The DataOps tool is priced at $1000 per engineer per month, or $10,000 per annum.  We have customers running amazingly well with just one or two users and this is orchestrating their entire data platform. 

Where do Snowflake tags apply in DataOps? 

As you know Snowflake queries can be tagged (https://docs.snowflake.com/en/sql-reference/parameters.html#query-tag ). A (string) tag is added to each and every query.  This can be accessed via Query History to allow you to do things like tracking credit usage for certain departments or subject areas (HR, production, sales, reporting-sales dashboard, etc). 

DataOps uses these tags to also add the pipeline ID to each and every query to enable you to track snowflake usage on a per pipeline basis.  This helps with both profiling and optimising usage within Snowflake.   

What is the language and editor for the code artifacts that get stored in the Git repo? 

Every language is supported but the primary languages we see used today across Snowflake are SQL, python and sometimes R. 

The DataOps platform has its own inbuild editor/IDE but since the git repository is 100% standard git, any IDE can be used e.g., Visual Studio Code, Eclipse. 

 

Does the tool contain tests out-of-the-box, or do you need to define all of them? (and can you add new ones, if you have a few out of the box) 

We have a library of standard test that come out of the box and these can then be extended or used as the basis for new tests.   

All tests are based on config files and SQL. 

So, if you clone your fb database do you keep the RBAC as close to the prod? 

 

Updated question: So, if you clone to a feature branch database, do you keep the RBAC (role-based access control) similar to the production? 

No – we use a different approach here where the clone function is actually a “clone, mask and grant” function.   

This means we clone the data firstly with no-one having any access to the cloned data, then we mask any data that developers should not be able to interact with like credit card data, PII data, etc., and then we apply the grants necessary for the developer to be able to develop his new features. 

You can choose whether to propagate the GRANTs from the source DB to the Feature Branch DB. However, if you apply all of your GRANTs using a DataOps pipeline (highly recommended) then it doesn’t matter because the pipeline will reapply these in the feature branch pipeline anyway. 

Are you removing first the grants given in the cloned database for the "feature branch database" then apply a similar RBAC roles for the cloned database with similar roles so you won’t use the same roles from prod? 

See above. In addition, the GRANT Management system is environment aware so that you create namespace sets of roles i.e., changes in dev or an fb don’t impact QA or PROD. 

We will be publishing a blog series about this topic soon! 

So, Matillion would not be the used for orchestration and building jobs? 

If you use Matillion, that’s excellent and we would fully support all the Matillion jobs produced, but you would orchestrate this from DataOps.live.  This enables you to build pipelines with both Matillion jobs and non-Matillion jobs/functions/code (e.g., either 3rd party tools or hand generated code) 

A DataOps pipeline is about the complete end to end, rather than just the ETL/ELT piece. Orchestrating existing ETL/ELT in a DataOps pipeline allows you to chain together many steps including Environment Management in Snowflake, the ETL/ELT itself, the post ETL/ELT testing, the transformations, subsequent testing, final publishing, and many other steps. Putting these into a single pipeline allows a) dependency management i.e. run D exactly and only when A, B and C have finished and b) success criteria i.e. only run D if A, B and C completed successfully. 

 

How can DataOps help me test my Matillion jobs? 

Regardless of the type of the Matillion job (ETL, ELT, snowflake transform, etc.) after every Matillion job, automated test routines can be run based using SQL, Python, or anything else required to ensure the job has completed successfully without errors, and the data meets all the tests for quality, validity, usability, etc. defined. 

 

Is there migration support? 

Yes.  Migration support is available for every customer either as advice and guidance, migration and architectural services, or as a fully managed migration service with any of our approved DataOps.live silver, gold and platinum delivery partners (e.g., Datalytyx www.datalytyx.com). 

Is there migration support - if we are using airflow Git and Matillion and snowflake? 

Yes.  We can help to migrate all these elements to operate in and be orchestrated by DataOps.live.   

And over time you will find that many, if not all, of the Airflow DAGs (directed acyclic graphs) can be run directly with DataOps.live removing the need for the additional Airflow layer. 

Can your tool also trigger data loads for non-SAAS tools out of the box? like Informatica BDM? 

Absolutely.  We have native support for cloud based and on-premises solutions and Informatica is amongst these.   

On premise non-SaaS workloads like that of Informatica BDM are invoked using our on-premises runners which are based on docker containers. 

How would you compare this product with Airflow and dbt? 

dbt is excellent and we have been contributors to the community for some time. 

Airflow is a great tool for running DAGs and deciding what should run when. That includes scripts and SQL and python and can include dbt.  

But neither of these is an enterprise, production grade, automation and orchestration platform, with all the code management, conde control, security management, secreted management, auditing and governance capability required in an enterprise data pipeline system.  And this is the level of governance and control that is required but nearly every enterprise, and now many start-ups and small to medium business, where data is at the core of their services delivery.   

What is the cost of DataOps tool? 

As above. 

Do you need to use the same snowflake user in prod, QA and dev in order to have the zero copy clone writeable by the different pipelines? 

No, as long as the role inheritance is setup correctly (we have a standard script to do this all for you) you can user different users/roles for each environment. 

Do you manage deltas to do the ETL? Do you have stream processing for semi-real-time loads? or do you rely on Kafka Connect? 

Delta management is handled differently by different ETL/ELT tools. Most of them manage their own delta state, but if not, they can be managed in DataOps. 

The DataOps platform has it’s own ability to ingest from Kafka, but works very will with Kafka Connect 

Behind CI/CD, there's the idea of having a "continuous" toolchain. How's your tool able to integrate with other CI/CD tools for the rest of the dev cycle ? 

This is an excellent question and one that shows a clear level of understanding and maturity in CICD and DataOps. 

DataOps.live contains within it full CI/CD “continuous toolchain” as you describe it to orchestrate everything and allow the seamless promotion of “code” from feature branches to dev and then test and finally prod, depending on your custom approach to environment management and governance.  

Additionally, DataOps.live has be designed to enable you to “run” the pipelines in other CI/CD tools, or BE RUN BY and higher-level enterprise CI/CD tools like Jenkins or Azure DevOps.  This enables DataOps to operate within a wider enterprise software development lifecycle by remaining responsible for CICD around all things data and data related. 

 

What happens with other concurrent queries or dml sentences from other sessions when you rollback a table. Is there any locking mechanism in place? 

This is more of a Snowflake question and can be read about here: https://docs.snowflake.com/en/user-guide/data-time-travel.html 

All DML for a specific Database (or at the very least a specific Schema within a Database) should be handled through DataOps pipelines (no automated systems making wholesale changes like conflicting with hand made changes) so there should never be a DML clash – you would make sure your DataOps pipelines were paused before you rolled back the table. 

Not a question: You must be really, really proud of what you've done here. Thanks a lot! 

Even though this is not a question, we left this in as it is such a lovely thing to say during a deeply technical webinar. 

Thank you whoever sent this.  And yes, we are extremely proud of what we have created and delivered.   

And this comes from the fact that the technology platform we built and is deliver value to so many Snowflake customers, is founded on one primary belief and one well thought through overall philosophy.  The belief is simply “why can’t we develop, test and deploy data platforms the same way we do software”.  And the philosophy is the technology agnostic #TRUEDATAOPS philosophy that we and other data industry pioneers have developed and continue to evolve as the DataOps movement evolves. 

So, the main orchestration is done in the DataOps platform? 

Yes. That is correct. 

Can we revert the data in an automated way in this case? 

While this is technically possible, in practice you would only use this functionality when something has gone fairly significantly wrong and virtually all System and Database Administrators, tell us that when dealing with this sort of situation, they don’t want an automated system – they want to review and check each step before the execute it. 

Can we have a lab video for DevOps testing 

These are currently being produced, and we’ll send the links out shortly. 

How does DataOps work for Trunk Based Development? 

You can actually use any branch strategy and development model supported by Git - we just demo for a specific approach! 

What is Git revert? 

The git revert command can be considered an 'undo' type command; however, it is not a traditional undo operation. You can read more here: https://www.atlassian.com/git/tutorials/undoing-changes/git-revert . 

What is a DataOps engineer in SNOWFLAKE 

DataOps is a new disciple and the role and nature of a DataOps engineer is still emerging. However, it’s fair to consider it this way.  

In the software world, a DevOps Engineer builds and manages the DevOps system, the configurations and pipelines etc and a Software Engineer writes the stuff that actually goes on inside the pipeline. 

In the data world, we can consider a DataOps Engineer builds and manages the DataOps system, the configurations and pipelines etc and a Data Engineer (or another type of user who understands basic SQL and configuration files) writes the stuff that actually goes on inside the pipeline. 

 

How was the graph generated during the start of the session 

The graph was generated automatically from a pipeline configuration file (or files) stored in the repository.  

Is Selenium supported by DevOps 

Selenium can be fully supported by DataOps for automated testing in the same way any 3rd party tool can be. 

Can we also run snowflake TASKS DAG's from DataOps pipelines? 

Yes – this is fully supported. 

Can we have a hybrid operations using DataOps? I mean, in an existing database some objects are managed manually and rest are managed by DataOps. Is it possible? 

Yes, that is definitely possible, especially when the manual vs DataOps managed pieces are say in separate schemas. We have some good design patterns to help you here.   

But the true principles of DataOps is that you move the “source of the truth”.  When you run things manually, or manage some objects manually, the source of truth will always be the database or the data platform.  And that is not the way the software world works. 

No-one in the software world ever configures a server, or a service, or a web service, or a video encoding server, manually.  Its breaks every element of governance, manageability, recoverability, auditability, etc. in the software world. 

It is possible to configure things manually in the software world, but no one ever does it.  If we stay true to the belief of “why can't we build, test and deploy data platforms the same way we do software”  

Most things that people have historically thought need to be done manually can now be done through a good DataOps platform. 

 

Does merge operations only updates/creates the new/changed objects or it refreshes all objects even those which are not changed in a feature branch? 

Merge will only update the new or changed objects in the pipeline as the rest remains the same.  

It looks like DataOps would require a lot of rights to work, how to secure? 

The level of rights needed by DataOps are generally fairly high- if you want a platform that will create schemas, tables, shares, stages, file formats, masking policies etc then, by definition, it needs the rights to create all of these. However, it doesn’t need any rights on any objects it hasn’t created so if you have things in your Snowflake tenant not managed by DataOps, the DataOps service accounts need no access to these. 

Is SSO with Azure AD supported? 

The standard DataOps.live enterprise cloud instance will not support this.  But this is full supported with our DataOps.live Enterprise private instance.   

How do you do schema management? Is the process declarative like a visual studio database project or imperative like the flyway database migration tool? 

Another excellent question, and a great one to finish on, as this is a good segue into the 9-blog series that is about to commence where we discuss: 

  1. The challenges of repeatable and idempotent schema management 
  1. Imperative versus Declarative for Data 
  1. What the current “latest and greatest” is, and where we are going in the near future 

As you probably know based on your question, declarative is the holy grail and where we want to be, but it’s EXTREMELY hard to do with data, so imperative approaches like Flyway predominate the solutions available today. 

We are not satisfied with that as an answer!  Just because its hard does not make the alternative the correct solution. 

We are 80% of the way to a full declarative engine for data.  We call the 80% point a “check based imperative” approach. 

In 2021 we expect to release our fully declarative engine for data.  

 

READ MORE:  Repeatable and Idempotent Schema Management

For more great technical content around DataOps and Snowflake, why not check out our series of three blogs on the Challenges of Repeatable and Idempotent Schema Management.  Links are below.

  Repeatable Idempotent Schema Management_1