Community blog | DataOps.live

Lifecycling Snowpark and Java UDFs Through DataOps

Written by DataOps.live | Jul 20, 2021 7:00:00 AM

We recently (14 July 2021) completed a masterclass with Kent Graziano, Chief Technical Evangelist, Snowflake, discussing Snowpark, the use of Scala and Java UDFs, and how we integrate this new technology into our DataOps platform. In particular, we discussed how we are using our Snowflake Object Lifecycle Engine to recycle these Snowpark objects through our DataOps platform via CI/CD pipelines and automated regression testing. 

It was great to share our integration with Snowpark and Java UDFs. In summary, we are very excited about Snowpark because it represents a significant new way of thinking in the data world.  

The philosophy of #TrueDataOps 

While this blog post’s main purpose is not a deep dive into the #TrueDataOps philosophy, it is worth taking a brief look for our DataOps for Snowflake platform is based on #TrueDataOps.  

When we developed our DataOps platform, we first started by defining our #TrueDataOps philosophy. We began with a pure, clean model of DevOps and CI/CD to create this philosophy. We took what worked, what didn’t work, and extended and built our #TrueDataOps philosophy on top of DevOps instead of creating DataOps as a new functionality.  

Consequently, the most significant benefits of #TrueDataOps are applicable and relevant to both data and traditional software development. 

Snowpark 

Snowpark, Snowflake’s developer environment, provides developer and data engineers with the functionality to write, compile, and execute code inside Snowflake in their preferred language. In other words, Snowpark makes it easier to extend Snowflake’s functionality with custom Java UDFs and Scala code.  

Snowpark’s primary objective is to move the execution of extended functionality closer to the data and significantly reduce the high-volume movement of data in and out of Snowflake by storing the code files close to Snowflake’s compute engine. 

The additional benefits of Snowpark include:  

  • Improve data security 
  • Significantly reduce costs 
  • Improve (or decrease) time-to-value 
  • Reduce operational overhead 

The Snowflake Object Lifecycle Engine  

Succinctly stated, we developed the Snowflake Object Lifecycle Engine (SOLE) to manage, recycle, and extend the life of every object found in Snowflake such as tables, grants, roles, constraints, sequences, shares, stored procedures, and user defined functions (UDFs).  

The fundamental question that this masterclass answered is: How do we lifecycle Snowpark objects and Java UDFs as part of the DevOps infrastructure by citing the following scenario:  

We started off with the following example of a pipeline (each number or step is a job in the pipeline):   

  1. Secrets management: This job extracts the necessary credentials from secure tools like Data Vault
  2. Snowflake orchestration
  3. Ingest data
  4. Advanced processing: Compile, build, execute Snowpark
  5. Advanced processing: Snowpark testing 
  6. Business domain validation:  Final data testing 
  7. Generate docs

The two pipeline jobs that are relevant to this scenario are as follows.

The Snowflake orchestration job 

This is where we build warehouses, roles, users, grants, shares, external functions, and resource monitors; all the objects required to make a Snowflake infrastructure work using the Snowflake Object Lifecycle Engine. The challenge with traditional SQL DDL (Data Definition Language) create and alter statements is they are very difficult to orchestrate. To solve this challenge, we model what the Snowflake environment must look like using YAML configuration files, including objects that hold the Snowpark output.   

When the job containing the configuration files run in the DataOps pipeline, it will build a complete Snowflake infrastructure if the environment is empty. If several objects do not yet exist or need to be altered or dropped, this engine will figure out how to replicate the model.   

 


Advanced processing: Compile, build, execute Snowpark 
 

For this masterclass, we built a simple Scala job called customer segmentation. Very simply, it selects customer data from an ingestion table, filters it, joins this data to master data information, groups it, shows the result, and appends it to another table. 

The Scala code is made up of data frames. However, when we look at the output, we can see that SQL was compiled and executed. In summary, Snowpark decides on a case-by-case basis whether it is best to create SQL statements or Java UDFs. We ask for the job to be done, and Snowpark does the rest. 


Conclusion

The key theme of this webinar is that Snowpark adds the standard software development requirements as set out in the DevOps, CI/CD, and #TrueDataOps philosophies. Because our DataOps for Snowflake platform’s foundations are based on these paradigms, all this functionality is native to our platform.   

Thus, the DataOps/Snowpark integration cycle is as follows:

The question is: How do we build, Test and Deploy Data Environments the same ways we do software while still keeping all the core software DevOps capabilities.


Lastly, as described above, we are now using the Snowflake Object Lifecycle Engine alongside Snowpark to manage all the Snowflake objects that Snowpark needs to interact with.
 

For more detail and technical information, you can watch the recording of this masterclass and connect with us here 

 

Ready to get started?

Access the latest resources on DataOps lifecycle management and support for Snowpark and Java UDFs from Snowflake.