DataOps.live & Snowflake—Better MLOPS with Snowpark

Written by Doug 'The Data Guy' Needham | Mar 08, 2023

MLOps is a new discipline. It is a core function of machine learning engineering focused on taking machine learning models to production and managing them. Once a Data Scientist comes up with an algorithm, or combination of algorithms to make a prediction, this algorithm along with a machine learning model must go into production.

Snowpark is an exciting new way to program—that is, promote code to data in Snowflake—using the languages you like, including Python, Java, and Scala. It takes advantage of custom warehouses and allows for these popular languages and the SQL interface from Snowflake. This is an exciting option for architects as they can now design warehouse solutions without having to move data out of Snowflake.

Snowpark is a powerful tool in the toolbox. The value of Snowpark increases when there are operational best practices supporting the deployment and monitoring of Snowpark code. The DataOps.live Data Platform has supported Snowpark implementations for Scala and Java for some time. Recently we have added the Python language to our support for Snowpark. Our Data Platform manages the deployment, management, and promotion of the code that would otherwise need to be done manually to get a Snowpark implementation off to the races.

With the increasing volume of data coming into the enrichment platform from a wide variety of sources (streaming, third parties, IOT, etc.…), leveraging the speed, performance, and cost-savings of keeping the data in a specific location is an advantage since the time it takes to move data around the Enterprise will only increase as the volume increases. One of the best use cases for deploying this code into a Snowflake environment is using machine learning directly on Snowflake.

Challenges with ML (machine learning) models

One of the most difficult things in my experience being a Data Scientist is getting a machine learning model into production. This is the overall goal and need for the creation of MLOps as a discipline. I have taken several courses, vendor training classes, MOOCS and others that talk about Data Science. All of them cover data preparation, data munging, some data cleaning, exploratory data analysis (EDA), visualizations, and various dials that can be tuned for several types of machine learning algorithms like XGBoost or Neural Networks.

Almost none of them cover the topics of delivering your machine learning model to production, retraining an existing model with new data, rolling back a trained model, comparing performance of models, and validation of model performance. All these capabilities are currently built into our platform, and we have been doing similar things since our product began. Extending these capabilities to support Snowpark was a logical step forward.

A machine learning model is not code alone. When a Data Scientist trains a machine learning algorithm through her Jupyter notebook, the model is in memory. That object stored in memory is analyzing potentially millions or billions of records. Using sophisticated algorithms that Mathematics PhD’s love spending time explaining this object in memory has jumped through every hoop imaginable. The cost function has been reduced, and the r²covers well over 90% of the data. The combination of the code to prep and munge the data into a particular data structure, then pass that data to the machine learning function that has access to the object in memory will produce predictions on new data. These predictions are usually what the business is after - whatever they may be.

Going live with ML projects faster, better with DataOps.live

How do you take that object in memory she has in her notebook and take it into production? It must be persisted as some sort of file. That file then must be moved to a location that is “production.” At the same time, all the precursor code that specifically works with that file must be promoted to a “production” environment and called at the right time new data is seen to get the predictions. All of this is very manual and ad hoc—with no automation and unpredictable results.

The team of engineers at DataOps.live works efficiently to add new features to our Data Platform very soon after Snowflake makes these new features available in preview. Our Platform has incorporated support for Snowpark since its initial release. DataOps.live support for Snowpark is a great blog that covers the first iterations of Snowpark for Java and Scala. We have been working with Snowpark for well over two years now.

How it works

The DataOps.live platform has built-in helper functions for the Python Data Scientist to save her models in easily retrieved locations for use within a Snowpark function. Our repository-based infrastructure ensures that configuration is stored as code, checked in, audited, and approved before releasing it to production—naturally supporting the incorporation of Snowpark code.

A DataOps.live orchestration can perform a multitude of functions including data ingestion, transformation, automated testing, and more. It is imperative to perform extensive validation and ensure everything meets high quality standards required by the data product or application. DataOps.live can seamlessly orchestrate the running of code within the Snowpark framework. This opens the door to operating with a multitude of machine learning use cases and applications.

The code of the machine learning model is stored in a git-compatible repository. The model object needed for that bit of code to accurately do its predictions is easily retrieved using our helper functions. Following our best practices, the code can even update the model object when new data arrives or on a schedule. All are easily controlled, managed, and visible to support any audit or questions about what a prediction means.

This comprehensive approach enables DataOps teams to work efficiently with Data Scientists in the organization to produce valuable insights quickly.

Your data team should focus on creating value for your organization—that begins with an organized, repeatable approach for building and operating machine learning models—MLOps made easy. DataOps.live provides the only purpose-built platform that helps modern data teams take the next step forward with Snowpark.

Want to learn more about DataOps.live? Connect with us at DataOps.live—“The folks who wrote the book on DataOps.” Here is a link to the book—DataOps for Dummies.

Check out the Roche Case Study, OneWeb Case Study, Accelerated Snowpark Deployments with DataOps.live and be sure to register for our Podcast Series hosted by Kent Graziano, The Data Warrior.

Ready to give it a try? Click here: DataOps.live free trial

View full post