Skip to content
DataOps.live Professional EditionNEW
Purpose-built environment for small data teams and dbt Core developers.
DataOps.live Enterprise
DataOps.live is the leading provider of Snowflake environment management, end-to-end orchestration, CI/CD, automated testing & observability, and code management, wrapped in an elegant developer interface.
Spendview for Snowflake FREE

 

An inexpensive, quick and easy way to build beautiful responsive website pages without coding knowledge.
Getting Started
Docs- New to DataOps.liveStart learning by doing. Create your first project and set up your DataOps execution environment.
Join the Community
Join the CommunityFind answers to your DataOps questions, collaborate with your peers, share your knowledge!
#TrueDataOps Podcast
#TrueDataOps PodcastWelcome to the #TrueDataOps podcast with your host Kent Graziano, The Data Warrior!
Academy
DataOps AcademyEnroll in the DataOps.live Academy to take advantage of training courses. These courses will help you make the most out of DataOps.live.
Resource Hub
On-Demand Resources: eBooks, White Papers, Videos, Webinars

Learning Resources
A collection of resources to support your learning journey.

Customer stories
Events
Connect with fellow professionals, expand your network, and gain knowledge from our esteemed product and industry experts.
#TrueDataOps.org
#TrueDataOps.Org#TrueDataOps is defined by seven key characteristics or pillars:
Blogs
Stay informed with the latest insights from the DataOps team and the vibrant DataOps Community through our engaging DataOps blog. Explore updates, news, and valuable content that keep you in the loop about the ever-evolving world of DataOps.
In The News

In The News

Stay up-to-date with the latest developments, press releases, and news.
About Us
About UsFounded in 2020 with a vision to enhance customer insights and value, our company has since developed technologies focused on DataOps.
Careers

Careers

Join the DataOps.live team today! We're looking for colleagues on our Sales, Marketing, Engineering, Product, and Support teams.
DataOps.liveMay 31, 2020 9:39:00 AM5 min read

You’re either burning money or not meeting demand with your time series Read/write provisioning

Unless it’s on Snowflake with Instantaneous Demand

One of the standard promises of Cloud Data Warehouses when compared to their on-premises alternatives is elasticity. As your requirements go up and down, the system can stretch or shrink with them. All such systems are capable of change to some extent. The question is all about timing.

Changing the capacity of an on-premises Data Warehouse could easily take months (by the time hardware has been procured, installed, tested, and incorporated into a cluster). Cloud Data Warehouses offer a new approach where scaling is at the click of a button and can take just minutes. But does this really allow you to avoid overprovisioning or under-provisioning?

Time Series workload profiles

Time Series data usually has a very typical usage pattern, split into Read and Write.

Write

Write can happen anywhere between near continuously:

Image-1


Very regularly (e.g. every 30 minutes):

Image-2


Batch (e.g. every 24 hours):

Image-3


All of these workloads except continuous share the same problem: using traditional Data Warehousing sizing (i.e. how many servers do I have and how large are they) there is always a compromise to be made. If I provision for the peak load, I waste all the capacity when they the system is not being used e.g.

Image-4-1

or

Image-5


The problem here is that the elasticity time for traditional Cloud Data Warehouses is still measured in several minutes. It’s just about possible to consider resizing a Data Warehouse for a daily load, but practical limitations (often including some downtime while the switch is being made) make this very rare.

Read

Read Workloads are even more complex. In addition to a usual steady state of small queries (e.g. to update status pages, wall boards, KPI reports, etc.):

Image-6


There are also the much more unpredictable, and often much larger queries driven by human usage (e.g. through a BI tool or Data Science Platform):

Image-7


And of course, the real-world being what it is, these won’t spread evenly over time, or even over the working day, there will be large bursts of load (e.g. at 9am on a Monday morning when everyone wants to explore what happened over the weekend, or when there is an issue ongoing and everyone is trying to find out why). These will even overlap each other. So if we profiled the instantaneous desired system capacity at any one time it will be much higher than the peaks above:

Image-8


Of course, trying to size for the peak load here is incredibly wasteful:

Image-9


And yet not doing so will still cause possibly key users to slow down significantly.

Multiple workloads

In practice it’s unheard of for a Data Warehouse to have JUST reads or writes. In the real world they have both. This creates both an opportunity and a problem. If the reads and the writes occur at different times, then this can increase the effective utilization and reduce the wasted resource.

Image-10


But is the real world ever so convenient? Absolutely not. In practical situations all the workloads, both read and write, will overlap each and interfere with each other and create a very bursty and unpredictable total demand with very large peaks.

The possible outcomes are

  • The system will be overprovisioned and under-loaded. More capacity will exist than is needed by the total read-write demands. This is akin to burning money.

Image-11


  • The system will be under-provisioned and overloadedthe total read-write demands will exceed capacity; users will have to wait longer for results and write loads will be slowed down. In extreme, but all too frequent cases, this can hit a feedback loopwrite queries fall behind, read queries time out and retry which increases system load. Write queries start to fail and fall further behind. Very quickly this loop will bring a system to its knees.
  • Very, very rarely the system will have absolutely the right capacity for the user demands and will be both cost effective and meet all user demands.


Snowflake’s Instantaneous Demand

The problem here is all down to Elasticity. Even most modern Cloud Data Warehouses have elasticity models that are meant to flex based on trends (e.g. “this week I have more users than last week”, “this month we added new data loads, so we need to increase capacity”). What is really needed is a system that can respond to Instantaneous Demand.

The Snowflake Virtual Warehouse model provides exactly this capability.

In a continuous write model, a relatively small Virtual Warehouse can be provisioned ‘always on’. In a Regular or Batch load model a Virtual Warehouse can be provisioned for this load with Auto Resume and Auto Suspend; the Warehouse is only being paid for when it’s needed and being used.

In more complex load scenarios different Virtual Warehouses can be provisioned for each data set for example:

  • Massive batch events being loaded daily: dedicated X-Large Virtual Warehouse with Auto Suspend and Auto Resume
  • Regular Time Series loads every 30 minutes: dedicated Large Virtual Warehouse with Auto Suspend and Auto Resume
  • Continuous Status Update loads: dedicated X-Small Virtual Warehouse always on

Critically, since each of these loads have their own warehouse, there is no contention between them (or between them and any read workloads).

Read workloads can be similarly provisioned:

  • Updating wall boards every 60 seconds: dedicated X-Small Virtual Warehouse always on
  • Hourly updating of KPI reports: dedicated Large Virtual Warehouse with Auto Suspend and Auto Resume
  • Ad-hoc queries from Engineering users: dedicated X-Large Virtual Warehouse with Auto Suspend and Auto Resume and auto-scale of the cluster size
  • Ad-hoc queries from Operations users: dedicated 2X-Large Virtual Warehouse with Auto Suspend and Auto Resume and auto-scale of the cluster size
  • CFO finance reports: dedicated X-Large Virtual Warehouse with Auto Suspend and Auto Resumewhen the CFO wants their report, they don’t want to wait in line. So why make them?!

Ultimately, this creates an architecture where workloads are modelled, distributed based on some key attributes (some of which have been discussed here and some of which are out of scope of this blog post like cache hit frequency) and they are provisioned as specific Snowflake Virtual Warehouses. This creates a system where capacity follows demand very closely (certainly sub 60 seconds) and creates the true win-win: All read-write workloads get the capacity they need when they need it.

RELATED ARTICLES