Skip to content
DataOps.live Professional EditionNEW
Purpose-built environment for small data teams and dbt Core developers.
DataOps.live Enterprise
DataOps.live is the leading provider of Snowflake environment management, end-to-end orchestration, CI/CD, automated testing & observability, and code management, wrapped in an elegant developer interface.
Spendview for Snowflake FREE

 

An inexpensive, quick and easy way to build beautiful responsive website pages without coding knowledge.
Getting Started
Docs- New to DataOps.liveStart learning by doing. Create your first project and set up your DataOps execution environment.
Join the Community
Join the CommunityFind answers to your DataOps questions, collaborate with your peers, share your knowledge!
#TrueDataOps Podcast
#TrueDataOps PodcastWelcome to the #TrueDataOps podcast with your host Kent Graziano, The Data Warrior!
Academy
DataOps AcademyEnroll in the DataOps.live Academy to take advantage of training courses. These courses will help you make the most out of DataOps.live.
Resource Hub
On-Demand Resources: eBooks, White Papers, Videos, Webinars

Learning Resources
A collection of resources to support your learning journey.

Customer stories
Events
Connect with fellow professionals, expand your network, and gain knowledge from our esteemed product and industry experts.
#TrueDataOps.org
#TrueDataOps.Org#TrueDataOps is defined by seven key characteristics or pillars:
Blogs
Stay informed with the latest insights from the DataOps team and the vibrant DataOps Community through our engaging DataOps blog. Explore updates, news, and valuable content that keep you in the loop about the ever-evolving world of DataOps.
In The News

In The News

Stay up-to-date with the latest developments, press releases, and news.
About Us
About UsFounded in 2020 with a vision to enhance customer insights and value, our company has since developed technologies focused on DataOps.
Careers

Careers

Join the DataOps.live team today! We're looking for colleagues on our Sales, Marketing, Engineering, Product, and Support teams.
DataOps.liveNov 23, 2021 3:00:00 AM5 min read

Why Data Orchestration Platforms are an Imperative in the Modern Data Ecosystem

As described in the article titled “Everything You Need to Know About DataOps and Data Platforms,” the amount of data generated per day continues to grow unabated. And trying to quantify these daily data volumes (as well as types) continues to prove challenging. At best, it is only possible to derive approximations.

Current estimations report that global netizens together create circa 1.145 trillion MB per day. Retrospectively, 2020 statistics showed that each person with access to the Internet created at least 1.7MB of data per second.

Partnered with the exponential explosion in the 3Vs of Big Data (Velocity, Volume, and Veracity), there has been and continues to be a strong movement towards the “development of more feasible and viable tools for storing” and processing this data.

The journal article titled “Big Data Storage Technologies: A Survey” describes a radical shift in the data storage methods and mechanisms from traditional database or data management systems to modern technologies such as NoSQL databases, data warehouses, and data lakes. Additionally, the authors of this research paper correctly note that “storage is the preliminary process of Big Data analytics for real-world applications.” Snowflake, AWS, and Google Cloud are the three biggest cloud-based data storage solution providers (with Snowflake rapidly gaining ground on the other two service providers).

ELT: Extracting, loading and transforming data

In summary, the requirement for all businesses to derive valuable insights from their data as fast as possible is increasing exponentially. Traditional data management practices, including siloed data sources and systems, and a lack of resources to implement these management practices will slow down or impede the delivery of information when needed.

Therefore, to move forward and solve these challenges, you need to implement the latest ELT tools that allow the adoption of modern data management best practices (such as those described in the #TrueDataOps philosophy).

At the same time, there has been an explosion of individual tools that are designed to manage a specific part of the ELT (Extract, Load, Transform) process or pipeline. Some of these tools include:


Matillion ETL

Matillion ETL (Extract, Transform, Load) is a data tool that is extremely effective at its core function: ELT. Our blog post titled “DataOps Launches Integration with Matillion” effectively describes what Matillion is good at:

“It is designed to extract data from multiple disparate data sources and load it into the Snowflake data cloud.” 


Soda SQL

Let’s turn, once again, to the DataOps.live blog posts for a description of yet another part of the data processing lifecycle. This time to a post titled “DataOps Launches Support for Soda SQL and Soda Cloud.”

Soda SQL is an “open-source, data testing, monitoring, and profiling tool for data-intensive environments… [and] designed to be integrated with your existing data pipelines and workflows.”

Data.world data catalog

As our highly rated and very experienced data engineers (led by Guy Adams) note: “Highly accurate, trustworthy, and timely metadata is gold to data catalogs and the business communities they serve.”

Data catalogs, like the data.world data catalog, are a vital part of the data processing toolchain and should not be forgotten about in the data processing workflow.

Orchestrating data toolchains with a data orchestration platform

If we consider each of the data processing tools highlighted above as separate entities, which they are in many scenarios, we end up with a disjointed, chaotic process. Adopting a manual approach where each tool is run independently of each other can probably lead to breakdowns in the ELT (or ETL) process, increasing the time it takes to deliver data insights to the business if they are delivered at all. Secondly, there is a significant chance that these data insights will not be accurate; thereby, causing more harm than if they were not delivered in the first place.

As a result, a solution is needed for this untenable and unsustainable scenario. In short, it is imperative to adopt a data orchestration model such as our DataOps.live data orchestration platform.

It is vital to note that data processes (managed by independent tooling as highlighted above) have interdependencies. And, to deliver reliable, trusted data insights, it is critical to implement a workflow (in the form of a data pipeline) where tasks are executed in the correct order. And if one process fails, the pipeline must stop, the errors corrected, and then restarted.

Therefore, integrations between DataOps.live and tools like data.world, Soda SQL, and Matillion are critical for their value to business communities looking to explore data. These integrations enable DataOps to successfully run data pipelines orchestrating a toolchain of these individual products. In short, the value to a customer to have a 1-click, zero-effort data pipeline that adds value to the business by returning highly available, robust, trusted data is immense.

SOLE: Managing the state of a database in a data pipeline

As spoken about in several other places, we have developed our Snowflake Lifecycle Engine (SOLE), which manages the data pipeline’s DDL statements; it is straightforward to restart a pipeline without worrying about recreating tables (and other Snowflake objects), causing the pipeline to fail.

SOLE essentially looks at the desired state of the database and compares it to its current state. It then creates and runs DDL statements to match the current state with the desired state.

For instance, in the context of a data orchestration pipeline breaking down before it is finished and needs to be restarted, SOLE will consider whether any object create, add, or delete DDL statements have already been run and whether it should run them again or not.

Let’s assume your data pipeline must create several new tables, add columns to other tables, and change the data type of several existing table columns. Additionally, in this scenario, the pipeline stops after the DDL statements have been run. Therefore, without SOLE, the DDL statements will rerun when the pipeline is restarted, resulting in potential breakdowns if the DDL statements have not been coded to check where the object exists or not. 

SOLE ensures that this scenario does not ever occur by comparing the database’s intended state with its actual state and only running DDL scripts necessary to ensure that the database’s current (or existing) state matches its intended state.

Conclusion

The need to harness the massive volumes of data generated daily to derive business value is a given. As a result, businesses must prioritize the need to become data-driven organizations. And a significant part of prioritizing the requirement to derive value from organizational data in the form of trusted data insights is to adopt a data orchestration platform; thereby, ensuring that your data management processes are reliable and robust.

RELATED ARTICLES