As described in the article titled “Everything You Need to Know About DataOps and Data Platforms,” the amount of data generated per day continues to grow unabated. And trying to quantify these daily data volumes (as well as types) continues to prove challenging. At best, it is only possible to derive approximations.
Current estimations report that global netizens together create circa 1.145 trillion MB per day. Retrospectively, 2020 statistics showed that each person with access to the Internet created at least 1.7MB of data per second.
Partnered with the exponential explosion in the 3Vs of Big Data (Velocity, Volume, and Veracity), there has been and continues to be a strong movement towards the “development of more feasible and viable tools for storing” and processing this data.
The journal article titled “Big Data Storage Technologies: A Survey” describes a radical shift in the data storage methods and mechanisms from traditional database or data management systems to modern technologies such as NoSQL databases, data warehouses, and data lakes. Additionally, the authors of this research paper correctly note that “storage is the preliminary process of Big Data analytics for real-world applications.” Snowflake, AWS, and Google Cloud are the three biggest cloud-based data storage solution providers (with Snowflake rapidly gaining ground on the other two service providers).
ELT: Extracting, loading and transforming data
In summary, the requirement for all businesses to derive valuable insights from their data as fast as possible is increasing exponentially. Traditional data management practices, including siloed data sources and systems, and a lack of resources to implement these management practices will slow down or impede the delivery of information when needed.
Therefore, to move forward and solve these challenges, you need to implement the latest ELT tools that allow the adoption of modern data management best practices (such as those described in the #TrueDataOps philosophy).
At the same time, there has been an explosion of individual tools that are designed to manage a specific part of the ELT (Extract, Load, Transform) process or pipeline. Some of these tools include:
Matillion ETL
Matillion ETL (Extract, Transform, Load) is a data tool that is extremely effective at its core function: ELT. Our blog post titled “DataOps Launches Integration with Matillion” effectively describes what Matillion is good at:
“It is designed to extract data from multiple disparate data sources and load it into the Snowflake data cloud.”
Soda SQL
Let’s turn, once again, to the DataOps.live blog posts for a description of yet another part of the data processing lifecycle. This time to a post titled “DataOps Launches Support for Soda SQL and Soda Cloud.”
Soda SQL is an “open-source, data testing, monitoring, and profiling tool for data-intensive environments… [and] designed to be integrated with your existing data pipelines and workflows.”
Data.world data catalog
As our highly rated and very experienced data engineers (led by Guy Adams) note: “Highly accurate, trustworthy, and timely metadata is gold to data catalogs and the business communities they serve.”
Data catalogs, like the data.world data catalog, are a vital part of the data processing toolchain and should not be forgotten about in the data processing workflow.
Orchestrating data toolchains with a data orchestration platform
If we consider each of the data processing tools highlighted above as separate entities, which they are in many scenarios, we end up with a disjointed, chaotic process. Adopting a manual approach where each tool is run independently of each other can probably lead to breakdowns in the ELT (or ETL) process, increasing the time it takes to deliver data insights to the business if they are delivered at all. Secondly, there is a significant chance that these data insights will not be accurate; thereby, causing more harm than if they were not delivered in the first place.
As a result, a solution is needed for this untenable and unsustainable scenario. In short, it is imperative to adopt a data orchestration model such as our DataOps.live data orchestration platform.
It is vital to note that data processes (managed by independent tooling as highlighted above) have interdependencies. And, to deliver reliable, trusted data insights, it is critical to implement a workflow (in the form of a data pipeline) where tasks are executed in the correct order. And if one process fails, the pipeline must stop, the errors corrected, and then restarted.
Therefore, integrations between DataOps.live and tools like data.world, Soda SQL, and Matillion are critical for their value to business communities looking to explore data. These integrations enable DataOps to successfully run data pipelines orchestrating a toolchain of these individual products. In short, the value to a customer to have a 1-click, zero-effort data pipeline that adds value to the business by returning highly available, robust, trusted data is immense.
SOLE: Managing the state of a database in a data pipeline
As spoken about in several other places, we have developed our Snowflake Lifecycle Engine (SOLE), which manages the data pipeline’s DDL statements; it is straightforward to restart a pipeline without worrying about recreating tables (and other Snowflake objects), causing the pipeline to fail.
SOLE essentially looks at the desired state of the database and compares it to its current state. It then creates and runs DDL statements to match the current state with the desired state.
For instance, in the context of a data orchestration pipeline breaking down before it is finished and needs to be restarted, SOLE will consider whether any object create, add, or delete DDL statements have already been run and whether it should run them again or not.
Let’s assume your data pipeline must create several new tables, add columns to other tables, and change the data type of several existing table columns. Additionally, in this scenario, the pipeline stops after the DDL statements have been run. Therefore, without SOLE, the DDL statements will rerun when the pipeline is restarted, resulting in potential breakdowns if the DDL statements have not been coded to check where the object exists or not.
SOLE ensures that this scenario does not ever occur by comparing the database’s intended state with its actual state and only running DDL scripts necessary to ensure that the database’s current (or existing) state matches its intended state.
Conclusion
The need to harness the massive volumes of data generated daily to derive business value is a given. As a result, businesses must prioritize the need to become data-driven organizations. And a significant part of prioritizing the requirement to derive value from organizational data in the form of trusted data insights is to adopt a data orchestration platform; thereby, ensuring that your data management processes are reliable and robust.