Increasing Productivity, Empowering Developers: Dynamically Scale Data Product Development Using DataOps.live and Kubernetes

Written by Patrice Borne | Feb 06, 2023

When you run a pipeline, you need a runner: a piece of code that does the job on your behalf. Someone has to set this up. It’s not very hard, of course but the problem is that you only have access to the resources of one machine where the runner runs. This brings limitations, including the size and cost of machine available to rent. It’s far from easy to scale up and down in a flexible way and the big ‘monster machines’ can get very expensive very quickly.

Resiliency can also be a problem alongside the fact you’re going to need to provision a large virtual machine, even if you only use it for a fraction of the time. You can set up a second runner or as many as you want. Issues may arise around the time and effort involved and how you manage and control them. It’s possible but far from ideal.

The combination of DataOps.live and Kubernetes solves these issues, enabling you to scale up and down and improve your resiliency. You can define the target number of pods that you want for a service and let Kubernetes figure out at runtime how to get there - if something breaks, it will recover and reschedule. And of course, you’re boosting developer productivity.

It’s worth remembering that Kubernetes itself is pretty low level and requires quite a lot of work to configure from scratch. So, we provide the Helm charts to specify a few things —to provide this additional level of abstraction on top of those super-low-level Kubernetes concepts. You can now think in terms of what your application does rather than what Kubernetes expects. Because of how the pipeline and job runner mechanisms work, you need to be able to pass information from one step to the next. We have a shared storage area mechanism to do this. And you then need to consider two important parameters: resiliency and concurrency.

Providing Resiliency and Concurrency for Dynamic Scalability

The first parameter is resiliency. How many Pods do you want—a typical concept with Kubernetes, with each of these going to present itself to DataOps.live as a runner. If you want two Pods then you’ll get two runners for the project you’re working on and you can register with a specific project.

The second important parameter is the concurrency. If you look at how a job is defined, logically, you have different stages. And within a stage you may have multiple jobs. When you put multiple jobs in the same stage, you have parallelism to execute those different jobs at runtime. You can also constrain using dependencies. In this way, you (or rather DataOps.live) can schedule more work to run in parallel to make the pipeline go faster.

Using the traditional approach, the runner (each job) is technically scheduled on the virtual machine as an individual Docker container. We’re back to that resource/cost issue: there’s only so big a machine you can rent from AWS, say, to run in parallel, and eventually you risk running out of resources. Especially if you’re involved in a non-trivial development—your headcount is growing, people want to run more projects, they want more pipelines—so what you start asking of your runner(s) can become overwhelming.

DataOps.live and Kubernetes solve this: you can define and manage a massive concurrency level should you want. When you’re running a pipeline, 10 jobs could run in parallel. You could have five or more teams working on multiple different projects in DataOps.live, each one with multiple pipelines, sharing Kubernetes runners. Instead of thinking in terms of setting up 1, 2, 10 or 20 runners, think in terms of a Kubernetes cluster —without the need to manage specifically where each runner is going to run.

The beauty of this is approach that when you have to deal with a huge spike in demand in your DataOps.live environment, with everyone wanting to run pipelines and you need huge resources, your Kubernetes cluster scales up dynamically, with no human intervention. In the traditional way, with specific runners you defined, you faced a physical limit. When you ran out of resources, everyone has to get in the queue. With the DataOps.live Kubernetes approach, you instead scale up dynamically to address whatever demands you have. And when you’re done, it will scale down automatically.

So what does this mean in practice? First, improved cost management as it now becomes dynamic. You don’t pay for compute resources for DataOps if you don’t use those resources (the original promise of the cloud).

Second, the DataOps.live platform means you can also share these flexible additional resources around runners and jobs across multiple teams and projects, safe in the knowledge that you’re supported by the secrets management and security features provided by the platform. This further improves productivity and collaboration. Developers are no longer constrained and have better access to resources.

And third, you get the resiliency you need. If a pipeline breaks, it’s detected automatically, and you keep on running. You can rely on Kubernetes to do the right thing at run time. So, business-critical processes using those pipelines become more reliable.

DataOps.live and Kubernetes support our mission to improve:

Developer productivity: making it easier for developers to work concurrently and faster—making changes, running, seeing the results, making further improvements… for higher productivity and accelerated development
End user productivity and customer satisfaction: delivering higher quality products faster plus increased resiliency and reduced downtime mean more productive and satisfied business users
Cross-project productivity: using leading edge DataOps.live technology and processes across all groups enable people to easily move around to balance changing workloads and priorities
Business productivity: time to market and revenue: increasing the speed, effectiveness and resiliency of data and analytic processes means you can build new use cases in days rather than months or years

View full post