Back to blog

Making Data Engineering Work: Clear, Repeatable, + Well-Documented

Make sure to read the first blog in our series featuring data engineers talking about all things data and DataOps. Our first blog covers Martin Getov, Bulgaria-based CloudOps & DataOps Engineer story covering; DRIVEN BY CURIOSITY: MAKING SENSE OF THE DATA PUZZLE. 
In this second blog, CloudOps & DataOps Engineer Mincho Ganchev talks about his route into data science and how data catacombs can be avoided by using 

Data engineering is the link between the input that a business has, and the decisions that need to be made. As data engineers, our role is to collect, curate and automate the data so it’s available to be interpreted. And this is quite a responsibility in today’s whirlwind of tech solutions and methods.  

Businesses need to take decisions or make assumptions based on the facts. If an organization wants to stay competitive and be successful, it has to be data-driven and not gut-driven. That’s why DataOps-driven data engineering is so important. 

 I wasn’t always a data engineer. For almost a decade, I was a sales representative in the machinery sector, and I wanted to work in a more challenging field. I already understood business requirements from diverse stakeholders, and had a longstanding love for coding and solving complex problems, so I brought the two together. I started using a BI tool and learned SQL. One thing led to another. I was into the Jinja web template engine, which made me curious about Python. An avid fan of Snowflake and dbt, seeing how it all comes together in made me want to be part of the team.  

My main expertise lies in analytics and data modeling. But prior to using, there were a number of issues that I wondered how to tackle. This included essential but mundane and time-consuming tasks in your day-to-day work that you’d love to automate or skip. If I wanted to test a colleague’s code - not only if it’s working but have tests on the data results over a period of time - I had to switch over to their branch, use the same roles & permissions, use the same target, and hope the source data was up to date. By then, the PROD database had usually moved ahead a few ingestion batches, meaning I had to re-run that branch of data, which for sets of data that were based on logic from the previous day had too much overhead. I had to simulate ingestion with flags for the days that I lacked. Annoying, to say the least.  

Even when everyone has their own database/schema to work on, if not kept up to date (which requires resources) it quickly turns into a stale and inaccurate representation of the data. You risk what I call the data catacombs (or data swamp, if you prefer). A place of musty old data, with many schemas named like ‘CUSTOMERS_JOHN’ or ‘CUSTOMERS_CLARE’ or any other ‘CUSTOMER_your_colleague_name_here’.  

By contrast, the platform provides the automation and simplified processes you need to reach your destination faster and easier. You get the developer experience you want – to, say zero copy clone instantly and automatically when branching out – in a clever, quick and fast way. No need to wonder if I’ve ran ingestion, if my data valid, and so on. I know from experience that a well-administered project saves you a large amount of time, so you can turn your attention towards more fun stuff, interpreting rather than administering data. And that’s where the real value lies.  

 I think our customers want answers to two questions: how quickly can we get a data product out, ready to be used by analysts; and how quickly can we onboard new team members? helps you to do both. This is about improving time-to-value without any compromises in the quality of the resulting data product, building something that is robust and can scale. 

When an organization outgrows a small team, you’re no longer in a simple Hello World environment, switching targets or copying schemas. And you face those data catacombs: dangling objects in your cloud instance that no one knows who made them, why they exist, if they’re a final version, or final-final?  

Based on my experiences and observations, there are three main requirements for an effective data engineering function. First, idempotence. That is, it can be applied multiple times and should work in all cases, no matter how many times it’s used or for what purpose. It should always do the same thing. Second, it needs to be simple. We sometimes forget it’s generally people reading our code and functions rather than machines; we spend most of our time reading code rather than writing it. And third, it should be well-documented so everyone - engineers, product owners, business users – can understand what the data engineering function is for, and what it can do. 

#TrueDataOps enabled via the platform is an opportunity to develop that true data-driven approach. It’s a mindset, a culture to be shared via champions: people who are inspired by this approach, and who can articulate the many benefits to be gained. 

Mincho has been with since August 2022. He was previously an Analytics Engineer at Infinite Lambda, Web Developer at Real Time Games, and Area Sales Manager at AIGER Engineering Ltd. He studied at the University of National and World Economics. Connect with Mincho here: 


Stay tuned for our next blog with CloudOps & DataOps Engineer Aleksandra  Shumkoska, where she explains how ensuring a more effective developer experience can help you move closer to becoming a data-driven organization.