Things I wish I knew about the modern data ecosystem as an MLE.

@April 20, 2022

💡

TL;DR This is some of what MLEs/data scientists need to understand about modern data stacks + teams to actually implement machine learning solutions that scale and drive business impact.

How I got started: a common scenario

I got started in data science and machine learning the way that many others do: with a series of model.fit()’s. That is, taking a flat file, applying an algorithm using a popular library, tuning it to the dataset and reporting a standard set of performance metrics. I imagine many data scientists have a similar journey to their work.

Over time, I realized that there are much more fundamental forms of data work that I lacked context about through spending time in the ecosystem.

I joined the MLOps Community, where people discussed the practical challenges they faced in machine learning and data work, and helped host the community’s podcast, MLOps Coffee Sessions. I read on the blogs of companies like Uber, Pinterest, and Google about the architecture that was crucial to operationalizing machine learning at scale. In particular, I worked with a company that was working on millions of text records a week. I realized the challenges of QAing, storing, transforming, and analyzing this large-scale data in a way that would meet the business's needs for rapid innovation to solve customer needs.

I had no ability to perform this kind of work with my core data science skill of performing model.fit()’s. This was because I had to no contextual understanding of modern data-intensive systems, let alone the ability to design and build them.

As I’ve realized this, I’ve been spending more time trying to understand the totality of the data stack, not just the modeling part. In many ways, it’s the classic maturity any data scientist or MLE goes through. It’s why Mihail Eric wrote a great article about why we need more data engineers, not data scientists.

It’s important for MLEs to understand the totality of the data stack.

MLEs and DSs are walking into organizations where data maturity is the bottleneck to becoming a more informed company, not machine learning models. This is eloquently summarized by the Data Hierarchy of Needs. If you are not up to speed on how such technologies impact data’s workflow, you will find yourself in a position where your work cannot be as effective as possible. That's exactly what I experienced, and I'm sure many others experience too.

It’s not just me saying MLEs need to understand data maturity and tools. Shreya Shankar, a noted production ML expert, said as much in this episode of MLOps Coffee Sessions.

Her advice for aspiring MLEs? Learn how to setup a Postgres database and perform effective data management.

What’s coming.

In this article, I will share some of the fundamental insights that helped me better understand the design of modern data-intensive systems.

These are things I wish I knew when I first got started in the world of data and machine learning.

This is an intro and survey, which will reference further resources you can read.

Cloud warehouses have changed the process of ingestion.

The first step in working with data is getting/ingesting data. If you’re an MLE, you’ve probably spent a lot of time working with informal sources of data like CSVs.

As you know businesses don’t run on CSVs. But what do they really rely on instead?

The modern place to store and access data is a data warehouse. Examples of data warehouses are products like Google BigQuery, Snowflake, and Amazon Redshift. These data warehouses are incredibly powerful distributed systems that allow you to work with lots of data very fast. In the same way GPUs have allowed MLEs to train larger and larger models, data warehouses have allowed us to store and process larger and larger amounts of data. It’s crucial to understand the central role data warehouses play in helping businesses to store and work with large quantities of data. In fact, data warehouses are so powerful now that people are starting to apply machine learning directly at the warehouse layer (check out Continual); these kinds of things are not possible in traditional database systems.

For more great reads on the topic, I suggest:

https://www.fivetran.com/blog/data-warehouse-vs-database
https://www.hopsworks.ai/post/feature-store-vs-data-warehouse
https://www.stitchdata.com/resources/database-vs-data-warehouse/
https://panoply.io/data-warehouse-guide/the-difference-between-a-database-and-a-data-warehouse/
https://www.integrate.io/blog/data-warehouse-vs-database-what-are-the-key-differences/

Place all tools in the ELT context.

In the context of data work, there are a lot of tools you hear about. You’ll hear about things like dbt, Airflow, Prefect, Fivetran, Airbyte, Starburst, Atlan, and on and on and on. It’s overwhelming to understand what all the tools out there do, especially when you don’t use them on a day to day basis. Just look at this all these tools:

It feels like there is a new tool every day!

To stay sane, constantly put data tooling into the ELT (extract, load, transform) context. Whenever you hear about a new tool, ask yourself “where does this fit in the ELT flow?” This helps you contextualize the application of the tool, not just its features. I often find myself overwhelmed by all the feature-level descriptions data tools provide of themselves. By focusing on the notion of ELT and application, you can cut through the noise and better compare similar tools.

This is also important because focusing on the ELT context helps you understand how modern data tools are different than their predecessors. The modern data stack is by definition premised on the notion that ELT is different than ETL. The ELT paradigm heavily leverages the speed and power of modern data warehouses. It also empower business users (e.g. analysts) to more easily control the transform layer of data ingestion via tools like dbt.

For more great reads on the topic, I suggest:

What is ELT?
Data Orchestration—A Primer
MLOps Community thread about ELT v. ETL
Why ELT is the Future of Data Integration

There are three (maybe four) core roles in the data ecosystem.

As a machine learning engineer or scientist, you’re one of the newest, hottest professionals to hire on the market. You are also, however, working in the most amorphous role in the data ecosystem. Where do MLEs fit into the modern data team? What kinds of roles are even in the modern data team? You'll have heard a lot of discussion about data analysts, data engineers, data scientists, machine learning engineers, software engineers, even machine learning software engineers.

Thanks to Laszlo Sragner, I learned that there are currently three core types of data professionals:

data engineers: They focus on applying software engineering best practices to data, often using Python and SQL. They are often engineering stakeholders.
data analysts: They focus on applying business logic and understanding core and operational business questions using SQL. They are often business stakeholders.
data scientists: They focus on modeling and applying advanced statistical techniques or understanding to data, often using Python or R. They are often product stakeholders.

As an advanced technology, machine learning requires uniquely qualified talent. You have to operate across the data and application stack to productionize a machine learning model as of right now. As a result of this, the machine learning engineer title has evolved in a separate trajectory relative to other data professionals.

As the data stack evolves, it’s likely that we continue to see evolution in the titles and roles on the data team. “Analytics engineer” has lately emerged as a new title, in a similar vein as machine learning engineer. I consider these newer titles to potentially be a fourth addition to the three core roles Laszlo identified. With all that said, MLEs walking into modern data teams should understand that their data counterparts fit into a defined organizational order. MLEs should understand where their work interacts with each of these data counterparts.

For more great reads on the topic of data roles, I suggest:

The aforementioned Mihail Eric article
https://towardsdatascience.com/analyst-2-0-3c10daf124c8
https://vickiboykis.com/2019/02/13/data-science-is-different-now/

Conclusion and Final Thoughts

In the past few years, the ML ecosystem has evolved in parallel to the data ecosystem. This is because ML has emerged more out of academic contexts, while the data world has emerged more out of commercial contexts.

Whatever the reason, this can all be really confusing to the novice machine learning professional. I know I myself was really confused for large portions of my career about what was different between the data world and the ML world. I hope that this gave you a little bit of context on how to think about modern data-intensive systems have evolved.

Some parting thoughts for all you MLEs as I sign off:

SQL is really important. As much fun as Python is to write, it will never do data manipulation with the performance of SQL.
You’re only as good as your tools. Most of our time as data and ML professionals is spent working with a clear set of tools (e.g. TensorFlow or Fivetran). Picking a good set of tools and investing in them can yield huge long-term advantages for your productivity.
A lot of the modern data stack content out there is really good content marketing. It’s actually educational and worth reading, but it’s also self-serving for the vendor trying to gain early adoption. Don’t let every new blog post about the latest and greatest tool make you feel like you should learn it; sticking with what works can yield great results!