📚

What reads impacted my ML Engineering journey most?

@April 26, 2022

💡
TL;DR This is a list of must-read articles for people interested in building machine learning systems. They deeply influence my perspective on machine learning engineering.

Why I’m writing this

Inspired by Ben Kuhn’s Essays on programming I think about a lot, I put together a list of the most influential reads on my journey as a machine learning engineer. I would highly recommend all of these essays as must-reads for MLEs at any stage of their career.

Before I begin, a point on the nature of the works included in this list.

There are timely reads and timeless reads. Timely reads are knowledge; timeless reads are wisdom. Throughout my MLE career, I’ve been bombarded with information and knowledge, much of which has been merely timely. I don’t consider these to be worth re-reading and truly absorbing. This list is heavily biased towards timeless reads. The lessons they contain are wise enough to stand up for years (as many of them have). The insights I apply to elevate my approach to work over the long-term are contained in these timeless reads.

Without further ado and in no particular order, here’s a list of my top essays about MLE that I think about a lot. I’ll talk about what each paper’s core insight is, the context in which they were written, and how I apply their lessons in my work.

150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com

Link: https://blog.kevinhu.me/2021/04/25/25-Paper-Reading-Booking.com-Experiences/bernardi2019.pdf

Booking.com is one of the world’s largest travel businesses. In this 2019 paper, the team at Booking summarized 6 core lessons from applying machine learning at scale.

We conducted an analysis on about 150 successful customer facing applications of Machine Learning, developed by dozens of teams in Booking.com, exposed to hundreds of millions of users worldwide and validated through rigorous Randomized Controlled Trials... Our main conclusion is that an iterative, hypothesis driven process, integrated with other disciplines was fundamental to build 150 successful products enabled by Machine Learning.

The totality of the paper is worth reading. What it taught me about most are the challenges of scaling machine learning models (not just building) and how a thoughtful approach to machine learning systems across a company is a huge competitive advantage.

I got a chance to interview an author of this paper, Pablo Estevez, in this episode of MLOps Coffee Sessions:

Machine Learning: The High Interest Credit Card of Technical Debt

Link: https://research.google/pubs/pub43146/

Perhaps the single most famous paper on this list, this 2015 Google paper could be considered the intellectual fountainhead of the modern MLE professions. Authors Sculley, et. al. identify problems imposed by machine learning’s entropic nature on software systems that still exist today.

One of the basic arguments in this paper is that machine learning packages have all the basic code complexity issues as normal code, but also have a larger system-level complexity that can create hidden debt... we focus on the system-level interaction between machine learning code and larger systems as an area where hidden technical debt may rapidly accumulate.

This paper taught me how to think about machine learning systems in a rigorous, almost-mathematical way. It also showed me how fundamental some of the problems I worked on as an MLE were (e.g. data drift and testing) and how crucial adopting best practices early on were to the long-term success of a project.

I got a chance to interview D. Sculley, the lead author on this paper, for the MLOps Coffee Sessions podcast.

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Link: https://research.google/pubs/pub46555/

This paper was also authored by Sculley, and I love how practical it is. At a time when few others were concerned with large-scale application of machine learning systems, Sculley, et. al. articulate an extremely clear way

In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt.

This paper taught me about how to approach applying the engineering mindset to machine learning, which is through rigorous, identifiable, and documented process and technique. Engineering is really a craftsman’s discipline. This paper shows how to use that ethos in the machine learning context. It further supplied me a rubric to grade myself and measure progress on.

Software 2.0

Link: https://karpathy.medium.com/software-2-0-a64152b37c35

Karpathy had long been a thought leader of note to me, but this was the first blog post by him I saw that highlighted the engineering opportunities of machine learning. Reading this made me think: “Wait, there’s more to machine learning than just the modeling...”.

Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software. They are Software 2.0.... Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network.

I learned about the opportunity presented by systemic application of machine learning. It planted the first seed in my head that machine learning engineering would be a field of interest to me. The essay also still highlights core challenges of machine learning engineering that haven’t been fully solved: in a coding paradigm where data and code are both first-class citizens, what tools and process are appropriate?

Meet Michelangelo: Uber’s Machine Learning Platform

Link: https://eng.uber.com/michelangelo-machine-learning-platform/

It’s hard to believe this blog post was authored in 2017. This piece, which describes Uber’s machine learning platform named Michelangelo, was written at the peak of Uber’s hypergrowth. At the same time, the crescendo of excitement around machine learning had reached a fever pitch. The Michelangelo blog post showed that machine learning had arrived as a business-ready technology.

We are increasingly investing in artificial intelligence (AI) and machine learning (ML) to fulfill this vision. At Uber, our contribution to this space is Michelangelo, an internal ML-as-a-service platform that democratizes machine learning and makes scaling AI to meet the needs of business as easy as requesting a ride.

I learned that I wanted to work on machine learning platforms after reading this piece. A lightbulb went off for me as a product-minded engineer. I didn’t just want to solve individual applications of machine learning. I wanted to enable the entirety of a business’s value through thoughtful design and implementation of machine learning platforms. This piece showed what such platforms did and how their architecture and impact looked.

Rules of Machine Learning: Best Practices from Machine Learning

Link: https://eng.uber.com/michelangelo-machine-learning-platform/

Google is the most influential company in the machine learning engineering realm. With their core product (recommendation and search systems) being so machine learning-driven, they have experienced more of the challenges of scaling machine learning than any other company out there. Importantly, they’ve been extremely transparent about sharing their best practices. The Rules of Machine Learning are one of Google’s earlier attempts at doing so.

To make great products, do machine learning like the great engineer you are, not like the great machine learning expert you aren’t... Most of the problems you will face are, in fact, engineering problems.

The Rules of Machine Learning distilled for me the core tenets of machine learning engineering. As I’ve said, it’s easy to get lost in models, tools, frameworks, etc. The Rules of Machine Learning simplified the entirety of the process of building ML-driven products into a set of commandments that I was able to follow. It also had the Google stamp of approval, which helped me to really buy in to simplicity and focus as essential to building better machine learning systems.

MLOps: Continuous delivery and automation pipelines in machine learning

Link: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

I swear I am not a Google shill; they just consistently put out the best machine learning content in the world, since it is informed by their experience. This piece was written by teams at Google Cloud, which has a different focus than Google. Cloud builds services for others, while Google mostly builds services for itself. This was the first article that highlighted what MLOps was to me.

This document is for data scientists and ML engineers who want to apply DevOps principles to ML systems (MLOps). MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management.

This article introduced me to the concept of system maturity, which helped me better architect systems and communicate progress to other stakeholders. It challenged me to think about building systems end-to-end first and how to think about improving iteratively after delivering initial value. This is the first article I read that rigorously categorized the entirety of the production machine learning systems and described their operations.

Continuous Delivery for Machine Learning

Link: https://martinfowler.com/articles/cd4ml.html

Martin Fowler writes and edits some of the most thoughtful, detailed content on software development. This piece on his blog applies his practical approach to software development to the context of machine learning. It follows an end-to-end example of building a machine learning system with a continuous delivery mindset (the crucial cultural element of DevOps).

Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.

I learned what the three components of machine learning systems are from this article: code, model data. Every challenge and opportunity that follows stems from the uniqueness of each of these components. I can’t tell you how many times I’ve solved problems or communicated progress by returning to this framework. It also supplied a crucial system-level example that I referred back to often on how to apply software engineering techniques to machine learning.

Ways to think about machine learning

Benedict Evans is a technology industry observer. He writes excellent content on major shifts in tech like mobile, Web, e-commerce, etc and their business implications. This was one of the first pieces I read about the business implications of machine learning, and it stuck with me.

So, this is a good grounding way to think about ML today - it’s a step change in what we can do with computers, and that will be part of many different products for many different companies. Eventually, pretty much everything will have ML somewhere inside and no-one will care.

This article gave me my first hint about what machine learning practically means to business. I learned that machine learning is not a fantastical technology, and likely never will become that. Rather it is a “step change” in the speed and scale of industrial automation and enablement that has been happening over the last century. It applies a new probabilistic component. Machine learning will be boring and everywhere.

The deployment phase of machine learning

Link: https://www.ben-evans.com/benedictevans/2019/10/4/machine-learning-deployment

Shout out again to Ben Evans. Him and D. Sculley are the only authors with multiple citations in this piece. The prior piece opened my mind to the business implications of machine learning; this piece clarified it.

ML is the new SQL... if you want to know ‘what’s our AI strategy?’ or ‘how do we choose an AI vendor?’, the answer is, well, how did you choose a cloud vendor or a SaaS Vendor, and how did you identify opportunities for databases?

This article definitively removed from the abstract realm of AI technology hype. It brought me into the definitive realm of driving business value from machine learning. ML really is the new SQL. That’s why data warehouses and the modern data stack are so crucial to realizing its value at scale. It’s just as boring, fundamental, and cross-functional as SQL is. I love this article and I cite its core insight all the time.

Conclusion

I’ll keep adding to this piece over time. Is there anything else you’d add to this list as must-reads for all MLEs? Let me know!

Appendix: Other Helpful Content

This is a list of other articles or websites I’ve found helpful and am considering for inclusion at some point.

  1. https://arxiv.org/abs/2011.03395
  2. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf
  3. https://www.microsoft.com/en-us/research/uploads/prod/2019/03/amershi-icse-2019_Software_Engineering_for_Machine_Learning.pdf
  4. https://fullstackdeeplearning.com/spring2021/
  5. Chip Huyen
  6. https://madewithml.com/#mlops