AWS SageMaker tackles important problems, but can fall flat.

@November 8, 2020

Heads up! I wrote this article for a healthcare industry networking group I'm in. Because of the largely non-technical audience, I tried to make it very detailed. If you're an ML-experienced professional reading this, it might come across as excessively explanatory.

Amazon SageMaker is the latest product innovation from AWS for machine learning. It's probably one of the hottest tools for machine learning that is currently out there.

I'm somebody that has owned its use internally, thought about how it fits into our architecture, and use it in multiple different contexts. I probably spend 10 hours a month with it at least, depending on the flow of our team.

I wanted to give a little bit of info to you so that you could also understand better what what work I do as a machine learning engineer.

What is SageMaker exactly?

In the most basic sense, SageMaker strings together multiple services that Amazon Web Services provides. In healthcare terms, you could call it a bundle 😉The services SageMaker bundles are used frequently in an ML context. SageMaker aims to turn those fragmented tools into a cohesive packaged product that allows companies to quickly, scalably, and cost effectively train and monitor machine learning models.

Why use it?

Let's start from the very basics of the problem. What is it you do when you want to train a machine learning model?

By training, what I mean is that you want to run some code that tells you exactly what mathematical operations a machine learning model consists of. You have a good sense of the problem domain and the performance needed for the problem. All that is stopping you is turning that code into a particular object that has been, loosely speaking, taught patterns in the data according to the math you specify. That's all that's stopping you from returning predictions.

You might start by training your data on your own computer. However, in today's paradigm of data-intensive deep learning models, personal computers very quickly run out of juice. In this situation, you turn to the "cloud" (which is a term for assigning remote computing resources quickly and efficiently). AWS is the most foremost cloud provider.

AWS created SageMaker to be the product hub for all things machine learning, once you decide to turn to the cloud. You can train, monitor, and deploy models all on SageMaker, radically simplifying the cloud ML work needed prior to SageMaker's introduction.

Who uses it?

In a nutshell, anyone touching machine learning: data scientists, ML engineers, ML scientists, and even data PMs.

For a lot of companies, employing machine learning can be tricky, because of all the people involved in building a production model. Data scientists and engineers think through the nature of the data, the availability, and the problems it can solve. Machine learning scientists think through research-y approaches to algorithm and model development, and how to mathematically set up problems the right way. Machine learning engineers think through how to train, productionize, and monitor models. Additional software engineering expertise may be needed for integration of the model's outputs into a larger product.

SageMaker is an attempt to harmonize these workforces, and solve all of the confusion around deploying machine learning in one simple product. Any one of the aforementioned professionals can use Sagemaker to quickly train models, and then perform an increasing list of interesting things on those models.

How to use it?

How is a really big question for this ambitious product. Let me quickly summarize the main tools offered, described as part of the overall ML development cycle.

  1. Label: An annotation service, called AWS GroundTruth, that allows you to quickly gather labels from a group of workers (i.e. Mechanical Turk) and have that data in a structured format that is easy to parse (i.e. AWS S3).
  2. Build: A series of interactive environments, such as AWS Sagemaker Studio and AWS Sagemaker Notebooks, that offer machine learning-specific tools and computing resources.
  3. Train and Tune: This can be kind of a pain (I wrote another blog post about it here), but you can use a series of AWS services (i.e. AWS Elastic Container Service) to take the model you built in the previous step, and train and optimize it.
  4. Deploy & Manage: Finally, once you have a model, you can serve it in a "microservice"-oriented way through AWS.

When to use it?

This is probably the toughest question for me. When does it make sense to adopt SageMaker thoroughly and depend on it for your architecture? There's a lot of discussion about this in the industry forums I follow on this.

At the simplest level, you should use SageMaker whenever you have model training job. However, the rap against AWS services is that they can be really easy to setup, but a pain in the ass to customize, integrate, or manage later. Because of this, at my company, we try to use it in a piecemeal way that doesn't result in an infrastructure level lock-in to AWS yet.

As an example, right now I'm doing an annotation study with SageMaker's GroundTruth labeling service. It took me 15 minutes to set up a labeling study that I'm doing with an optometrist, from whom I'm getting labels on images. That's huge! Furthermore, I have that data in exactly the format I need to do analysis and model training. At the same time I performed this study, however, I consistently run into roadblocks in terms of task type (i.e. if I want my annotations to be structured in a way different than SageMaker currently provides for) or in terms of custom workforce (i.e. the way to notify non-AWS workers of jobs sucks).

This feeds well into the overall pros and cons of the product.

Pros and Cons

Pro 1: Simplicity

As an ML engineer, I feel is that there is an explosion of tools, products, and frameworks for my work. Managing a diverse set of tools and making them work together is not easy. I think anyone, if they could, would want to rely on a core set of tools.

SageMaker aims to completely solve this problem. Many ML teams already interface with AWS, as it is the foremost cloud provider. AWS wants to make SageMaker the natural home for the emerging software process of ML development and engineering. SageMaker tries to be the core tool for ML engineering. This results in a process-level simplicity that is really attractive for architects and engineers.

Pro 2: Speed

With SageMaker, you can go end-to-end really fast. Most importantly, this is even with being a novice user. When I started with SageMaker, I was not very familiar with AWS's services. In a matter of a few days, I was able to get models training and deploying. That level of speed is incredibly attractive for experimentation and iteration on machine learning models.

Pro 3: Comprehensive

As the "how" section of this article shows, SageMaker packs a lot of things into one product. It's nearly comprehensive for the entire machine learning process. In fact, that comprehensive nature actually helps companies like ours do things better and ensure all of our methods follow baseline best practices.

Con 1: Lack of Flexibility

Sometimes, SageMaker feels like a tool designed for companies to get started with machine learning that don't necessarily have any background in machine learning. It's not necessarily a tool for companies that have perspective and knowledge of the kind of machine learning problems and processes they want to adopt to their custom field.

This is because it routinely sacrifices flexibility in favor of speed in performing basic operations. For example, while training a standard ML algorithm can be really easy, doing your own custom training can be a complete pain, because the SageMaker API's for it are underbaked and poorly documented.

Additionally, the rush to be comprehensive on the part of SageMaker, rather than well-documented, hints at an adoption strategy focused on IT and engineering architects, rather than developers themselves. These kinds of people tend to make such decisions at bigger companies.

Con 2: Unclear Costs

Keeping costs associated with machine learning down is a crucial objective. Managing costs with SageMaker as your needs scale is not intuitive, particularly given how its a wrapper around other AWS services. I'm not sure that it's a very cost effective solution in the long run for any company that plans to seriously spend money on building machine learning models.

Con 3: Lack of Community

Community is crucial for the uptake of a technical product. Often times, issues you encounter with a product are solved by the community (i.e. in Stack Overflow threads), not in official documentation. Many other machine learning products have thriving communities, like, DVC, TensorFlow, etc. Comparatively, I've been a little bit disappointed in the quality of the community output that's there for SageMaker. Most of the people on Github threads and other forums are SageMaker employees themselves, which is not the same as a diverse, thriving community applying a tool to novel problem areas and generating knowledge.