Custom Algorithm Training on Sagemaker Using Docker


If you are trying to train custom algorithms (i.e. Tensorflow-defined models) on Sagemaker, then you need to use Docker. There are a couple different challenges that come with using Docker with SageMaker, primarily because there are some things that the documentation doesn't make explicit. The overall documentation on SageMaker and Docker, however, is definitely a good start (link here).

The best place to start, after initial reading, is to run the train your own custom algorithms notebook on AWS. You can easily login to SageMaker and open this tutorial. If you start with that link and you run through all the code on SageMaker, you can very easily get started with running an example on CIFAR 10.

I suggest completing the tutorial before referring back to this guide for a more complete discussion of the issues in the tutorial. Furthermore, I suggest using this Sagemaker Workshop link as an additional guide on how containers for Sagemaker work.

Step 1: Understand Dockerization and Push to ECR

The first step is to understand Dockerization. Docker is a useful tool for running code with all of its dependencies in build environment that are totally separate from the native environment. It works by creating a tiny Linux virtual machine that can re-create the filesystem snapshot packaged in the relevant image for it.

Importantly, what you include in a Docker image and share is ONLY code, not data.

After creating an image, the next step is to push it to ECR. This is simple with the AWS CLI, and AWS provides a sample shell script to do so in its tutorial.

Step 2: Run Training Code

How does SageMaker run training?

Now that our image is on ECR, we use the SageMaker Python SDK to set up the calls to run the code in the image and actually start the training process.

The SageMaker Python SDK is helpful to understand. Once you understand the various objects that are created in the process of training (i.e. an Estimator class) by the Python SDK, it becomes a lot more intuitive to work on this process.

Ultimately, using SageMaker's objects and methods, you're going to call the Docker image that you pushed to ECR, run it as a container, and train your algorithm.

Importantly, SageMaker runs a Docker container in two different modes: train and serve. The way that it does is through calling the image and overriding ANY default commands that may be already invoked by the container. To further clarify, SageMaker imports an image from ECR and runs one of the following commands:

docker run <image name> train
docker run <image name> serve

This is crucial to understand because SageMaker actually requires, in order to run Docker commands in this method, files in the container called train and serve, depending on what use case you are working on. AWS gives dummy scripts for these two argument inputs, which are basically command line interface scripts written in Python to call your training algorithm code.

Train with either local data or S3 data

In the tutorial linked above, you run Dockerized training code on both local data or data located in S3. This process is pretty self-explanatory; the only addendum I will make is that Sagemaker only works with S3 buckets that are namespaced with the word "sagemaker".

Step 3: Set up an endpoint to predict

After the previous step in the tutorial, you should have a trained model file sitting in a folder in your Sagemaker S3 bucket. This is called a model artifact.

Using this model artifact, we want to set up an endpoint to allow for inference.

There are four different ways that Sagemaker allows for trained model artifacts to be served as endpoints:

  1. Using the Sagemaker Tensorflow Serving Model object in the Python SDK
    1. This is the recommended method. However, it is paramount that you specify the Tensorflow framework_version argument in defining the object.
  2. Using the Sagemaker TensorFlowModel object in the Python SDK.
    1. This is NOT a recommended method. It is not based on Python 3. Sagemaker's documentation is unclear and sometimes recommends this method. See this thread for an example error in the case where the framework version or Python version does not match that of your trained artifact.
  3. Using Docker Images
    1. As we discussed earlier, a Docker container on Sagemaker can be run in either train or serve mode. Using the serve.py provided by AWS, you can create a simple web server to send requests to. This is NOT a recommended method, because it is not as seamlessly integrated with Python. You need to modify the input and output data that is sent to the model due to the HTTP request format that this involves.
  4. Using AWS Lambda
    1. I don't know too much about this, but here is an AWS guide to do so.

Notes on SageMaker Deployment

Sagemaker works with Tensorflow Serving to set up model-based inference. Tensorflow Serving is a special library created to allow for web-based interactions with machine learning models.

To use Tensorflow Serving properly, your model must be saved in the Tensorflow SavedModel format. As of TF 2.0, SavedModel is the new default format for models; the Keras h5 format is older and will be phased out in future versions of Tensorflow.

If your model is written in pure Keras and saved in the h5 format, you can follow this AWS guide to converting them to TensorFlow SavedModel and serving them. If you model is in pure TensorFlow, you can use the Sagemaker Tensorflow Estimator class to train and serve your model.

I suggest not using Tensorflow Keras to write your model if you would like to deploy it using Tensorflow Serving. TF Keras does not fully benefit from the simplicity of Keras. It also does not have the same tight integration for training and inference that pure Tensorflow has. It's basically neither here nor there.

For Tensorflow Serving, a specific model saving location is required so that SageMaker requires the folder 'export/Servo' TensorFlow requires a model number below 'export/Servo", i.e. 'export/Servo/1' "The save-path follows a convention used by TensorFlow Serving where the last path component (1/ here) is a version number for your model - it allows tools like Tensorflow Serving to reason about the relative freshness." Let us save the model to the right file path and then use it.


  1. Github sample notebook for deploying TensorFlow code in container on SageMaker
  2. Sagemaker guide to containers
  3. How to deploy Keras h5 models to TensorFlow serving
  4. SageMaker namespacing docs
  5. Building fully custom machine learning models on AWS SageMaker: a practical guide
  6. Brewing up custom ML models on AWS SageMaker