Docker Best Practices for Machine Learning
Arslan Ashraf
March 2024
Docker is an extremely important tool for modern software development. What makes Docker so powerful is that solves the "but it works on my machine" problem. Docker allows us to take one set of programs built in one specific environment and run it in a completely different environment without dependency conflicts.
However, powerful tools often come with a learning curve. What we want to explore in this guide are some of the best practices for building and running Docker images. We want to touch on some of the best practices for keeping our images small, secure, reproducible, and built off of stable, secure base images.
Best practice - choose the right base image. We want base images that roughly meet the following criteria:
- Stability - the base image must have a basic set of widely used packages and stable infrastructure to build off of.
- Security updates - a base image that gets regular security patches.
- Updated dependencies - the base image needs to have updated packages that it depends on, such as compilers.
- Small size - smaller images are always better as they are faster to build, faster to run, and have a smaller surface area for attacks.
Debian based official Python images from Docker the company are a good choice. We will avoid Alpine Linux for reasons discussed by Itamar Turner-Trauring [1] and Martin Heinz [2]. Furthermore, one should always the exact tag for an image and never depend on the default tag.
Best practice - batch together consecutive RUN instructions. To understand why, we need to know how Docker builds an image. Docker builds image layers additively where each layer is a single instruction in the Dockerfile such as FROM, COPY, RUN, etc. and these layers build on top of each other. As such, more layers in the image mean a bloated image. This is why it's best to batch together commands when possible.
Best practice - remove unnecessary files added to the image. As mentioned before, we want to keep images as small as possible, so we want to clean up and remove any unnecessary files such tar or zip files that have been downloaded, and then untarred or unzipped.
Best practice - place frequently changing instructions towards the bottom. To see why, we need to understand Docker layer caching. Docker builds images from top to bottom and when an image is rebuilt, Docker reuses as many layers as it can from its local cache. But if a layer is changed, then this change forces a rebuild of this layer and all layers below it. Hence, it's a best practice to place frequently changed files such as training code as lower down as possible.
Best practice - don't build a Docker image as a root user after installing packages. Docker runs on Linux and if an attacker gains access to a container running as root, the attacker might be able to escape the container and gain access to the underlying operating system as root. This could be a huge security threat that is best avoided by running as a non root user with very limited or no privileges.
Best practice - don't expose build time secrets. Let's say you're copying files from an AWS S3 bucket into the Docker image. But doing so requires AWS authentication. We must be careful not to leak our secrets into the Docker image. The following are both bad practices as they both save secrets directly into the Docker image and should be avoided:
A much more secure way to provide authentication is through temporary mounts which we can do as follows:
The Docker build command changes now because we need the secrets file saved in our local machine to be mounted into the image. We can securely do that as follows:
Best practice - don't expose run time secrets. Sometimes we may need to provide authentication when running a Docker container. Once again, we must be careful not to leak secrets into the Docker image. There are several ways to configure that. One way is to use specific IAM roles that give specific permission to a container to use some service in cloud environments. Another more straightforward way is to use volume mounts when running the container:
The list above is by no means exhaustive. There are numerous other best practices that we don't touch on in this guide. We list some more best practices below:
- multi-stage builds
- using official base machine learning images that have been heavily optimized
- giving Docker images multiple tags with git commit hash
- building new images faster by reusing the unchanged layers of a previous build
- scanning images for security vulnerabilities.
References
[1] https://pythonspeed.com/articles/alpine-docker-python/
[2] https://martinheinz.dev/blog/92
[3] https://pythonspeed.com/products/docker/