Docker Best Practices for Machine Learning

Arslan Ashraf

March 2024

Docker is an extremely important tool for modern software development. What makes Docker so powerful is that solves the "but it works on my machine" problem. Docker allows us to take one set of programs built in one specific environment and run it in a completely different environment without dependency conflicts.

However, powerful tools often come with a learning curve. What we want to explore in this guide are some of the best practices for building and running Docker images. We want to touch on some of the best practices for keeping our images small, secure, reproducible, and built off of stable, secure base images.

See full Dockerfile

Best practice - choose the right base image. We want base images that roughly meet the following criteria:

Stability - the base image must have a basic set of widely used packages and stable infrastructure to build off of.
Security updates - a base image that gets regular security patches.
Updated dependencies - the base image needs to have updated packages that it depends on, such as compilers.
Small size - smaller images are always better as they are faster to build, faster to run, and have a smaller surface area for attacks.

Debian based official Python images from Docker the company are a good choice. We will avoid Alpine Linux for reasons discussed by Itamar Turner-Trauring [1] and Martin Heinz [2]. Furthermore, one should always the exact tag for an image and never depend on the default tag.

# specify a specific image tag

FROM python:3.11.8-slim-bookworm

Best practice - batch together consecutive RUN instructions. To understand why, we need to know how Docker builds an image. Docker builds image layers additively where each layer is a single instruction in the Dockerfile such as FROM, COPY, RUN, etc. and these layers build on top of each other. As such, more layers in the image mean a bloated image. This is why it's best to batch together commands when possible.

FROM python:3.11.8-slim-bookworm

# DEBIAN_FRONTEND=noninteractive prevents some packages from interactive

# input or manual feedback which can lead to indefinite waiting

# -y flag prevents apt from interactively asking if indeed you do want to install packages

# --no-install-recommends flag tells apt to only install the required dependencies of a

# package and none of the recommended dependencies

# apt-get clean clears out the local apt cache

# rm -rf /var/lib/apt/lists/* command prevents apt lists from ending up in the

# container, apt-get update command fetches these lists

RUN export DEBIAN_FRONTEND=noninteractive && \

apt-get update && \

apt-get -y upgrade && \

apt-get install -y --no-install-recommends < package_to_install > && \

apt-get -y clean && \

rm -rf /var/lib/apt/lists/*

Best practice - remove unnecessary files added to the image. As mentioned before, we want to keep images as small as possible, so we want to clean up and remove any unnecessary files such tar or zip files that have been downloaded, and then untarred or unzipped.

FROM python:3.11.8-slim-bookworm

RUN export DEBIAN_FRONTEND=noninteractive && \

apt-get update && \

apt-get -y upgrade && \

apt-get install -y --no-install-recommends < package_to_install > && \

apt-get -y clean && \

rm -rf /var/lib/apt/lists/*

# -o or --output flag tells curl to save the downloaded file to a particular

# local file in the image

# -d flag tells unzip to send the unzipped file to the target directory

# rm then removes the original zipped file to prevent it from unnecessarily

# bloating the image

RUN curl < url_of_file_to_download > && \

--output < destination_directory >/< filename >.zip && \

unzip < destination_directory >/< filename >.zip && \

--d < unzipped_target_directory > && \

rm < destination_directory >/< filename >.zip

Best practice - place frequently changing instructions towards the bottom. To see why, we need to understand Docker layer caching. Docker builds images from top to bottom and when an image is rebuilt, Docker reuses as many layers as it can from its local cache. But if a layer is changed, then this change forces a rebuild of this layer and all layers below it. Hence, it's a best practice to place frequently changed files such as training code as lower down as possible.

FROM python:3.11.8-slim-bookworm

RUN export DEBIAN_FRONTEND=noninteractive && \

apt-get update && \

apt-get -y upgrade && \

apt-get install -y --no-install-recommends < package_to_install > && \

apt-get -y clean && \

rm -rf /var/lib/apt/lists/*

RUN curl < url_of_file_to_download > && \

--output < destination_directory >/< filename >.zip && \

unzip < destination_directory >/< filename >.zip && \

--d < unzipped_target_directory > && \

rm < destination_directory >/< filename >.zip

# copy the requirements.txt file into the Docker image

COPY requirements.txt .

# pip automatically caches HTTP responses and locally built wheels and it

# checks its local cache first when installing packages

# --no-cache-dir turns off pip's automatic mechanism, so pip won't have any local

# cache storing HTTP responses and built wheel, thereby reducing the image size

RUN pip install --no-cache-dir -r requirements.txt

Best practice - don't build a Docker image as a root user after installing packages. Docker runs on Linux and if an attacker gains access to a container running as root, the attacker might be able to escape the container and gain access to the underlying operating system as root. This could be a huge security threat that is best avoided by running as a non root user with very limited or no privileges.

FROM python:3.11.8-slim-bookworm

RUN export DEBIAN_FRONTEND=noninteractive && \

apt-get update && \

apt-get -y upgrade && \

apt-get install -y --no-install-recommends < package_to_install > && \

apt-get -y clean && \

rm -rf /var/lib/apt/lists/*

RUN curl < url_of_file_to_download > && \

--output < destination_directory >/< filename >.zip && \

unzip < destination_directory >/< filename >.zip && \

--d < unzipped_target_directory > && \

rm < destination_directory >/< filename >.zip

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# create a non root user

RUN useradd --create-home non_root_user

# set the user to be the newly created non root user

USER non_root_user

Best practice - don't expose build time secrets. Let's say you're copying files from an AWS S3 bucket into the Docker image. But doing so requires AWS authentication. We must be careful not to leak our secrets into the Docker image. The following are both bad practices as they both save secrets directly into the Docker image and should be avoided:

# bad practice

ENV AWS_ACCESS_KEY=...

ENV AWS_SECRET_ACCESS_KEY=...

RUN ./file_that_needs_secrets.sh

# bad practice

ARG AWS_ACCESS_KEY

ARG AWS_SECRET_ACCESS_KEY

RUN ./file_that_needs_secrets.sh

A much more secure way to provide authentication is through temporary mounts which we can do as follows:

COPY file_that_needs_secrets.sh .

# mount secrets from local system, file_that_needs_secrets.sh can find secrets

# by default at /run/secrets/aws_secrets inside the container

RUN --mount=type=secret,id=aws_secrets,target=/root/.aws/aws_credentials,required && \

./file_that_needs_secrets.sh

The Docker build command changes now because we need the secrets file saved in our local machine to be mounted into the image. We can securely do that as follows:

docker build --secret id=aws_secrets,src=< local_path_to_secrets_file > .

Best practice - don't expose run time secrets. Sometimes we may need to provide authentication when running a Docker container. Once again, we must be careful not to leak secrets into the Docker image. There are several ways to configure that. One way is to use specific IAM roles that give specific permission to a container to use some service in cloud environments. Another more straightforward way is to use volume mounts when running the container:

docker run -v "< local_path_to_secrets_file >:< image_base_dir >/< file_that_needs_secret >" < image_name >

The list above is by no means exhaustive. There are numerous other best practices that we don't touch on in this guide. We list some more best practices below:

multi-stage builds
using official base machine learning images that have been heavily optimized
giving Docker images multiple tags with git commit hash
building new images faster by reusing the unchanged layers of a previous build
scanning images for security vulnerabilities.

References

[1] https://pythonspeed.com/articles/alpine-docker-python/

[2] https://martinheinz.dev/blog/92

[3] https://pythonspeed.com/products/docker/