Dockerfiles are a key part of building Docker images. Here, we share the best practices for writing Dockerfiles so that your images are built efficiently but remain secure.
Machine Learning Engineer and DevOps Expert
Dockerfiles provide the build commands that give rise to your Docker images, which can then be used to create Docker containers! Writing a good Dockerfile is one of the most important steps in building a containerized application, and care should be taken to ensure that best practices are being followed.
Because the Docker runtime provides so many different commands that can be used, it can be difficult to figure out how to best structure your Dockerfiles. This article aims to provide six best practices for those writing Dockerfiles so that your images are built efficiently while remaining secure.
For the most part, ADD and COPY commands are used to move files from your local filesystem into your Docker image, whereas RUN commands are used to install application libraries and dependencies. If your application install does not depend on your local files (e.g. you’re creating a container that runs a Python script), it’s best to put your RUN commands towards the beginning of the Dockerfile. This is because Docker caches unchanged layers during the build process, and each command in the Dockerfile represents a delta over the previous layer.
If you plan on making frequent changes to your application scripts, but not to the runtime libraries, it behooves you to put the RUN commands at the beginning of the Dockerfile. This way, the various library install processes will be cached, and only the relevant ADD and COPY commands will rerun each time you rebuild the Docker image. Conversely, placing the ADD and COPY commands toward the beginning of the file and the RUN commands after will cause the libraries installed using RUN to reinstall every time the local files are changed and copied into the image. This makes for expensive and time-consuming builds that could be spared by simply reordering the commands in the Dockerfile.
In general, you want to minimize the number of layers you use in your Dockerfiles, so long as you can keep a clear separation of commands of different types. In fact, you’re somewhat limited by the Docker runtime itself, which specifies that images can contain at most 127 layers, although in practice you’d be hard-pressed to reach that amount.
Generally, you want to tie together commands of the same type. For example, instead of writing multiple “pip install” commands to install many packages, you can drop them into a single layer by writing something like
pip install sklearn\ django \ torch \ …
Additionally, commands that are dependent upon one another should be chained together. This is accomplished using the && operator. A common example of this involves
apt-get update and
apt-get install, which are used frequently in Ubuntu derived images. If you write apt-get update and apt-get install in separate layers, i.e. without &&, then you might end up installing an outdated version of a package that was saved in the cache. This technique is also commonly known as “cache busting.”
Another way to accomplish the same outcome is to manually pin each package installed to a specific version, e.g. by writing
RUN apt-get install -y \ curl=7.88.0 \ python=3.9.1 \
By manually specifying a version number, you can ensure that Docker will always include that specific version of the package into the build environment.
Environment variables are great for linking in information to your programs that can change at runtime. This often includes paths to executables that will be called by your script, package version numbers, or things like API keys. Creating an environment variable in a Dockerfile is simple. You just write something like this:
Note that we want to be careful with data such as API keys and secrets as they grant privileged access to accounts. NEVER commit keys or other sensitive data directly to a git repository. Thus, in this scenario we bind the
GITHUB_API_KEY environment variable to a build arg (see section #4 below) and pass it in at build-time as follows (assuming we have the key saved in a
~/.bashrc file or similar):
docker build -t my_img --build-arg GITHUB_API_KEY=$GITHUB_API_KEY
This environment variable is then set in your Docker image/container and can be retrieved at runtime by writing “echo $GITHUB_API_KEY” from the terminal or accessing it from within your program code.
Environment variables can also be set at build time by using the “-e” flag to docker build, e.g.
docker build -t my_img -e GITHUB_API_KEY=$GITHUB_API_KEY.
Building off #3, let’s say you want to set certain values in your Dockerfile dynamically, at build time. For example, you might want to build different versions of your image that use different versions of the base Python image, such as one version for Python 2.7 and another for Python 3.10. You can accomplish this using build args. Build args provide an interface for dynamically declaring Dockerfile variables and values. You declare a build arg by writing the following within your Dockerfile:
You can then access the argument further down in the Dockerfile by prefacing it with $, for example:
Values are specified for build args at build time using the
--build-arg flag in docker. You can write
docker build -t img_py3.10.0 --build-arg PYTHON_VERSION=3.10.0
and the Python version will dynamically be included into the Dockerfile build.
Docker containers are usually used for stateless microservices that are meant to be horizontally scaled. This means that many container instances of the same Docker image will often be running simultaneously and need to be launched quickly at random, for example upon failure and restarting of a cloud instance. For this reason, it’s important to keep images lightweight so as to enable quicker restarts and access times. To that end, you should not install unnecessary packages into your images. Only the bare minimum essentials required to run your application should be installed.
In a similar vein, each Docker image and container should have a singular focus: running one, specific application. This provides the best profile for scaling applications. If two applications are coupled within a single Docker container, then it becomes impossible to scale them individually despite the fact that they likely have different resource requirements and demand. Constraining each image to focus on a singular task provides the best scaling profile and allows containers to operate relatively autonomously.
CMD are two Dockerfile directives for running code when a container is launched. Because
CMD provide similar functionality, there is often confusion among users as to when one should be preferred over the other.
CMD works by allowing you to run an executable by writing a directive in your Dockerfile such as
CMD [“python”, “train_model.py”, “--lr”, “1e-3”]
This would cause the container to run the Python script
train_model.py with the learning rate parameter
(--lr) set to a default value of 10^-3.
Similarly, we can write
ENTRYPOINT [“python”, “train_model.py”, “--lr”, “1e-3”]
and the default behavior will be practically the same. The container will run the
train_model.py script upon launch. So what’s the difference?
The primary contrast is in the behavior of the two directives in response to user-provided arguments. Specifically, it is possible to override the default
CMD instruction from the Dockerfile when running the container using
docker run. This is not possible with
ENTRYPOINT, which will always execute its command when the container is run, regardless of user-provided overrides. Thus,
ENTRYPOINT is appropriate for defining a command with fixed parameters that should run exactly as written EVERY time the container is launched. Conversely,
CMD should be used for running scripts with variable arguments that are intended to be adjusted and overridden by the user of the container.
ENTRYPOINT can be cleverly combined to enable more complex, dynamic behavior. For example, to create a container that runs an executable with some arguments fixed and others overridable, you could write the following in the Dockerfile:
ENTRYPOINT [“python”, “train_model.py”, “--lr”, “1e-3”] CMD [“--batch-size”, “32”]
This container will always execute the
train_model.py script with a fixed learning rate of 10^-3. However, it enables the user to change the batch size by launching the container using a
docker run command of the form
docker build -t train_model . docker run train_model --batch-size 64
There are some small caveats to be aware of when considering whether to use
ENTRYPOINT (or both), more details of which can be found here. In short,
CMD is especially well-suited for use in a development environment, where you might want to interact with the shell of your Docker container directly.
Docker provides a number of commands and directives that make it possible to configure advanced behaviors for your images and containers. However, all of this added complexity can make it difficult to determine what the best practices are for defining Docker images via Dockerfiles. In this article, we walked through some guiding principles for building your docker images and running docker containers. The core maxim of Docker based development is to value simplicity, so ideally you should streamline your images and builds whenever possible so as to reduce build sizes, make effective use of the build cache, and enable horizontal scaling of your applications and services.
Why Dockerize Your Digital Applications?
If you work anywhere remotely IT-adjacent, it's extremely unlikely you won't have at least some familiarity with Docker, or have heard somebody discussing “dockerizing” an application, but what does this mean, and why would it be a good thing?