Metadata:

A tutorial to help understand the Dockerfile. An approach is taken which synthesizes using live Docker containers to guide composition of a Dockerfile. This approach is contrary to one that simply informs how to deploy a web-app; The Dockerfiles involved do not include the line cmd ["node","index.js"]
- Initially published on 01-21-2025.

Making and understanding the Dockerfile

Docker, and containerization by extension, is not taught in typical computer science curriculum. This statement is coming from an individual who has recently been academically immersed in Computer Science for 7 years. This span has brought three degrees from three different schools. This education includes a handful of graduate-level software engineering courses. One would think that this amount of experience would make learning Docker trivial. There is truth to this assertion, but any triviality is not inherit through the plethora of Docker tutorials that are pushed by any search engine.

A good docker tutorial is rare. Most produce a Dockerfile which concludes with the statement: cmd ["node", "index.js"]. These tutorials do well enough in terms of telling a reader how to make a container for their JavaScript web app. Frankly, these do a poor job in helping an individual produce their own Dockerfiles from scratch.

This is an attempt at producing a piece that helps one better understand both the Dockerfile and Docker in general. Personal experience has shown that an iterative approach to pedagogy allows for a better understanding; An act of scaffolding knowledge is what will be done here. That being said, it will be assumed that the reader has experience using Linux through a command-line shell. This page will use a handful of Linux commands that will not be described in detail.

It will also be assumed that a reader already has taken the steps to install Docker onto their system. The general purpose and motivation for using a container should also be understood, despite the fact that this tutorial will illustrate a subset of these motivations.

The basis of using Docker

Running a container from an image

The Dockerfile acts as a set of declarations to build a container. These instructions tell Docker what to build and the order in which it should be built. Perusing DockerHub exposes one to a wide range of images for containers that can be used as-is. This is typically done by using docker pull within the command line shell. Pulling an image from the hub will produce and build the image so that it can be used as a container.

Let us dive into using an image. Ubuntu is a Linux distribution which is a common gateway for those who are new to practicing computer science. This provides a good basis in becoming comfortable with the behaviors of Docker.

In a terminal, run docker pull ubuntu:24.04 to download and build Ubuntu version 24.04. This will pull an image that represents the minimal installation of the most recent, (at the time of this writing), long-term-service version of Ubuntu.

Screenshot of the Linux command-line shell. The shell is showing the result of running a Docker pull command — Pulling ubuntu:24.04 from DockerHub.

The Docker image has been pulled from DockerHub and subsequently built for use as a container. Docker now needs to be told to start a container which runs a build of the image. This will be done by the docker run command where the name of the image is given along with any other arguments. In this case, a trio of arguments will be given to ensure that the container remains running such that it can be accessed.

Input docker run -dit ubuntu:24.04. After issuing this command, a string will be output. This string is a unique identifier for the container. This can be used to confirm whether or not the container is running.

Screenshot of the Linux command-line shell. The shell is showing the result of running a Docker container. — Creating and running a Docker container using the ubuntu:24.04 image.

To view a list of running containers, most tutorials will recommend using docker ps. To someone who is new to Docker, this command has a confusing label. Never is it discussed as to what ps actually means; no elaboration is given as to why ps is used instead of ls - a command which is the POSIX compliant means of listing entries; One that just about every other piece of unix software uses. Turns out that ps stands for process status. It also turns out that it is an alias for a command which has presumably existed prior to the implementation of docker ps - docker container list. Furthermore, there are even more aliases for this command. This tutorial will opt to use docker container ls as an alternative. Just remember that docker container ls is functionally equivalent to docker ps.

Screenshot of the Linux command-line shell. The shell is showing the result of listing the running Docker containers. — Listing the currently running Docker containers. It is worth observing the CONTAINER ID column and correlating it to the id previously generated.

Take note the CONTAINER_ID column in the output shown in the screenshot above. This is a truncated version of the id that was generated post-docker run. This truncated value can be used to attach to the running container. This is done by using docker attach <container-id>, where <container-id> is c954e606dfe3 in the example above.

Screenshot of the Linux command-line shell. The shell is showing the result of attaching to a running container. — Attaching to the container whose ID is c954e606dfe3.

The shell for Ubuntu 24.04 is now being accessed! This is indicated by root@c954e606dfe3 as the username and hostname designation. This shell can receive any command that is inherit with the minimal distribution of Ubuntu. This includes all the POSIX compliant commands such as ls, cat, echo, and so on. This also includes access to programs such as dpkg or apt which are unique to Ubuntu and its Debian heritage.

It's good to get your bearing in a new environment like this. Get a feel for the current working directory and then navigate to /root, which acts as the home folder for the root user. Here, create a file called test.txt by directing the output of echo "Hello" to it.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which test out the new operating environment. — Testing out the containerized Ubuntu command-line shell. The hotkey `ctrl-l` was used prior to taking this screenshot to clear `stdout`.

A computer scientist is never complacent with using just "Hello" as test output. "Hello World!" is in order. Try editing the text file by using a text editor like nano.

Screenshot of the Linux command-line shell. The shell is showing the result of attempting to run nano. — Attempting to run nano in the containerized Ubuntu shell.

The result of attempting to use nano should make it apparent how minimal this installation of Ubuntu really is. As mentioned prior, the apt package manager is available. Install nano with apt install nano. Before doing so, be sure to update the package information so that apt knows what to reference. This is done with apt update. Once finished, use nano to alter test.txt and then confirm the change has been made with cat.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which first installs nano and then uses nano to modify a text file. — Post-installation of nano. It is used to alter test.txt.

The persistence of a container

One could scaffold exactly what they need within the shell of this container. Need a LAMP server? Use apt to install Apache2, MySql, and PHP. Need that JavaScript web app? Install node.js and its dependencies. One could then use the docker inspect command to discover the local IP address of the container to do proper network configuration on the host machine to allow external access to the container.

This approach is not typically taken for the reason of scalability. Docker shines when deploying services at scale. A machine may end up running multiple instances of the same image. These cloned containers may have their state diverge as time progresses. The reciprocal of this is that the same image may be deployed across multiple machines to act as a means to load balance. Here, the state of the containers may be the same as they work in concert. In this case, the state itself might be stored by some abstractly singular entity such as an external database.

To shore up this intuition, it's important to get a good feel for how data persists within a container. This intuition can be evaluated while a container switches between operational states through Docker's engine. What happens to the data when a user detaches from the shell? What happens to the data when a container is stopped?

The Ubuntu shell is currently waiting for some command. Intuitively, supplying exit to this shell will exit the shell and bring the scope of the terminal back to the environment that initially called docker attach. Unfortunately, this has the side effect of stopping the container. Recall that when this container was initiated, the docker run command was supplied with a trio of arguments: -dit. Documentation defines these as follows:

-d, --detatch: Run container in background and print container ID
-i, --interactive: Keep STDIN open even if not attached
-t, --tty: Allocate a pseudo-TTY

If these arguments were not supplied, then the container would have spun up, complete the required set of instructions to ensure that Ubuntu is powered on, and then immediately shut down. These docker run arguments allowed the container to persist in a headless state.

A command is needed to allow the container to continue to exist in the headless state while being able to leave the container's shell. Docker's engine affords ctrl+p + ctrl+q to accomplish this. With the shell selected, press and hold ctrl and then press p followed by q. stdout will present read escape sequence and then escape to the shell that attached the container in the first place.

Here, it can be confirmed that the container is still running by using docker container ls.

Screenshot of the Linux command-line shell. The shell is showing the result of detaching from a Docker container and then listing running Docker processes. — Using `docker container ls` to confirm that the container is still running.

Logic would dictate that data would persist within a running container. This can be confirmed by reattaching to the container by running docker attach as done previously. Before trying this, take note of the NAME column within the output of docker container ls. This is a mnemonic that acts as an alias for the container. In this example, what this means is that instead of using docker attach c954e606dfe3 to connect to the container, docker attach strange_herschel can be used. These mnemonics are randomly generated by Docker whenever a container is instantiated.

Data persistence can be confirmed within the shell by navigating back to the root folder and observing that test.txt still has the same value.

Screenshot of the Linux command-line shell. The shell is showing the result of attaching to a running Docker container while confirming nothing has changed within it. — Confirming that data has persisted within this container.

Once again, the option of using exit can be used to stop the container. Instead of taking this route, detach from the container using the ctrl+(p q) method. Once the shell is detached, run docker stop <container-name>, where <container-name> is the aforementioned alias given to the container. It may take a moment for the container to spin down. Once it is finished, its alias will be produced in stdout of the terminal. Confirm that the container is no longer running by docker container ls.

Screenshot of the Linux command-line shell. The shell is showing the result of stopping a container and confirming that it's no longer running. — Result of stopping a container. Obeserve that it is no longer listed in `docker container ls`.

The container is not being listed! Is the container gone along with its data? docker container ls actually reports a list of running containers. To view the entire list of containers, opt for docker container ls -a.

Screenshot of the Linux command-line shell. The shell is showing the result of viewing all container images. — The complete process status of Docker. Showing containers which are also powered off.

To confirm whether or not data persists in a container that has stopped, start the container back up using docker start <container-name>. Reattach to the session's shell and confirm that the text file still exists. It wouldn't hurt to use nano to change the file as well to confirm that it is still installed.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands after starting and attaching to a Docker container. The commands show that data has persisted. — Starting and attaching to a container shows that data has persisted from a state where the container was shut off.

This clues us into the fact that data persists as long as the container hasn't been properly deleted.

Detach from the container by using ctrl+(p q).

The persistence of multiple containers

Reconsider the case in which multiple containers of the same image may exist in a Docker environment. Spinning up another Ubuntu container can help build intuition of how data persists within Docker. Let's do so by running docker run -dit ubuntu:20.04 once more. The result of running this command will produce some new identifying string for the new container.

The existence of this new container can be shown by once again running docker container ls. Two running containers are now displayed.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which first starts a new container than uses docker's process status command to view it running in tandem with other containers. — Viewing the containers running in tandem after creating a new.

The name of the new container is estatic_archimedes in this current example. Attaching to the old container will show that its data is still in place, but attaching to estatic_archimedes will show that it is a fresh slate. Nano is not installed and there is no text file within the root's home folder.

Screenshot of the Linux command-line shell. The shell is showing the result of attaching to the new container and confirming that it is a fresh slate. — Attaching to the new container to confirm that it does not have nano and that the root's home directory is empty.

Exit from this shell using the exit command. Stop the remaining container by using docker stop. Then proceed to remove both of these images. This is done by using the docker rm command. It is with this command that the containers will be destroyed and their data will stop persisting.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which stop and remove running containers. — Stopping and removing the active containers.

Using a Dockerfile

When it comes to deploying containers at scale, requiring some system administrator to manually install the dependencies and set up the environment within each container is unreasonable.

Reconsider the need to deploy multiple containers of the same web application. Perhaps this can be an array of containers which serve a website using the LAMP stack. The purpose of using an array of containers could be to ensure that users of varying geographic regions can access the website from a server that is within reasonable proximity. Containers that exist on different hardware can be leveraged as load balancers to capture the case where a single server might be overwhelmed with traffic. Load balacing will ensure that a user can access the same website by communicating with a different physical server.

Assume that the same web app is deployed across 100 different machines. Using a container, as we have done here, actually adds more steps to the process compared to just running the server on bare metal. That is, for every instance, a system admin will have to attach to the container and install the set of dependencies. Using the example of a LAMP stack, this includes running apt install apache2 php and likely even more tools to ensure the server runs properly. They will need to manually make configurations to apache and manually set up any pointers to an external mysql database. They will also need to configure which ports to expose using iptables on the host machine.

Imagine having to do this 100 times over. Intuition for a computer scientist should be to automate this process. Indeed, this can be done by building a script that iterates through the relevant ip addresses within a virtual private network. Here, docker inspect can be used to discover the route to a container. Then ssh can be used to pipe in a set of instructions as a bash script. Within each container the bash script can be run to ensure all the pieces are in place.

While this general process is valid, the Docker engine provides this functionality as a Dockerfile. The Dockerfile is an abstraction that allows an administrator to present a recipe for a new image to Docker. These new images can inherit instructions from other images. Furthermore, docker pull can be considered an abstraction for a set of instructions that first downloads a Dockerfile and then runs docker build on said Dockerfile.

Reconsider the images of containers presented on DockerHub. Earlier, docker pull ubuntu:24.04 was used. The syntax of the image label includes a colon. Anything to the left-hand side of the colon is the base name of the image. Anything to the right indicates a specific tag that implies a version. This specific image can be found by searching Ubuntu within DockerHub and selecting the tags section. Specifically, it is listed as an image layer of ubuntu:24.04.

Looking at the image layer page, the left pane has a section dedicated to layers. These layers are as follows:

ARG RELEASE
ARG LAUNCHPAD_BUILD_ARCH
LABEL org.opencontainers.image.ref.name=ubuntu
LABEL org.opencontainers.image.version=24.04
ADD file:bcebbf0f....
CMD["/bin/bash"]

These linear statements are actually the Dockerfile of this image!

What is Ubuntu? We know it is an operating system which makes use of the Linux kernel; it is a set of programs and subroutines that exist as a shell around it. If one selects the second-to-last entry of the image layers pane followed by selecting the Packages tab in the pane on the right-hand side of the web page, the list of packages that constitute Ubuntu are listed here. One of the first entries within this pane is apt, a program that we've already used in one of the containers created through this tutorial.

A screenshot of a web browser viewing DockerHub's ubuntu:24.04 page. Highlighted on the page is the apt package on the pane to the right. — Ubuntu's 24.04 page on Dockerhub. Take note of the packages pane.

This implies that Dockerfiles can be composed of other images which are also defined by Dockerfiles. Granted, this recursive nature isn't inherit in the example of ubuntu:24.04, but can be shown by creating a Dockerfile for an Ubuntu image that has nano preinstalled in addition to having a "Hello World!" text file in the root's home folder.

Building the minimal

What is the bare minimum required of a useful Dockerfile? DockerHub contains a wide range of images which seem to scaffold on top of some minimal implementation of an operating system. What would be required of a Dockerfile to contain one of these images?

It is simple. A Dockerfile would need to include a statement that pulls from the DockerHub! DockerDocs has a reference page that explains all the applicable key words that can be used within a Dockerfile. Take note of FROM:

FROM Create a new build stage from a base image.

Referring back to the layers that compose the ubuntu:24.04 image, the ARG, LABEL, ADD, and CMD keywords are used within the Dockerfile. In order, they are defined as such:

ARG Use build-time variables.

LABEL metadata to an image.

ADD local or remote files and directories.

CMD default commands.

Firstly, build-time variables are established which are dynamically assigned values by the Docker engine to communicate version and cpu architectural information. This information is used by the bootstrapped file system being imported with the ADD keyword. When the file system is in place, (which includes a set of executable binaries that allow the file system's contents to be an Ubuntu distribution), a shell to the operating system can be accessed using bin/bash. This is what allows the usage of the command line and to interact with the programs and system calls.

The above paragraph may seem daunting. Understanding the operating system concepts which allow this to happen is way beyond the scope of this writing. The big takeaway is that the maintainers of the DockerHub image have allowed the usage of the operating system such that we can create a Dockerfile which makes use of this image.

When building from a Dockerfile, the Docker engine expects the Dockerfile to be labeled Dockerfile. Create a new folder within the host operating system which will contain the Dockerfile. Create a new text file named Dockerfile. Be sure that it has no file extension; it cannot be labeled Dockerfile.txt. Now edit Dockerfile to include a single line:

That's it! That's all that's needed. The next step is to tell Docker to build from Dockerfile. Within the directory, run the command docker build -t ubuntu:24.04-from-dockerfile . . The -t argument provides a tag name which acts as a label for the image. The period which follows this argument is the path where the Dockerfile resides.

Once the Docker image is built, the command docker images -a can report a list of images that are built within a host's environment. This image can be run and attached just as the prior Docker image.

Screenshot of the Linux command-line shell. The shell is showing the result of buiding an image from a Dockerfile. — Building an image from a Dockerfile.

What's needed of a Dockerfile such that it comes preinstalled with nano and has a text file in the root's home directory? To answer this question, one needs to ask themselves how they accomplished this without using a Dockerfile. What were the steps they took to set this up in a live container?

Having these steps in mind, take note of the following keywords. These can be used in a Dockerfile:

RUN Execute build commands.

WORKDIR Change working directory.

RUN can be used to execute programs and system calls that are afforded by earlier imports, such as by the ADDition of Ubuntu. WORKDIR can be used to set the working directory in which these commands are executed. It will also define the working directory when the head is attached to the container.

Alter Dockerfile such that it contains the following:

Once in place, build the image from the Dockerfile with docker build -t ubuntu:24.04-with-nano . . Whilst the image is being built, more information will be output to stdout informing the user the progress of applying apt update and the progress of the installation of nano.

The image can now be used to create a container: docker run -dit ubuntu:24.04-with-nano.

Screenshot of the Linux command-line shell. The shell is showing the result of running the new Docker file. — Building the new image from a Dockerfile. Exact output may defer due to Docker engine's caching features.

Attach to the image to confirm that nano is already installed and test.txt is in place in the root's home folder.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands from attaching to a newly created Docker container and making use of its extended features. — Attaching to the newly created container and observing that nano is installed and the text file is in place.

Installing Linux programs from different sources

Many Linux distributions come with their own package management software. These often afford the ability to communicate with remote package repositories to fetch the program that a user may want to install. They also have awareness in terms of the dependencies required to run said program. With this awareness, they will fetch and install the dependencies, if necessary.

apt is what is commonly used for operating systems based on Debian. Ubuntu is one of these operating systems. Arch-based distributions use pacman. Fedora uses RPM. Gentoo uses portage. And so on.

apt acts as a front-end for dpkg. dpkg is a medium-level package manager that apt interfaces with. What this entails is that dpkg will check to see if dependencies are met prior to installing a package. If these dependencies aren't met, it will not seek them. It's up to the user to find the relevant packages.

Drilling further down takes us to the low-level package manager: dpkg-deb. This package manager will not even determine whether dependencies are in place. All three of these tiers install .deb packages.

Consider a case where a system administrator may not want to use apt to install packages within a container. This will require access to a .deb package file. If we were to install nano, the package file would need to be obtained from nano's website.

Nano's .deb package is located here. Navigating through to the stable package link onto Debian's package listing, a list of dependencies are given. It needs to be confirmed whether the base image comes prepackaged with these dependencies.

A cropped screenshot of Debian's manpages website. The cropping emphasizes the required dependencies for a package. — Listing of required dependencies for Nano from Debian's manpages website.

Create a new container using docker run -dit ubuntu:24.04. Attach to this container. Within the container's shell, use dpkg -S to determine whether a package has been installed:

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which first create a Docker container and then attaches to it. The screenshot then shows the result of a set of commands within the container's environment showing the installation status of various Debian packages. — Confirming that the required dependencies are already installed within the container.

It looks like the packages are already pre-installed! Exit this container such that it stops as a process. Navigating back to the Debian package information page, look past the Other Packages Related to nano section toward the Download nano section. Download the relevant package based on the host computer's system architecture. This will most likely be amd64, whose direct download link is here.

Once the .deb package is downloaded, place it in a subfolder with a name such as packages. Ultimately, this subfolder name is arbitrary, but a good name helps keep items in order. The file can now be copied into the container.

The command docker cp will be used. The syntax here is docker cp path/to/source_file container:path/to/destination. That is, the first argument is the path to the source file as it exists on the host machine. The second argument is the container name or id followed by a colon which is then followed by the path where the file should be placed within the container.

Screenshot of the Linux command-line shell. The shell is showing the result of copying a file from the host system to a docker container. — Copying nano from the host machine's filesystem to the container's filesystem.

Start and attach to the container. Navigate to the container_path. Then run dpkg -i <package file name> on the .deb file to install it. Give nano a try once it's installed!

Screenshot of the Linux command-line shell. The shell is showing the result of installing nano using the Debian package manager. — Installing the nano .deb package within the container and confirming its success.

Leave the container once finished by using the exit keyword in the container's shell.

Shared volumes

It's been shown that files can be copied to a Docker container. This is accomplished using the docker cp command. An alternative to using docker cp involves taking advantage of shared volumes. Volumes are storage spaces that are shared among different containers. A volume can also be used to connect the host's file system to a given container such that the host machine can interact using a file browser after the container has been created.

On the host system, navigate to the packages subfolder. Download the test version of nano, whose information can be found here. The same dependencies are required, so no extra effort needs to be made to ensure that this package can be installed. Download the .deb file for the required architecture. The amd64 image is at this url.

Place the downloaded .deb package into the packages subfolder. There should now be two files in the packages subfolder.

Screenshot of the Linux command-line shell. The shell is showing the result of downloading the test version of nano. — Using wget within the host system to download another version of nano onto the host machine.

To create a shared volume, a new container must be created. Information pertaining to the shared volume will be given by the -v parameter of docker run. Within the packages parent folder, use docker run -dit -v ./packages:/root/packages ubuntu:24.04. The argument given to -v is the source folder as it exists on the host machine. This is followed by a colon and then the path in which the contents of the source folder should be placed. In this case, it's within the home folder of the root user.

Attach to the docker container and navigate to the placement folder. Take a peek inside to see that both the .deb packages are included. Either of these can now be installed within this new container.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which create a Docker container using the volume argument. The commands used then show the set of files imported using said command. — Starting a container using the `-v` parameter and then confirming that the expected files are in place.

On Dockerfiles and shared volumes

Recall the purpose of leveraging a Dockerfile. It reduces the amount of work a system admin needs to do once a container is built. The -v parameter used in docker run is given a value during the time in which the container is built. Because of this, it may seem like a non-issue in terms of integrating the notion of establishing a volume into a Dockerfile. This becomes more of a problem whilst considering how to integrate multiple volumes.

Consider the VOLUME keyword as noted in the DockerDocs reference manual:

VOLUME Create volume mounts.

One would logically conclude that the VOLULME keyword within a Dockerfile can be leveraged to map a folder in the host's file system to to a container's file system. Unfortunately this is not the case. A volume is defined by Docker as such:

Volumes are persistent data stores for containers, created and managed by Docker. You can create a volume explicitly using the docker volume create command, or Docker can create a volume during container or service creation.

The existence of the docker volume create command implies that there exists a list of units which act as logical storage for the Docker engine. The documentation reinforces this notion by stating:

When you create a volume, it's stored within a directory on the Docker host. When you mount the volume into a container, this directory is what's mounted into the container. This is similar to the way that bind mounts work, except that volumes are managed by Docker and are isolated from the core functionality of the host machine.

What the above documentation implies is that volumes are typically reserved as units that only containers may take advantage of. This can be confirmed by looking at the list of volumes available for a container by running docker volume ls on the host machine, which reports an empty list at this point.

What is happening here? Is the -v parameter with docker run not short hand for --volume? Unfortunately, it is. This is misleading given the context though. The documentation for using run lists the parameter as follows:

-v, --volume Bind mount a volume

The documentation for using volumes makes things a bit more clear by stating:

To mount a volume with the docker run command, you can use either the --mount or --volume flag.

In general, --mount is preferred. The main difference is that the --mount flag is more explicit and supports all the available options.

It turns out that the way that -v has been used in this tutorial is shorthand for mounting the host machine's folder. This can be confirmed by using docker inspect to take a peak at a container's properties and finding information pertaining to a mount. This can be done by running docker inspect -f '{{ .Mounts }}' <container-id>.

Screenshot of the Linux command-line shell. The shell is showing the result of a command which shows the directory mapping which constitutes a Docker mount. — Using the inspect command to reveal the directory mapping which constitutes this 'volume' as a mount.

The --mount parameter is more flexible and can be used for interacting with volumes whilst also allowing the creation of bind mounts. In short, docker run -dit -v ./packages:/root/packages is functionally equivalent to docker run -dit --mount type=bind, source="./packages", target=/root/packages. In short, a Docker volume should be reserved for allowing containers to share data among each other - logically exclusive from the host machine or array of host machines.

Furthermore, to enforce containers using a shared volume, the VOLUME keyword within a Dockerfile is used to create an anonymous volume that can be discovered using docker volume ls. In order to take advantage of this volume, any subsequently created container must use the id of the anonymous container with the --mount or --volume parameter while using docker run.

These facts add a layer of complexity and highlights an inflexibility with the Dockerfile which makes the VOLUME keyword an inadequate option for porting data from a host's file system into a container. So how can data portability be accomplished within a Dockerfile?

Including local data in a Dockerfile

It has been discussed that using the VOLUME keyword creates logical space for containers to share data. This is not too useful when designing a Dockerfile unless initiating an anonymous volume is the intended action.

How can data be moved into a volume in the first place? Instead of using a mount, reconsider the technique that led into the discussion of volumes and mounts: docker cp. Dockerfile reference includes the following keyword:

COPY Copy files and directories

Here the basic syntax is COPY <source_path> <destination_path>. The <source_path> is one that exists on the host's file system!

Create a new file called Dockerfile within the parent directory of the packages folder. Alter it so that it reads as follows:

Within this folder, execute sudo docker build -t ubuntu:24.04-nano-local . . Take note the inclusion of sudo. This might be necessary to ensure the files are copied from the host's file system.

Create a new container and then take note that nano is installed.

Screenshot of the Linux command-line shell. The shell is showing the result of building an image from a Dockerfile that runs dpkg as one of its statements. — Result of building an image from the new Dockerfile. Take note that nano is ready to be used.

Dockerfiles and building packages from source

Recall the benefits provided by the different tiers of package managers. The high-level package manager apt acts as a front-end for dpkg. It proactively installs any dependencies that dpkg finds that a system is missing. The result dpkg's feature for confirming dependencies is passed on to dpkg-deb which handles the actual installation of a program or library.

There are times where there might not exist any .deb package for dpkg-deb to make use of! This typically occurs in the case of requiring niche software. For example, an organization may develop their own tooling and decide to not package it for wider use or to keep the package distribution-agnostic. When this occurs, compiling from source is what is typically done.

Compiling from source leverages a set of tools to build an executable binary that represents the program. Thinking abstractly, these tools use the source code and compile it into a language that can be more easily interpreted by a computer's architecture.

It's worth noting that package managers cannot be made without these tools. Indeed, creating a package file that a package manager operates on is also often reliant on these tools. The requirement of these tools is only temporary; The result of building from source is a binary that is not reliant on any of the programs which produced it. For example, the binary for dpkg in the Ubuntu image is located at /usr/bin. When a user types dpkg into the shell, the PATH environmental variable is used to locate the binary within this folder.

Reconsider nano. Package managers so far have been leveraged to place its binary somewhere into /usr. Nano's website also gives an option to download an RPM package so that a binary built for the Fedora linux distribution can be placed into some /bin subdirectory within its /usr directory. Nano also provides source code to allow one to build the binary from source.

Taking a look at the Installation and Configuration section within nano's website, a simple set of steps are given guiding one how to build from source. These steps are typical for Linux programs. They involve:

Using tar to extract a compressed archive of the program's source code.
Running a configuration shell script (.sh) which gathers information about the system and its dependencies. The script considers this system information and produces a Makefile.
Runs the program called make with respect to the Makefile which compiles the source code into a binary.
Places the resultant binary into the correct system folder while also setting up environment variables to allow the usage of the program.

These bullet points should seem familiar. Both the concept and naming convention of the Makefile is very close to that of the Dockerfile. Compiling from source under different conditions will produce different Makefiles. Different computers may have different versions of libraries, programs, and even operating systems which will end up affecting this. The advantage of Docker is emphasized when observing that while compiling from source within a Docker image, the resultant Makefile will be the same regardless of underlying hardware and software of the host machine.

While making use of package managers to install nano, no dependencies were required. On the surface, this is true for nano while building from source. If one were to create a new Docker image using Ubuntu as a base image where the source archive is copied to the container's file system, the first step of extracting the archive would succeed because Ubuntu already has tar installed. The next step of running the configuration script would fail. It would not produce a Makefile as various dependencies for the tools required to compile are missing. Furthermore, make itself is missing. Running make in the shell will confirm this where the shell produces the output of bash: make: command not found.

Yes, nano does not require any dependencies to run, but it requires dependencies in order for it to be built. The build tools that are needed to compile nano from source are missing.

To explore how to build a package from source, three different contexts will be given. The first is one in which it is assumed that using a high-level package manager is a valid option for installing the tools required to build from source. The second option is one where it is assumed that a low-level package manger can be leveraged for these tools. The third is one where a different Docker base image is used which already includes the build tools.

Installing dependencies by external source

Assume a scenario where nano is not distributed such that it can be installed with a package manager. Assume that the system does indeed have access to a high-level package manger such as apt. This makes the process convenient as one can simply leverage apt to download and install the tools required to build nano from source.

In order to compile something from source, a system will need to have access to the source code. This will require a space in which the source code can be copied to a container.

Create an empty directory. Within it, create a sub-directory called 'archives'. Download nano from Berkley's nano mirror page. Specifically, download nano-8.2.tar.gz. Place the tar.gz archive into the archives sub-folder.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which first download nano-8.2 and ensures that it exists in a particular subdirectory. — Downloading nano-8.2 and placing it in an adequate subdirectory.

Knowing which tools to install

Looking through the build instructions for nano, one could conclude that the only tool required is make. This is a reasonable conclusion. But if one were to build a Dockerfile based off this conclusion, (where there is an attempt to compile the source code), the build will not succeed. Whilst Docker attempts to build the image, a portion of the base-image's stdout will be exposed to the user revealing the following:

Screenshot of the Linux command-line shell. The shell is showing the result an attempt at building a Docker image from a Dockerfile. The attempt is thwarted by an error message preventing the installation of a key component. — The attempt at building an image from a Dockerfile which leads to an error.

Where the Dockerfile that is attempted to be built here is as follows:

The truncated error received from the base-image's environment does give enough information here, but there are cases in which much more information is output to stdout to the point where this truncated view is too much of an obstruction.

How can this be navigated? As was the case in prior examples, attaching to a base-image and exploring which steps to take in a live environment is a surefire way of becoming acquainted with knowing what instructions to put in a Dockerfile.

Attach to a container which is built using ubuntu:24.04 that has access to the archive. This can be done by creating a minimal Dockerfile that includes FROM ubuntu:24.04 and COPY archives/ /root/archives or this can be done by creating the container with a mounted folder using docker run -dit -v ./archives:/root/archives ubuntu:24.04.

Once attached, run apt update followed by apt install make. Navigate to /root/archives. Extract the archive by using tar with the command tar -xf nano-8.2.tar.gz. Here, in the live environment, the configuration script will be run where the output can be examined more closely. Execute the configuration file within the extracted folder.

Screenshot of the Linux command-line shell. The shell is showing the result of attempting to compile nano from source within a running container. This attempt is thwarted by an error message which exposes a lack of dependencies for the build process. — The error output of the attempt at a building from source from within a running container.

The information output to stdout from running configure is as follows:

Taking a look at each line, it seems that the configuration script checks for certain packages and programs. If the package is in place, its name is output. If it is not in place, then a simple "no" is noted.

It's encouraged that the reader look up the packages that are noted with a "no". Not all of these are necessary. For example, cl.exe pertains to something that can only exist within a Windows operating system environment. Pay attention to how gcc, cc, and clang are related.

The remaining packages that are required for this build to succeed are gawk, gcc, and clang. Run apt install gawk gcc clang.

Once completed, run the configuration script with ./configure within the directory of the extracted archive. Once the configuration script is finished, run make. Once make is finished, run make install.

Upon completion, nano can be used!

Screenshot of nano being used. — Nano has successfully been installed.

The steps to allow building nano from source are now known. From this information, the Dockerfile can be built.

Installing dependencies as local packages

On top of assuming a scenario where nano is not distributed such that it can be installed with a package manager, also assume that the system does not have access to a high-level package manger such as apt. This would facilitate a need to install the dependencies of clang, gawk, gcc, and make from local packages.

There are many reasons why such an environment may exist. This may be on account of an organizational decision involving the use of internal packages that aren't published outside the organization. Alternatively, an organizational decision may dictate that an attempt should be made to reduce external bandwidth incurred by calling the apt repositories. This is relevant when considering the multitude of containers that may be created from the same Dockerfile. These containers may exist on different sets of machines. For each machine, a call to the repositories is required and each dependency is subsequently downloaded. Imagine how this bandwidth may add up when deploying something like 50 containers across a set of load balancers.

Reconsider a call to apt install gawk gcc clang. The amount of dependencies for these packages are numerous both in quantity and size. 142 new packages were required, totaling to 241 MB of archives that needed to be downloaded. This is just for the sake of getting a program as simple as nano onto the machine!

Screenshot of the Linux command-line shell. The shell is showing the result of attempting to install a set of packages using the apt package manager. Emphasized is a large list of dependencies required to install said packages. — The long list of dependencies required for `gawk`, `gcc`, and `clang`. 142 dependencies totalling for a 241 MB download.

When building the resultant image from a Dockerfile, the leg of the process which took the longest was certainly downloading the packages from the repositories. It may be in the best interest of an organization to download these packages once and then distribute them within an internal network. This will save time and bandwidth both from the perspective of the organization and the maintainers of the apt network.

Screenshot of the Linux command-line shell. The shell is showing the result building an image from some Dockerfile which includes using the apt package manager to install required dependencies. — Building an image from some Dockerfile which includes `apt install gawk gcc clang`. Observe the line which reads `[ 4/10] RUN apt install make gawk gcc clang -y` and its associated runtime.

Taking this stance leads to needing to solve the problem of knowing which dependencies to install. Ubuntu is Debian based Linux distribution. Both apt and dpkg make use of .deb files to configure the operating system and place binaries. An exploration of Debian's manpages could be a place to explore this. Manual inspection shows that the dependencies of make wouldn't take too much effort to gather. make is only reliant on libc6 which is a package library already installed on the ubuntu:24.04 base image.

Diving into the dependency chain for gawk shows a bit more of a complicated picture. On the surface level, gawk is dependent on five packages - libmpfr6, libgmp10, libreadline8t64, libsigsegv2, and libc6. These packages in turn also have dependencies that may or not be installed.

Using the dpkg -S to determine whether or not a dependency in place shows that only libmpf46, libreadline8t64, libsigsegv2 are missing from the above group of five. In terms of recursive dependencies, libreadline8t64 requires another missing dependency that is readline-common.

Making use of Debian's manpages in conjunction with dpkg's search feature becomes more and more cumbersome to manually use as the size of a given dependency tree increases. This makes taking such an approach too time consuming. Consider the effort that would be required to gather the dependencies required of gcc:

A directed acyclic graph that traces package dependencies for gcc. — The dependency tree for gcc.

How can this process be automated? Intuitively, one could build a web crawler that takes a breadth-first-search of a package's dependencies using Debian manpages whilst also running dpkg -S for each dependency discovered in the dependency tree.

Developing a breadth-first-search is beyond the scope of this writing. It is something that should be considered should apt absolutely not be an option. Instead, an alternative approach will be taken which leverage's the ability for apt to download .deb files. This is adequate while considering the scenario in which an organization may want to reduce calls to the apt repositories.

On the host machine, create an empty directory called "packages". This directory will be used to collect the required dependencies. Use Docker to create a container using the ubuntu:24.04 base-image while using the "packages" subfolder as a mount-point. Attach to the resultant container.

Screenshot of the Linux command-line shell. The shell is showing the result of creating a new Docker container with a volume mounting. The shell then shows the result of attaching to the newly created container. — Creating a new Docker container with a mount to a packages directory which will contain the relevant `.deb` dependencies.

Run apt update. When this is complete, run the following command: apt-get install --download-only gcc -y. Running gcc once this is completed will show that it hasn't been installed. Using ls within the current working directory or the directory located at /root/packages won't expose these packages. Where were they downloaded?

The .deb files currently reside within /var/cache/apt/archives/. These need to be moved into the packages folder. Within packages, create a sub directory using mkdir /root/packages/gcc. Then move these archives into the gcc folder using mv /var/cache/apt/archives/*.deb /root/packages/gcc. Run ls /root/packages/gcc to confirm the move was successful.

Repeat the process of making a new sub-directory within /root/packages, downloading the relevant packages using apt-get, and moving said packages into the sub-folder for make, gawk, and clang.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which are confirming the existence of certain quantities of files within a set of subdirectoies. — Using `wc` to confirm the quantity of files that exist in each of the packages' subdirectories.

Since the packages folder is a mounted folder, the host machine can can now copy the folder to wherever it pleases. This means that these packages can be included in a COPY statement within a Dockerfile.

Before exiting the container, install the packages to get a sense of what exactly should be included within the Dockerfile. Luckily dpkg allows the batch installation of .deb files where a user needs not worry the order in which they should be installed - should no Pre-Depends exist. Within /root/packages, run dpkg -i gcc/*.deb. Once this is finished, run gcc to confirm its installation.

Screenshot of the Linux command-line shell. The shell is showing the result of installing gcc and its dependencies and then confirming that it succeeded. — Confirming the success of installing `gcc` by attempting to run it.

Do the same for the remaining primary dependencies. Starting with gawk, an issue arises while running dpkg. stdout concludes with a report that states the following:

This is due to gawk having dependencies that qualify as Pre-Depends status. That is, they must be installed before gawk. This can be confirmed by running dpkg -i /root/packages/gawk/*.deb once again. The dependencies are already installed allowing gawk to be successful in its installation.

A directed acyclic graph that traces package dependencies for gawk. — The dependency tree for gawk. Blue edges indicate a Pre-Depends relationship.

This facet will need to be considered when building a Dockerfile. Within the packages sub-directory, rename the gawk sub-folder to gawk-dependencies. Then make a new directory called gawk and move gawk_1%3a5.2.1-2build3_amd64.deb to it.

Screenshot of the Linux command-line shell. The shell is showing the result of reorganizing a set of directories and the files contained within them. — Reorganizing the directory structure of the `gawk` dependencies. `gawk` is in its own folder while the collection of dependencies are in another. This creates an environment where it's easier to install one grouping of packages, (the dependencies), before the other group, (the program).

Moving onto clang, another problem arises when attempting to run dpkg on the resultant set of .deb files. The batch concludes with the following message:

Piping ls to grep for each of the packages noted within the error message informs that these packages do indeed exist within /root/packages/clang. Try to give each of the packages from the error output an install.

An attempt at installing python3_3.12.3-0ubuntu2_amd64.deb presents a dependency error: python3 pre-depends on python3-minimal. Performing a search using ls | grep -i minimal shows that three minimal packages exist within these set of packages. This is a flag that python3-minimal_3.12.3-0ubuntu2_amd64.deb should be installed first. Do so using dpkg.

Having installed python3-minimal, the other packages which have a prefix of python3- are now installable. Moving onto the packages with the prefix of llvm-, if one tries to install llvm-18-dev, another dependency error is revealed: llvm-18-dev depends on llvm-18-tools. A grep search reveals the existence of llvm-18-tools_1%3a18.1.3-1ubuntu1_amd64.deb within the clang packages folder. Install this individual package using dpkg. Follow up with the installation of the correct llvm-18-dev file.

Intuitively, this seems to have the effect of putting all the ducks in a row. It seems that the python packages included in the error output were dependent on python3-minimal. This is also true with llvm-18-tools, which was also a dependency for llvm-18-dev. This true in the context of the running container. Running dpkg -i *.deb is not transactional. That means the packages that didn't result in an error were still installed. This lends to a new environment in which the set of dependencies that allow one to install the packages that resulted in an error on first attempt. This environment will need to be considered whilst building the Dockerfile.

Thus a set of folders need to be set up within /root/packages to emulate this environment. Within /root/packages/ rename clang to clang_dependencies-tier1. Make two new directories called clang_dependencies-tier2 and clang. From clang_dependencies-tier1, move the following files into clang_dependencies-tier2:

llvm-18-dev_1%3a18.1.3-1ubuntu1_amd64.deb
llvm-18-tools_1%3a18.1.3-1ubuntu1_amd64.deb
python3_3.12.3-0ubuntu2_amd64.deb
python3-minimal_3.12.3-0ubuntu2_amd64.deb
python3-pkg-resources_68.1.2-2ubuntu1.1_all.deb
python3-pygments_2.17.2+dfsg-1_all.deb
python3-yaml_6.0.1-2build2_amd64.deb

After placing the above files, move the following file from clang_dependencies-tier1 into clang: clang_1%3a18.0-59~exp2_amd64.deb

With these packages in place, run the following:

clang should now be installed! Run clang in the command shell interpreter to confirm. Once confirmation is made, install the packages within the make sub-directory within the packages folder.

This container is now in a state in which nano can be built from source! This needs to be abstracted such that the process can be applied to a Dockerfile. Exit the running container and create an empty directory on the host machine. Within this directory, copy the packages from the mounted folder to it. Create a second sub-folder called "archives" which contains the nano-8.2.tar.gz archive. Lastly, create a Dockerfile.

Alter the Dockerfile such that it reads as follows:

Build this image using docker build -t ubuntu:24.04-nano-source . and use docker run to create a container and check to see that nano is operational!

Using a base image with dependencies already included

The prior section has shown that a good amount of effort needs to be put into building a Dockerfile when a high-level package manager is not an option. The manual process of discovering dependencies and determining the order in which they should be installed requires a lot of time. As mentioned in the section, this process could be automated by building a script using breadth-first search graph traversal algorithm with respect to a dependency tree. Building such a script will likely be the subject of some other project note, as it is out of the scope of this current writing.

Not being able to build such an algorithm does not preclude the ability to find another aspect for saving time. Continue to assume an environment where a high-level package manager is not an option. In this assumption, operate on the fact that time should not be wasted wrangling dependencies for the deployment of a piece of in-house software that needs to be built from source. What approach can be taken?

So far it has been the case that ubuntu:24.04 has been used as a base-image for all the containers which have been built for this tutorial. This minimized instance of the operating system does not come preinstalled with the necessary dependencies to build packages from source. Ubuntu is not the sole base-image, though. There exist base-images that make use of most mainstream distributions. There exist base-images which scaffold on top of these base images.

DockerHub can be leveraged to find a suitable base-image. One just needs to know what problem they're trying to solve. In this case, an image that is able to build programs from source is required.

Providence would have it that somebody on DockerHub has addressed the same problem. There exists the gcc image for the purpose of compiling and building programs. Upon the inspection of the tags tab, this image is also Debian based; it scaffolds on top of the minimal Debian operating system in a similar vein to Ubuntu. This is evidenced by the bookworm tag, which is the latest release of Debian at the time of this writing.

Let's put this base-image to use in building nano from source. Firstly, run docker pull gcc:bookworm on the host machine to pull the base-image from DockerHub. Once this is complete, ensure that the current working directory has access to a folder called 'archives' that contains the nano-8.2.tar.gz archive. Use docker run -dit -v ./archives:/root/archives gcc:bookworm to create a running container using this base-image and folder mount point. Attach to the new container.

Screenshot of the Linux command-line shell. The shell is showing the result of a set of commands which pulls an image from DockerHub, creates a container with a mounted volume from that image, then attaches to it. — Pulling `gcc:bookworm` from DockerHub and using a container from the image.

Within the container, navigate to /root/archives and extract the archive using tar -xf nano-8.2.tar.gz. Change directory into the resultant nano-8.2 folder and run the configuration script with ./configure.

Observe that running configure completes without a hitch and a Makefile is created. This is because the base-image has what's required for building packages from source. Finish up by running make followed by make install. Once completed, run nano to see that its installation was successful.

The Dockerfile that can be pieced together from the above procedure is as follows:

Using local Docker images

Throughout this page, a set of Docker images have been built. Each of these images have been given an arbitrary name and tag, such as ubuntu:24.04-with-nano, ubuntu:24.04-nano-local, and ubuntu:24.04-nano-source. On top of those with arbitrary names, various images have been pulled from DockerHub that have their own predefined name and tags. This includes ubuntu:24,04 and gcc:bookworm. These can all be viewed by using docker images within the command-line shell.

As each subsequent image was built, it was assumed that the context of the deployment environement changed. ubuntu:24.04-with-nano, for example, had access to apt to get all packages. ubuntu:24.04-nano-local had access to apt, but used dpkg to install nano from a .deb package file. ubuntu:24.04-nano-source was built operating on two different assumptions in terms of how the dependencies required to build nano from source. The first assumption considered apt as an applicable source for these dependencies. The second assumption made use of dpkg to install .deb files that were downloaded using apt.

The motivation for using apt to download .deb files for the required dependencies was to have a basket of files that could be passed around on an internal network. Sharing these files with containers that exist on the internal network saves external bandwidth for an organization deploying them while also preventing any stress that multiple pulls may cause on the apt repositories.

There still is room to alleviate excess usage of external bandwidth. The act of pulling a base-image from DockerHub itself can compound. Each theoretical machine within an organization is still accessing the Docker repositories to retrieve ubuntu:24.04 or gcc:bookworm.

Consider the image that would be built from the following Dockerfile:

The resultant image would represent a package that is capable of building nano from source, correct? This image would satisfy the expectations of gcc:bookworm as it has been used in this tutorial. Go ahead and build an image based on this Dockerfile. Name it gcc-equiv.

Screenshot of the Linux command-line shell. The shell is showing the result of building an image from a Dockerfile which acts as an equivalent to another image that can be pulled from DockerHub. — Building an image from a Dockerfile which has the programs that are used from `gcc:bookworm`.

Now set up a directory which contains archives/nano-8.2.tar.gz. Within this directory create a Dockerfile that reads as follows:

Take note that this Dockerfile is almost exactly the same as the Dockerfile which leveraged gcc:bookworm. This Dockerfile instead references gcc-equiv as the base-image. This exposes a behavior inherit with the FROM keyword, where it will leverage any Docker image that exists on a system before attempting to pull from DockerHub.

Build this image as gcc-equiv:nano.

Screenshot of the Linux command-line shell. The shell is showing the result of building an image from a Dockerfile. — Creating a new container using the image that is an equivalent to the `gcc:bookworm` image.

There now exists another image that is functionally similar to the others made in this tutorial! That is, it exhibits the behavior we expect of it - it has nano and a set of build packages.

DockerHub is still being tapped into. When porting this Dockerfile to other pieces of hardware, DockerHub will still be queried for the originating ubuntu:24.04 image. How can this be mitigated?

Within the command-line shell of the host machine, run docker image save -i nano-image.tar.gz gcc-equiv:nano. This will save an image onto the host machine's filesystem as an archive! This can be distributed within an internal network, negating any external repository access.

To test this local image, remove the gcc-equiv and gcc-equiv:nano from the host machine by running docker rmi gcc-equiv:nano gcc-equiv. Once these two images are removed, run docker load -i nano-image.tar.gz.

Screenshot of the Linux command-line shell. The shell is showing the result of loading a locally saved Docker image. The screenshot then attaches to a container built from said image. — Loading a Docker image from local file archive and then attaching to it.

Conclusion and future considerations

In general, this page has described how to set up a suitable environment within a running Docker container from some base-image. The efforts within these live containers were then transcribed to a Dockerfile which can be used to create containers with the inherit environment. An iterative approach was taken where the approach of setting up the same environment became more and more complex.

Understanding that the same problem can be solved many different ways is a core component to the study of Computer Science. A good programmer will consider this as they approach a given problem. The typical Docker tutorial only gives singular absolutes; they do not satisfy the curiosity of knowing the extent of how and why a Dockerfile is put together. They only satisfy the need to run a containerized node.js server.

One may rebut that this tutorial only succeeds in informing a user how to build a container that runs nano - something that is much less useful. This would be an obtuse assertion. The different approaches taken through the iterative approach guide a reader into recognizing that the process can be generalized beyond deployment of nano. Different programs use the same processes to install.

That being said, this tutorial exists to guide a reader to a better understanding how the Dockerfile fits into the process of building Docker images. Hopefully they have found it to be much less daunting after working through this page.

There are lingering observations to be made in this tutorial that I will leave to the reader. The following are worth thinking about to help shore up general understanding:

Looking at the list of images using docker images, there is a discrepancy between the SIZE of each of these images. ubuntu:24.04-nano-local is significantly smaller than ubuntu:24.04-nano-source. Why is this? What can be added to the Dockerfile to bring ubuntu:24.04-nano-source closer to the size of ubuntu:24.04-nano-local?
Why is it that the host system's disk usage doesn't always increase at a rate that correlates to the SIZE of a Docker image when a new image is created?
How is the following an advantage to using RUN apt install clang gawk gcc make within a Dockerfile. (This relates to cache effeciency):
For practice, make an image from a Dockerfile which has both nano and tmux installed. Try all approaches of using apt and dpkg to install tmux while also trying an approach to build tmux from source.

> Making and understanding Dockerfiles

> Notes

Home