An Advanced, Practical Introduction to Docker
This is an advanced, practical introduction to Docker. It's mainly for people who have used Docker, but want a deeper understanding. It's similar to, after finishing Calculus I, II, III, and DiffEq, taking Advanced Calculus to go back over limits, derivatives, and integration at a deeper level.
There's so such thing as an "operating system" in a Docker container. There's such thing as being "in" a Docker container. What you call a Docker "container" is just an abstraction.
Docker "containers" aren't like virtual machines; you're not creating a general purpose environment with its own kernel. What you're doing is creating a runnable, deliverable Docker binary that contains the minimum that's needed to run a single application. You're delivering an application, not an environment. When you hear "container", think "application". When you hear "image", think "binary".
Docker hides so many of the underlying mechanics that you get the impression you're dealing with lightweight virtual machines. We should be grateful for the fact that we're victims of Docker's success.
The basic idea behind Docker is that Linux already has the capabilities for creating isolation, they just needed to be harnessed in a user-friendly manner. Docker is largely a front-end that abstracts already existing Linux cleverness, including namespaces (isolation) and cgroups (resource utilization).
The session "What Have Namespaces Done for You Lately?" by Liz Rice helps to demonstrate this concept; she effectively builds her own Docker-like tool from the ground up using Go (which is what Docker is written in!)
When you're running what is colloquially known as a Docker "container", you're running a process just like any other process, but with a different namespace ID. This namespace concept is the same concept you already know from C# and C++: it separates entities so they don't conflict. Thus, in Linux, you can have process ID 1 is one namespace and process ID 1 is another. They aren't in different environments like virtual machines. They're isolated, but not entirely separate.
Namespaces also let you have /some/random/file in one namespace and a different /some/random/file in another namespace: think super-chrooting. You can even have something listening on port 80 in one namespace and something entirely different listening on port 80 in a different namespace. No conflicts.
There's just a lot of namespace magic to give the illusion of various "micro-machines". In reality, there are no "micro-machines"; everything is running in the same space, but with a simple label separating them.
The term "container" and the preposition "in" lead to extreme confusion. There's nothing "in" a container, but the terminology is pretty much baked into the industry at this point. Note also, you never run something "in" Docker, but you can run something "using" Docker.
One way to prove to yourself that there's no voodoo subsystem is to look at how
ps works on your machine: you see the processes across each namespace. You may be running Elasticsearch and MongoDB as separate Docker "containers", but both of them will show up in the same
ps output on your host machine.
See example below:
[dbetz@ganymede src]$ docker run -d mongo:3.7 fa55205ad518da6fe61a794732c325263c96d6a10a5692fa6ea9821c4bbcfc79 [dbetz@ganymede src]$ docker run -d docker.elastic.co/elasticsearch/elasticsearch:6.2.4 c7e337b67689831da6beb78231394ff2bcb9341e60689187f14d579650027d5e [dbetz@ganymede src]$ ps aux | grep -E "mongo|elastic" polkitd 43993 1.1 0.7 985104 56008 ? Ssl 15:11 0:01 mongod --bind_ip_all dbetz 45304 74.0 14.3 4040776 1145040 ? Ssl 15:13 0:02 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.uHJg0AmQ -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch
A solid grasp of namespaces is critical to understanding Docker. Once you understand namespace concepts, you can move to understanding how namespaces can interact with each other. That's the larger world of Docker that extends deep into the design and deployment of orchestration.
To further review and reframe Docker concepts, let's recognize some of the resource types Docker uses. For the purpose of this discussion, let's use Azure's provider categories. This should keep the concepts general enough for reuse and specific enough to be practical.
The different resource types are:
- compute (e.g. processes)
There are others as well, but they're usually very similar to the others (e.g. IPC is similar to networking).
When you spin something up using Docker (e.g. docker run), it will have everything in its own namespaces: the process, storage, and networking. You manage the mapping between namespaces yourself, per resource type.
Let's review with an example..
Run Elasticsearch ("ES"):
docker run -d docker.elastic.co/elasticsearch/elasticsearch:6.2.4
ES will run in it's own process (PID) namespace. It will listen on port
9200 in its own network namespace. It will store data at
/usr/share/elasticsearch/data in its own mount namespace. It's entirely sandboxed.
To make ES practical, you need to map
9200 to something that can touch your network card and
/usr/share/elasticsearch/data to something in a less ephemeral location.
Here's our new command:
docker run -d -p 9200:9200 -v /srv/elasticsearch6:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:6.2.4
The point of reviewing these fundamental concepts is to further train your intuition in terms of namespaces. It's important have this intuitive training before going too deep into more Docker concepts like volumes or networks. Without training your intuitions to work in terms of namespaces, you'll inevitably end up with confused analogies with virtual machines, inefficient images, overly complex deployments, and unbelievably confused discussions.
On the other hand, with this understanding, it should be easy to understand that Docker represents applications, not operating systems. There are no kernels in Docker binaries or in "containers" just as there are no network drivers in your application's tarballs or zip files.
Namespaces are clever and very helpful. If you were to write a plug-in model for an application, you could instantiate each plug-in into a different namespace, then share an IPC namespace for communication. Supposedly, Google Chrome on Linux does something similar. Namespaces give you an easy, built-in way to do jailing/sandboxing.
Consider also this: because Docker spins-up processes just like any other process, each process has the same direct hardware access. Once you do a few device mappings to let the process in that namespace know where the real world hardware is, you're solid. So, you don't have to put too much thought into how to get something GPU access setup. Consider this is the context of this very confused SO discussion where people continue to cause confusion by talking about something "inside Docker containers". There is no "inside"; it's a process like any other.
man unshareon Linux to see the details for a native tool that creates namespaces.
Docker "containers" are created from binaries called
images are merely Docker binaries of your application just like any other deliverable binary format (e.g. binary tarball).
images are nothing more than file-system layouts with some metadata. The blueprint that provides instructions on how to build the file-system layout for an
image is a
This file-system will contain the application binary you want Docker to run. When your application runs, it may reach for various files (e.g.
/lib64/libstdc++.so.6), these files just need to be where the application would expect them.
Dockerfile also provides metadata that either describes the resulting binary. It also adds an instruction for how to start your application (e.g.
The most important concept reframe here is this: the resulting Docker
image is your complete deliverable application binary. It does not represent a system, just a single application.
Take care to avoid large multi-level image inheritance for the sake of "standardization". Standardization is the exact opposite of what you want with Docker. Tailor the deliverable to your specific application's needs.
Image Starting Points
Your application will run like any other application on your system. As such, it will follow the same rules of dependencies as any other application: if you application needs a file in order to run, you need to make sure it's within grasp. A solid understanding that these file-systems exist in different namespaces instead of different subsystems, enables flexible ways of satisying dependency requirements.
For example, if your machine already has a fairly large file (e.g.
/lib64/liblarge.so.7), instead of putting it in each image, keep it on the host and map it at run time (
-v /lib64/liblarge.so.7:/lib64/liblarge.so.7). When Docker sees the running application ask for
/lib64/liblarge.so.7, it will get it from the host machine. This concept, similar to symlinking, is at the heart of some important techniques discussed later.
When creating images, one option you have is to create a file-system from scratch. This entails adding each and every file to the proper location in the
image. Much of what follows a bit later will pursue this method and explain how to effectively create such lightweight images.
Another option you have is to build your file-system on an existing file-system template. This is the traditional approach most applications use. It maximizes portability, but the resulting
images are larger, containing a huge number of unused files.
When not careful, this second approach leads to atrocious misunderstandings.
Consider the following
FROM ubuntu RUN groupadd user01 \ && useradd --gid user01 user01 RUN apt-get install sometools CMD [ "sometool" ]
This file could very well lead many to think that there's an Ubuntu operating system "base image" that you're using and extending .This is entirely wrong.
As mentioned previously, Docker is primarily a front-end for existing Linux functionality. There is no concept of a hypervisor subsystem or the like. Applications run as they have always ran. There are no kernels in images, therefore there are no operating-systems in images. Docker does not work with operating-systems, it works with applications. There is no OS "base image". There is no place for sysadmins to do any work with Docker at all. Your
ENDPOINT instruction does not start
systemd, it starts your application.
Ubuntu is not in your image, only a file-system that looks like an Ubuntu file-system is in your image.
FROM ubuntu merely states that the
Dockerfile will start with a file-system template that looks like Ubuntu. You use it when you don't care about the size of your image and really need your application to work in an Ubuntu-like file-system. If your host system is RHEL, your binaries still run on RHEL -- Docker does not deal with operating systems.
For the most part, using Linux OS file-system templates are a very poor practice. They are largely not optimized for Docker. However, there is one OS file-system template that is optimized for Docker: Docker Alpine.
Docker Alpine provides a very small OS file-system template that maximizes application portability while minimizing binary size.
Dockerfile would transform to Docker Alpine like this:
FROM alpine RUN addgroup -g 1000 user01 && \ adduser -D -u 1000 -G user01 user01 RUN apk add --no-cache sometools CMD [ "sometool" ]
The resulting binary would be much smaller. Yet, keep in mind that Alpine is not in your Docker
image. Docker does not put operating systems into images. Your image is merely built on a file-system template that looks like an Alpine Linux file-system.
Do not confuse Docker Alpine with Alpine Linux. The former is a file-system template that looks like an Alpine file-system, while the latter is an operating system for routers, tiny linux deployments, and Raspberry Pi.
When creating portable images without extensive binary optimization, Docker Alpine is the only viable option. Do not use
FROM centos or
FROM ubuntu in any environment. These lead to extremely large and cause severe confusion.
This bears repeating: the entire point of Docker is to run your application. To do this, you just need to make sure your application has what it needs to run. The question is not "What do I build my application on?", the question is "What specific files does my application require?" Your application most likely doesn't need 90% of the files that a Linux file-system template provides, it probably just needs a few libraries. It may not even need the full XYZ library, but just file Y.
Docker lets you optimize your application like this. If you can identify the dependencies of your application, you'll be to build your file-system
At runtime, you're working with namespaces. At build time, you're working with images. Your ability to create optimal images is directly proportional to your understanding of namespaces and your application.
Let's jump right to an example of creating a tiny, usable Docker image...
First let's look at the
hello.asm file we want to run (taken from http://cs.lmu.edu/~ray/notes/x86assembly/):
global _start section .text _start: ; write(1, message, 13) mov rax, 1 ; system call 1 is write mov rdi, 1 ; file handle 1 is stdout mov rsi, message ; address of string to output mov rdx, 13 ; number of bytes syscall ; invoke operating system to do the write ; exit(0) mov eax, 60 ; system call 60 is exit xor rdi, rdi ; exit code 0 syscall ; invoke operating system to exit message: db "Hello, World", 10 ; note the newline at the end
Our goal is to create a tiny, deliverable Docker binary that writes-out "Hello, World".
Here's how we'll do it:
FROM alpine as asm WORKDIR /elephant COPY hello.asm . RUN apk add --no-cache binutils nasm && \ nasm -f elf64 -p hello.o hello.asm && \ ld -o hello hello.o FROM scratch COPY --from=asm /elephant/hello / ENTRYPOINT ["./hello"]
Dockerfile uses two stages: a build-stage and a run-stage. The first stage installs NASM, assembles the code, then links it the applcation, the second carefully places the application into the deliverable Docker binary. The second stage contains a single file:
/elephant/hello. It does not contain NASM, the source code, nor any intemediate files.
You can use as many stages as you want: sometimes you'll need a CI-setup stage (setup tools), then a backend-build stage (setup node, run
npm install), then a front-end build-stage (build Angular), then a final stage to carefully place files from previous stages (copy /node_modules/ and Angular/dist files to node application). Only the final stage is deployed, everything else is thrown out.
This results in the following:
[dbetz@ganymede tiny-image]$ docker build . -t local/tiny-image Sending build context to Docker daemon 4.096kB Step 1/7 : FROM alpine as asm ---> 3fd9065eaf02 Step 2/7 : WORKDIR /elephant Removing intermediate container da8e9f72ebd2 ---> 29896ad4bb3c Step 3/7 : COPY hello.asm . ---> 9ccc8ab38794 Step 4/7 : RUN apk add --no-cache binutils nasm && nasm -f elf64 -p hello.o hello.asm && ld -o hello hello.o ---> Running in f99cbecc309d fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz (1/3) Installing binutils-libs (2.30-r1) (2/3) Installing binutils (2.30-r1) (3/3) Installing nasm (2.13.01-r0) Executing busybox-1.27.2-r7.trigger OK: 17 MiB in 14 packages Removing intermediate container f99cbecc309d ---> d840e66bfbdb Step 5/7 : FROM scratch ---> Step 6/7 : COPY --from=asm /elephant/hello / ---> fd85715eaf85 Step 7/7 : ENTRYPOINT ["./hello"] ---> Running in e163f47f4d8a Removing intermediate container e163f47f4d8a ---> 1f30c749e8b9 Successfully built 1f30c749e8b9 Successfully tagged local/tiny-image:latest [dbetz@ganymede tiny-image]$ docker run local/tiny-image Hello, World [dbetz@ganymede tiny-image]$ docker image ls | grep "local/tiny-image" local/tiny-image latest 1f30c749e8b9 15 seconds ago 848B
It builds and it runs, and the entire binary is
The more functionality you add to your binary, the larger it grows. Your image size should remain somewhat proportional to your functionality. That's what you'd expect from a tarball, that's how you should think with Docker.
This means that you should be careful with what files go into your resulting binary. This means being careful with how you satisfy your application's dependency needs.
Would you really throw an entire Linux OS operating system into your tarball?
In the previous assembler example, we had an application with zero dependencies. When this is your situation, your Docker image size will be very near your application size. You want them to be as close as possible.
One popular way to satisfy this need is to use Go: this can output statically linked binaries that require zero dependencies. Go has many places where it fits nicely. You can see, for example, my recursivecall for Docker project. Docker itself is also written in Go.
Regardless, while Go is beautiful for many uses, you already have applications. Let's focus on deploying those application via Docker, not rewriting them in Go.
For this next section, let's assume that our appplication
/usr/lib64/libc.so.6. Our application will crash if it doesn't find
In this situation, we have two options based on your understanding of Docker namespaces:
/usr/lib64/libc.so.6into the image with
/usr/lib64/libc.so.6from the host machine to your running application's namespace
The first option can be accomplished with a multi-stage build with a simple COPY from the first stage:
FROM ubuntu as os FROM scratch COPY ./runner /var/app/runner COPY --from=os /usr/lib64/libc.so.6 /lib64/ CMD ["/var/app/runner"]
Remember, the Ubuntu stage will be thrown out, but, yeah, you should still try to use Alpine where possible.
This will create a portable binary; everything the application needs will be within reach.
As your portability increases, so does your binary size. When you need more than just a few files and you must maintain portability (e.g. posting to Docker Hub), it's time to use the Docker Alpine file-system template.
However, in the case where your control your environment, thus don't require portability, the second approach may be better.
It allows you to provide a much simpler
FROM scratch COPY ./runner /var/app/runner CMD ["/var/app/runner"]
In stead of copying the dependency into the image, you tell Docker at run-time to use a different namespace to satisy the dependency.
Your build and run would look like this:
docker build . -t local/runner docker run -v /usr/lib64/libc.so.6:/lib64/libc.so.6 local/runner
If you don't want to play around with each and every file, just map the entire
docker run -v /lib64/:/lib64/ local/runner
Since most libraries are loaded from
/lib64/, this technique will account for a large percentage of your scenarios.
scratch with Node
Let's make this more real-world by manually building a Docker Node binary which our deliverable Docker application binaries will use.
FROM alpine RUN apk add --no-cache curl && \ mkdir -p /tmp/node && \ mkdir -p /tmp/etc && \ curl -s https://nodejs.org/dist/v8.11.2/node-v8.11.2-linux-x64.tar.xz | tar -Jx -C /tmp/node/ RUN addgroup -g 500 -S nodeuser && \ adduser -u 500 -S nodeuser -G nodeuser RUN grep nodeuser /etc/passwd > /tmp/etc/passwd && \ grep nodeuser /etc/group > /tmp/etc/group FROM scratch COPY --from=0 /bin/sh /tmp/node/node-v8.11.2-linux-x64/bin/node /bin/ COPY --from=0 /usr/bin/env /usr/bin/ COPY --from=0 /tmp/etc/passwd /tmp/etc/group /etc/ CMD ["/bin/node"]
The resulting deliverable Docker binary will contain the node binary, passwd/group, and env (as an example of copying something you may need in Node development).
The first stage downloads Node, creates a user and group, then simplifies
/etc/group. Only the final stage represents the deliverable binary.
Build and run:
docker build . -t local/node8 docker run -it local/node8
Building and running results in the following error:
standard_init_linux.go:195: exec user process caused "no such file or directory"
Run it again with the mapping:
docker run -it -v /lib64/:/lib64/ local/node8
[dbetz@ganymede node8]$ docker run -it -v /lib64/:/lib64/ local/node8 node >
Let's check the application version:
[dbetz@ganymede node8]$ docker run -it -v /lib64/:/lib64/ local/node8 node -v v8.11.2
What's our size?
[dbetz@ganymede ~]$ docker image ls | grep "local/node8" local/node8 latest 901c2740deb9 12 seconds ago 36.4MB
It's 36.4MB. Pretty small.
Your Docker application binary will only contain 36.4MB of overhead when you ship your product.
Using our binary
With the Docker Node image built, we can build our deliverable application binary.
FROM node:8.11.2-alpine as swap-space WORKDIR /var/app COPY package.json /var/app/ RUN npm install COPY . /var/app FROM local/node8 WORKDIR /var/app COPY --from=swap-space /var/app/ /var/app/ ENV PORT=3000 USER nodeuser:nodeuser ENTRYPOINT ["node", "server.js"]
The first stage will use an official Docker Node binary to prepare our application. The second stage merely copies the application in. NPM isn't needed for your application to run. It only needs the application folder consisting of your code and
Build it and push it out (real-world example):
TAG=`date +%F_%H-%M-%S` docker build . -t local/docker-sample-project:$TAG -t registry.gitlab.com/davidbetz/docker-sample-project:$TAG docker push registry.gitlab.com/davidbetz/docker-sample-project:$TAG
/etc/passwd and /etc/group
The addition of
/etc/group are artifacts of how most Linux tools work: they want a name, not UID or GID. You create a user and group just to name them. You can't simply specify UID 500.
/etc/group are part of a file-system in a specific namespace, tools use the files within the file-system they're looking at to do the ID to name lookup.
This gives us an opportunity to do an experiment...
Let's run MongoDB in the background:
[dbetz@ganymede ~]$ docker run -d mongo 8e4be179031cc7221e561934df02c513cb9f7b3946343e0a2b027dace5f83f03
sh in that namespace:
Remember, you aren't going into a container, there is no container. But, it's still phenomenological language like "the sun rises". You'll end up talking about "containers", but remember they're merely abstractions.
[dbetz@ganymede ~]$ docker exec -it 8e4b sh #
Let's see the processes from the perspective of that namespace:
# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND mongodb 1 7.0 0.6 984072 55524 ? Ssl 18:43 0:00 mongod --bind_
OK, so the user for
mongodb. Let's get the UID and GID for
# grep mongodb /etc/passwd mongodb:x:999:999::/home/mongodb:/bin/sh # grep mongodb /etc/group mongodb:x:999:
sh and look for
mongod in your processes on your host machinnne:
[dbetz@ganymede ~]$ ps aux | grep mongod polkitd 40754 0.6 0.7 990396 58628 ? Ssl 13:43 0:02 mongod --bind_ip_all
The user is
Well, look for user
999 in your
[dbetz@ganymede ~]$ grep 999 /etc/passwd polkitd:x:999:997:User for polkitd:/:/sbin/nologin
999 and used the
/etc/passwd within reach to do the lookup, thus interpreted it as
- Docker uses existing Linux functionality.
- There is no "subsystem" or hypervisor.
- Images do not contain operating systems.
- Docker images are merely Docker binaries of your application.
- Your Docker images should not contain anything other than your application and it's dependencies.
- Images are either from
scratchor based on a file-system template like Docker Alpine.
- You can map files already on your system's file-system to minimize the size of images.