An Advanced, Practical Introduction to Docker

This is an advanced, practical introduction to Docker. It's mainly for people who have used Docker, but want a deeper understanding. It's similar to, after finishing Calculus I, II, III, and DiffEq, taking Advanced Calculus to go back over limits, derivatives, and integration at a deeper level.

There's so such thing as an "operating system" in a Docker container. There's such thing as being "in" a Docker container. What you call a Docker "container" is just an abstraction.

Docker "containers" aren't like virtual machines; you're not creating a general purpose environment with its own kernel. What you're doing is creating a runnable, deliverable Docker binary that contains the minimum that's needed to run a single application. You're delivering an application, not an environment. When you hear "container", think "application". When you hear "image", think "binary".

Docker hides so many of the underlying mechanics that you get the impression you're dealing with lightweight virtual machines. We should be grateful for the fact that we're victims of Docker's success.

Namespaces

The basic idea behind Docker is that Linux already has the capabilities for creating isolation, they just needed to be harnessed in a user-friendly manner. Docker is largely a front-end that abstracts already existing Linux cleverness, including namespaces (isolation) and cgroups (resource utilization).

The session "What Have Namespaces Done for You Lately?" by Liz Rice helps to demonstrate this concept; she effectively builds her own Docker-like tool from the ground up using Go (which is what Docker is written in!)

When you're running what is colloquially known as a Docker "container", you're running a process just like any other process, but with a different namespace ID. This namespace concept is the same concept you already know from C# and C++: it separates entities so they don't conflict. Thus, in Linux, you can have process ID 1 is one namespace and process ID 1 is another. They aren't in different environments like virtual machines. They're isolated, but not entirely separate.

Namespaces also let you have /some/random/file in one namespace and a different /some/random/file in another namespace: think super-chrooting. You can even have something listening on port 80 in one namespace and something entirely different listening on port 80 in a different namespace. No conflicts.

There's just a lot of namespace magic to give the illusion of various "micro-machines". In reality, there are no "micro-machines"; everything is running in the same space, but with a simple label separating them.

The term "container" and the preposition "in" lead to extreme confusion. There's nothing "in" a container, but the terminology is pretty much baked into the industry at this point. Note also, you never run something "in" Docker, but you can run something "using" Docker.

One way to prove to yourself that there's no voodoo subsystem is to look at how ps works on your machine: you see the processes across each namespace. You may be running Elasticsearch and MongoDB as separate Docker "containers", but both of them will show up in the same ps output on your host machine.

See example below:

[dbetz@ganymede src]$ docker run -d mongo:3.7
fa55205ad518da6fe61a794732c325263c96d6a10a5692fa6ea9821c4bbcfc79

[dbetz@ganymede src]$ docker run -d docker.elastic.co/elasticsearch/elasticsearch:6.2.4
c7e337b67689831da6beb78231394ff2bcb9341e60689187f14d579650027d5e

[dbetz@ganymede src]$ ps aux | grep -E "mongo|elastic"
polkitd   43993  1.1  0.7 985104 56008 ?        Ssl  15:11   0:01 mongod --bind_ip_all
dbetz     45304 74.0 14.3 4040776 1145040 ?     Ssl  15:13   0:02 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.uHJg0AmQ -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch

A solid grasp of namespaces is critical to understanding Docker. Once you understand namespace concepts, you can move to understanding how namespaces can interact with each other. That's the larger world of Docker that extends deep into the design and deployment of orchestration.

To further review and reframe Docker concepts, let's recognize some of the resource types Docker uses. For the purpose of this discussion, let's use Azure's provider categories. This should keep the concepts general enough for reuse and specific enough to be practical.

The different resource types are:

compute (e.g. processes)
storage
networking

There are others as well, but they're usually very similar to the others (e.g. IPC is similar to networking).

When you spin something up using Docker (e.g. docker run), it will have everything in its own namespaces: the process, storage, and networking. You manage the mapping between namespaces yourself, per resource type.

Let's review with an example..

Run Elasticsearch ("ES"):

docker run -d docker.elastic.co/elasticsearch/elasticsearch:6.2.4

ES will run in it's own process (PID) namespace. It will listen on port 9200 in its own network namespace. It will store data at /usr/share/elasticsearch/data in its own mount namespace. It's entirely sandboxed.

To make ES practical, you need to map 9200 to something that can touch your network card and /usr/share/elasticsearch/data to something in a less ephemeral location.

Here's our new command:

docker run -d -p 9200:9200 -v /srv/elasticsearch6:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:6.2.4

The point of reviewing these fundamental concepts is to further train your intuition in terms of namespaces. It's important have this intuitive training before going too deep into more Docker concepts like volumes or networks. Without training your intuitions to work in terms of namespaces, you'll inevitably end up with confused analogies with virtual machines, inefficient images, overly complex deployments, and unbelievably confused discussions.

On the other hand, with this understanding, it should be easy to understand that Docker represents applications, not operating systems. There are no kernels in Docker binaries or in "containers" just as there are no network drivers in your application's tarballs or zip files.

Namespaces are clever and very helpful. If you were to write a plug-in model for an application, you could instantiate each plug-in into a different namespace, then share an IPC namespace for communication. Supposedly, Google Chrome on Linux does something similar. Namespaces give you an easy, built-in way to do jailing/sandboxing.

Consider also this: because Docker spins-up processes just like any other process, each process has the same direct hardware access. Once you do a few device mappings to let the process in that namespace know where the real world hardware is, you're solid. So, you don't have to put too much thought into how to get something GPU access setup. Consider this is the context of this very confused SO discussion where people continue to cause confusion by talking about something "inside Docker containers". There is no "inside"; it's a process like any other.

Run man unshare on Linux to see the details for a native tool that creates namespaces.

Images

Docker "containers" are created from binaries called images. Docker images are merely Docker binaries of your application just like any other deliverable binary format (e.g. binary tarball).

These images are nothing more than file-system layouts with some metadata. The blueprint that provides instructions on how to build the file-system layout for an image is a Dockerfile.

This file-system will contain the application binary you want Docker to run. When your application runs, it may reach for various files (e.g. /lib64/libstdc++.so.6), these files just need to be where the application would expect them.

A Dockerfile also provides metadata that either describes the resulting binary. It also adds an instruction for how to start your application (e.g. CMD, ENDPOINT).

The most important concept reframe here is this: the resulting Docker image is your complete deliverable application binary. It does not represent a system, just a single application.

Take care to avoid large multi-level image inheritance for the sake of "standardization". Standardization is the exact opposite of what you want with Docker. Tailor the deliverable to your specific application's needs.

Image Starting Points

Your application will run like any other application on your system. As such, it will follow the same rules of dependencies as any other application: if you application needs a file in order to run, you need to make sure it's within grasp. A solid understanding that these file-systems exist in different namespaces instead of different subsystems, enables flexible ways of satisying dependency requirements.

For example, if your machine already has a fairly large file (e.g. /lib64/liblarge.so.7), instead of putting it in each image, keep it on the host and map it at run time (-v /lib64/liblarge.so.7:/lib64/liblarge.so.7). When Docker sees the running application ask for /lib64/liblarge.so.7, it will get it from the host machine. This concept, similar to symlinking, is at the heart of some important techniques discussed later.

When creating images, one option you have is to create a file-system from scratch. This entails adding each and every file to the proper location in the image. Much of what follows a bit later will pursue this method and explain how to effectively create such lightweight images.

Another option you have is to build your file-system on an existing file-system template. This is the traditional approach most applications use. It maximizes portability, but the resulting images are larger, containing a huge number of unused files.

When not careful, this second approach leads to atrocious misunderstandings.

Consider the following Dockerfile:

FROM ubuntu

RUN groupadd user01 \
  && useradd --gid user01 user01

RUN apt-get install sometools

CMD [ "sometool" ]

This file could very well lead many to think that there's an Ubuntu operating system "base image" that you're using and extending .This is entirely wrong.

As mentioned previously, Docker is primarily a front-end for existing Linux functionality. There is no concept of a hypervisor subsystem or the like. Applications run as they have always ran. There are no kernels in images, therefore there are no operating-systems in images. Docker does not work with operating-systems, it works with applications. There is no OS "base image". There is no place for sysadmins to do any work with Docker at all. Your CMD/ENDPOINT instruction does not start init nor systemd, it starts your application.

Ubuntu is not in your image, only a file-system that looks like an Ubuntu file-system is in your image. FROM ubuntu merely states that the Dockerfile will start with a file-system template that looks like Ubuntu. You use it when you don't care about the size of your image and really need your application to work in an Ubuntu-like file-system. If your host system is RHEL, your binaries still run on RHEL -- Docker does not deal with operating systems.

For the most part, using Linux OS file-system templates are a very poor practice. They are largely not optimized for Docker. However, there is one OS file-system template that is optimized for Docker: Docker Alpine.

Docker Alpine provides a very small OS file-system template that maximizes application portability while minimizing binary size.

The previous Dockerfile would transform to Docker Alpine like this:

FROM alpine

RUN addgroup -g 1000 user01 && \
adduser -D -u 1000 -G user01 user01

RUN apk add --no-cache sometools

CMD [ "sometool" ]

The resulting binary would be much smaller. Yet, keep in mind that Alpine is not in your Docker image. Docker does not put operating systems into images. Your image is merely built on a file-system template that looks like an Alpine Linux file-system.

Do not confuse Docker Alpine with Alpine Linux. The former is a file-system template that looks like an Alpine file-system, while the latter is an operating system for routers, tiny linux deployments, and Raspberry Pi.

When creating portable images without extensive binary optimization, Docker Alpine is the only viable option. Do not use FROM centos or FROM ubuntu in any environment. These lead to extremely large and cause severe confusion.

This bears repeating: the entire point of Docker is to run your application. To do this, you just need to make sure your application has what it needs to run. The question is not "What do I build my application on?", the question is "What specific files does my application require?" Your application most likely doesn't need 90% of the files that a Linux file-system template provides, it probably just needs a few libraries. It may not even need the full XYZ library, but just file Y.

Docker lets you optimize your application like this. If you can identify the dependencies of your application, you'll be to build your file-system FROM scratch.

`scratch`

At runtime, you're working with namespaces. At build time, you're working with images. Your ability to create optimal images is directly proportional to your understanding of namespaces and your application.

Let's jump right to an example of creating a tiny, usable Docker image...

First let's look at the hello.asm file we want to run (taken from http://cs.lmu.edu/~ray/notes/x86assembly/):

        global  _start

        section .text
_start:
        ; write(1, message, 13)
        mov     rax, 1                  ; system call 1 is write
        mov     rdi, 1                  ; file handle 1 is stdout
        mov     rsi, message            ; address of string to output
        mov     rdx, 13                 ; number of bytes
        syscall                         ; invoke operating system to do the write

        ; exit(0)
        mov     eax, 60                 ; system call 60 is exit
        xor     rdi, rdi                ; exit code 0
        syscall                         ; invoke operating system to exit
message:
        db      "Hello, World", 10      ; note the newline at the end

Our goal is to create a tiny, deliverable Docker binary that writes-out "Hello, World".

Here's how we'll do it:

FROM alpine as asm

WORKDIR /elephant

COPY hello.asm .

RUN apk add --no-cache binutils nasm && \
    nasm -f elf64 -p hello.o hello.asm && \
    ld -o hello hello.o

FROM scratch

COPY --from=asm /elephant/hello /

ENTRYPOINT ["./hello"]

This Dockerfile uses two stages: a build-stage and a run-stage. The first stage installs NASM, assembles the code, then links it the applcation, the second carefully places the application into the deliverable Docker binary. The second stage contains a single file: /elephant/hello. It does not contain NASM, the source code, nor any intemediate files.

You can use as many stages as you want: sometimes you'll need a CI-setup stage (setup tools), then a backend-build stage (setup node, run npm install), then a front-end build-stage (build Angular), then a final stage to carefully place files from previous stages (copy /node_modules/ and Angular/dist files to node application). Only the final stage is deployed, everything else is thrown out.

This results in the following:

[dbetz@ganymede tiny-image]$ docker build . -t local/tiny-image
Sending build context to Docker daemon  4.096kB
Step 1/7 : FROM alpine as asm
 ---> 3fd9065eaf02
Step 2/7 : WORKDIR /elephant
Removing intermediate container da8e9f72ebd2
 ---> 29896ad4bb3c
Step 3/7 : COPY hello.asm .
 ---> 9ccc8ab38794
Step 4/7 : RUN apk add --no-cache binutils nasm &&     nasm -f elf64 -p hello.o hello.asm &&     ld -o hello hello.o
 ---> Running in f99cbecc309d
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
(1/3) Installing binutils-libs (2.30-r1)
(2/3) Installing binutils (2.30-r1)
(3/3) Installing nasm (2.13.01-r0)
Executing busybox-1.27.2-r7.trigger
OK: 17 MiB in 14 packages
Removing intermediate container f99cbecc309d
 ---> d840e66bfbdb
Step 5/7 : FROM scratch
 --->
Step 6/7 : COPY --from=asm /elephant/hello /
 ---> fd85715eaf85
Step 7/7 : ENTRYPOINT ["./hello"]
 ---> Running in e163f47f4d8a
Removing intermediate container e163f47f4d8a
 ---> 1f30c749e8b9
Successfully built 1f30c749e8b9
Successfully tagged local/tiny-image:latest

[dbetz@ganymede tiny-image]$ docker run local/tiny-image
Hello, World

[dbetz@ganymede tiny-image]$ docker image ls | grep "local/tiny-image"
local/tiny-image                                    latest                                     1f30c749e8b9        15 seconds ago      848B

It builds and it runs, and the entire binary is 848 bytes.

The more functionality you add to your binary, the larger it grows. Your image size should remain somewhat proportional to your functionality. That's what you'd expect from a tarball, that's how you should think with Docker.

This means that you should be careful with what files go into your resulting binary. This means being careful with how you satisfy your application's dependency needs.

Would you really throw an entire Linux OS operating system into your tarball?

Practical `scratch`

In the previous assembler example, we had an application with zero dependencies. When this is your situation, your Docker image size will be very near your application size. You want them to be as close as possible.

One popular way to satisfy this need is to use Go: this can output statically linked binaries that require zero dependencies. Go has many places where it fits nicely. You can see, for example, my recursivecall for Docker project. Docker itself is also written in Go.

On the other hand, Go doesn't have deep support for dynamic types. This means you won't have the JavaScript/Python dynamic object concept. Instead, you'll have to refresh yourself on those data-structures we all forgot decades ago.

Regardless, while Go is beautiful for many uses, you already have applications. Let's focus on deploying those application via Docker, not rewriting them in Go.

For this next section, let's assume that our appplication /var/app/runner requires /usr/lib64/libc.so.6. Our application will crash if it doesn't find /usr/lib64/libc.so.6.

In this situation, we have two options based on your understanding of Docker namespaces:

copy /usr/lib64/libc.so.6 into the image with /var/app/runner
link /usr/lib64/libc.so.6 from the host machine to your running application's namespace

The first option can be accomplished with a multi-stage build with a simple COPY from the first stage:

FROM ubuntu as os

FROM scratch

COPY ./runner /var/app/runner

COPY --from=os /usr/lib64/libc.so.6 /lib64/

CMD ["/var/app/runner"]

Remember, the Ubuntu stage will be thrown out, but, yeah, you should still try to use Alpine where possible.

This will create a portable binary; everything the application needs will be within reach.

As your portability increases, so does your binary size. When you need more than just a few files and you must maintain portability (e.g. posting to Docker Hub), it's time to use the Docker Alpine file-system template.

However, in the case where your control your environment, thus don't require portability, the second approach may be better.

It allows you to provide a much simpler Dockerfile:

FROM scratch

COPY ./runner /var/app/runner

CMD ["/var/app/runner"]

In stead of copying the dependency into the image, you tell Docker at run-time to use a different namespace to satisy the dependency.

Your build and run would look like this:

docker build . -t local/runner
docker run -v /usr/lib64/libc.so.6:/lib64/libc.so.6 local/runner

If you don't want to play around with each and every file, just map the entire /lib64/ folder.

docker run -v /lib64/:/lib64/ local/runner

Since most libraries are loaded from /lib64/, this technique will account for a large percentage of your scenarios.

Practical `scratch` with Node

Let's make this more real-world by manually building a Docker Node binary which our deliverable Docker application binaries will use.

Here's our Dockerfile:

FROM alpine

RUN apk add --no-cache curl && \
    mkdir -p /tmp/node && \
    mkdir -p /tmp/etc && \
    curl -s https://nodejs.org/dist/v8.11.2/node-v8.11.2-linux-x64.tar.xz | tar -Jx -C /tmp/node/

RUN addgroup -g 500 -S nodeuser && \
    adduser -u 500 -S nodeuser -G nodeuser

RUN grep nodeuser /etc/passwd > /tmp/etc/passwd && \
    grep nodeuser /etc/group > /tmp/etc/group

FROM scratch

COPY --from=0 /bin/sh /tmp/node/node-v8.11.2-linux-x64/bin/node /bin/
COPY --from=0 /usr/bin/env /usr/bin/
COPY --from=0 /tmp/etc/passwd /tmp/etc/group /etc/

CMD  ["/bin/node"]

The resulting deliverable Docker binary will contain the node binary, passwd/group, and env (as an example of copying something you may need in Node development).

The first stage downloads Node, creates a user and group, then simplifies /etc/passwd and /etc/group. Only the final stage represents the deliverable binary.

Build and run:

docker build . -t local/node8
docker run -it local/node8

Building and running results in the following error:

standard_init_linux.go:195: exec user process caused "no such file or directory"

Run it again with the mapping:

docker run -it -v /lib64/:/lib64/ local/node8

It works.

[dbetz@ganymede node8]$ docker run -it -v /lib64/:/lib64/ local/node8 node
>

Let's check the application version:

[dbetz@ganymede node8]$ docker run -it -v /lib64/:/lib64/ local/node8 node -v
v8.11.2

What's our size?

[dbetz@ganymede ~]$ docker image ls | grep "local/node8"
local/node8                                           latest                                     901c2740deb9        12 seconds ago        36.4MB

It's 36.4MB. Pretty small.

Your Docker application binary will only contain 36.4MB of overhead when you ship your product.

Using our binary

With the Docker Node image built, we can build our deliverable application binary.

FROM node:8.11.2-alpine as swap-space

WORKDIR /var/app

COPY package.json /var/app/

RUN npm install

COPY . /var/app

FROM local/node8

WORKDIR /var/app

COPY --from=swap-space /var/app/ /var/app/

ENV PORT=3000

USER nodeuser:nodeuser

ENTRYPOINT ["node", "server.js"]

The first stage will use an official Docker Node binary to prepare our application. The second stage merely copies the application in. NPM isn't needed for your application to run. It only needs the application folder consisting of your code and node_modules/.

Build it and push it out (real-world example):

TAG=`date +%F_%H-%M-%S`
docker build . -t local/docker-sample-project:$TAG -t registry.gitlab.com/davidbetz/docker-sample-project:$TAG

docker push registry.gitlab.com/davidbetz/docker-sample-project:$TAG

/etc/passwd and /etc/group

The addition of /etc/passwd and /etc/group are artifacts of how most Linux tools work: they want a name, not UID or GID. You create a user and group just to name them. You can't simply specify UID 500.

Because /etc/passwd and /etc/group are part of a file-system in a specific namespace, tools use the files within the file-system they're looking at to do the ID to name lookup.

This gives us an opportunity to do an experiment...

Let's run MongoDB in the background:

[dbetz@ganymede ~]$ docker run -d mongo
8e4be179031cc7221e561934df02c513cb9f7b3946343e0a2b027dace5f83f03

Let's execute sh in that namespace:

Remember, you aren't going into a container, there is no container. But, it's still phenomenological language like "the sun rises". You'll end up talking about "containers", but remember they're merely abstractions.

[dbetz@ganymede ~]$ docker exec -it 8e4b sh
#

Let's see the processes from the perspective of that namespace:

# ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb       1  7.0  0.6 984072 55524 ?        Ssl  18:43   0:00 mongod --bind_

OK, so the user for mongod is mongodb. Let's get the UID and GID for mongodb:

# grep mongodb /etc/passwd
mongodb:x:999:999::/home/mongodb:/bin/sh
# grep mongodb /etc/group
mongodb:x:999:

It's 999.

Now exit sh and look for mongod in your processes on your host machinnne:

[dbetz@ganymede ~]$ ps aux | grep mongod
polkitd   40754  0.6  0.7 990396 58628 ?        Ssl  13:43   0:02 mongod --bind_ip_all

The user is polkitd, not mongodb.

Why polkitd?

Well, look for user 999 in your /etc/passwd file:

[dbetz@ganymede ~]$ grep 999 /etc/passwd
polkitd:x:999:997:User for polkitd:/:/sbin/nologin

ps saw 999 and used the /etc/passwd within reach to do the lookup, thus interpreted it as polkitd.

Key Takeaways

Docker uses existing Linux functionality.
There is no "subsystem" or hypervisor.
Images do not contain operating systems.
Docker images are merely Docker binaries of your application.
Your Docker images should not contain anything other than your application and it's dependencies.
Images are either from scratch or based on a file-system template like Docker Alpine.
You can map files already on your system's file-system to minimize the size of images.

David Betz' Home for Architecture, Cloud, Linux, Docker, and Elegance