Linux Application Container Fundamentals

At my current job, I spend a lot of time coaxing Docker to run containers while trying to avoid network failures, Docker bugs and kernel reference count issues. Recently, I’ve gotten into reading about how Docker and other containerization software is implemented under the hood. This is a write-up of my exploration and experiments looking at how container runtimes are implemented. Nothing in this essay is original, but I hope it helps some people.

Running applications as containers requires some support from the operating system. Docker, rkt and other container runtimes wrap these kernel APIs, provide good defaults and add features like image description formats and process management. In this article we will take a look at the primitives Linux provides, and how they are combined to containerize applications.

The aim of containerization is to isolate and impose restrictions on applications, while keeping the applications themselves unaware that they are running in any special environment. The kernel usually provides mechanisms so that the userspace APIs remain unchanged, while giving the kernel and the operator granular control over the resources made available to containerized processes.

Like most things in systems programming, this sleight of hand is achieved via another layer of indirection. We need to do 3 things to isolate an application:

  1. Allow it to have a possibly unique view of the “global” environment. This includes things like the process space, network devices and filesystem mounts. In Linux, this is achieved via namespaces.

  2. Allow the kernel and the operator to impose limits on the containerized application that are stricter than hardware imposed limits. This allows multiple applications to run on the same host without abusing resources. The Linux mechanisms for this have been around for almost a decade. They are called control groups, or simply cgroups.

  3. Allow custom filesystem layouts to be attached to containerized processes. For example, the operator may want to mount a read-only layer in all processes that contains configuration options. At the same time, perhaps they do not want to expose system details like /proc, /etc or /sys to the process. But they may want the database process to have write access to a specific directory so that records may persist across reboots. Unsurprisingly, filesystem “wrappers” that can unify this multiple approaches are called union filesystems.

Before we dig in, here is the system configuration I’m running these examples on:

Linux 4.7.4-1-ARCH x86_64 GNU/Linux

Namespaces

Namespaces allow isolating a process view of various shared resources. That is, a process inside a namespace can view what it perceives as global attributes, but this view is independent of processes in other namespaces. Similarly, it may change these global attributes, but these changes are not visible to other processes.

Linux provides various namespaces which let an application have its own network stack, own process space and so on. Today we are going to look at the simplest one - the UTS namespace. This namespace affects only the host name and domain name and its effects are easy to see in practice.

A process’ namespaces are represented under the /proc/<pid>/ns directory. Each entry is a soft link to the namespace type, and an “inode” that identifies the ID of the namespace the process belongs to. All processes are in the default, shared namespace in normal operation. For example, my shell ($$ evaluates to the shell’s PID):

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 net -> 'net:[4026531957]'
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 nikhil users 0 Sep 25 15:20 uts -> 'uts:[4026531838]'

This is the root user’s shell on my system:

# ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 25 15:21 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Sep 25 15:21 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Sep 25 15:21 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Sep 25 15:21 net -> 'net:[4026531957]'
lrwxrwxrwx 1 root root 0 Sep 25 15:21 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Sep 25 15:21 uts -> 'uts:[4026531838]'

They are in the same namespace.

A new process can be put in its own namespace by passing the CLONE_NEW* family of flags to clone(2). An existing process can use the setns(2) or unshare(2) calls to achieve the same.

We will explore a simple Rust program that creates a child process with its own UTS namespace. The child process will then change the hostname. The changes will reflect in the child, but not in the parent, or the rest of the system. The program source is on Github in the uts-namespace directory. It can be run with cargo run.

This program needs root privileges to work. It is your responsibility to do due diligence on the source code before executing it. Any damage to your computer is not my responsibility!

When run, it proceeds as:

# cargo run
Parent pid 15798
Original host name in parent arbitrary
Child pid 15799
Original host name in child arbitrary
Host name in child temporary-only-in-child
Sleeping for 30 seconds
Host name in parent arbitrary

While the program sleeps, we can check on it in another shell (again as root).

# ls -l /proc/15798/ns
...
lrwxrwxrwx 1 root root 0 Sep 25 17:15 uts -> 'uts:[4026531838]'
# ls -l /proc/15799/ns
...
lrwxrwxrwx 1 root root 0 Sep 25 17:15 uts -> 'uts:[4026532272]'

As you can see, the two namespaces are different. The hostname change is only applied in uts:[4026532272].

The Rust program uses the libc FFI binding and some unsafe calls to hook into the various system calls we are going to perform. We allocate some stack space, then execute the enter function in a new process. The CLONE_NEWUTS requests creation of a new UTS namespace.

fn child_in_new_uts() -> Result<i32, i32> {
    const STACK_SIZE: usize = 1024 * 1024;
    let mut stack = Vec::with_capacity(STACK_SIZE);
    let stack_top = unsafe { stack.as_mut_ptr().offset(STACK_SIZE as isize) };
    let child_pid = unsafe {
        libc::clone(enter,
                    stack_top,
                    libc::CLONE_NEWUTS | libc::SIGCHLD,
                    std::ptr::null_mut())
    };
    if child_pid == -1 {
        perror("clone");
        return Err(-1);
    }
    Ok(child_pid)
}

In enter we display our new PID to make sure we aren’t cheating. We change the hostname using this fragment.

fn sethostname(new: &str) {
    let cs = CString::new(new).unwrap();
    let ptr = cs.as_ptr();
    let r = unsafe { libc::sethostname(ptr, new.len()) };
    if r == -1 {
        perror("sethostname");
    }
}

As you can see, the program (in this case the child process) isn’t even aware that it is in a custom namespace. It goes about its merry business assuming it is manipulating the system hostname. This is really powerful. It allows us to containerize all kinds of programs, including init daemons, browser plugins, and other programs that need restricted privileges, without having to change the programs. We can also do things like set up virtual networks for programs, on which we can test how they react to network failures or dropped packets, without having to stop watching Youtube at the same time.

cgroups

cgroups allow imposing granular resource limits on CPU usage, memory, pids, block devices and several other system resources. I haven’t looked into the details much, but cgroups is how Docker and friends allow multiple containers to run on the same host without one process hogging CPU and memory.

cgroups are much easier than namespaces to fiddle around with, as they can be created and manipulated simply by operating on files and directories in /sys/fs/cgroup using standard shell utilities.

cgroups have a version 1 and a version 2. While version 2 is the better API, it is not fully featured yet and we will be sticking to version 1 in this article.

The entries in /sys/fs/cgroup are directories representing each kind of cgroup controller. Each controller represents one kind of resource, so there is a cpu controller, a memory controller and so on. A top-level cgroup.procs file in each controller lists the processes on the system that are not in custom cgroups. By default, all (non-zombie) system processes are in here. For example:

$ cat /sys/fs/cgroup/memory/cgroup.procs
1
2
3
5
7
8
9
10
11
12
...

Let’s jump right in and create a simple cgroup that restricts how much memory an application has access to.

(The following examples need root access).

# cd /sys/fs/cgroup
# mkdir memory/myfirstgroup
# ls -l memory/myfirstgroup
total 0
-rw-r--r-- 1 root root 0 Sep 25 15:39 cgroup.clone_children
--w--w--w- 1 root root 0 Sep 25 15:39 cgroup.event_control
-rw-r--r-- 1 root root 0 Sep 25 15:39 cgroup.procs
...and more

As you can see, the directory was populated with various files by the kernel. Some of these files are read-only and allow accessing the cgroup attributes. Others are read-write and allow modifying the cgroup attributes. Changes made to group myfirstgroup do not affect any processes not in the cgroup. First let’s make sure no process is currently in the cgroup:

# cat memory/myfirstgroup/cgroup.procs
# (empty)

To test memory limits, we are going to use this shell fragment:

$ dd if=/dev/zero | read x

which attempts to keep storing a string of zeroes into the variable x, causing the shell to buffer in memory. This runs forever, but if you terminate it, it will show you how much data was copied.

# interrupt with Ctrl+C
$ dd if=/dev/zero | read x
^C108321+0 records in
108320+0 records out
55459840 bytes (55 MB, 53 MiB) copied, 9.0239 s, 6.1 MB/s

To test cgroup limits, we are going to move this shell into the cgroup first. Although modifying cgroups requires root privileges, the processes running in cgroups can have non-root privileges.

$ echo $$
13399

Back in the root shell:

# echo 13399 > /sys/fs/cgroup/memory/myfirstgroup/cgroup.procs
# cat /sys/fs/cgroup/memory/myfirstgroup/cgroup.procs 
13399

So now the user shell is in the cgroup. Let’s put a 4MB cap on this shell.

# echo 4M > /sys/fs/cgroup/memory/myfirstgroup/memory.memsw.limit_in_bytes
# cat memory/myfirstgroup/memory.memsw.limit_in_bytes 
4194304

The memsw prefix represents “memory+swap”. We want to impose a swap limit too so that the process does not simply page to disk.

Try reading to infinity in the user shell again.

$ dd if=/dev/zero | read x
zsh: killed     /bin/zsh

Poof! That got killed fast.

Union filesystems

Union filesystems are an idea that has been around for decades, and they are the least container specific. They’ve had other uses in a pre-container world, including Live-CDs, configuration management and creating default files. A union filesystem has branches or layers that correspond to each underlying filesystem. The union filesystem is a virtual filesystem that manages changes to the merged view and spreads them around the layers as appropriate. For example, consider a database program I’ve installed off of a CD.

/mnt/cdrom0 - a read-only CD-ROM that has the software and related
configuration.
/home/dbuser/data - mount to an NFS server hosted on my NAS.

The database process I’m using requires a directory structure in the filesystem that has /var/lib/database/config, and it stores data in /var/lib/database/data.

Union filesystems allow us to represent this structure. Writes to /var/lib/database/data will be reflected onto the NFS system. Also, we do not want to allow access to any other filesystem on the host.

The linux recommended union filesystem – overlayfs – allows us to do the former. Docker and other container runtimes use some other techniques, including mount namespaces to do the latter. Today, we will just look at a quick overlayfs example.

overlayfs has two layers, called upper and lower. The upper is read-write, the lower is read-only. First we create a simple directory tree to represent these layers.

$ mkdir ~/overlay-lower ~/overlay-upper ~/overlay-work ~/merged
$ echo "Read only file" > ~/overlay-lower/ro.txt
$ echo "Read write file" > ~/overlay-upper/rw.txt

# As root
# mount -t overlay overlay \
        -olowerdir=/home/nikhil/overlay-lower,upperdir=/home/nikhil/overlay-upper,workdir=/home/nikhil/overlay-work \
        /home/nikhil/merged

Now overlayfs has mounted the two directories in a unified view under ~/merged. As far as I can tell, workdir is used for some book-keeping, but I could not find sources verifying this.

Let’s try out how ~/merged works.

$ cd ~/merged
$ ls
ro.txt rw.txt
$ cat ro.txt
Read only file
$ echo "Another line" >> ro.txt
$ cat ro.txt
Read only file
Another line
$ cat ~/overlay-lower/ro.txt
Read only file
$ echo "This reflects in the original filesystem" >> rw.txt
$ cat rw.txt
Read write file
This reflects in the original filesystem
$ cat ~/overlay-upper/rw.txt
Read write file
This reflects in the original filesystem
$ rm -f ro.txt
$ ls
rw.txt
$ ls ~/overlay-lower
ro.txt
$ ls ~/overlay-upper
ro.txt rw.txt
$ cat ~/overlay-upper/ro.txt
cat: /home/nikhil/overlay-upper/ro.txt: Permission denied

That was very interesting! The upper file system was kept writable. The “read-only” file was also writable, but the changes did not reflect in the original filesystem. They were only present in the merged filesystem. Unmounting the overlay filesystem will lead to those changes being lost. How did that work?

The writes to the read-only file were performed by copying-up the file to the upper layer as soon as we tried to change it. Since we removed the file on the overlay filesystem, the read-write layer will no longer allow reading it. This is done via some support that overlayfs requires from the filesystem on the upper layer. Similarly, once we remove the ro.txt file, something called whiteouts are used to hide the file from ls. The references at the end of this article explain all these tricks.

I encourage you to play with creating, manipulating and removing directories in the overlay filesystem.

Wrapping up

That is all for now. If you have had any experience with Docker or other container runtimes, you should now be able to see how they use these primitives to provide application containerization. Things I want to explore next include playing around with cgroup control using Rust and trying to containerize a simple process. That is, give it its own namespaces, root filesystem and use cgroups to impose CPU and memory controls. I may write some more thoughts when I get more experience with that.

Thanks to bluss on #rust for help in tracking down a FFI issue that was biting me. Finally none of this would have been possible without the excellent containerization and Linux internals resources out there.

References

  • Rami Rosen’s slides at Net Dev Conf 2016 are a great introduction to cgroups and go in detail about v1 vs v2.

  • The memory cgroup documentation. The memory restriction example above is pretty much a copy of the example presented here.

  • Valerie Aurora’s three part series about union filesystems is recommended reading.

  • Follow that up with the overlayfs documentation.

  • The excellent linux man page on namespaces.

  • The clone(2) describes namespace creation behavior. The Rust example is a translation of the C program presented here, with minor tweaks.