Container Internals
From Syscalls to Docker

An interactive deep dive into the Linux primitives that power every container runtime. No magic — just namespaces, cgroups, and union filesystems.

Docker / containerd / runc
Overlay Filesystem
Control Groups (cgroups)
Linux Namespaces
Linux Kernel (5.x+)
Hardware (CPU · Memory · Disk · Network)
01

Linux Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources, while another set sees a different set. They're the isolation primitive — the reason a container can't see your host's processes, network, or filesystem.

The kernel currently supports 8 namespace types. Each is created via clone(2), unshare(2), or setns(2). Every process belongs to exactly one namespace of each type.

Key insight: Namespaces don't provide security — they provide visibility isolation. A process inside a PID namespace literally cannot address PIDs outside it, but a root process could escape via other means. Namespaces + capabilities + seccomp = real containment.

Namespace Lifecycle

C // Create a child process in new PID + NET + MNT namespaces
clone(child_fn, stack + STACK_SIZE, CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD, arg);

// OR: move current process into new namespace
unshare(CLONE_NEWUTS);
sethostname("container", 9);

// OR: join an existing namespace via fd
int fd = open("/proc/<pid>/ns/net", O_RDONLY);
setns(fd, CLONE_NEWNET);
Gotcha: CLONE_NEWPID affects the child, not the caller. The calling process doesn't move — its next fork will land in the new PID namespace. This trips people up when building container runtimes.

Interactive: Toggle Isolation

See what the container process can observe with each namespace enabled or disabled:

02

Control Groups (cgroups)

If namespaces are about what you can see, cgroups are about what you can use. They limit, account for, and isolate the resource usage (CPU, memory, I/O, PIDs) of process groups.

cgroups v2 uses a single unified hierarchy mounted at /sys/fs/cgroup. Each cgroup is a directory. Controllers are enabled per-subtree via cgroup.subtree_control.

bash # Create a cgroup for our container
mkdir /sys/fs/cgroup/mycontainer

# Enable memory + cpu controllers
echo "+memory +cpu" > /sys/fs/cgroup/cgroup.subtree_control

# Set memory limit to 256MB
echo 268435456 > /sys/fs/cgroup/mycontainer/memory.max

# Set CPU weight (relative, default=100)
echo 50 > /sys/fs/cgroup/mycontainer/cpu.weight

# Move process into cgroup
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs
v2 key design: The "no internal processes" rule — a cgroup can either contain processes OR have child cgroups with controllers, not both. This eliminates the resource accounting ambiguity of v1.

v1 uses separate hierarchies per controller — each controller (memory, cpu, cpuacct, blkio, etc.) is a distinct mount. A process can be in different cgroups across different hierarchies.

bash # v1 paths are per-controller
/sys/fs/cgroup/memory/docker/<id>/memory.limit_in_bytes
/sys/fs/cgroup/cpu/docker/<id>/cpu.shares
/sys/fs/cgroup/pids/docker/<id>/pids.max
v1 pain point: Because hierarchies are independent, you can end up with conflicting resource policies across controllers. This is the primary motivation for v2's unified tree.
Aspectv1v2
HierarchyMultiple (one per controller)Single unified tree
Thread granularityThreads can be in different cgroupsAll threads in same cgroup (threaded controllers opt-in)
Memory accountingDoesn't track shared pages wellAccurate shared page tracking
PSI (Pressure Stall Info)Not availableBuilt-in per-cgroup pressure metrics
OOM controlConfigurable OOM killer disablecgroup-aware OOM with oom.group
Default in systemdPre-248248+ (Sept 2021)

Interactive: Resource Limiter Simulator

Adjust cgroup limits and watch how processes behave when they hit resource ceilings:

512 MB
100000 µs
64
SIMULATED PROCESSES (each = ~64MB mem, 1 thread)
03

Overlay Filesystems

Container images are layered. OverlayFS merges multiple directory trees into a single unified view. Lower layers are read-only; the top layer captures writes. This is what makes docker pull efficient — shared base layers aren't duplicated.

bash mount -t overlay overlay \
-o lowerdir=/layer2:/layer1:/base,\
upperdir=/container/upper,\
workdir=/container/work \
/merged

Click each layer to explore its contents and see how copy-on-write, whiteouts, and opaque directories work:

Copy-on-Write Mechanics

Read Path

Lookup starts at upperdir → walks down through lowerdirs in order. First match wins. O(layers) worst case, but dentries are cached after first lookup.

Write / Modify

File is copied from lower layer to upperdir (copy_up), then modified in place. Metadata (xattrs, permissions) preserved. Large files pay a full copy cost on first write.

Delete

Creates a whiteout — a char device (0,0) in upperdir. For directories, an opaque xattr (trusted.overlay.opaque=y) hides all lower contents.

Rename

Cross-layer rename requires redirect_dir feature (kernel 4.10+). Without it, directory renames fail with EXDEV. This is a common gotcha in older kernels.

Performance gotcha: copy_up on large files (think: database files) in lower layers triggers a full copy to upper on first write. This is why database containers often use volumes or tmpfs for data directories, bypassing overlayfs entirely.
04

Build a Container from Scratch

Walk through the exact syscall sequence to create a container. No Docker, no containerd — just raw Linux primitives. Click each step to see the commands execute.

▸ Container Builder

root@host:~#
05

Docker Architecture

Docker is a convenience layer. It orchestrates all the primitives above through a well-defined component chain. Here's how docker run translates to syscalls:

docker CLI dockerd (API) containerd containerd-shim runc clone() + exec()

dockerd

REST API daemon. Handles image management, volumes, networks. Delegates container lifecycle to containerd via gRPC. If dockerd crashes, running containers survive (containerd keeps them alive).

containerd

Industry-standard container runtime. Manages image pull/push (via content store), snapshot drivers (overlayfs), and container lifecycle. Direct CRI integration for Kubernetes. This is the real runtime.

containerd-shim

Per-container process that reparents the container PID 1. Keeps stdin/stdout open, reports exit status. Allows containerd to restart without killing containers. The shim is why docker restart dockerd is safe.

runc

OCI reference runtime. Reads the OCI bundle spec (config.json + rootfs), does the actual clone() with namespace flags, sets up cgroups, pivots root, drops capabilities, applies seccomp, execs the entrypoint. Then exits — it's not a daemon.

OCI Image Spec

Images are a manifest → config + ordered layer digests. Layers are tar+gzip diffs. The manifest's mediaType and platform fields handle multi-arch. Content-addressable via SHA256.

OCI Runtime Spec

config.json defines: root filesystem, mounts, process (args, env, cwd, capabilities), Linux-specific (namespaces, cgroups path, seccomp, apparmor, rlimits). This is the "container contract".

docker run — Full Call Chain

sequence 1. docker run alpine sh
2. CLI → POST /containers/create to dockerd (REST)
3. dockerd → containerd.Create() (gRPC) + pull image if needed
4. containerd → prepare snapshot (overlayfs), create OCI bundle
5. containerd → start containerd-shim, pass bundle path
6. shim → exec runc create --bundle /run/containerd/...
7. runc: clone(NEWPID|NEWNS|NEWNET|NEWUTS|NEWIPC|NEWCGROUP)
8. runc: setup cgroups, mount proc/sys/dev, pivot_root
9. runc: drop caps, apply seccomp, execve("/bin/sh")
10. runc exits. shim reparents PID 1. Container running.
06

Security Layers

Containers are not VMs. Isolation comes from layered security mechanisms, each addressing a different attack surface. Understanding these layers is critical for threat modeling containerized workloads.

Linux Capabilities

Root's monolithic power split into ~41 fine-grained caps. Default Docker drops: CAP_SYS_ADMIN, CAP_NET_RAW (post-20.10), CAP_SYS_PTRACE, etc. Adding --privileged grants ALL caps — effectively uncontained.

Seccomp-BPF

Syscall filter at the kernel boundary. Default Docker profile blocks ~44 of ~330+ syscalls. Notably blocks: clone with CLONE_NEWUSER (preventing nested user namespaces), mount, reboot, kexec_load, bpf.

AppArmor / SELinux

MAC (Mandatory Access Control) policies. Docker generates an AppArmor profile per container that restricts mount, ptrace, signal operations. SELinux uses MCS labels (sVirt) to prevent cross-container access.

User Namespaces

Map container UID 0 to unprivileged host UID (e.g., 100000). Even if process escapes, it's nobody on the host. Not enabled by default in Docker (--userns-remap). Rootless mode uses this extensively.

Read-only Layers

Image layers are immutable. --read-only flag makes the entire rootfs read-only. tmpfs mounts for /tmp, /run. Prevents persistent compromise of the container filesystem.

No New Privileges

The no_new_privs bit (prctl PR_SET_NO_NEW_PRIVS) prevents gaining privileges via setuid/setgid binaries or filesystem capabilities. Docker sets this when a seccomp profile is active.

Container Escape Vectors

VectorMechanismMitigation
Kernel exploitShared kernel = shared vulnerabilities. Container shares syscall surface with host.Keep kernel patched. Use gVisor/Kata for untrusted workloads (separate kernel/VMM).
Privileged mode--privileged disables all isolation. Full device access, all caps, no seccomp.Never use --privileged. Use --cap-add for specific needs.
Mounted Docker socket-v /var/run/docker.sock gives full Docker API access = host root equivalent.Use Docker-in-Docker or rootless Docker. Never mount the socket in prod.
Sensitive host mounts-v /:/host exposes entire host filesystem to the container.Principle of least privilege. Mount only what's needed, read-only where possible.
Metadata serviceCloud metadata (169.254.169.254) accessible from containers = credential theft.Network policies. IMDSv2 with hop limit=1. Workload identity.