An interactive deep dive into the Linux primitives that power every container runtime. No magic — just namespaces, cgroups, and union filesystems.
Namespaces partition kernel resources so that one set of processes sees one set of resources, while another set sees a different set. They're the isolation primitive — the reason a container can't see your host's processes, network, or filesystem.
The kernel currently supports 8 namespace types. Each is created via clone(2), unshare(2), or setns(2). Every process belongs to exactly one namespace of each type.
CLONE_NEWPID affects the child, not the caller. The calling process doesn't move — its next fork will land in the new PID namespace. This trips people up when building container runtimes.
See what the container process can observe with each namespace enabled or disabled:
If namespaces are about what you can see, cgroups are about what you can use. They limit, account for, and isolate the resource usage (CPU, memory, I/O, PIDs) of process groups.
cgroups v2 uses a single unified hierarchy mounted at /sys/fs/cgroup. Each cgroup is a directory. Controllers are enabled per-subtree via cgroup.subtree_control.
v1 uses separate hierarchies per controller — each controller (memory, cpu, cpuacct, blkio, etc.) is a distinct mount. A process can be in different cgroups across different hierarchies.
| Aspect | v1 | v2 |
|---|---|---|
| Hierarchy | Multiple (one per controller) | Single unified tree |
| Thread granularity | Threads can be in different cgroups | All threads in same cgroup (threaded controllers opt-in) |
| Memory accounting | Doesn't track shared pages well | Accurate shared page tracking |
| PSI (Pressure Stall Info) | Not available | Built-in per-cgroup pressure metrics |
| OOM control | Configurable OOM killer disable | cgroup-aware OOM with oom.group |
| Default in systemd | Pre-248 | 248+ (Sept 2021) |
Adjust cgroup limits and watch how processes behave when they hit resource ceilings:
Container images are layered. OverlayFS merges multiple directory trees into a single unified view. Lower layers are read-only; the top layer captures writes. This is what makes docker pull efficient — shared base layers aren't duplicated.
Click each layer to explore its contents and see how copy-on-write, whiteouts, and opaque directories work:
Lookup starts at upperdir → walks down through lowerdirs in order. First match wins. O(layers) worst case, but dentries are cached after first lookup.
File is copied from lower layer to upperdir (copy_up), then modified in place. Metadata (xattrs, permissions) preserved. Large files pay a full copy cost on first write.
Creates a whiteout — a char device (0,0) in upperdir. For directories, an opaque xattr (trusted.overlay.opaque=y) hides all lower contents.
Cross-layer rename requires redirect_dir feature (kernel 4.10+). Without it, directory renames fail with EXDEV. This is a common gotcha in older kernels.
copy_up on large files (think: database files) in lower layers triggers a full copy to upper on first write. This is why database containers often use volumes or tmpfs for data directories, bypassing overlayfs entirely.
Walk through the exact syscall sequence to create a container. No Docker, no containerd — just raw Linux primitives. Click each step to see the commands execute.
Docker is a convenience layer. It orchestrates all the primitives above through a well-defined component chain. Here's how docker run translates to syscalls:
REST API daemon. Handles image management, volumes, networks. Delegates container lifecycle to containerd via gRPC. If dockerd crashes, running containers survive (containerd keeps them alive).
Industry-standard container runtime. Manages image pull/push (via content store), snapshot drivers (overlayfs), and container lifecycle. Direct CRI integration for Kubernetes. This is the real runtime.
Per-container process that reparents the container PID 1. Keeps stdin/stdout open, reports exit status. Allows containerd to restart without killing containers. The shim is why docker restart dockerd is safe.
OCI reference runtime. Reads the OCI bundle spec (config.json + rootfs), does the actual clone() with namespace flags, sets up cgroups, pivots root, drops capabilities, applies seccomp, execs the entrypoint. Then exits — it's not a daemon.
Images are a manifest → config + ordered layer digests. Layers are tar+gzip diffs. The manifest's mediaType and platform fields handle multi-arch. Content-addressable via SHA256.
config.json defines: root filesystem, mounts, process (args, env, cwd, capabilities), Linux-specific (namespaces, cgroups path, seccomp, apparmor, rlimits). This is the "container contract".
Containers are not VMs. Isolation comes from layered security mechanisms, each addressing a different attack surface. Understanding these layers is critical for threat modeling containerized workloads.
Root's monolithic power split into ~41 fine-grained caps. Default Docker drops: CAP_SYS_ADMIN, CAP_NET_RAW (post-20.10), CAP_SYS_PTRACE, etc. Adding --privileged grants ALL caps — effectively uncontained.
Syscall filter at the kernel boundary. Default Docker profile blocks ~44 of ~330+ syscalls. Notably blocks: clone with CLONE_NEWUSER (preventing nested user namespaces), mount, reboot, kexec_load, bpf.
MAC (Mandatory Access Control) policies. Docker generates an AppArmor profile per container that restricts mount, ptrace, signal operations. SELinux uses MCS labels (sVirt) to prevent cross-container access.
Map container UID 0 to unprivileged host UID (e.g., 100000). Even if process escapes, it's nobody on the host. Not enabled by default in Docker (--userns-remap). Rootless mode uses this extensively.
Image layers are immutable. --read-only flag makes the entire rootfs read-only. tmpfs mounts for /tmp, /run. Prevents persistent compromise of the container filesystem.
The no_new_privs bit (prctl PR_SET_NO_NEW_PRIVS) prevents gaining privileges via setuid/setgid binaries or filesystem capabilities. Docker sets this when a seccomp profile is active.
| Vector | Mechanism | Mitigation |
|---|---|---|
| Kernel exploit | Shared kernel = shared vulnerabilities. Container shares syscall surface with host. | Keep kernel patched. Use gVisor/Kata for untrusted workloads (separate kernel/VMM). |
| Privileged mode | --privileged disables all isolation. Full device access, all caps, no seccomp. | Never use --privileged. Use --cap-add for specific needs. |
| Mounted Docker socket | -v /var/run/docker.sock gives full Docker API access = host root equivalent. | Use Docker-in-Docker or rootless Docker. Never mount the socket in prod. |
| Sensitive host mounts | -v /:/host exposes entire host filesystem to the container. | Principle of least privilege. Mount only what's needed, read-only where possible. |
| Metadata service | Cloud metadata (169.254.169.254) accessible from containers = credential theft. | Network policies. IMDSv2 with hop limit=1. Workload identity. |