Container Internals — From Syscalls to Docker

01

Linux Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources, while another set sees a different set. They're the isolation primitive — the reason a container can't see your host's processes, network, or filesystem.

The kernel currently supports 8 namespace types. Each is created via clone(2), unshare(2), or setns(2). Every process belongs to exactly one namespace of each type.

Key insight: Namespaces don't provide security — they provide visibility isolation. A process inside a PID namespace literally cannot address PIDs outside it, but a root process could escape via other means. Namespaces + capabilities + seccomp = real containment.

Namespace Lifecycle

    C
    // Create a child process in new PID + NET + MNT namespaces
    
clone(child_fn, stack + STACK_SIZE,
      CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD,
      arg);
    
// OR: move current process into new namespace
    
unshare(CLONE_NEWUTS);
    
sethostname("container", 9);
    
// OR: join an existing namespace via fd
    
int fd = open("/proc/<pid>/ns/net", O_RDONLY);
    
setns(fd, CLONE_NEWNET);

Gotcha: CLONE_NEWPID affects the child, not the caller. The calling process doesn't move — its next fork will land in the new PID namespace. This trips people up when building container runtimes.

Interactive: Toggle Isolation

See what the container process can observe with each namespace enabled or disabled:

02

Control Groups (cgroups)

If namespaces are about what you can see, cgroups are about what you can use. They limit, account for, and isolate the resource usage (CPU, memory, I/O, PIDs) of process groups.

cgroups v2 uses a single unified hierarchy mounted at /sys/fs/cgroup. Each cgroup is a directory. Controllers are enabled per-subtree via cgroup.subtree_control.

      bash
      # Create a cgroup for our container
      
mkdir /sys/fs/cgroup/mycontainer
      
# Enable memory + cpu controllers
      
echo "+memory +cpu" > /sys/fs/cgroup/cgroup.subtree_control
      
# Set memory limit to 256MB
      
echo 268435456 > /sys/fs/cgroup/mycontainer/memory.max
      
# Set CPU weight (relative, default=100)
      
echo 50 > /sys/fs/cgroup/mycontainer/cpu.weight
      
# Move process into cgroup
      
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs

v2 key design: The "no internal processes" rule — a cgroup can either contain processes OR have child cgroups with controllers, not both. This eliminates the resource accounting ambiguity of v1.

v1 uses separate hierarchies per controller — each controller (memory, cpu, cpuacct, blkio, etc.) is a distinct mount. A process can be in different cgroups across different hierarchies.

      bash
      # v1 paths are per-controller
      
/sys/fs/cgroup/memory/docker/<id>/memory.limit_in_bytes
      
/sys/fs/cgroup/cpu/docker/<id>/cpu.shares
      
/sys/fs/cgroup/pids/docker/<id>/pids.max

v1 pain point: Because hierarchies are independent, you can end up with conflicting resource policies across controllers. This is the primary motivation for v2's unified tree.

Aspect	v1	v2
Hierarchy	Multiple (one per controller)	Single unified tree
Thread granularity	Threads can be in different cgroups	All threads in same cgroup (threaded controllers opt-in)
Memory accounting	Doesn't track shared pages well	Accurate shared page tracking
PSI (Pressure Stall Info)	Not available	Built-in per-cgroup pressure metrics
OOM control	Configurable OOM killer disable	cgroup-aware OOM with oom.group
Default in systemd	Pre-248	248+ (Sept 2021)

Interactive: Resource Limiter Simulator

Adjust cgroup limits and watch how processes behave when they hit resource ceilings:

memory.max

512 MB

cpu.max (µs per 100ms period)

100000 µs

pids.max

64

SIMULATED PROCESSES (each = ~64MB mem, 1 thread)

03

Overlay Filesystems

Container images are layered. OverlayFS merges multiple directory trees into a single unified view. Lower layers are read-only; the top layer captures writes. This is what makes docker pull efficient — shared base layers aren't duplicated.

    bash
    mount -t overlay overlay \
    
  -o lowerdir=/layer2:/layer1:/base,\
    
     upperdir=/container/upper,\
    
     workdir=/container/work \
    
  /merged

Click each layer to explore its contents and see how copy-on-write, whiteouts, and opaque directories work:

Copy-on-Write Mechanics

Read Path

Lookup starts at upperdir → walks down through lowerdirs in order. First match wins. O(layers) worst case, but dentries are cached after first lookup.

Write / Modify

File is copied from lower layer to upperdir (copy_up), then modified in place. Metadata (xattrs, permissions) preserved. Large files pay a full copy cost on first write.

Delete

Creates a whiteout — a char device (0,0) in upperdir. For directories, an opaque xattr (trusted.overlay.opaque=y) hides all lower contents.

Rename

Cross-layer rename requires redirect_dir feature (kernel 4.10+). Without it, directory renames fail with EXDEV. This is a common gotcha in older kernels.

Performance gotcha: copy_up on large files (think: database files) in lower layers triggers a full copy to upper on first write. This is why database containers often use volumes or tmpfs for data directories, bypassing overlayfs entirely.

04

Build a Container from Scratch

Walk through the exact syscall sequence to create a container. No Docker, no containerd — just raw Linux primitives. Click each step to see the commands execute.

05

Docker Architecture

Docker is a convenience layer. It orchestrates all the primitives above through a well-defined component chain. Here's how docker run translates to syscalls:

docker CLI → dockerd (API) → containerd → containerd-shim → runc → clone() + exec()

dockerd

REST API daemon. Handles image management, volumes, networks. Delegates container lifecycle to containerd via gRPC. If dockerd crashes, running containers survive (containerd keeps them alive).

containerd

Industry-standard container runtime. Manages image pull/push (via content store), snapshot drivers (overlayfs), and container lifecycle. Direct CRI integration for Kubernetes. This is the real runtime.

containerd-shim

Per-container process that reparents the container PID 1. Keeps stdin/stdout open, reports exit status. Allows containerd to restart without killing containers. The shim is why docker restart dockerd is safe.

runc

OCI reference runtime. Reads the OCI bundle spec (config.json + rootfs), does the actual clone() with namespace flags, sets up cgroups, pivots root, drops capabilities, applies seccomp, execs the entrypoint. Then exits — it's not a daemon.

OCI Image Spec

Images are a manifest → config + ordered layer digests. Layers are tar+gzip diffs. The manifest's mediaType and platform fields handle multi-arch. Content-addressable via SHA256.

OCI Runtime Spec

config.json defines: root filesystem, mounts, process (args, env, cwd, capabilities), Linux-specific (namespaces, cgroups path, seccomp, apparmor, rlimits). This is the "container contract".

docker run — Full Call Chain

    sequence
docker run alpine sh
    
CLI → POST /containers/create to dockerd (REST)
    
dockerd → containerd.Create() (gRPC) + pull image if needed
    
containerd → prepare snapshot (overlayfs), create OCI bundle
    
containerd → start containerd-shim, pass bundle path
    
shim → exec runc create --bundle /run/containerd/...
    
runc: clone(NEWPID|NEWNS|NEWNET|NEWUTS|NEWIPC|NEWCGROUP)
    
runc: setup cgroups, mount proc/sys/dev, pivot_root
    
runc: drop caps, apply seccomp, execve("/bin/sh")
    
runc exits. shim reparents PID 1. Container running.

06

Security Layers

Containers are not VMs. Isolation comes from layered security mechanisms, each addressing a different attack surface. Understanding these layers is critical for threat modeling containerized workloads.

Linux Capabilities

Root's monolithic power split into ~41 fine-grained caps. Default Docker drops: CAP_SYS_ADMIN, CAP_NET_RAW (post-20.10), CAP_SYS_PTRACE, etc. Adding --privileged grants ALL caps — effectively uncontained.

Seccomp-BPF

Syscall filter at the kernel boundary. Default Docker profile blocks ~44 of ~330+ syscalls. Notably blocks: clone with CLONE_NEWUSER (preventing nested user namespaces), mount, reboot, kexec_load, bpf.

AppArmor / SELinux

MAC (Mandatory Access Control) policies. Docker generates an AppArmor profile per container that restricts mount, ptrace, signal operations. SELinux uses MCS labels (sVirt) to prevent cross-container access.

User Namespaces

Map container UID 0 to unprivileged host UID (e.g., 100000). Even if process escapes, it's nobody on the host. Not enabled by default in Docker (--userns-remap). Rootless mode uses this extensively.

Read-only Layers

Image layers are immutable. --read-only flag makes the entire rootfs read-only. tmpfs mounts for /tmp, /run. Prevents persistent compromise of the container filesystem.

No New Privileges

The no_new_privs bit (prctl PR_SET_NO_NEW_PRIVS) prevents gaining privileges via setuid/setgid binaries or filesystem capabilities. Docker sets this when a seccomp profile is active.

Container Escape Vectors

Vector	Mechanism	Mitigation
Kernel exploit	Shared kernel = shared vulnerabilities. Container shares syscall surface with host.	Keep kernel patched. Use gVisor/Kata for untrusted workloads (separate kernel/VMM).
Privileged mode	--privileged disables all isolation. Full device access, all caps, no seccomp.	Never use --privileged. Use --cap-add for specific needs.
Mounted Docker socket	-v /var/run/docker.sock gives full Docker API access = host root equivalent.	Use Docker-in-Docker or rootless Docker. Never mount the socket in prod.
Sensitive host mounts	-v /:/host exposes entire host filesystem to the container.	Principle of least privilege. Mount only what's needed, read-only where possible.
Metadata service	Cloud metadata (169.254.169.254) accessible from containers = credential theft.	Network policies. IMDSv2 with hop limit=1. Workload identity.

Container Internals
From Syscalls to Docker

Linux Namespaces

Namespace Lifecycle

Interactive: Toggle Isolation

Control Groups (cgroups)

Interactive: Resource Limiter Simulator

Overlay Filesystems

Copy-on-Write Mechanics

Read Path

Write / Modify

Delete

Rename

Build a Container from Scratch

▸ Container Builder

Docker Architecture

dockerd

containerd

containerd-shim

runc

OCI Image Spec

OCI Runtime Spec

docker run — Full Call Chain

Security Layers

Linux Capabilities

Seccomp-BPF

AppArmor / SELinux

User Namespaces

Read-only Layers

No New Privileges

Container Escape Vectors

Container InternalsFrom Syscalls to Docker

Linux Namespaces

Namespace Lifecycle

Interactive: Toggle Isolation

Control Groups (cgroups)

Interactive: Resource Limiter Simulator

Overlay Filesystems

Copy-on-Write Mechanics

Read Path

Write / Modify

Delete

Rename

Build a Container from Scratch

▸ Container Builder

Docker Architecture

dockerd

containerd

containerd-shim

runc

OCI Image Spec

OCI Runtime Spec

docker run — Full Call Chain

Security Layers

Linux Capabilities

Seccomp-BPF

AppArmor / SELinux

User Namespaces

Read-only Layers

No New Privileges

Container Escape Vectors

Container Internals
From Syscalls to Docker