How Linux Works — An Interactive Deep Dive

01 — Kernel Architecture

Monolithic by design, modular in practice

Linux runs all core services — scheduling, memory management, filesystems, drivers, networking — in a single shared address space at Ring 0. This monolithic design trades isolation for speed: no IPC overhead between subsystems, just function calls.

"Linux is obsolete" — Tanenbaum's 1992 critique predicted monolithic kernels would lose to microkernels. Three decades later, Linux powers 100% of the top 500 supercomputers.

— The Tanenbaum–Torvalds Debate, comp.os.minix, 1992

Loadable Kernel Modules (.ko files) resolve the monolithic-vs-microkernel tension: drivers and filesystems load at runtime via modprobe, executing in Ring 0 with full kernel privileges. Rust gained kernel support in 6.1 and was promoted to a core language in December 2025.

Click a layer to explore the architecture

Linux Kernel Architecture

User Applications Ring 3

Bash, Firefox, Docker, your code — all run in unprivileged user space. They can only interact with hardware through system calls. Each gets a private virtual address space (128 TB on x86-64) enforced by the MMU. Attempting to access kernel memory triggers a General Protection Fault.

C Library (glibc / musl) Ring 3

The C library wraps raw syscalls into POSIX-compatible functions. printf() eventually calls write(1, buf, len). The VDSO (Virtual Dynamic Shared Object) maps kernel time data into user space, letting gettimeofday() execute without any ring transition — zero syscall overhead.

System Call Interface Ring 3 → Ring 0

The controlled gateway. On x86-64: place syscall number in %rax, arguments in %rdi/%rsi/%rdx/%r10/%r8/%r9, execute the syscall instruction. CPU flips to Ring 0, saves return address, jumps to the entry point in LSTAR. The kernel indexes into sys_call_table (460+ entries) and dispatches. Cost: ~100ns per transition.

VFS / Process Scheduler / Memory Manager Ring 0

Core kernel subsystems. The VFS dispatches file operations through function pointer tables — ext4, procfs, and /dev/null all implement the same interface. CFS/EEVDF schedules tasks via a red-black tree of virtual runtimes. The memory manager handles page tables, demand paging, COW, and the OOM killer.

Device Drivers & Loadable Modules Ring 0

Drivers account for ~60% of the kernel codebase. Loaded via modprobe from /lib/modules/$(uname -r)/. Character devices (byte streams: serial, /dev/null) and block devices (random access: NVMe, SATA) are identified by major:minor number pairs. DMA enables zero-copy data transfers between device and memory.

Hardware (CPU, RAM, Devices) Physical Layer

The MMU translates virtual→physical addresses through 4-level page tables. The TLB caches recent translations (64–1536 entries). Interrupts signal asynchronous events (disk completion, network packet arrival, timer tick). DMA controllers transfer data without CPU involvement. The IOMMU protects against rogue device memory access.

02 — Process Management

From fork() to the scheduler's red-black tree

Every process starts as a copy of another via fork(), which duplicates the task_struct (~6–8 KB) and page tables — but not the actual memory pages (that's copy-on-write). The child then typically calls exec() to replace its image. This two-step model lets the gap between fork and exec handle arbitrary setup: redirecting FDs, dropping privileges, configuring namespaces.

The kernel treats threads and processes identically — both are task_struct entries. The clone() syscall with CLONE_VM | CLONE_FILES | CLONE_THREAD creates a thread; without those flags, a full process. EEVDF (kernel 6.6+) replaced CFS, using virtual deadlines and lag-based eligibility to eliminate scheduling heuristics.

Click a state to explore the process lifecycle

Process State Machine

R Running

S Sleeping

D Disk Sleep

T Stopped

Z Zombie

TASK_RUNNING (R)

Either executing on a CPU or waiting in the run queue. CFS/EEVDF sorts runnable tasks in a red-black tree by virtual runtime — the leftmost node (lowest vruntime, cached for O(1) access) runs next. There are no fixed time slices; the targeted latency (~6ms) is divided proportionally by task weight (nice value).

03 — Virtual Memory

Every process owns 128 terabytes (it thinks)

Virtual memory gives each process the illusion of a vast, private, contiguous address space. A 48-bit virtual address is split into four 9-bit indices walking PGD → PUD → PMD → PTE page tables, plus a 12-bit page offset. The MMU does this in hardware; the TLB caches results. Huge pages (2 MB or 1 GB) reduce TLB pressure for large working sets.

Copy-on-write makes fork() fast: the kernel copies page table entries and marks all shared pages read-only. A write triggers a page fault; the handler allocates a new page and copies on demand. Since most forks immediately exec(), actual copying rarely happens.

Click a region to explore the address space

x86-64 Virtual Address Space Layout

0xFFFF800000000000 Kernel Space 128 TB

Non-canonical Guard Hole ~16M TB

0x7FFF........ Stack ↓ grows down

variable mmap / Shared Libs dynamic

after BSS Heap ↑ grows up

0x00600000 BSS / Data / Text ELF segments

Kernel Space (128 TB) — The upper half of the address space is mapped identically in every process. Contains the direct map of all physical memory, vmalloc regions, module space, and fixmap. User-space code cannot access it (Supervisor bit in PTEs). KPTI (Kernel Page Table Isolation) unmaps most kernel pages from user-space page tables to mitigate Meltdown.

04 — Filesystems

Everything is a file (descriptor)

The Virtual Filesystem Switch dispatches open(), read(), write() through function pointer tables — ext4, procfs, and /dev/null all implement the same interface. Four objects make it work: superblock (mounted FS metadata), inode (file metadata, no filename), dentry (name→inode mapping, cached in dcache), and file (open handle with offset).

ext4 replaced indirect block mapping with extents: each describes a contiguous range (up to 128 MiB). Journaling (JBD2, default data=ordered) writes intended metadata changes to a journal first — crash recovery takes seconds instead of the hours fsck needed on ext2.

Click a directory to explore the FHS

Filesystem Hierarchy Standard

//root of everything

📁/binessential binaries

📁/etcconfiguration files

📁/procprocess virtual FS

📁/syskernel device model

📁/devdevice nodes

📁/homeuser directories

📁/varvariable/runtime data

📁/tmpephemeral scratch

📁/bootkernel & bootloader

/ — The root directory. The single root of the entire filesystem tree. Every file, device, process pseudo-file, and mount point is reachable from here. The root partition is kept intentionally small to minimize corruption risk. All other filesystems are mounted at subdirectories of /.

05 — I/O & Devices

The /proc, /sys, udev triad

Linux classifies devices as character (byte streams: serial, /dev/null) or block (random-access: NVMe, SATA), identified by major:minor number pairs. Three virtual filesystems expose internals: /proc (process info, tunable knobs in /proc/sys/), /sys (structured device model by bus/class/device), and udev (userspace daemon creating device nodes in /dev from kernel uevents).

I/O schedulers sit between filesystem and block driver: mq-deadline (sorted batches), BFQ (interactive fairness), kyber (latency-optimized for NVMe), or none (passthrough for NVMe with internal schedulers — often 785 KIOPS vs 315 for BFQ).

06 — Networking Stack

From NIC interrupt to socket read()

A packet arrives → NIC DMAs it into a ring buffer → hardware interrupt → NAPI polling drains up to 300 packets per cycle → sk_buff allocation (pointer manipulation, not copying) → GRO reassembly → ip_rcv() routing → transport layer (TCP state machine or UDP direct queue) → socket buffer → read() returns data.

Netfilter provides 5 hooks (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING). nftables replaced iptables with atomic rulesets and unified IPv4/IPv6. Network namespaces virtualize the entire stack — each gets independent interfaces, routing, and firewall rules. Connected by veth pairs, this is the foundation of container networking.

eBPF + XDP enables programmable packet processing at the earliest point in the stack — before sk_buff allocation — achieving up to 24M packets/sec/core for DDoS mitigation.

07 — Boot Process

Firmware to PID 1 in five stages

Click Play Boot to step through the sequence, or click any stage directly.

Linux Boot Sequence

Firmware (BIOS / UEFI) T+0ms

POST checks CPU, RAM, and peripherals. Legacy BIOS: 16-bit real mode, reads 512-byte MBR, limited to 2 TB disks. UEFI: 32/64-bit protected mode from the start, reads .efi executables from a FAT32 ESP, supports GPT (>2 TB, 128 partitions), and provides Secure Boot — cryptographic chain of trust from firmware through shim to GRUB to kernel.

GRUB2 Bootloader T+500ms

GRUB2 understands filesystems (ext4, XFS, Btrfs) and locates the kernel by path, not raw sectors. Loads vmlinuz (compressed kernel) and the initramfs image into memory, passing command line parameters (root=UUID=..., quiet, ro). In UEFI mode, firmware directly executes grubx64.efi from the ESP.

Kernel Decompression T+800ms

"Decompressing Linux..." — vmlinuz (the "z" means compressed) self-extracts. kASLR randomizes the load address. Assembly entry code sets up GDT, initial page tables, and a 16 KB kernel stack, then jumps to start_kernel() in init/main.c — 86+ initialization calls: arch setup, memory, scheduler, IRQs, timers, VFS, console.

initramfs T+1.5s

Solves the chicken-and-egg problem: the kernel needs storage/FS drivers to mount root, but those drivers are on the root filesystem. initramfs is a gzip+cpio archive unpacked into tmpfs, containing exactly the needed modules. Its init runs udev, loads drivers, assembles RAID/LVM, decrypts LUKS, mounts real root, then switch_root deletes the tmpfs and execs the real init.

systemd (PID 1) T+2.5s

PID 1 is immune to SIGKILL — if it dies, the kernel panics. systemd replaces sequential SysVinit with a dependency graph of declarative unit files, starting services in parallel. Socket activation creates listening sockets before services start — connections trigger on-demand launch. Every service runs in its own cgroup for precise process tracking and resource limits (MemoryMax=, CPUQuota=, TasksMax=).

08 — Security & Isolation

From chmod to containers

Traditional Unix security is Discretionary Access Control: owner/group/other × read/write/execute. Capabilities break root's monolithic privilege into ~41 units (CAP_NET_BIND_SERVICE, CAP_SYS_MODULE, etc.). SELinux (RHEL/Fedora, label-based MAC) and AppArmor (Ubuntu/SUSE, path-based MAC) enforce mandatory policies beyond DAC.

Namespaces (PID, network, mount, user, UTS, IPC, cgroup, time) virtualize kernel resources. Cgroups v2 control CPU, memory, I/O, and PID limits. Combined, they are containers — Docker just orchestrates the setup. Seccomp-BPF filters syscalls per-process; Docker blocks ~44 dangerous ones by default.

Interactive permission calculator

Unix File Permissions

Owner

r

w

x

Group

r

w

x

Other

r

w

x

-rwxr-xr-x

0755

chmod 755 filename