Explore the eight fundamental subsystems that power everything from smartphones to supercomputers.
Linux runs all core services — scheduling, memory management, filesystems, drivers, networking — in a single shared address space at Ring 0. This monolithic design trades isolation for speed: no IPC overhead between subsystems, just function calls.
"Linux is obsolete" — Tanenbaum's 1992 critique predicted monolithic kernels would lose to microkernels. Three decades later, Linux powers 100% of the top 500 supercomputers.
— The Tanenbaum–Torvalds Debate, comp.os.minix, 1992Loadable Kernel Modules (.ko files) resolve the monolithic-vs-microkernel tension: drivers and filesystems load at runtime via modprobe, executing in Ring 0 with full kernel privileges. Rust gained kernel support in 6.1 and was promoted to a core language in December 2025.
printf() eventually calls write(1, buf, len). The VDSO (Virtual Dynamic Shared Object) maps kernel time data into user space, letting gettimeofday() execute without any ring transition — zero syscall overhead.syscall instruction. CPU flips to Ring 0, saves return address, jumps to the entry point in LSTAR. The kernel indexes into sys_call_table (460+ entries) and dispatches. Cost: ~100ns per transition.modprobe from /lib/modules/$(uname -r)/. Character devices (byte streams: serial, /dev/null) and block devices (random access: NVMe, SATA) are identified by major:minor number pairs. DMA enables zero-copy data transfers between device and memory.Every process starts as a copy of another via fork(), which duplicates the task_struct (~6–8 KB) and page tables — but not the actual memory pages (that's copy-on-write). The child then typically calls exec() to replace its image. This two-step model lets the gap between fork and exec handle arbitrary setup: redirecting FDs, dropping privileges, configuring namespaces.
The kernel treats threads and processes identically — both are task_struct entries. The clone() syscall with CLONE_VM | CLONE_FILES | CLONE_THREAD creates a thread; without those flags, a full process. EEVDF (kernel 6.6+) replaced CFS, using virtual deadlines and lag-based eligibility to eliminate scheduling heuristics.
Virtual memory gives each process the illusion of a vast, private, contiguous address space. A 48-bit virtual address is split into four 9-bit indices walking PGD → PUD → PMD → PTE page tables, plus a 12-bit page offset. The MMU does this in hardware; the TLB caches results. Huge pages (2 MB or 1 GB) reduce TLB pressure for large working sets.
Copy-on-write makes fork() fast: the kernel copies page table entries and marks all shared pages read-only. A write triggers a page fault; the handler allocates a new page and copies on demand. Since most forks immediately exec(), actual copying rarely happens.
The Virtual Filesystem Switch dispatches open(), read(), write() through function pointer tables — ext4, procfs, and /dev/null all implement the same interface. Four objects make it work: superblock (mounted FS metadata), inode (file metadata, no filename), dentry (name→inode mapping, cached in dcache), and file (open handle with offset).
ext4 replaced indirect block mapping with extents: each describes a contiguous range (up to 128 MiB). Journaling (JBD2, default data=ordered) writes intended metadata changes to a journal first — crash recovery takes seconds instead of the hours fsck needed on ext2.
Linux classifies devices as character (byte streams: serial, /dev/null) or block (random-access: NVMe, SATA), identified by major:minor number pairs. Three virtual filesystems expose internals: /proc (process info, tunable knobs in /proc/sys/), /sys (structured device model by bus/class/device), and udev (userspace daemon creating device nodes in /dev from kernel uevents).
I/O schedulers sit between filesystem and block driver: mq-deadline (sorted batches), BFQ (interactive fairness), kyber (latency-optimized for NVMe), or none (passthrough for NVMe with internal schedulers — often 785 KIOPS vs 315 for BFQ).
A packet arrives → NIC DMAs it into a ring buffer → hardware interrupt → NAPI polling drains up to 300 packets per cycle → sk_buff allocation (pointer manipulation, not copying) → GRO reassembly → ip_rcv() routing → transport layer (TCP state machine or UDP direct queue) → socket buffer → read() returns data.
Netfilter provides 5 hooks (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING). nftables replaced iptables with atomic rulesets and unified IPv4/IPv6. Network namespaces virtualize the entire stack — each gets independent interfaces, routing, and firewall rules. Connected by veth pairs, this is the foundation of container networking.
eBPF + XDP enables programmable packet processing at the earliest point in the stack — before sk_buff allocation — achieving up to 24M packets/sec/core for DDoS mitigation.
Click Play Boot to step through the sequence, or click any stage directly.
Traditional Unix security is Discretionary Access Control: owner/group/other × read/write/execute. Capabilities break root's monolithic privilege into ~41 units (CAP_NET_BIND_SERVICE, CAP_SYS_MODULE, etc.). SELinux (RHEL/Fedora, label-based MAC) and AppArmor (Ubuntu/SUSE, path-based MAC) enforce mandatory policies beyond DAC.
Namespaces (PID, network, mount, user, UTS, IPC, cgroup, time) virtualize kernel resources. Cgroups v2 control CPU, memory, I/O, and PID limits. Combined, they are containers — Docker just orchestrates the setup. Seccomp-BPF filters syscalls per-process; Docker blocks ~44 dangerous ones by default.