From Power-On to Your Kernel's First Instruction
The x86 boot process is a chain of increasingly capable environments. The CPU starts in 16-bit real mode with 1 MiB of addressable memory, and your job is to climb through protected mode to 64-bit long mode — setting up the GDT, page tables, and control registers along the way.
BIOS Legacy Boot
When power is applied, the CPU executes at reset vector 0xFFFFFFF0 (mapped to BIOS ROM). The BIOS performs POST, initializes the memory controller and PCI bus, then loads sector 0 (512 bytes) to physical address 0x7C00, checks for the boot signature 0x55AA at bytes 510–511, and jumps there in 16-bit real mode with DL set to the boot drive number.
Minimal Boot Sector
[BITS 16] [ORG 0x7C00] start: jmp 0x0000:.flush_cs .flush_cs: xor ax, ax mov ds, ax mov ss, ax mov sp, 0x7C00 ; Stack grows down from load address mov si, msg .print: lodsb test al, al jz .halt mov ah, 0x0E ; BIOS TTY output int 0x10 jmp .print .halt: cli hlt msg: db "Hello from bootloader!", 0x0D, 0x0A, 0 times 510 - ($ - $$) db 0 dw 0xAA55 ; Boot signature
Real Mode → Protected Mode → Long Mode
The transition requires a precise sequence. Click through each step below:
Modern Alternative: The Limine Bootloader
Limine eliminates all the assembly above. It boots your kernel directly into 64-bit long mode with paging active and higher-half mappings configured. The kernel uses a request/response protocol — place struct requests in a .limine_requests section, and the bootloader fills in the response:
#include <limine.h> __attribute__((used, section(".limine_requests"))) static volatile struct limine_framebuffer_request fb_request = { .id = LIMINE_FRAMEBUFFER_REQUEST, .revision = 0 }; void _start(void) { struct limine_framebuffer *fb = fb_request.response->framebuffers[0]; uint32_t *pixels = fb->address; pixels[0] = 0xFFFFFF; // White pixel at top-left for (;;) asm("hlt"); }
4-Level Page Tables on x86-64
x86-64 uses a 4-level page table hierarchy. CR3 points to the PML4 (Page Map Level 4). Each table contains 512 entries of 8 bytes, occupying exactly one 4 KiB page. A 48-bit virtual address is decomposed into indices at each level plus a 12-bit page offset.
Page Table Entry Layout
Click on any bit to learn what it does. Toggle bits on/off to see how the entry value changes.
Higher-Half Kernel
During early boot, identity mapping (virtual = physical) is essential so the instruction right after paging activation can still be fetched. A higher-half kernel maps the kernel at a high address like 0xFFFFFFFF80000000 while user processes occupy the lower half. This gives user space a contiguous range and keeps the kernel mapped in every process's page tables — avoiding TLB flushes on syscalls.
Canonical addressing enforces that bits 48–63 must be sign-extensions of bit 47, creating a gap: lower half 0x0000000000000000–0x00007FFFFFFFFFFF (user space) and upper half 0xFFFF800000000000–0xFFFFFFFFFFFFFFFF (kernel space). Non-canonical addresses trigger a #GP fault.
TLB Management
The TLB caches virtual-to-physical translations. After modifying any page table entry, invalidate the stale TLB entry with INVLPG (single page) or reload CR3 (full flush). On SMP systems, modifying shared page tables requires a TLB shootdown: the modifying CPU sends IPIs to all other cores, each of which runs INVLPG in its IPI handler.
Physical Frames, Heaps, and Allocators
Before you can allocate anything, you need the physical memory map. The BIOS provides this via INT 15h, EAX=E820h, returning entries with a 64-bit base, 64-bit length, and 32-bit type (1=usable, 2=reserved, 3=ACPI reclaimable). UEFI provides the equivalent via GetMemoryMap().
Allocator Designs
One bit per 4 KiB page frame: 0 = free, 1 = used. Costs 32 KiB per GiB of RAM. Allocation scans for a zero bit (O(n) worst case), freeing clears a bit (O(1)).
static uint8_t frame_bitmap[BITMAP_SIZE]; uint64_t alloc_frame(void) { for (uint64_t i = 0; i < BITMAP_SIZE; i++) { if (frame_bitmap[i] != 0xFF) { for (int bit = 0; bit < 8; bit++) { if (!(frame_bitmap[i] & (1 << bit))) { frame_bitmap[i] |= (1 << bit); return (i * 8 + bit) * 4096; } } } } return 0; // Out of memory }
Linux's primary page allocator. Memory is divided into blocks of sizes 20 through 2k pages. Allocation finds the smallest available block ≥ requested size, splitting larger blocks in half. The buddy of block at address A order k is at A XOR (1 << (k + 12)). Both alloc and free are O(log n).
Strengths
- Fast O(log n) alloc and free
- Low external fragmentation
- Efficient coalescing via XOR buddy check
- Good cache behavior for large allocations
Weaknesses
- Internal fragmentation (power-of-2 rounding)
- More complex to implement than bitmap
- May waste memory on odd-sized requests
- Splitting/coalescing overhead per operation
The slab allocator pre-allocates caches of fixed-size objects — achieving O(1) alloc and free for kernel structures like task_struct and inode. Generic caches (kmalloc-32, kmalloc-64, kmalloc-128, etc.) handle variable-size requests by rounding up to the nearest cache size. Linux uses SLUB (simplified unqueued slab allocator) as the default.
IDT, PIC, APIC, and System Calls
The Interrupt Descriptor Table maps vectors 0–255 to handler functions. In 64-bit mode, each IDT entry is 16 bytes. The first 32 vectors are reserved for CPU exceptions (#DE, #DB, #NMI, #BP, #DF, #GP, #PF, etc.). Vectors 8, 10–14, and 17 push an error code; the rest do not.
typedef struct { uint16_t offset_low; // Bits 0–15 of handler address uint16_t selector; // Code segment selector (e.g., 0x08) uint8_t ist; // Bits 0–2: IST index (0 = legacy) uint8_t type_attributes; // Gate type + DPL + Present uint16_t offset_mid; // Bits 16–31 uint32_t offset_high; // Bits 32–63 uint32_t reserved; // Must be zero } __attribute__((packed)) idt_entry_t;
PIC Remapping
The legacy 8259 PIC maps IRQs 0–7 to interrupts 0x08–0x0F, colliding with CPU exceptions. You must remap them — the standard approach puts them at vectors 0x20–0x2F via the ICW1–ICW4 initialization sequence on ports 0x20/0x21 (master) and 0xA0/0xA1 (slave). After handling an IRQ, send EOI (0x20) to the command port.
SYSCALL / SYSRET
The modern fast-path system call mechanism uses MSRs instead of the IDT:
| MSR | Address | Purpose |
|---|---|---|
IA32_EFER | 0xC0000080 | Bit 0 (SCE): enable SYSCALL |
STAR | 0xC0000081 | Bits 47:32 = kernel CS; 63:48 = user CS base |
LSTAR | 0xC0000082 | Kernel entry point (RIP loaded on SYSCALL) |
SFMASK | 0xC0000084 | RFLAGS bits cleared on entry (typically 0x700) |
SWAPGS to access per-CPU data and manually load RSP0 before touching the stack.
Context Switching and Ring Transitions
Task State Segment (64-bit)
The 64-bit TSS is no longer used for hardware task switching but remains essential for RSP0 (kernel stack loaded on ring 3→0 transitions) and the IST (dedicated stacks for critical exceptions like double faults). Each CPU core needs its own TSS.
Context Switch
Save callee-saved registers, swap stack pointers, restore. The full flow: timer IRQ → push interrupt frame → save GPRs → schedule() picks next task → swap RSP → update TSS.RSP0 → if different process, mov cr3, new_cr3 → restore GPRs → IRETQ.
; void context_switch(uint64_t *old_rsp, uint64_t new_rsp) context_switch: push rbp push rbx push r12 push r13 push r14 push r15 pushfq mov [rdi], rsp ; Save current RSP mov rsp, rsi ; Load next task's RSP popfq pop r15 pop r14 pop r13 pop r12 pop rbx pop rbp ret ; Returns into the new task
Entering User Mode (Ring 3)
Construct a fake interrupt frame on the stack and execute IRETQ:
enter_usermode: push 0x23 ; SS — user data selector, RPL=3 push user_rsp ; RSP — user stack pointer push 0x202 ; RFLAGS — IF=1 (interrupts enabled) push 0x1B ; CS — user code selector, RPL=3 push user_entry ; RIP — user entry point iretq
Scheduling Algorithms
Circular queue of runnable tasks. Each gets a fixed time quantum (10–100 ms). On timer interrupt, current task moves to the back. Simple, fair for equal-priority tasks, but no priority differentiation.
Linux's Completely Fair Scheduler tracks vruntime per task in a red-black tree. The task with the lowest vruntime runs next. Higher-priority tasks accumulate vruntime more slowly, getting proportionally more CPU time. No fixed quanta — slice lengths adapt dynamically.
Rust's async/await provides cooperative multitasking within the kernel. Futures run until Poll::Pending at .await points — zero context-switch overhead, no per-task stacks. Ideal for I/O-bound kernel tasks (disk, network) while user processes still use preemptive scheduling via timer interrupts.
From Ramdisk to Disk Drivers
VFS Layer
A Virtual File System provides a uniform API (open, read, write, close) dispatching to different backends via function pointer tables. Core objects: superblock (per-FS metadata), inode (per-file metadata), dentry (path→inode), file (open handle with cursor).
Initrd (Initial Ramdisk)
Solves the chicken-and-egg problem: the kernel needs drivers to read from disk, but those drivers might live on disk. The bootloader loads a ramdisk image into memory alongside the kernel. A tar-based initrd is simplest — parse 512-byte USTAR headers sequentially to find files by name.
ATA PIO Disk Access
The primary ATA bus uses ports 0x1F0–0x1F7. Key commands: IDENTIFY (0xEC) returns drive info, READ SECTORS (0x20) reads via 28-bit LBA.
void ata_read_sector(uint32_t lba, uint8_t *buffer) { outb(0x1F6, 0xE0 | ((lba >> 24) & 0x0F)); outb(0x1F2, 0x01); // 1 sector outb(0x1F3, lba & 0xFF); // LBA low outb(0x1F4, (lba >> 8) & 0xFF); // LBA mid outb(0x1F5, (lba >> 16) & 0xFF);// LBA high outb(0x1F7, 0x20); // READ SECTORS while (inb(0x1F7) & 0x80); // Wait BSY clear while (!(inb(0x1F7) & 0x08)); // Wait DRQ set for (int i = 0; i < 256; i++) ((uint16_t*)buffer)[i] = inw(0x1F0); }
Choices That Shape Everything
Monolithic vs Microkernel
Monolithic (Linux)
- All services in kernel space — fast direct calls
- No IPC overhead between subsystems
- Simple to develop incrementally
- Vast existing ecosystem and drivers
Microkernel (seL4, MINIX 3)
- Superior fault isolation — driver crash ≠ kernel crash
- Minimal trusted computing base
- Modern L4 IPC: ~0.5–1 μs per message
- Formal verification possible (seL4: ~10K lines C)
Rust vs C for Kernel Development
C offers direct memory manipulation and a 40+ year ecosystem. But ~70% of Linux kernel CVEs are memory-safety bugs that Rust prevents at compile time. In a kernel, unsafe blocks (for hardware access) represent roughly 5–15% of code, concentrated in the HAL. The rest gets ownership-enforced guarantees against use-after-free, double-free, buffer overflows, null derefs, and data races. Rust entered the Linux kernel mainline in v6.1 (2022).
Preemptive vs Cooperative
Cooperative (tasks yield voluntarily) is simpler but lets one bad task freeze everything — classic Mac OS and Windows 3.1 suffered this. Preemptive uses timer interrupts to forcibly switch, ensuring fairness. All modern general-purpose OSes are preemptive. Best approach: async/await (cooperative) for kernel I/O tasks, preemptive scheduling for user processes.
Essential References
OSDev Wiki
The single most important resource. Bare Bones, GCC Cross-Compiler, Paging, ATA PIO, FAT, Interrupts — all documented with code.
Writing an OS in Rust
Philipp Oppermann's step-by-step series. VGA, exceptions, paging, heap allocation, async/await. 17.4k GitHub stars.
Limine Bare Bones
Skip the bootloader assembly. Limine boots you straight to 64-bit long mode with framebuffer, memory map, and higher-half.
Intel SDM / AMD APM
The authoritative architecture references. Volume 3 covers system programming: GDT, IDT, paging, control registers, MSRs.
The Little OS Book
Concise guide from Bochs setup through paging and user mode. Good for a weekend sprint.
r/osdev & OSDev Discord
Active communities for questions, debugging help, and project showcases. Check the forum at forum.osdev.org too.
Dev Workflow
Build a cross-compiler targeting x86_64-elf: configure binutils + GCC with --target=x86_64-elf --without-headers. For Rust: rustup component add rust-src on nightly, build with --target x86_64-unknown-none. Test with QEMU (-s -S for GDB, -d int for interrupt logging, -serial stdio). Use Bochs for precise hardware emulation and page table inspection.