Knowledge Base
x86-64 bare metal guide

Writing an OS
From Scratch

From the first instruction the CPU executes after power-on, through virtual memory, interrupts, scheduling, and filesystem I/O — every layer of the hardware-software boundary, explained with real register layouts and working code.

$ qemu-system-x86_64 -drive format=raw,file=kernel.bin -s -S

From Power-On to Your Kernel's First Instruction

The x86 boot process is a chain of increasingly capable environments. The CPU starts in 16-bit real mode with 1 MiB of addressable memory, and your job is to climb through protected mode to 64-bit long mode — setting up the GDT, page tables, and control registers along the way.

BIOS Legacy Boot

When power is applied, the CPU executes at reset vector 0xFFFFFFF0 (mapped to BIOS ROM). The BIOS performs POST, initializes the memory controller and PCI bus, then loads sector 0 (512 bytes) to physical address 0x7C00, checks for the boot signature 0x55AA at bytes 510–511, and jumps there in 16-bit real mode with DL set to the boot drive number.

Real Mode Memory Layout at Boot

Minimal Boot Sector

NASM
[BITS 16]
[ORG 0x7C00]

start:
    jmp  0x0000:.flush_cs
.flush_cs:
    xor  ax, ax
    mov  ds, ax
    mov  ss, ax
    mov  sp, 0x7C00      ; Stack grows down from load address

    mov  si, msg
.print:
    lodsb
    test al, al
    jz   .halt
    mov  ah, 0x0E        ; BIOS TTY output
    int  0x10
    jmp  .print
.halt:
    cli
    hlt

msg: db "Hello from bootloader!", 0x0D, 0x0A, 0
times 510 - ($ - $$) db 0
dw 0xAA55             ; Boot signature

Real Mode → Protected Mode → Long Mode

The transition requires a precise sequence. Click through each step below:

Modern Alternative: The Limine Bootloader

Limine eliminates all the assembly above. It boots your kernel directly into 64-bit long mode with paging active and higher-half mappings configured. The kernel uses a request/response protocol — place struct requests in a .limine_requests section, and the bootloader fills in the response:

C
#include <limine.h>

__attribute__((used, section(".limine_requests")))
static volatile struct limine_framebuffer_request fb_request = {
    .id = LIMINE_FRAMEBUFFER_REQUEST, .revision = 0
};

void _start(void) {
    struct limine_framebuffer *fb = fb_request.response->framebuffers[0];
    uint32_t *pixels = fb->address;
    pixels[0] = 0xFFFFFF;  // White pixel at top-left
    for (;;) asm("hlt");
}
Recommendation
Use Limine for any new project. It saves thousands of lines of assembly bootstrap and hands you a 64-bit, paged, higher-half environment with a framebuffer and memory map ready to go. Write your own bootloader only if understanding the boot process is the goal.

4-Level Page Tables on x86-64

x86-64 uses a 4-level page table hierarchy. CR3 points to the PML4 (Page Map Level 4). Each table contains 512 entries of 8 bytes, occupying exactly one 4 KiB page. A 48-bit virtual address is decomposed into indices at each level plus a 12-bit page offset.

Interactive Address Translation

Page Table Entry Layout

Click on any bit to learn what it does. Toggle bits on/off to see how the entry value changes.

64-bit Page Table Entry — Click to Explore
Hover or click a bit to inspect
Each 64-bit page table entry controls access permissions, caching behavior, and the physical frame address for one 4 KiB virtual page.

Higher-Half Kernel

During early boot, identity mapping (virtual = physical) is essential so the instruction right after paging activation can still be fetched. A higher-half kernel maps the kernel at a high address like 0xFFFFFFFF80000000 while user processes occupy the lower half. This gives user space a contiguous range and keeps the kernel mapped in every process's page tables — avoiding TLB flushes on syscalls.

Canonical addressing enforces that bits 48–63 must be sign-extensions of bit 47, creating a gap: lower half 0x00000000000000000x00007FFFFFFFFFFF (user space) and upper half 0xFFFF8000000000000xFFFFFFFFFFFFFFFF (kernel space). Non-canonical addresses trigger a #GP fault.

TLB Management

The TLB caches virtual-to-physical translations. After modifying any page table entry, invalidate the stale TLB entry with INVLPG (single page) or reload CR3 (full flush). On SMP systems, modifying shared page tables requires a TLB shootdown: the modifying CPU sends IPIs to all other cores, each of which runs INVLPG in its IPI handler.

Gotcha: Page Fault Error Code
When #PF fires, CR2 holds the faulting virtual address and the error code has specific bits: bit 0 = protection violation (1) vs non-present (0), bit 1 = write fault, bit 2 = user-mode fault, bit 4 = instruction fetch / NX violation. These bits drive demand paging and copy-on-write logic.

Physical Frames, Heaps, and Allocators

Before you can allocate anything, you need the physical memory map. The BIOS provides this via INT 15h, EAX=E820h, returning entries with a 64-bit base, 64-bit length, and 32-bit type (1=usable, 2=reserved, 3=ACPI reclaimable). UEFI provides the equivalent via GetMemoryMap().

Allocator Designs

One bit per 4 KiB page frame: 0 = free, 1 = used. Costs 32 KiB per GiB of RAM. Allocation scans for a zero bit (O(n) worst case), freeing clears a bit (O(1)).

C
static uint8_t frame_bitmap[BITMAP_SIZE];

uint64_t alloc_frame(void) {
    for (uint64_t i = 0; i < BITMAP_SIZE; i++) {
        if (frame_bitmap[i] != 0xFF) {
            for (int bit = 0; bit < 8; bit++) {
                if (!(frame_bitmap[i] & (1 << bit))) {
                    frame_bitmap[i] |= (1 << bit);
                    return (i * 8 + bit) * 4096;
                }
            }
        }
    }
    return 0; // Out of memory
}

Linux's primary page allocator. Memory is divided into blocks of sizes 20 through 2k pages. Allocation finds the smallest available block ≥ requested size, splitting larger blocks in half. The buddy of block at address A order k is at A XOR (1 << (k + 12)). Both alloc and free are O(log n).

Strengths

  • Fast O(log n) alloc and free
  • Low external fragmentation
  • Efficient coalescing via XOR buddy check
  • Good cache behavior for large allocations

Weaknesses

  • Internal fragmentation (power-of-2 rounding)
  • More complex to implement than bitmap
  • May waste memory on odd-sized requests
  • Splitting/coalescing overhead per operation

The slab allocator pre-allocates caches of fixed-size objects — achieving O(1) alloc and free for kernel structures like task_struct and inode. Generic caches (kmalloc-32, kmalloc-64, kmalloc-128, etc.) handle variable-size requests by rounding up to the nearest cache size. Linux uses SLUB (simplified unqueued slab allocator) as the default.

Architecture layering
Physical frame allocator (buddy) → Slab allocator (built on top of buddy pages) → kmalloc / kfree (slab API). Each layer builds on the one below.

IDT, PIC, APIC, and System Calls

The Interrupt Descriptor Table maps vectors 0–255 to handler functions. In 64-bit mode, each IDT entry is 16 bytes. The first 32 vectors are reserved for CPU exceptions (#DE, #DB, #NMI, #BP, #DF, #GP, #PF, etc.). Vectors 8, 10–14, and 17 push an error code; the rest do not.

C
typedef struct {
    uint16_t offset_low;       // Bits 0–15 of handler address
    uint16_t selector;         // Code segment selector (e.g., 0x08)
    uint8_t  ist;              // Bits 0–2: IST index (0 = legacy)
    uint8_t  type_attributes;  // Gate type + DPL + Present
    uint16_t offset_mid;       // Bits 16–31
    uint32_t offset_high;      // Bits 32–63
    uint32_t reserved;         // Must be zero
} __attribute__((packed)) idt_entry_t;

PIC Remapping

The legacy 8259 PIC maps IRQs 0–7 to interrupts 0x08–0x0F, colliding with CPU exceptions. You must remap them — the standard approach puts them at vectors 0x200x2F via the ICW1–ICW4 initialization sequence on ports 0x20/0x21 (master) and 0xA0/0xA1 (slave). After handling an IRQ, send EOI (0x20) to the command port.

SYSCALL / SYSRET

The modern fast-path system call mechanism uses MSRs instead of the IDT:

MSRAddressPurpose
IA32_EFER0xC0000080Bit 0 (SCE): enable SYSCALL
STAR0xC0000081Bits 47:32 = kernel CS; 63:48 = user CS base
LSTAR0xC0000082Kernel entry point (RIP loaded on SYSCALL)
SFMASK0xC0000084RFLAGS bits cleared on entry (typically 0x700)
Critical: RSP is NOT changed by SYSCALL
SYSCALL saves RIP → RCX and RFLAGS → R11, loads CS/SS from STAR, jumps to LSTAR — but does not swap the stack pointer. Your kernel entry code must use SWAPGS to access per-CPU data and manually load RSP0 before touching the stack.

Context Switching and Ring Transitions

Task State Segment (64-bit)

The 64-bit TSS is no longer used for hardware task switching but remains essential for RSP0 (kernel stack loaded on ring 3→0 transitions) and the IST (dedicated stacks for critical exceptions like double faults). Each CPU core needs its own TSS.

Context Switch

Save callee-saved registers, swap stack pointers, restore. The full flow: timer IRQ → push interrupt frame → save GPRs → schedule() picks next task → swap RSP → update TSS.RSP0 → if different process, mov cr3, new_cr3 → restore GPRs → IRETQ.

NASM
; void context_switch(uint64_t *old_rsp, uint64_t new_rsp)
context_switch:
    push rbp
    push rbx
    push r12
    push r13
    push r14
    push r15
    pushfq

    mov  [rdi], rsp   ; Save current RSP
    mov  rsp, rsi     ; Load next task's RSP

    popfq
    pop  r15
    pop  r14
    pop  r13
    pop  r12
    pop  rbx
    pop  rbp
    ret              ; Returns into the new task

Entering User Mode (Ring 3)

Construct a fake interrupt frame on the stack and execute IRETQ:

NASM
enter_usermode:
    push 0x23        ; SS  — user data selector, RPL=3
    push user_rsp    ; RSP — user stack pointer
    push 0x202       ; RFLAGS — IF=1 (interrupts enabled)
    push 0x1B        ; CS  — user code selector, RPL=3
    push user_entry  ; RIP — user entry point
    iretq

Scheduling Algorithms

Circular queue of runnable tasks. Each gets a fixed time quantum (10–100 ms). On timer interrupt, current task moves to the back. Simple, fair for equal-priority tasks, but no priority differentiation.

Linux's Completely Fair Scheduler tracks vruntime per task in a red-black tree. The task with the lowest vruntime runs next. Higher-priority tasks accumulate vruntime more slowly, getting proportionally more CPU time. No fixed quanta — slice lengths adapt dynamically.

Rust's async/await provides cooperative multitasking within the kernel. Futures run until Poll::Pending at .await points — zero context-switch overhead, no per-task stacks. Ideal for I/O-bound kernel tasks (disk, network) while user processes still use preemptive scheduling via timer interrupts.

From Ramdisk to Disk Drivers

VFS Layer

A Virtual File System provides a uniform API (open, read, write, close) dispatching to different backends via function pointer tables. Core objects: superblock (per-FS metadata), inode (per-file metadata), dentry (path→inode), file (open handle with cursor).

Initrd (Initial Ramdisk)

Solves the chicken-and-egg problem: the kernel needs drivers to read from disk, but those drivers might live on disk. The bootloader loads a ramdisk image into memory alongside the kernel. A tar-based initrd is simplest — parse 512-byte USTAR headers sequentially to find files by name.

ATA PIO Disk Access

The primary ATA bus uses ports 0x1F00x1F7. Key commands: IDENTIFY (0xEC) returns drive info, READ SECTORS (0x20) reads via 28-bit LBA.

C
void ata_read_sector(uint32_t lba, uint8_t *buffer) {
    outb(0x1F6, 0xE0 | ((lba >> 24) & 0x0F));
    outb(0x1F2, 0x01);              // 1 sector
    outb(0x1F3, lba & 0xFF);        // LBA low
    outb(0x1F4, (lba >> 8) & 0xFF); // LBA mid
    outb(0x1F5, (lba >> 16) & 0xFF);// LBA high
    outb(0x1F7, 0x20);              // READ SECTORS
    while (inb(0x1F7) & 0x80);      // Wait BSY clear
    while (!(inb(0x1F7) & 0x08));   // Wait DRQ set
    for (int i = 0; i < 256; i++)
        ((uint16_t*)buffer)[i] = inw(0x1F0);
}

Choices That Shape Everything

Monolithic vs Microkernel

Monolithic (Linux)

  • All services in kernel space — fast direct calls
  • No IPC overhead between subsystems
  • Simple to develop incrementally
  • Vast existing ecosystem and drivers

Microkernel (seL4, MINIX 3)

  • Superior fault isolation — driver crash ≠ kernel crash
  • Minimal trusted computing base
  • Modern L4 IPC: ~0.5–1 μs per message
  • Formal verification possible (seL4: ~10K lines C)

Rust vs C for Kernel Development

C offers direct memory manipulation and a 40+ year ecosystem. But ~70% of Linux kernel CVEs are memory-safety bugs that Rust prevents at compile time. In a kernel, unsafe blocks (for hardware access) represent roughly 5–15% of code, concentrated in the HAL. The rest gets ownership-enforced guarantees against use-after-free, double-free, buffer overflows, null derefs, and data races. Rust entered the Linux kernel mainline in v6.1 (2022).

Preemptive vs Cooperative

Cooperative (tasks yield voluntarily) is simpler but lets one bad task freeze everything — classic Mac OS and Windows 3.1 suffered this. Preemptive uses timer interrupts to forcibly switch, ensuring fairness. All modern general-purpose OSes are preemptive. Best approach: async/await (cooperative) for kernel I/O tasks, preemptive scheduling for user processes.

Essential References

Dev Workflow

Build a cross-compiler targeting x86_64-elf: configure binutils + GCC with --target=x86_64-elf --without-headers. For Rust: rustup component add rust-src on nightly, build with --target x86_64-unknown-none. Test with QEMU (-s -S for GDB, -d int for interrupt logging, -serial stdio). Use Bochs for precise hardware emulation and page table inspection.