Lexical Analysis
Breaking source code into meaningful atoms — tokens — is the compiler's first act of interpretation.
A lexer (or scanner, or tokenizer) consumes a raw stream of characters and produces a stream of tokens — the smallest meaningful units in a language. Identifiers, keywords, operators, literals, and punctuation each become discrete objects with a type, a value, and a source location.
Most lexers are implemented as finite automata — either hand-written (a big switch statement over the current character) or generated from regular-expression grammars via tools like Flex or re2c. The generated DFA approach guarantees O(n) scanning with minimal branching, but hand-written lexers offer better error recovery and are easier to debug. GCC, Clang, Go, Rust, and V8 all use hand-written lexers.
Maximal Munch
Lexers use the longest match rule: given >>=, the lexer will produce one RIGHT_SHIFT_ASSIGN token rather than GT, GT, EQ. This is why C++ required a space in vector<vector<int> > before C++11 — the lexer would greedily match >> as a right-shift operator.
Lookahead Budget
Most tokens need only 1–2 characters of lookahead to disambiguate. But some languages break this: Python's significant indentation requires tracking a stack of indent levels, making the lexer context-sensitive. C's type/identifier ambiguity (T * x; — declaration or multiplication?) leaks all the way into parsing.
Deep Dive — Position Tracking and Error Reporting
Production lexers track byte offset, line, and column for every token. This metadata flows downstream into the AST, IR, and eventually debug info (DWARF). Rust's Span type and TypeScript's source maps are direct descendants of this metadata. A common optimization: store a line-start offset table and compute line/column on demand rather than updating counters per character.
Parsing & AST Construction
Imposing hierarchical structure on a flat token stream — discovering the tree hidden in the text.
Parsing transforms a linear token stream into a tree — usually an Abstract Syntax Tree (AST). This tree captures the syntactic structure of the program: which expressions are nested, what the operator precedence is, and how statements group together. Every subsequent compiler phase operates on this tree (or a derivative of it).
There are two fundamental families of parsing algorithms. Top-down parsers start from the root of the grammar and try to derive the input. Bottom-up parsers start from the input tokens and reduce them toward the start symbol. The choice has deep implications for what grammars you can express and how you handle ambiguity.
Recursive Descent
The workhorse of hand-written parsers. Each grammar rule becomes a function. The call stack mirrors the parse tree. Easy to write, easy to debug, good error messages. Used by GCC (since 2004), Clang, Rust, Go, Swift, V8. Limitation: can't directly handle left-recursive grammars without rewriting them.
Pratt Parsing
A top-down operator-precedence parser — elegantly handles infix, prefix, postfix, and mixfix expressions. Each token has a binding power (numeric precedence) and optional nud (null denotation) and led (left denotation) handlers. Dramatically simplifies expression parsing. Used in V8, Rust (partially), and many interpreters.
LR / LALR Parsers
Table-driven shift-reduce parsers generated by tools like Yacc, Bison, ANTLR. Can handle a wider class of grammars (including left recursion) but produce famously inscrutable error messages. The parse table itself can be enormous — LALR(1) reduces it at the cost of some expressive power. Used by older GCC, Ruby (LALR), PostgreSQL's SQL parser.
PEG / Packrat
Parsing Expression Grammars use ordered choice (/) instead of ambiguous alternation (|). Packrat parsing memoizes intermediate results for guaranteed O(n) time at the cost of O(n) memory. Never ambiguous by construction, but the ordered choice can make some grammars behave unintuitively. Used in Lua, Pest (Rust).
Architectural insight: The trend is strongly toward hand-written recursive descent. Parser generators save initial effort but create pain in error recovery, incremental parsing, and tooling integration. Language server protocols (LSP) need parsers that can handle incomplete and broken code — something recursive descent handles naturally with strategic synchronization points.
Deep Dive — Concrete vs. Abstract Syntax Trees
A CST (Concrete Syntax Tree, or parse tree) preserves every token including parentheses, commas, and whitespace. An AST (Abstract Syntax Tree) discards syntactic sugar and retains only semantic structure. IDEs and formatters often need the CST; compilers work with the AST. Tree-sitter produces CSTs; most compilers produce ASTs. Rust-analyzer uses a lossless syntax tree (à la Roslyn/libcst) — a hybrid that preserves all trivia but provides AST-like traversal APIs.
Deep Dive — The Expression Problem in Parsers
How you represent your AST matters. Two patterns dominate: discriminated unions (Rust enum, OCaml variants) make adding new operations easy but new node types hard. Class hierarchies (OOP visitor pattern) make adding new node types easy but new operations hard. This is the classic expression problem. Rust's compiler uses enums; javac uses visitors. Modern approach: autogenerated AST types from a grammar spec (used by rust-analyzer's ungrammar).
Intermediate Representation
The lingua franca between the frontend and backend — where most optimization happens.
An Intermediate Representation is the compiler's internal language — neither the source language nor the target machine code. It's designed to be easy to analyze and transform. Many compilers use multiple IRs at different abstraction levels, progressively lowering from high-level constructs toward machine operations.
The key innovation is SSA form — Static Single Assignment. In SSA, every variable is assigned exactly once. When control flow merges, φ (phi) functions select the correct value. SSA makes def-use chains explicit, dramatically simplifying optimizations like constant propagation, dead code elimination, and global value numbering.
// Before SSA
x = 1
x = x + 1
y = x
// After SSA transformation
x₁ = 1
x₂ = x₁ + 1
y₁ = x₂
LLVM IR
The most influential compiler IR in production. A typed, SSA-based, low-level instruction set. Target-independent but close enough to hardware for efficient lowering. Features: explicit types, getelementptr for pointer arithmetic, intrinsics for target-specific operations. Used by Clang, Rust, Swift, Julia, Zig, and over 30 other language frontends.
define i32 @add(i32 %a, i32 %b) {
%result = add i32 %a, %b
ret i32 %result
}
Sea of Nodes
Used by V8 (TurboFan), HotSpot C2, and Graal. Instead of a linear instruction list within basic blocks, all values and effects float in a graph. Nodes have data edges (value dependencies) and effect/control edges. This representation makes optimizations like GVN, loop-invariant code motion, and scheduling fall out naturally from graph rewriting — but it's notoriously hard to debug.
Control Flow Graphs
A CFG is the backbone of most IR representations. Nodes are basic blocks — straight-line sequences of instructions with a single entry and single exit. Edges represent branches (conditional, unconditional, exceptions). Dominance trees computed from the CFG are essential for SSA construction: a definition in block A dominates a use in block B if every path to B goes through A. The dominance frontier tells you exactly where to insert φ functions.
| IR | Used By | Form | Notable Property |
|---|---|---|---|
| LLVM IR | Clang, Rust, Swift, Zig | SSA, typed | Target-independent; most widely reused IR |
| GIMPLE | GCC | SSA, 3-address | Tree-SSA; GIMPLE→RTL lowering boundary |
| MIR | Rust (rustc) | SSA, CFG | Borrow checker operates here |
| Bytecode (CIL) | .NET / C# | Stack-based | Metadata-rich, verifiable type safety |
| Sea of Nodes | V8 TurboFan, HotSpot C2 | Graph | Value, effect, and control edges |
| Cranelift IR (CLIF) | Wasmtime, rustc (experimental) | SSA, EBB | Designed for fast compilation |
| MLIR | TensorFlow, CIRCT, Flang | Multi-level | Nested regions; progressive lowering by design |
Architecture tradeoff: Higher-level IRs preserve more source semantics (making language-specific optimizations possible) but are harder to share across languages. Lower-level IRs are more reusable but lose information. LLVM's sweet spot — typed, SSA, but low-level — is why it became the industry standard. MLIR's multi-level approach attempts to get both.
Bytecode & Virtual Machines
Portable instruction sets for abstract machines — the layer where many languages live and die.
Bytecode is a compact, portable instruction set designed for a virtual machine rather than real hardware. It sits between the high-level AST and native machine code. Languages like Java, Python, C#, Lua, Erlang, and WebAssembly compile to bytecode and then either interpret it or JIT-compile it to native code at runtime.
The fundamental design decision: stack-based or register-based? This choice cascades through instruction encoding, code density, interpreter performance, and JIT complexity.
Stack Machines
Operands are implicit — they live on a virtual stack. Instructions like PUSH, ADD, STORE manipulate the top of stack. Compact encoding (no register operands in the instruction), trivially simple code generation from an AST (just post-order traverse), and easy verification. Used by JVM, .NET CIL, WebAssembly, CPython, Forth.
// x = 42 + y (stack-based)
PUSH 42 // stack: [42]
LOAD y // stack: [42, y]
ADD // stack: [42+y]
STORE x // stack: []
Register Machines
Operands are explicit register names. Instructions like ADD r0, r1, r2 specify source and destination. Fewer instructions (no push/pop overhead), maps more naturally to hardware. But larger instruction encoding and more complex code generation. Used by Lua 5.0+, Dalvik/ART (Android), LuaJIT, Dis (Limbo).
// x = 42 + y (register-based)
LOADK r0, 42 // r0 = 42
LOAD r1, y // r1 = y
ADD r2, r0, r1 // r2 = r0+r1
STORE x, r2 // x = r2
| Bytecode | Type | Typed? | Key Characteristic |
|---|---|---|---|
| JVM | Stack | Yes | Verifiable; rich metadata; 200+ opcodes |
| CPython | Stack | No | Dynamic; opcode set changes between versions |
| WebAssembly | Stack | Yes | Structured control flow (no goto); sandboxed |
| Lua 5.x | Register | No | Fixed 32-bit instruction word; 256 registers |
| Dalvik/ART | Register | Yes | Optimized for memory-constrained devices |
| CIL (.NET) | Stack | Yes | Generics in bytecode (not erased like JVM) |
| BEAM | Register | No | Process-oriented; hot code swapping support |
JIT Compilation: Where Bytecode Meets Native
A Just-In-Time compiler translates bytecode to native machine code at runtime, guided by profiling data. The canonical approach: start by interpreting, profile to find hot loops and functions, then compile those to optimized native code. If speculative optimizations prove wrong (a type guard fails), deoptimize — fall back to the interpreter with the correct state.
Gotcha — Deoptimization is expensive. When V8's TurboFan-compiled code hits an unexpected type, it must reconstruct the interpreter's stack frame from the optimized code's state. This "OSR (On-Stack Replacement) exit" is one of the most complex operations in a JIT. Getting the frame reconstruction wrong causes subtle, unreproducible crashes.
Deep Dive — Interpreter Dispatch Techniques
How an interpreter dispatches to the next opcode handler is a surprising performance bottleneck. From slowest to fastest:
Switch dispatch: A big switch(opcode). Simple but every dispatch goes through one branch point, defeating CPU branch prediction.
Direct threading: Each handler ends with goto *dispatch_table[*ip++]. One indirect branch per opcode. GCC's "labels as values" extension makes this possible. ~15–25% faster than switch.
Tail-call threading: Each handler is a function that tail-calls the next. Requires guaranteed tail call optimization. Used in some experimental interpreters.
Copy-and-patch: Pre-compile opcode handlers to machine code "stencils" and stitch them together at load time by patching in operands. Achieves near-JIT performance at interpreter-construction speed. Adopted by CPython 3.13+.
Machine Code Generation
The final lowering: from abstract operations to the exact bytes a processor will execute.
Code generation ("codegen") translates the compiler's IR into the target machine's native instruction set. This phase must solve several interrelated problems simultaneously: instruction selection (which machine instructions implement each IR operation), register allocation (mapping virtual registers to physical ones), and instruction scheduling (ordering instructions to maximize pipeline throughput and minimize stalls).
Instruction Selection
Mapping IR operations to machine instructions. A single IR operation might map to one instruction, multiple instructions, or be folded into a complex addressing mode. Tree-pattern matching (BURG/iburg) or DAG-based selection (LLVM's SelectionDAG / GlobalISel) are the standard approaches. The challenge: modern ISAs have thousands of instructions with overlapping capabilities — x86-64 has ~1500 distinct instruction forms.
Register Allocation
The IR assumes infinite virtual registers; the hardware has a small fixed set (x86-64: 16 GPRs, ARM64: 31). Graph coloring allocators (Chaitin-Briggs) model this as coloring an interference graph — two live ranges that overlap can't share a register. When coloring fails, spill to the stack. Linear scan allocators (used in JITs) trade allocation quality for speed — one pass over the live ranges, O(n log n).
Instruction Scheduling
Reorder instructions to hide latencies and fill pipeline slots. A multiply on modern x86 has 3-cycle latency — the scheduler inserts independent instructions in those cycles. List scheduling builds a DAG of dependencies and greedily picks ready instructions. This interacts with register allocation: more reordering increases register pressure. LLVM schedules both before and after regalloc.
Peephole Optimization
A final pass over the generated assembly, replacing instruction patterns with cheaper equivalents. Example: mov rax, 0 → xor eax, eax (shorter encoding, breaks dependency chain). imul rax, rax, 2 → shl rax, 1. Modern peephole optimizers are often auto-generated from declarative pattern-match rules.
Calling Conventions & ABIs
The ABI (Application Binary Interface) dictates how functions communicate: which registers pass arguments, who saves which registers (caller-saved vs callee-saved), how the stack is laid out, and how values are returned. Getting this wrong causes bugs that only manifest across compilation-unit boundaries — the worst kind. x86-64 has two major calling conventions: System V AMD64 (Linux, macOS — args in rdi, rsi, rdx, rcx, r8, r9) and Microsoft x64 (Windows — args in rcx, rdx, r8, r9).
| ISA | GPRs | Word Size | Design Philosophy |
|---|---|---|---|
| x86-64 | 16 | 64-bit | CISC; variable-length encoding; backward-compatible to 1978 |
| AArch64 | 31 | 64-bit | RISC; fixed 32-bit encoding; clean break from ARM32 |
| RISC-V | 32 | 32/64-bit | Open ISA; modular extensions; minimal base set |
| Wasm | ∞ (stack) | 32/64-bit | Portable target; structured control flow; sandboxed |
Architecture note: LLVM uses a two-phase approach to codegen: first SelectionDAG (or GlobalISel) lowers IR to MachineInstr — target-specific pseudo-instructions with virtual registers. Then register allocation, scheduling, and frame layout produce the final MCInst stream, which is either emitted as assembly text or encoded directly to machine bytes.
The Broader Landscape
Where these pieces come together — and where the field is heading.
MLIR & Multi-Level IRs
Developed at Google and now part of the LLVM project, MLIR provides a framework for defining custom IR dialects at any abstraction level — from TensorFlow operations to GPU kernel launches to LLVM IR. Dialects can be mixed in the same module and progressively lowered. This is reshaping compiler design for domain-specific languages, especially in ML hardware targeting.
WebAssembly Beyond the Web
Wasm was designed as a browser compilation target, but its sandboxed execution model, fast startup, and near-native speed make it compelling for edge computing, plugin systems, and serverless. WASI (WebAssembly System Interface) provides POSIX-like capabilities. The component model adds cross-language interop via interface types — a universal ABI for the future.
Compiler as a Service
Modern compilers aren't just batch translators — they're interactive services. The Language Server Protocol (LSP) requires the compiler to provide completions, diagnostics, refactoring, and go-to-definition on every keystroke. This demands incremental compilation (don't recompute what hasn't changed), lazy evaluation (don't analyze files the user isn't looking at), and error recovery (produce useful output from broken code).
Compile-Time Execution
The boundary between compile-time and runtime is dissolving. Zig's comptime runs arbitrary code during compilation. Rust's const fn evaluates functions at compile time. C++'s constexpr/consteval has become Turing-complete. This requires the compiler to include an interpreter for its own IR — Rust's miri and LLVM's ConstantFolding serve this role.
Optimizations: The Greatest Hits
Most compiler optimizations are variations on a few fundamental ideas, applied at different IR levels.
| Optimization | What It Does | Enabled By |
|---|---|---|
| Constant Folding | Evaluate constant expressions at compile time | SSA (unique defs → easy to check if inputs are constant) |
| Dead Code Elimination | Remove code whose results are never used | SSA (if a def has no uses, it's dead) |
| Inlining | Replace function call with the function body | Call graph analysis; enables further optimizations |
| Loop Unrolling | Duplicate loop body to reduce branch overhead | Loop analysis; trip count estimation |
| LICM | Move loop-invariant computations outside the loop | Dominance analysis + alias analysis |
| GVN | Eliminate redundant computations with identical values | SSA + value numbering (hash-based) |
| Escape Analysis | Stack-allocate objects that don't escape the function | Points-to analysis; critical for GC'd languages |
| Vectorization | Convert scalar loops to SIMD operations | Loop dependence analysis; target SIMD width |
Closing thought: A compiler is one of the most beautiful artifacts in computer science — a program that reads programs and writes programs. Every layer involves fundamental CS: automata theory (lexing), formal grammars (parsing), graph algorithms (IR optimization), constraint satisfaction (register allocation), and hardware architecture (codegen). Understanding compilers doesn't just make you better at writing them — it makes you a better programmer in every domain.