From Source to Silicon — A Compiler's Journey

01 — Lex

Lexical Analysis

Breaking source code into meaningful atoms — tokens — is the compiler's first act of interpretation.

A lexer (or scanner, or tokenizer) consumes a raw stream of characters and produces a stream of tokens — the smallest meaningful units in a language. Identifiers, keywords, operators, literals, and punctuation each become discrete objects with a type, a value, and a source location.

Most lexers are implemented as finite automata — either hand-written (a big switch statement over the current character) or generated from regular-expression grammars via tools like Flex or re2c. The generated DFA approach guarantees O(n) scanning with minimal branching, but hand-written lexers offer better error recovery and are easier to debug. GCC, Clang, Go, Rust, and V8 all use hand-written lexers.

Source: "let x = 42 + y;" ┌─────┐ ┌───┐ ┌───┐ ┌────┐ ┌───┐ ┌───┐ ┌───┐ │ LET │ → │ x │ → │ = │ → │ 42 │ → │ + │ → │ y │ → │ ; │ └─────┘ └───┘ └───┘ └────┘ └───┘ └───┘ └───┘ KW IDENT ASSIGN INT PLUS IDENT SEMI

Fig 1.1 — Token stream produced from a simple variable declaration

Key Concept

Maximal Munch

Lexers use the longest match rule: given >>=, the lexer will produce one RIGHT_SHIFT_ASSIGN token rather than GT, GT, EQ. This is why C++ required a space in vector<vector<int> > before C++11 — the lexer would greedily match >> as a right-shift operator.

Design Decision

Lookahead Budget

Most tokens need only 1–2 characters of lookahead to disambiguate. But some languages break this: Python's significant indentation requires tracking a stack of indent levels, making the lexer context-sensitive. C's type/identifier ambiguity (T * x; — declaration or multiplication?) leaks all the way into parsing.

Deep Dive — Position Tracking and Error Reporting

Production lexers track byte offset, line, and column for every token. This metadata flows downstream into the AST, IR, and eventually debug info (DWARF). Rust's Span type and TypeScript's source maps are direct descendants of this metadata. A common optimization: store a line-start offset table and compute line/column on demand rather than updating counters per character.

02 — Parse

Parsing & AST Construction

Imposing hierarchical structure on a flat token stream — discovering the tree hidden in the text.

Parsing transforms a linear token stream into a tree — usually an Abstract Syntax Tree (AST). This tree captures the syntactic structure of the program: which expressions are nested, what the operator precedence is, and how statements group together. Every subsequent compiler phase operates on this tree (or a derivative of it).

Expression: x = 42 + y Assign ├── Ident("x") └── BinOp(+) ├── IntLit(42) └── Ident("y")

Fig 2.1 — Abstract Syntax Tree for an assignment expression

There are two fundamental families of parsing algorithms. Top-down parsers start from the root of the grammar and try to derive the input. Bottom-up parsers start from the input tokens and reduce them toward the start symbol. The choice has deep implications for what grammars you can express and how you handle ambiguity.

Top-Down

Recursive Descent

The workhorse of hand-written parsers. Each grammar rule becomes a function. The call stack mirrors the parse tree. Easy to write, easy to debug, good error messages. Used by GCC (since 2004), Clang, Rust, Go, Swift, V8. Limitation: can't directly handle left-recursive grammars without rewriting them.

Top-Down

Pratt Parsing

A top-down operator-precedence parser — elegantly handles infix, prefix, postfix, and mixfix expressions. Each token has a binding power (numeric precedence) and optional nud (null denotation) and led (left denotation) handlers. Dramatically simplifies expression parsing. Used in V8, Rust (partially), and many interpreters.

Bottom-Up

LR / LALR Parsers

Table-driven shift-reduce parsers generated by tools like Yacc, Bison, ANTLR. Can handle a wider class of grammars (including left recursion) but produce famously inscrutable error messages. The parse table itself can be enormous — LALR(1) reduces it at the cost of some expressive power. Used by older GCC, Ruby (LALR), PostgreSQL's SQL parser.

Alternative

PEG / Packrat

Parsing Expression Grammars use ordered choice (/) instead of ambiguous alternation (|). Packrat parsing memoizes intermediate results for guaranteed O(n) time at the cost of O(n) memory. Never ambiguous by construction, but the ordered choice can make some grammars behave unintuitively. Used in Lua, Pest (Rust).

Architectural insight: The trend is strongly toward hand-written recursive descent. Parser generators save initial effort but create pain in error recovery, incremental parsing, and tooling integration. Language server protocols (LSP) need parsers that can handle incomplete and broken code — something recursive descent handles naturally with strategic synchronization points.

Deep Dive — Concrete vs. Abstract Syntax Trees

A CST (Concrete Syntax Tree, or parse tree) preserves every token including parentheses, commas, and whitespace. An AST (Abstract Syntax Tree) discards syntactic sugar and retains only semantic structure. IDEs and formatters often need the CST; compilers work with the AST. Tree-sitter produces CSTs; most compilers produce ASTs. Rust-analyzer uses a lossless syntax tree (à la Roslyn/libcst) — a hybrid that preserves all trivia but provides AST-like traversal APIs.

Deep Dive — The Expression Problem in Parsers

How you represent your AST matters. Two patterns dominate: discriminated unions (Rust enum, OCaml variants) make adding new operations easy but new node types hard. Class hierarchies (OOP visitor pattern) make adding new node types easy but new operations hard. This is the classic expression problem. Rust's compiler uses enums; javac uses visitors. Modern approach: autogenerated AST types from a grammar spec (used by rust-analyzer's ungrammar).

03 — IR

Intermediate Representation

The lingua franca between the frontend and backend — where most optimization happens.

An Intermediate Representation is the compiler's internal language — neither the source language nor the target machine code. It's designed to be easy to analyze and transform. Many compilers use multiple IRs at different abstraction levels, progressively lowering from high-level constructs toward machine operations.

AST

→

HIR

→

MIR

→

LIR

→

Machine Code

Fig 3.1 — Progressive lowering through multiple IR levels

The key innovation is SSA form — Static Single Assignment. In SSA, every variable is assigned exactly once. When control flow merges, φ (phi) functions select the correct value. SSA makes def-use chains explicit, dramatically simplifying optimizations like constant propagation, dead code elimination, and global value numbering.

// Before SSA
x = 1
x = x + 1
y = x

// After SSA transformation
x₁ = 1
x₂ = x₁ + 1
y₁ = x₂

Framework

LLVM IR

The most influential compiler IR in production. A typed, SSA-based, low-level instruction set. Target-independent but close enough to hardware for efficient lowering. Features: explicit types, getelementptr for pointer arithmetic, intrinsics for target-specific operations. Used by Clang, Rust, Swift, Julia, Zig, and over 30 other language frontends.

define i32 @add(i32 %a, i32 %b) {
  %result = add i32 %a, %b
  ret i32 %result
}

Alternative

Sea of Nodes

Used by V8 (TurboFan), HotSpot C2, and Graal. Instead of a linear instruction list within basic blocks, all values and effects float in a graph. Nodes have data edges (value dependencies) and effect/control edges. This representation makes optimizations like GVN, loop-invariant code motion, and scheduling fall out naturally from graph rewriting — but it's notoriously hard to debug.

Concept

Control Flow Graphs

A CFG is the backbone of most IR representations. Nodes are basic blocks — straight-line sequences of instructions with a single entry and single exit. Edges represent branches (conditional, unconditional, exceptions). Dominance trees computed from the CFG are essential for SSA construction: a definition in block A dominates a use in block B if every path to B goes through A. The dominance frontier tells you exactly where to insert φ functions.

IR	Used By	Form	Notable Property
LLVM IR	Clang, Rust, Swift, Zig	SSA, typed	Target-independent; most widely reused IR
GIMPLE	GCC	SSA, 3-address	Tree-SSA; GIMPLE→RTL lowering boundary
MIR	Rust (rustc)	SSA, CFG	Borrow checker operates here
Bytecode (CIL)	.NET / C#	Stack-based	Metadata-rich, verifiable type safety
Sea of Nodes	V8 TurboFan, HotSpot C2	Graph	Value, effect, and control edges
Cranelift IR (CLIF)	Wasmtime, rustc (experimental)	SSA, EBB	Designed for fast compilation
MLIR	TensorFlow, CIRCT, Flang	Multi-level	Nested regions; progressive lowering by design

Architecture tradeoff: Higher-level IRs preserve more source semantics (making language-specific optimizations possible) but are harder to share across languages. Lower-level IRs are more reusable but lose information. LLVM's sweet spot — typed, SSA, but low-level — is why it became the industry standard. MLIR's multi-level approach attempts to get both.

04 — Bytecode

Bytecode & Virtual Machines

Portable instruction sets for abstract machines — the layer where many languages live and die.

Bytecode is a compact, portable instruction set designed for a virtual machine rather than real hardware. It sits between the high-level AST and native machine code. Languages like Java, Python, C#, Lua, Erlang, and WebAssembly compile to bytecode and then either interpret it or JIT-compile it to native code at runtime.

The fundamental design decision: stack-based or register-based? This choice cascades through instruction encoding, code density, interpreter performance, and JIT complexity.

Architecture

Stack Machines

Operands are implicit — they live on a virtual stack. Instructions like PUSH, ADD, STORE manipulate the top of stack. Compact encoding (no register operands in the instruction), trivially simple code generation from an AST (just post-order traverse), and easy verification. Used by JVM, .NET CIL, WebAssembly, CPython, Forth.

// x = 42 + y  (stack-based)
PUSH 42       // stack: [42]
LOAD y        // stack: [42, y]
ADD           // stack: [42+y]
STORE x       // stack: []

Architecture

Register Machines

Operands are explicit register names. Instructions like ADD r0, r1, r2 specify source and destination. Fewer instructions (no push/pop overhead), maps more naturally to hardware. But larger instruction encoding and more complex code generation. Used by Lua 5.0+, Dalvik/ART (Android), LuaJIT, Dis (Limbo).

// x = 42 + y  (register-based)
LOADK  r0, 42   // r0 = 42
LOAD   r1, y    // r1 = y
ADD    r2, r0, r1 // r2 = r0+r1
STORE  x,  r2   // x = r2

Bytecode	Type	Typed?	Key Characteristic
JVM	Stack	Yes	Verifiable; rich metadata; 200+ opcodes
CPython	Stack	No	Dynamic; opcode set changes between versions
WebAssembly	Stack	Yes	Structured control flow (no goto); sandboxed
Lua 5.x	Register	No	Fixed 32-bit instruction word; 256 registers
Dalvik/ART	Register	Yes	Optimized for memory-constrained devices
CIL (.NET)	Stack	Yes	Generics in bytecode (not erased like JVM)
BEAM	Register	No	Process-oriented; hot code swapping support

JIT Compilation: Where Bytecode Meets Native

A Just-In-Time compiler translates bytecode to native machine code at runtime, guided by profiling data. The canonical approach: start by interpreting, profile to find hot loops and functions, then compile those to optimized native code. If speculative optimizations prove wrong (a type guard fails), deoptimize — fall back to the interpreter with the correct state.

Bytecode

→

Interpreter

→

Profile Data

→

Baseline JIT

→

Optimizing JIT

Fig 4.1 — Tiered compilation pipeline (V8, HotSpot, JavaScriptCore)

Gotcha — Deoptimization is expensive. When V8's TurboFan-compiled code hits an unexpected type, it must reconstruct the interpreter's stack frame from the optimized code's state. This "OSR (On-Stack Replacement) exit" is one of the most complex operations in a JIT. Getting the frame reconstruction wrong causes subtle, unreproducible crashes.

Deep Dive — Interpreter Dispatch Techniques

How an interpreter dispatches to the next opcode handler is a surprising performance bottleneck. From slowest to fastest:

Switch dispatch: A big switch(opcode). Simple but every dispatch goes through one branch point, defeating CPU branch prediction.

Direct threading: Each handler ends with goto *dispatch_table[*ip++]. One indirect branch per opcode. GCC's "labels as values" extension makes this possible. ~15–25% faster than switch.

Tail-call threading: Each handler is a function that tail-calls the next. Requires guaranteed tail call optimization. Used in some experimental interpreters.

Copy-and-patch: Pre-compile opcode handlers to machine code "stencils" and stitch them together at load time by patching in operands. Achieves near-JIT performance at interpreter-construction speed. Adopted by CPython 3.13+.

05 — Native

Machine Code Generation

The final lowering: from abstract operations to the exact bytes a processor will execute.

Code generation ("codegen") translates the compiler's IR into the target machine's native instruction set. This phase must solve several interrelated problems simultaneously: instruction selection (which machine instructions implement each IR operation), register allocation (mapping virtual registers to physical ones), and instruction scheduling (ordering instructions to maximize pipeline throughput and minimize stalls).

Phase

Instruction Selection

Mapping IR operations to machine instructions. A single IR operation might map to one instruction, multiple instructions, or be folded into a complex addressing mode. Tree-pattern matching (BURG/iburg) or DAG-based selection (LLVM's SelectionDAG / GlobalISel) are the standard approaches. The challenge: modern ISAs have thousands of instructions with overlapping capabilities — x86-64 has ~1500 distinct instruction forms.

Phase

Register Allocation

The IR assumes infinite virtual registers; the hardware has a small fixed set (x86-64: 16 GPRs, ARM64: 31). Graph coloring allocators (Chaitin-Briggs) model this as coloring an interference graph — two live ranges that overlap can't share a register. When coloring fails, spill to the stack. Linear scan allocators (used in JITs) trade allocation quality for speed — one pass over the live ranges, O(n log n).

Phase

Instruction Scheduling

Reorder instructions to hide latencies and fill pipeline slots. A multiply on modern x86 has 3-cycle latency — the scheduler inserts independent instructions in those cycles. List scheduling builds a DAG of dependencies and greedily picks ready instructions. This interacts with register allocation: more reordering increases register pressure. LLVM schedules both before and after regalloc.

Phase

Peephole Optimization

A final pass over the generated assembly, replacing instruction patterns with cheaper equivalents. Example: mov rax, 0 → xor eax, eax (shorter encoding, breaks dependency chain). imul rax, rax, 2 → shl rax, 1. Modern peephole optimizers are often auto-generated from declarative pattern-match rules.

// C source // x86-64 output (AT&T syntax) int add(int a, int b) { add: return a + b; leal (%edi,%esi), %eax } retq // Note: compiler chose LEA over ADD — LEA can compute // the sum into a third register without a MOV. // The System V ABI places args in %edi, %esi and return in %eax.

Fig 5.1 — From C to x86-64: a single function showing instruction selection

Calling Conventions & ABIs

The ABI (Application Binary Interface) dictates how functions communicate: which registers pass arguments, who saves which registers (caller-saved vs callee-saved), how the stack is laid out, and how values are returned. Getting this wrong causes bugs that only manifest across compilation-unit boundaries — the worst kind. x86-64 has two major calling conventions: System V AMD64 (Linux, macOS — args in rdi, rsi, rdx, rcx, r8, r9) and Microsoft x64 (Windows — args in rcx, rdx, r8, r9).

ISA	GPRs	Word Size	Design Philosophy
x86-64	16	64-bit	CISC; variable-length encoding; backward-compatible to 1978
AArch64	31	64-bit	RISC; fixed 32-bit encoding; clean break from ARM32
RISC-V	32	32/64-bit	Open ISA; modular extensions; minimal base set
Wasm	∞ (stack)	32/64-bit	Portable target; structured control flow; sandboxed

Architecture note: LLVM uses a two-phase approach to codegen: first SelectionDAG (or GlobalISel) lowers IR to MachineInstr — target-specific pseudo-instructions with virtual registers. Then register allocation, scheduling, and frame layout produce the final MCInst stream, which is either emitted as assembly text or encoded directly to machine bytes.

06 — Ecosystem

The Broader Landscape

Where these pieces come together — and where the field is heading.

Trend

MLIR & Multi-Level IRs

Developed at Google and now part of the LLVM project, MLIR provides a framework for defining custom IR dialects at any abstraction level — from TensorFlow operations to GPU kernel launches to LLVM IR. Dialects can be mixed in the same module and progressively lowered. This is reshaping compiler design for domain-specific languages, especially in ML hardware targeting.

Trend

WebAssembly Beyond the Web

Wasm was designed as a browser compilation target, but its sandboxed execution model, fast startup, and near-native speed make it compelling for edge computing, plugin systems, and serverless. WASI (WebAssembly System Interface) provides POSIX-like capabilities. The component model adds cross-language interop via interface types — a universal ABI for the future.

Concept

Compiler as a Service

Modern compilers aren't just batch translators — they're interactive services. The Language Server Protocol (LSP) requires the compiler to provide completions, diagnostics, refactoring, and go-to-definition on every keystroke. This demands incremental compilation (don't recompute what hasn't changed), lazy evaluation (don't analyze files the user isn't looking at), and error recovery (produce useful output from broken code).

Concept

Compile-Time Execution

The boundary between compile-time and runtime is dissolving. Zig's comptime runs arbitrary code during compilation. Rust's const fn evaluates functions at compile time. C++'s constexpr/consteval has become Turing-complete. This requires the compiler to include an interpreter for its own IR — Rust's miri and LLVM's ConstantFolding serve this role.

Optimizations: The Greatest Hits

Most compiler optimizations are variations on a few fundamental ideas, applied at different IR levels.

Optimization	What It Does	Enabled By
Constant Folding	Evaluate constant expressions at compile time	SSA (unique defs → easy to check if inputs are constant)
Dead Code Elimination	Remove code whose results are never used	SSA (if a def has no uses, it's dead)
Inlining	Replace function call with the function body	Call graph analysis; enables further optimizations
Loop Unrolling	Duplicate loop body to reduce branch overhead	Loop analysis; trip count estimation
LICM	Move loop-invariant computations outside the loop	Dominance analysis + alias analysis
GVN	Eliminate redundant computations with identical values	SSA + value numbering (hash-based)
Escape Analysis	Stack-allocate objects that don't escape the function	Points-to analysis; critical for GC'd languages
Vectorization	Convert scalar loops to SIMD operations	Loop dependence analysis; target SIMD width

Closing thought: A compiler is one of the most beautiful artifacts in computer science — a program that reads programs and writes programs. Every layer involves fundamental CS: automata theory (lexing), formal grammars (parsing), graph algorithms (IR optimization), constraint satisfaction (register allocation), and hardware architecture (codegen). Understanding compilers doesn't just make you better at writing them — it makes you a better programmer in every domain.

From Source Codeto Silicon

Lexical Analysis

Maximal Munch

Lookahead Budget

Parsing & AST Construction

Recursive Descent

Pratt Parsing

LR / LALR Parsers

PEG / Packrat

Intermediate Representation

LLVM IR

Sea of Nodes

Control Flow Graphs

Bytecode & Virtual Machines

Stack Machines

Register Machines

JIT Compilation: Where Bytecode Meets Native

Machine Code Generation

Instruction Selection

Register Allocation

Instruction Scheduling

Peephole Optimization

Calling Conventions & ABIs

The Broader Landscape

MLIR & Multi-Level IRs

WebAssembly Beyond the Web

Compiler as a Service

Compile-Time Execution

Optimizations: The Greatest Hits

From Source Code
to Silicon