From source code to machine execution
The lexer (or scanner) is the compiler's first pass. It reads raw characters and groups them into tokens — the smallest meaningful units. Keywords, identifiers, literals, operators, and delimiters are all token types. Whitespace and comments are consumed and discarded.
Token patterns are formally defined as regular expressions, implemented via the classic pipeline: regex → NFA (Thompson's construction) → DFA (subset construction) → minimized DFA (Hopcroft's algorithm). Tools like Lex/Flex generate DFA-based scanners from specification files. However, most production compilers use hand-written lexers — GCC, Clang, Rust, Go, and V8 all do, for better error messages and context-sensitive handling.
==, the lexer matches the two-character EQUALS token, not two separate ASSIGN tokens. When two patterns match the same longest string, declaration order breaks ties.
Pure DFA-based lexing breaks down in several real-world cases. Python's indentation requires a stack to emit INDENT/DEDENT tokens. JavaScript's / is ambiguous — it could start a regex literal or a division operator, resolved by tracking the prior token's syntactic category. And C's "lexer hack" requires the lexer to consult the symbol table to distinguish type names from identifiers, since A * B can be a pointer declaration or multiplication.
The parser takes a flat token stream and builds a hierarchical Abstract Syntax Tree. Unlike a Concrete Syntax Tree (which mirrors the grammar one-to-one), an AST retains only semantically meaningful structure — parentheses, for example, are implicit in the tree's shape.
Recursive descent maps each grammar rule to a function. For Expr → Term (('+' | '-') Term)*, a parseExpr() function calls parseTerm(), checks for operators, and builds AST nodes. This is the overwhelmingly dominant approach: GCC, Clang, Rust, Go, TypeScript, Java (javac), and V8 all use hand-written recursive descent.
function parseExpr() { let left = parseTerm(); while (match('+') || match('-')) { const op = previous(); const right = parseTerm(); left = { type: 'BinaryExpr', op, left, right }; } return left; }
Pratt parsing (precedence climbing) assigns each operator a binding power. The algorithm: parse a prefix, then loop consuming infix operators with binding power above the current minimum. Left-associativity is encoded by making right binding power slightly higher. Used in rust-analyzer, V8's expression parser, and GCC.
function prattParse(minBp = 0) { let left = parsePrefix(); // literals, unary, grouped while (infixBp(peek()) > minBp) { const op = advance(); const [lbp, rbp] = bindingPower(op); const right = prattParse(rbp); left = { type: 'BinaryExpr', op, left, right }; } return left; } // Powers: + → (1,2), * → (3,4) Left-assoc: rbp > lbp
LR parsers are bottom-up: they shift tokens onto a stack and reduce when a complete production right-hand side appears. LALR(1), generated by Yacc/Bison, is the sweet spot between power and table size. LR parsers handle left-recursive grammars naturally — something LL parsers cannot — but error messages are notoriously poor and grammar debugging is difficult.
Parsing Expression Grammars use ordered choice (/ instead of |), making them inherently unambiguous. CPython switched from LL(1) to PEG in Python 3.9 because LL(1) required accepting syntactically invalid programs and rejecting them later. Packrat parsing guarantees linear time via memoization at O(n) memory cost.
After parsing produces a syntactically valid AST, semantic analysis determines whether the program is meaningful. This phase resolves names, checks types, and validates control flow.
A symbol table maps names to declarations, types, and scopes. Lookup follows the most-closely-nested rule: search the current scope, then walk parent pointers upward. Most languages use lexical (static) scoping, where bindings are determined by textual structure. Dynamic scoping (early Lisp, Bash) resolves names by walking the call stack — largely abandoned because it defeats local reasoning.
int x = 1; // scope 0: x → int void foo() { // scope 1 int x = 2; // scope 1: x → int (shadows global) { // scope 2 int x = 3; // scope 2: x → int (shadows scope 1) int z = x + y; // x resolves to scope 2, y walks up } }
The Hindley-Milner system infers the most general type for untyped programs. Algorithm W works through constraint generation and unification: for let f = λx. x + 1, fresh type variable α is assigned to x, α unifies with Int via the + constraint, yielding f : Int → Int. The occurs check prevents infinite types. HM powers ML, OCaml, and Haskell; Rust and TypeScript use partial inference with bidirectional type checking.
The compiler builds a control flow graph and runs dataflow analyses. Definite assignment (Java, C#) propagates assignment sets forward, ensuring every variable is assigned before use. Liveness analysis works backward: LiveIn(n) = Use(n) ∪ (LiveOut(n) \ Def(n)). Dead variables enable dead code elimination and inform register allocation.
IRs decouple languages from machines. Without them, supporting M languages and N architectures requires M×N compilers. LLVM demonstrates the power: Clang, rustc, and Swift all emit LLVM IR, consumed by backends for x86, ARM, AArch64, and RISC-V — M+N components instead of M×N.
Static Single Assignment requires every variable to be assigned exactly once. At control-flow join points, φ (phi) functions select between values from different predecessors. SSA makes def-use chains trivially explicit, enabling constant propagation, dead code elimination, global value numbering, and loop-invariant code motion. It is the single most important compiler innovation — used by LLVM, GCC, V8's TurboFan, and Java's C2.
// Before SSA // After SSA x = 0 x0 = 0 if (cond) if (cond) x = 1 x1 = 1 else else x = 2 x2 = 2 print(x) x3 = φ(x1, x2) print(x3)
Most bytecode VMs (JVM, CPython, .NET) use a stack-based architecture. Operands are pushed and popped implicitly. Step through the execution below:
2 + 3 * 4
Modern VMs use tiered compilation: start interpreting, then progressively optimize hot code. V8's pipeline is the canonical example:
Register-based interpreter. Collects type feedback at every operation via inline caches.
Baseline compiler. Translates bytecode to native code 1:1 without optimization. 10-1000× faster compilation than TurboFan.
Mid-tier SSA optimizer. Inlines small functions, eliminates redundant checks based on type feedback.
Full optimizer. Sea-of-Nodes IR, speculative type specialization, escape analysis, inlining. Deoptimizes on speculation failure.
The heap below simulates generational GC. Green = young gen, purple = old gen. Allocate objects, trigger minor/major collections, and watch the generational hypothesis in action.
"1" + 1 → TypeError. Types exist at runtime and are enforced strictly.void* weaken guarantees"5" - 3 → 2. Implicit coercion everywhere. JS: [] + {} → "[object Object]"TypeScript, Go interfaces. Compatibility based on shape — any object with makeSound() satisfies an interface requiring it. Less boilerplate, better third-party interop.
Java, Rust, C#. Compatibility based on declared names. Prevents accidental equivalence: distinguishing UserId from ProductId when both wrap int.
malloc/free. Maximum control, maximum foot-gun potential. Use-after-free, double-free, memory leaks.
Compile-time rules: one owner, drop when out of scope, borrow checker prevents data races. No GC, no manual frees.
Compiler-inserted retain/release. Deterministic deallocation. Cycles broken with weak/unowned references.
Runtime traces reachable objects. Generational, concurrent, compacting variants. Highest throughput, non-deterministic pauses.
Arguments evaluated before function call. Used by almost all languages. Simple mental model, predictable performance.
Haskell's default. Computation deferred until value demanded. Enables infinite data structures (ones = 1 : ones). Risk of space leaks from thunk buildup.
All parsing infrastructure rests on formal grammar theory. A context-free grammar is a 4-tuple G = (V, Σ, R, S) of nonterminals, terminals, production rules, and start symbol. BNF (Backus-Naur Form, 1959) describes these production rules; EBNF adds repetition and optionality.
Classifies grammars by expressive power, from the most restrictive (regular) to the most general (unrestricted). Programming language syntax is mostly Type 2 (context-free), but semantics push into context-sensitive or Turing-complete territory — C++ templates are Turing-complete.
A classic grammar ambiguity: if a then if b then s1 else s2 has two valid parse trees. Solutions include convention (else binds to nearest if), grammar rewriting, or syntactic design — Go and Rust require braces, eliminating the ambiguity entirely. PEGs resolve it through ordered choice: the with-else alternative is tried first.