How Programming Languages Work — From Source to Machine

Phase 01Lexical Analysis

The lexer (or scanner) is the compiler's first pass. It reads raw characters and groups them into tokens — the smallest meaningful units. Keywords, identifiers, literals, operators, and delimiters are all token types. Whitespace and comments are consumed and discarded.

Token patterns are formally defined as regular expressions, implemented via the classic pipeline: regex → NFA (Thompson's construction) → DFA (subset construction) → minimized DFA (Hopcroft's algorithm). Tools like Lex/Flex generate DFA-based scanners from specification files. However, most production compilers use hand-written lexers — GCC, Clang, Rust, Go, and V8 all do, for better error messages and context-sensitive handling.

The maximal munch rule: given input ==, the lexer matches the two-character EQUALS token, not two separate ASSIGN tokens. When two patterns match the same longest string, declaration order breaks ties.

Interactive Tokenizer

Context-Sensitive Edge Cases

Pure DFA-based lexing breaks down in several real-world cases. Python's indentation requires a stack to emit INDENT/DEDENT tokens. JavaScript's / is ambiguous — it could start a regex literal or a division operator, resolved by tracking the prior token's syntactic category. And C's "lexer hack" requires the lexer to consult the symbol table to distinguish type names from identifiers, since A * B can be a pointer declaration or multiplication.

Phase 02Parsing & ASTs

The parser takes a flat token stream and builds a hierarchical Abstract Syntax Tree. Unlike a Concrete Syntax Tree (which mirrors the grammar one-to-one), an AST retains only semantically meaningful structure — parentheses, for example, are implicit in the tree's shape.

Live AST Viewer

Expression

Abstract Syntax Tree

Parsing Strategies

Recursive Descent

Pratt Parsing

LR / LALR

PEG

Recursive descent maps each grammar rule to a function. For Expr → Term (('+' | '-') Term)*, a parseExpr() function calls parseTerm(), checks for operators, and builds AST nodes. This is the overwhelmingly dominant approach: GCC, Clang, Rust, Go, TypeScript, Java (javac), and V8 all use hand-written recursive descent.

function parseExpr() {
  let left = parseTerm();
  while (match('+') || match('-')) {
    const op = previous();
    const right = parseTerm();
    left = { type: 'BinaryExpr', op, left, right };
  }
  return left;
}

Pratt parsing (precedence climbing) assigns each operator a binding power. The algorithm: parse a prefix, then loop consuming infix operators with binding power above the current minimum. Left-associativity is encoded by making right binding power slightly higher. Used in rust-analyzer, V8's expression parser, and GCC.

function prattParse(minBp = 0) {
  let left = parsePrefix();  // literals, unary, grouped
  while (infixBp(peek()) > minBp) {
    const op = advance();
    const [lbp, rbp] = bindingPower(op);
    const right = prattParse(rbp);
    left = { type: 'BinaryExpr', op, left, right };
  }
  return left;
}
// Powers: + → (1,2), * → (3,4)   Left-assoc: rbp > lbp

LR parsers are bottom-up: they shift tokens onto a stack and reduce when a complete production right-hand side appears. LALR(1), generated by Yacc/Bison, is the sweet spot between power and table size. LR parsers handle left-recursive grammars naturally — something LL parsers cannot — but error messages are notoriously poor and grammar debugging is difficult.

The shift-reduce tradeoff: LR parsers are strictly more powerful than LL, handling all LL grammars plus left-recursive ones. But the generated parse tables are opaque, making debugging grammar conflicts (shift/reduce, reduce/reduce) a specialized skill.

Parsing Expression Grammars use ordered choice (/ instead of |), making them inherently unambiguous. CPython switched from LL(1) to PEG in Python 3.9 because LL(1) required accepting syntactically invalid programs and rejecting them later. Packrat parsing guarantees linear time via memoization at O(n) memory cost.

Key insight: PEGs resolve the "dangling else" problem by construction — the with-else alternative is tried first, and ordered choice commits to the first match.

Phase 03Semantic Analysis

After parsing produces a syntactically valid AST, semantic analysis determines whether the program is meaningful. This phase resolves names, checks types, and validates control flow.

Symbol Tables & Scope Resolution

A symbol table maps names to declarations, types, and scopes. Lookup follows the most-closely-nested rule: search the current scope, then walk parent pointers upward. Most languages use lexical (static) scoping, where bindings are determined by textual structure. Dynamic scoping (early Lisp, Bash) resolves names by walking the call stack — largely abandoned because it defeats local reasoning.

int x = 1;                // scope 0: x → int
void foo() {            // scope 1
    int x = 2;            // scope 1: x → int (shadows global)
    {                       // scope 2
        int x = 3;        // scope 2: x → int (shadows scope 1)
        int z = x + y;    // x resolves to scope 2, y walks up
    }
}

Type Inference: Hindley-Milner

The Hindley-Milner system infers the most general type for untyped programs. Algorithm W works through constraint generation and unification: for let f = λx. x + 1, fresh type variable α is assigned to x, α unifies with Int via the + constraint, yielding f : Int → Int. The occurs check prevents infinite types. HM powers ML, OCaml, and Haskell; Rust and TypeScript use partial inference with bidirectional type checking.

Definite Assignment & Liveness

The compiler builds a control flow graph and runs dataflow analyses. Definite assignment (Java, C#) propagates assignment sets forward, ensuring every variable is assigned before use. Liveness analysis works backward: LiveIn(n) = Use(n) ∪ (LiveOut(n) \ Def(n)). Dead variables enable dead code elimination and inform register allocation.

Phase 04Intermediate Representations

IRs decouple languages from machines. Without them, supporting M languages and N architectures requires M×N compilers. LLVM demonstrates the power: Clang, rustc, and Swift all emit LLVM IR, consumed by backends for x86, ARM, AArch64, and RISC-V — M+N components instead of M×N.

LLVM Compilation Pipeline

📝

Source

🌳

AST

⚡

LLVM IR

🔧

Optimizer

📊

SelectionDAG

🔩

Machine IR

💾

Assembly

SSA Form

Static Single Assignment requires every variable to be assigned exactly once. At control-flow join points, φ (phi) functions select between values from different predecessors. SSA makes def-use chains trivially explicit, enabling constant propagation, dead code elimination, global value numbering, and loop-invariant code motion. It is the single most important compiler innovation — used by LLVM, GCC, V8's TurboFan, and Java's C2.

// Before SSA          // After SSA
x = 0                   x0 = 0
if (cond)               if (cond)
  x = 1                  x1 = 1
else                    else
  x = 2                  x2 = 2
print(x)                x3 = φ(x1, x2)
                        print(x3)

Phase 05Runtimes & Virtual Machines

Stack-Based VM Simulator

Most bytecode VMs (JVM, CPython, .NET) use a stack-based architecture. Operands are pushed and popped implicitly. Step through the execution below:

Bytecode VM — Evaluating 2 + 3 * 4

Stack

JIT Compilation Tiers

Modern VMs use tiered compilation: start interpreting, then progressively optimize hot code. V8's pipeline is the canonical example:

Ignition

Sparkplug

Baseline compiler. Translates bytecode to native code 1:1 without optimization. 10-1000× faster compilation than TurboFan.

Maglev

Mid-tier SSA optimizer. Inlines small functions, eliminates redundant checks based on type feedback.

TurboFan

Full optimizer. Sea-of-Nodes IR, speculative type specialization, escape analysis, inlining. Deoptimizes on speculation failure.

Garbage Collection Simulator

The heap below simulates generational GC. Green = young gen, purple = old gen. Allocate objects, trigger minor/major collections, and watch the generational hypothesis in action.

Generational GC Simulator

Phase 06Language Design Decisions

Type System Taxonomy

Strong

Rust, Haskell, Java, C#
Types checked at compile time; no implicit coercion between unrelated types

Python, Ruby, Erlang
"1" + 1 → TypeError. Types exist at runtime and are enforced strictly.

Weak

C, C++ (partial)
Static checks, but implicit integer promotions, pointer casts, and void* weaken guarantees

JavaScript, PHP, Perl
"5" - 3 → 2. Implicit coercion everywhere. JS: [] + {} → "[object Object]"

Structural vs. Nominal Typing

Structural

TypeScript, Go interfaces. Compatibility based on shape — any object with makeSound() satisfies an interface requiring it. Less boilerplate, better third-party interop.

Nominal

Java, Rust, C#. Compatibility based on declared names. Prevents accidental equivalence: distinguishing UserId from ProductId when both wrap int.

Memory Management Spectrum

Manual (C/C++)

malloc/free. Maximum control, maximum foot-gun potential. Use-after-free, double-free, memory leaks.

Ownership (Rust)

Compile-time rules: one owner, drop when out of scope, borrow checker prevents data races. No GC, no manual frees.

Ref Counting (Swift)

Compiler-inserted retain/release. Deterministic deallocation. Cycles broken with weak/unowned references.

Tracing GC (JVM, Go, V8)

Runtime traces reachable objects. Generational, concurrent, compacting variants. Highest throughput, non-deterministic pauses.

Evaluation Strategy

Eager (strict)

Arguments evaluated before function call. Used by almost all languages. Simple mental model, predictable performance.

Lazy

Haskell's default. Computation deferred until value demanded. Enables infinite data structures (ones = 1 : ones). Risk of space leaks from thunk buildup.

Phase 07Grammar Formalisms

All parsing infrastructure rests on formal grammar theory. A context-free grammar is a 4-tuple G = (V, Σ, R, S) of nonterminals, terminals, production rules, and start symbol. BNF (Backus-Naur Form, 1959) describes these production rules; EBNF adds repetition and optionality.

The Chomsky Hierarchy

Classifies grammars by expressive power, from the most restrictive (regular) to the most general (unrestricted). Programming language syntax is mostly Type 2 (context-free), but semantics push into context-sensitive or Turing-complete territory — C++ templates are Turing-complete.

Chomsky Hierarchy — Click to explore

The Dangling Else Problem

A classic grammar ambiguity: if a then if b then s1 else s2 has two valid parse trees. Solutions include convention (else binds to nearest if), grammar rewriting, or syntactic design — Go and Rust require braces, eliminating the ambiguity entirely. PEGs resolve it through ordered choice: the with-else alternative is tried first.

Why regular grammars aren't enough: regular languages cannot match balanced parentheses (provable via the pumping lemma). This is exactly why context-free grammars are needed for programming language syntax — nesting is everywhere.