The LLM Field Guide

01 · Architecture

How LLMs Work

Every major LLM — GPT-5, Claude, Gemini, Llama, DeepSeek — is a decoder-only transformer. The architecture has remained fundamentally stable since 2019; what has changed are carefully validated component upgrades that collectively yield massive gains.

The Transformer, Explored

Click any layer below to see how it works. Each component in the stack plays a critical role in transforming raw tokens into intelligent predictions.

Tokenization + Embedding

→

Positional Encoding (RoPE)

→

Multi-Head Self-Attention

→

Feed-Forward Network (SwiGLU)

→

RMSNorm + Residual Connections

→

Output Projection → Next Token

→

Text is split into subword tokens via BPE (Byte Pair Encoding). Each token is mapped to a dense vector — Llama 3 405B uses 16,384-dimensional embeddings across a vocabulary of 128,256 tokens. Larger vocabularies improve multilingual and code efficiency. English averages ~0.75 tokens per word.

Try It: Simulated Tokenization

Type text below to see how a BPE-like tokenizer would split it into subword tokens. Each color represents a different token.

Key Architectural Innovations

Grouped-Query Attention

Standard multi-head attention gives each head its own Key and Value projections. GQA groups multiple query heads to share K/V projections — Llama 3 uses 8 KV heads shared across 128 query heads. This reduces the KV cache by 8× with negligible quality loss, making long-context inference practical.

Flash Attention

Developed by Tri Dao, Flash Attention is an IO-aware algorithm that computes exact attention while reducing memory from O(n²) to O(n) by processing in tiles without materializing the full attention matrix. Flash Attention 2 & 3 are now standard in every production inference engine, delivering 2-4× speedups.

Mixture of Experts

MoE replaces the dense FFN with multiple expert sub-networks, activating only a subset per token via a learned router. DeepSeek-V3 has 671B total but only 37B active parameters (256 experts, 8 routed + 1 shared). Over 60% of open-source models in late 2025 used MoE, and all top-10 models on intelligence leaderboards are MoE architectures.

Quantization

Reduces weight precision from FP16 to INT8 or INT4, shrinking a 7B model from 14 GB → 3.5 GB with minimal quality loss. Major methods: GPTQ (Hessian-informed), AWQ (activation-aware), and GGUF (the community standard for llama.cpp). Enables 7B models to run on laptops at 50-100+ tokens/sec.

KV Cache & PagedAttention

During generation, each new token must attend to all previous tokens. The KV cache stores Key/Value tensors for reuse — for Llama 3 70B at 128K context, this alone requires ~40 GB. PagedAttention (vLLM) borrows virtual memory paging to eliminate cache fragmentation, achieving <4% waste vs 60-80% in naive allocation.

02 · Training

From Raw Text to Useful Assistants

Training an LLM is a multi-stage pipeline: pre-train on vast internet text, then refine through supervised fine-tuning, and align with human preferences via RLHF or DPO.

STAGE 01

Pre-Training

Next-token prediction on trillions of tokens. The model learns grammar, facts, reasoning patterns, and common sense from vast text corpora.

15T

tokens (Llama 3)

16K

H100 GPUs

STAGE 02

Supervised Fine-Tuning

Train on instruction-response pairs. A 1.3B fine-tuned model was preferred over 175B GPT-3 — alignment matters more than raw scale.

25M+

examples

<1%

params (LoRA)

STAGE 03

RLHF / DPO

Align with human preferences. RLHF trains a reward model on rankings then optimizes via PPO. DPO simplifies this into a single classification loss.

$1-10

per annotation

10×

cheaper (RLAIF)

STAGE 04

Safety & Red-Teaming

Constitutional AI, red-team adversarial testing, and capability evaluations. Anthropic's ASL framework scales precautions with model capability.

ASL 1-5

safety levels

Grade D

best safety score

The Scaling Laws Paradigm

DeepMind's 2022 Chinchilla paper established that compute-optimal training needs ~20 tokens per parameter. But the industry deliberately moved beyond this: modern models are massively over-trained to reduce inference cost. Llama 3 8B was trained at 1,875:1 tokens-to-parameters — 94× the Chinchilla-optimal ratio.

Training costs have grown ~2.4× per year. GPT-4 cost an estimated $78-100M. Llama 3.1 405B cost ~$170-191M. DeepSeek-V3 reported just $5.6M in compute — a dramatic efficiency achievement that shocked the industry in January 2025.

The Data Wall

Epoch AI estimates quality public text at roughly 300 trillion tokens. With aggressive over-training ratios, this may be functionally exhausted by 2027. The response: synthetic data generation, licensing deals (Reddit-Google, News Corp-OpenAI), and efficiency improvements. FineWeb provides 15 trillion filtered tokens from 96 Common Crawl snapshots.

03 · Applications

What LLMs Can Do

From writing production code to discovering drug targets, LLMs have moved from novelty to infrastructure. Here's how they're being used across industries.

⌨️

Code Generation

GitHub Copilot has 4.7M paid subscribers. Claude Code scored 72.5% on SWE-bench. Cursor IDE reached $29.3B valuation. 85% of developers now use AI tools regularly.

77-81% SWE-bench top scores

🔬

Scientific Discovery

AlphaFold 3 predicts protein structures with 50%+ improvement — earning a Nobel Prize. AI achieved unofficial IMO gold medals, solving 5 of 6 problems.

98.5% human proteome covered

🏥

Healthcare

FDA authorized 1,250+ AI medical devices by mid-2025. Ambient AI scribes save physicians 2+ hours daily. ~80% of devices are in radiology.

1,250+ FDA-authorized devices

⚖️

Legal

Harvey AI reached $190M ARR, used by most AmLaw 100 firms. CoCounsel won the federal judiciary contract serving 25,000+ legal professionals.

$190M ARR (Harvey AI)

🤖

Agentic AI

MCP reached 97M monthly SDK downloads. Claude Computer Use, OpenAI Operator, and Google Mariner enable autonomous task execution. MCP donated to Agentic AI Foundation.

97M monthly MCP downloads

🧠

Reasoning Models

Test-time compute scaling: o1 scored 74% on AIME 2024 vs GPT-4o's 12%. DeepSeek-R1 matched o1 at 70% lower cost. Inference compute projected to exceed training 118×.

10-100× more tokens per query

04 · Timeline

Key Milestones

From the Transformer paper to billion-user agents — a new frontier model was released approximately every 17.5 days throughout 2025. Click any milestone to expand.

JUNE 2017

Transformer Paper — "Attention Is All You Need"

Google researchers introduced the self-attention mechanism, replacing recurrence entirely. This single paper underpins every major LLM that exists today.

JUNE 2018

GPT-1 — Generative Pre-Training (117M params)

OpenAI established the paradigm: pre-train a transformer on unlabeled text, then fine-tune on downstream tasks. Small by today's standards, revolutionary in concept.

JUNE 2020

GPT-3 — Few-Shot Learning at Scale (175B params)

Demonstrated that scaling alone could yield emergent capabilities. The first model to spark the "prompt engineering" paradigm — no fine-tuning needed for many tasks.

NOV 2022

ChatGPT Launch — 100M Users in 60 Days

The fastest-growing consumer application in history. Combined GPT-3.5 with RLHF alignment to create the first broadly usable conversational AI.

MAR 2023

GPT-4 — 90th Percentile on the Bar Exam

First rumored MoE architecture. Claude 1 launched the same month with Constitutional AI. LLaMA 1 leaked and catalyzed the open-source ecosystem.

FEB 2024

Gemini 1.5 Pro — 1 Million Token Context

First production model with a million-token context window, enabling entire codebases and book-length documents in a single prompt.

SEP 2024

OpenAI o1 — The Reasoning Revolution Begins

First model to "think" before answering via test-time compute scaling. Scored 74% on AIME 2024 vs GPT-4o's 12% — a paradigm shift in how inference compute is used.

JAN 2025

DeepSeek-R1 — The "Sputnik Moment"

Open-sourced reasoning matching o1, trained at a fraction of the cost. Triggered a $600B single-day Nvidia market cap loss and challenged assumptions about AI compute requirements.

MAY 2025

Claude 4 — Agentic Coding Goes Mainstream

Claude Code reached general availability, scoring leading results on SWE-bench. Extended thinking allowed dynamic reasoning budgets per query.

AUG 2025

GPT-5 — Unified Multimodal Reasoning

Combined multimodal perception with deep reasoning. 400K context window with intelligent routing between fast and deep thinking modes. OpenAI reached 700M weekly active users.

EARLY 2026

The Frontier Converges

Claude, GPT-5.x, Gemini 3, DeepSeek, and Qwen converge in capability while diverging in specialization. The Agentic AI Foundation governs MCP and interoperability standards. OpenAI surpasses 900M weekly active users.

05 · Future

Where LLMs Are Headed

Three tensions will define the next phase: efficiency vs scale, open vs closed, and capability vs safety. Here are the frontiers being explored.

Architecture

MoE Dominates the Frontier

Over 60% of open-source model releases in late 2025 used MoE. DeepSeek-V3's 671B total / 37B active parameters showed how to get frontier quality at a fraction of the compute cost. NVIDIA's Blackwell delivers 10× MoE inference improvement.

Efficiency

On-Device Models Get Capable

Microsoft Phi-4 (14B) beats larger models on reasoning. Apple runs 3B models with 2-bit quantization on-device. Gartner forecasts organizations will use small domain-specific models 3× more than general-purpose LLMs by 2027.

Ecosystem

Open Source Reaches Frontier

Stanford AI Index 2025 confirmed open models match closed ones on knowledge, math, and science benchmarks. DeepSeek (MIT), Qwen (Apache 2.0), Llama, and Mistral compete at the frontier. But geopolitical bans create new barriers.

Regulation

Three Global Models Emerge

EU AI Act phases in through 2027 with risk-based classification. The US has no comprehensive federal law, creating state-level patchwork. China enforces labeling requirements. No AI lab scored above D in existential safety planning.

Scaling

The Pre-Training Wall is Real

Ilya Sutskever declared internet data "the fossil fuel of AI." But $7.8 trillion is committed to AI infrastructure through 2030. Scaling has shifted to post-training, test-time compute, and tooling — not just bigger pre-training runs.

AGI

Timelines Remain Contested

Dario Amodei targets late 2026/early 2027. Sam Altman says "closer than most think." Jensen Huang: 2029. Yann LeCun: decades. Metaculus median: 2037. OpenAI places the field at Level 2 (Reasoners), pushing into Level 3 (Agents).

Model	Organization	Parameters	Context	Architecture	Access
GPT-5	OpenAI	Undisclosed	400K	MoE (rumored)	Closed
Claude Opus 4.6	Anthropic	Undisclosed	1M	Dense	Closed
Gemini 2.5 Pro	Google	Undisclosed	2M	MoE	Closed
DeepSeek-V3	DeepSeek	671B (37B active)	128K	MoE + MLA	Open
Llama 4 Maverick	Meta	400B (17B active)	1M	MoE + iRoPE	Open
Qwen3-235B	Alibaba	235B (22B active)	128K	MoE	Open
Mistral Large 3	Mistral	Undisclosed	128K	Dense	Open

07 · Test Yourself

LLM Knowledge Quiz

See how well you absorbed the material. Click an answer to check — no second chances.

1. What is the fundamental training objective of modern LLMs?

Classifying text into categories

Predicting the next token given all preceding tokens

Translating between languages

Compressing text into embeddings

Correct! Causal language modeling — predicting the next token — forces the model to learn grammar, facts, reasoning, and common sense from vast text.

2. Why do modern LLMs use Grouped-Query Attention (GQA) instead of standard multi-head attention?

It improves training speed by 10×

It allows the model to process images

It reduces KV cache memory by sharing key-value heads across query groups

It replaces the feed-forward network

GQA shares K/V projections across multiple query heads, reducing the KV cache by up to 8× — critical for making long-context inference practical.

3. DeepSeek-V3 has 671 billion total parameters. How many are active per token?

671 billion

175 billion

37 billion

7 billion

DeepSeek-V3 uses a Mixture of Experts architecture with 256 fine-grained experts. Only 8 routed experts plus 1 shared expert activate per token — totaling 37B active parameters.

4. What event in January 2025 was called AI's "Sputnik moment"?

GPT-5 announcement

DeepSeek-R1 open-sourcing reasoning capabilities rivaling o1

Google Gemini reaching 1B users

EU AI Act taking effect

DeepSeek-R1 open-sourced reasoning matching OpenAI's o1 at a fraction of the cost, triggering a $600B Nvidia market cap loss and challenging Western AI compute assumptions.

5. What does the Chinchilla scaling law recommend for compute-optimal training?

1 token per parameter

~20 tokens per parameter

1,000 tokens per parameter

It depends on model architecture

The Chinchilla paper showed ~20 tokens per parameter is compute-optimal. But modern models deliberately over-train (Llama 3 8B used 1,875:1) to reduce inference costs — smaller, well-trained models are cheaper to serve.

Understanding Large Language Models