The Complete Guide · March 2026

Understanding Large Language Models

How they work, how they're trained, what they can do, and where they're headed — an interactive deep dive into the technology reshaping the world.

0
Million Weekly Users
0
Trillion Training Tokens
0
Million Token Context
Scroll to explore

How LLMs Work

Every major LLM — GPT-5, Claude, Gemini, Llama, DeepSeek — is a decoder-only transformer. The architecture has remained fundamentally stable since 2019; what has changed are carefully validated component upgrades that collectively yield massive gains.

The Transformer, Explored

Click any layer below to see how it works. Each component in the stack plays a critical role in transforming raw tokens into intelligent predictions.

T
Tokenization + Embedding
P
Positional Encoding (RoPE)
A
Multi-Head Self-Attention
F
Feed-Forward Network (SwiGLU)
N
RMSNorm + Residual Connections
O
Output Projection → Next Token
Text is split into subword tokens via BPE (Byte Pair Encoding). Each token is mapped to a dense vector — Llama 3 405B uses 16,384-dimensional embeddings across a vocabulary of 128,256 tokens. Larger vocabularies improve multilingual and code efficiency. English averages ~0.75 tokens per word.

Try It: Simulated Tokenization

Type text below to see how a BPE-like tokenizer would split it into subword tokens. Each color represents a different token.


Key Architectural Innovations

Grouped-Query Attention

Standard multi-head attention gives each head its own Key and Value projections. GQA groups multiple query heads to share K/V projections — Llama 3 uses 8 KV heads shared across 128 query heads. This reduces the KV cache by with negligible quality loss, making long-context inference practical.

Flash Attention

Developed by Tri Dao, Flash Attention is an IO-aware algorithm that computes exact attention while reducing memory from O(n²) to O(n) by processing in tiles without materializing the full attention matrix. Flash Attention 2 & 3 are now standard in every production inference engine, delivering 2-4× speedups.

Mixture of Experts

MoE replaces the dense FFN with multiple expert sub-networks, activating only a subset per token via a learned router. DeepSeek-V3 has 671B total but only 37B active parameters (256 experts, 8 routed + 1 shared). Over 60% of open-source models in late 2025 used MoE, and all top-10 models on intelligence leaderboards are MoE architectures.

Quantization

Reduces weight precision from FP16 to INT8 or INT4, shrinking a 7B model from 14 GB → 3.5 GB with minimal quality loss. Major methods: GPTQ (Hessian-informed), AWQ (activation-aware), and GGUF (the community standard for llama.cpp). Enables 7B models to run on laptops at 50-100+ tokens/sec.

KV Cache & PagedAttention

During generation, each new token must attend to all previous tokens. The KV cache stores Key/Value tensors for reuse — for Llama 3 70B at 128K context, this alone requires ~40 GB. PagedAttention (vLLM) borrows virtual memory paging to eliminate cache fragmentation, achieving <4% waste vs 60-80% in naive allocation.

From Raw Text to Useful Assistants

Training an LLM is a multi-stage pipeline: pre-train on vast internet text, then refine through supervised fine-tuning, and align with human preferences via RLHF or DPO.

STAGE 01

Pre-Training

Next-token prediction on trillions of tokens. The model learns grammar, facts, reasoning patterns, and common sense from vast text corpora.

15T
tokens (Llama 3)
16K
H100 GPUs
STAGE 02

Supervised Fine-Tuning

Train on instruction-response pairs. A 1.3B fine-tuned model was preferred over 175B GPT-3 — alignment matters more than raw scale.

25M+
examples
<1%
params (LoRA)
STAGE 03

RLHF / DPO

Align with human preferences. RLHF trains a reward model on rankings then optimizes via PPO. DPO simplifies this into a single classification loss.

$1-10
per annotation
10×
cheaper (RLAIF)
STAGE 04

Safety & Red-Teaming

Constitutional AI, red-team adversarial testing, and capability evaluations. Anthropic's ASL framework scales precautions with model capability.

ASL 1-5
safety levels
Grade D
best safety score

The Scaling Laws Paradigm

DeepMind's 2022 Chinchilla paper established that compute-optimal training needs ~20 tokens per parameter. But the industry deliberately moved beyond this: modern models are massively over-trained to reduce inference cost. Llama 3 8B was trained at 1,875:1 tokens-to-parameters — 94× the Chinchilla-optimal ratio.

Training costs have grown ~2.4× per year. GPT-4 cost an estimated $78-100M. Llama 3.1 405B cost ~$170-191M. DeepSeek-V3 reported just $5.6M in compute — a dramatic efficiency achievement that shocked the industry in January 2025.

The Data Wall

Epoch AI estimates quality public text at roughly 300 trillion tokens. With aggressive over-training ratios, this may be functionally exhausted by 2027. The response: synthetic data generation, licensing deals (Reddit-Google, News Corp-OpenAI), and efficiency improvements. FineWeb provides 15 trillion filtered tokens from 96 Common Crawl snapshots.

What LLMs Can Do

From writing production code to discovering drug targets, LLMs have moved from novelty to infrastructure. Here's how they're being used across industries.

⌨️

Code Generation

GitHub Copilot has 4.7M paid subscribers. Claude Code scored 72.5% on SWE-bench. Cursor IDE reached $29.3B valuation. 85% of developers now use AI tools regularly.

77-81% SWE-bench top scores
🔬

Scientific Discovery

AlphaFold 3 predicts protein structures with 50%+ improvement — earning a Nobel Prize. AI achieved unofficial IMO gold medals, solving 5 of 6 problems.

98.5% human proteome covered
🏥

Healthcare

FDA authorized 1,250+ AI medical devices by mid-2025. Ambient AI scribes save physicians 2+ hours daily. ~80% of devices are in radiology.

1,250+ FDA-authorized devices
⚖️

Legal

Harvey AI reached $190M ARR, used by most AmLaw 100 firms. CoCounsel won the federal judiciary contract serving 25,000+ legal professionals.

$190M ARR (Harvey AI)
🤖

Agentic AI

MCP reached 97M monthly SDK downloads. Claude Computer Use, OpenAI Operator, and Google Mariner enable autonomous task execution. MCP donated to Agentic AI Foundation.

97M monthly MCP downloads
🧠

Reasoning Models

Test-time compute scaling: o1 scored 74% on AIME 2024 vs GPT-4o's 12%. DeepSeek-R1 matched o1 at 70% lower cost. Inference compute projected to exceed training 118×.

10-100× more tokens per query

Key Milestones

From the Transformer paper to billion-user agents — a new frontier model was released approximately every 17.5 days throughout 2025. Click any milestone to expand.

JUNE 2017
Transformer Paper — "Attention Is All You Need"
Google researchers introduced the self-attention mechanism, replacing recurrence entirely. This single paper underpins every major LLM that exists today.
JUNE 2018
GPT-1 — Generative Pre-Training (117M params)
OpenAI established the paradigm: pre-train a transformer on unlabeled text, then fine-tune on downstream tasks. Small by today's standards, revolutionary in concept.
JUNE 2020
GPT-3 — Few-Shot Learning at Scale (175B params)
Demonstrated that scaling alone could yield emergent capabilities. The first model to spark the "prompt engineering" paradigm — no fine-tuning needed for many tasks.
NOV 2022
ChatGPT Launch — 100M Users in 60 Days
The fastest-growing consumer application in history. Combined GPT-3.5 with RLHF alignment to create the first broadly usable conversational AI.
MAR 2023
GPT-4 — 90th Percentile on the Bar Exam
First rumored MoE architecture. Claude 1 launched the same month with Constitutional AI. LLaMA 1 leaked and catalyzed the open-source ecosystem.
FEB 2024
Gemini 1.5 Pro — 1 Million Token Context
First production model with a million-token context window, enabling entire codebases and book-length documents in a single prompt.
SEP 2024
OpenAI o1 — The Reasoning Revolution Begins
First model to "think" before answering via test-time compute scaling. Scored 74% on AIME 2024 vs GPT-4o's 12% — a paradigm shift in how inference compute is used.
JAN 2025
DeepSeek-R1 — The "Sputnik Moment"
Open-sourced reasoning matching o1, trained at a fraction of the cost. Triggered a $600B single-day Nvidia market cap loss and challenged assumptions about AI compute requirements.
MAY 2025
Claude 4 — Agentic Coding Goes Mainstream
Claude Code reached general availability, scoring leading results on SWE-bench. Extended thinking allowed dynamic reasoning budgets per query.
AUG 2025
GPT-5 — Unified Multimodal Reasoning
Combined multimodal perception with deep reasoning. 400K context window with intelligent routing between fast and deep thinking modes. OpenAI reached 700M weekly active users.
EARLY 2026
The Frontier Converges
Claude, GPT-5.x, Gemini 3, DeepSeek, and Qwen converge in capability while diverging in specialization. The Agentic AI Foundation governs MCP and interoperability standards. OpenAI surpasses 900M weekly active users.

Where LLMs Are Headed

Three tensions will define the next phase: efficiency vs scale, open vs closed, and capability vs safety. Here are the frontiers being explored.

Architecture

MoE Dominates the Frontier

Over 60% of open-source model releases in late 2025 used MoE. DeepSeek-V3's 671B total / 37B active parameters showed how to get frontier quality at a fraction of the compute cost. NVIDIA's Blackwell delivers 10× MoE inference improvement.

Efficiency

On-Device Models Get Capable

Microsoft Phi-4 (14B) beats larger models on reasoning. Apple runs 3B models with 2-bit quantization on-device. Gartner forecasts organizations will use small domain-specific models 3× more than general-purpose LLMs by 2027.

Ecosystem

Open Source Reaches Frontier

Stanford AI Index 2025 confirmed open models match closed ones on knowledge, math, and science benchmarks. DeepSeek (MIT), Qwen (Apache 2.0), Llama, and Mistral compete at the frontier. But geopolitical bans create new barriers.

Regulation

Three Global Models Emerge

EU AI Act phases in through 2027 with risk-based classification. The US has no comprehensive federal law, creating state-level patchwork. China enforces labeling requirements. No AI lab scored above D in existential safety planning.

Scaling

The Pre-Training Wall is Real

Ilya Sutskever declared internet data "the fossil fuel of AI." But $7.8 trillion is committed to AI infrastructure through 2030. Scaling has shifted to post-training, test-time compute, and tooling — not just bigger pre-training runs.

AGI

Timelines Remain Contested

Dario Amodei targets late 2026/early 2027. Sam Altman says "closer than most think." Jensen Huang: 2029. Yann LeCun: decades. Metaculus median: 2037. OpenAI places the field at Level 2 (Reasoners), pushing into Level 3 (Agents).

Frontier Model Comparison

The current landscape as of early 2026 — a snapshot of the models that define the frontier.

Model Organization Parameters Context Architecture Access
GPT-5OpenAIUndisclosed400KMoE (rumored)Closed
Claude Opus 4.6AnthropicUndisclosed1MDenseClosed
Gemini 2.5 ProGoogleUndisclosed2MMoEClosed
DeepSeek-V3DeepSeek671B (37B active)128KMoE + MLAOpen
Llama 4 MaverickMeta400B (17B active)1MMoE + iRoPEOpen
Qwen3-235BAlibaba235B (22B active)128KMoEOpen
Mistral Large 3MistralUndisclosed128KDenseOpen

LLM Knowledge Quiz

See how well you absorbed the material. Click an answer to check — no second chances.

1. What is the fundamental training objective of modern LLMs?
Classifying text into categories
Predicting the next token given all preceding tokens
Translating between languages
Compressing text into embeddings
Correct! Causal language modeling — predicting the next token — forces the model to learn grammar, facts, reasoning, and common sense from vast text.
2. Why do modern LLMs use Grouped-Query Attention (GQA) instead of standard multi-head attention?
It improves training speed by 10×
It allows the model to process images
It reduces KV cache memory by sharing key-value heads across query groups
It replaces the feed-forward network
GQA shares K/V projections across multiple query heads, reducing the KV cache by up to 8× — critical for making long-context inference practical.
3. DeepSeek-V3 has 671 billion total parameters. How many are active per token?
671 billion
175 billion
37 billion
7 billion
DeepSeek-V3 uses a Mixture of Experts architecture with 256 fine-grained experts. Only 8 routed experts plus 1 shared expert activate per token — totaling 37B active parameters.
4. What event in January 2025 was called AI's "Sputnik moment"?
GPT-5 announcement
DeepSeek-R1 open-sourcing reasoning capabilities rivaling o1
Google Gemini reaching 1B users
EU AI Act taking effect
DeepSeek-R1 open-sourced reasoning matching OpenAI's o1 at a fraction of the cost, triggering a $600B Nvidia market cap loss and challenging Western AI compute assumptions.
5. What does the Chinchilla scaling law recommend for compute-optimal training?
1 token per parameter
~20 tokens per parameter
1,000 tokens per parameter
It depends on model architecture
The Chinchilla paper showed ~20 tokens per parameter is compute-optimal. But modern models deliberately over-train (Llama 3 8B used 1,875:1) to reduce inference costs — smaller, well-trained models are cheaper to serve.