How they work, how they're trained, what they can do, and where they're headed — an interactive deep dive into the technology reshaping the world.
Every major LLM — GPT-5, Claude, Gemini, Llama, DeepSeek — is a decoder-only transformer. The architecture has remained fundamentally stable since 2019; what has changed are carefully validated component upgrades that collectively yield massive gains.
Click any layer below to see how it works. Each component in the stack plays a critical role in transforming raw tokens into intelligent predictions.
BPE (Byte Pair Encoding). Each token is mapped to a dense vector — Llama 3 405B uses 16,384-dimensional embeddings across a vocabulary of 128,256 tokens. Larger vocabularies improve multilingual and code efficiency. English averages ~0.75 tokens per word.
Type text below to see how a BPE-like tokenizer would split it into subword tokens. Each color represents a different token.
Standard multi-head attention gives each head its own Key and Value projections. GQA groups multiple query heads to share K/V projections — Llama 3 uses 8 KV heads shared across 128 query heads. This reduces the KV cache by 8× with negligible quality loss, making long-context inference practical.
Developed by Tri Dao, Flash Attention is an IO-aware algorithm that computes exact attention while reducing memory from O(n²) to O(n) by processing in tiles without materializing the full attention matrix. Flash Attention 2 & 3 are now standard in every production inference engine, delivering 2-4× speedups.
MoE replaces the dense FFN with multiple expert sub-networks, activating only a subset per token via a learned router. DeepSeek-V3 has 671B total but only 37B active parameters (256 experts, 8 routed + 1 shared). Over 60% of open-source models in late 2025 used MoE, and all top-10 models on intelligence leaderboards are MoE architectures.
Reduces weight precision from FP16 to INT8 or INT4, shrinking a 7B model from 14 GB → 3.5 GB with minimal quality loss. Major methods: GPTQ (Hessian-informed), AWQ (activation-aware), and GGUF (the community standard for llama.cpp). Enables 7B models to run on laptops at 50-100+ tokens/sec.
During generation, each new token must attend to all previous tokens. The KV cache stores Key/Value tensors for reuse — for Llama 3 70B at 128K context, this alone requires ~40 GB. PagedAttention (vLLM) borrows virtual memory paging to eliminate cache fragmentation, achieving <4% waste vs 60-80% in naive allocation.
Training an LLM is a multi-stage pipeline: pre-train on vast internet text, then refine through supervised fine-tuning, and align with human preferences via RLHF or DPO.
Next-token prediction on trillions of tokens. The model learns grammar, facts, reasoning patterns, and common sense from vast text corpora.
Train on instruction-response pairs. A 1.3B fine-tuned model was preferred over 175B GPT-3 — alignment matters more than raw scale.
Align with human preferences. RLHF trains a reward model on rankings then optimizes via PPO. DPO simplifies this into a single classification loss.
Constitutional AI, red-team adversarial testing, and capability evaluations. Anthropic's ASL framework scales precautions with model capability.
DeepMind's 2022 Chinchilla paper established that compute-optimal training needs ~20 tokens per parameter. But the industry deliberately moved beyond this: modern models are massively over-trained to reduce inference cost. Llama 3 8B was trained at 1,875:1 tokens-to-parameters — 94× the Chinchilla-optimal ratio.
Training costs have grown ~2.4× per year. GPT-4 cost an estimated $78-100M. Llama 3.1 405B cost ~$170-191M. DeepSeek-V3 reported just $5.6M in compute — a dramatic efficiency achievement that shocked the industry in January 2025.
Epoch AI estimates quality public text at roughly 300 trillion tokens. With aggressive over-training ratios, this may be functionally exhausted by 2027. The response: synthetic data generation, licensing deals (Reddit-Google, News Corp-OpenAI), and efficiency improvements. FineWeb provides 15 trillion filtered tokens from 96 Common Crawl snapshots.
From writing production code to discovering drug targets, LLMs have moved from novelty to infrastructure. Here's how they're being used across industries.
GitHub Copilot has 4.7M paid subscribers. Claude Code scored 72.5% on SWE-bench. Cursor IDE reached $29.3B valuation. 85% of developers now use AI tools regularly.
AlphaFold 3 predicts protein structures with 50%+ improvement — earning a Nobel Prize. AI achieved unofficial IMO gold medals, solving 5 of 6 problems.
FDA authorized 1,250+ AI medical devices by mid-2025. Ambient AI scribes save physicians 2+ hours daily. ~80% of devices are in radiology.
Harvey AI reached $190M ARR, used by most AmLaw 100 firms. CoCounsel won the federal judiciary contract serving 25,000+ legal professionals.
MCP reached 97M monthly SDK downloads. Claude Computer Use, OpenAI Operator, and Google Mariner enable autonomous task execution. MCP donated to Agentic AI Foundation.
Test-time compute scaling: o1 scored 74% on AIME 2024 vs GPT-4o's 12%. DeepSeek-R1 matched o1 at 70% lower cost. Inference compute projected to exceed training 118×.
From the Transformer paper to billion-user agents — a new frontier model was released approximately every 17.5 days throughout 2025. Click any milestone to expand.
Three tensions will define the next phase: efficiency vs scale, open vs closed, and capability vs safety. Here are the frontiers being explored.
Over 60% of open-source model releases in late 2025 used MoE. DeepSeek-V3's 671B total / 37B active parameters showed how to get frontier quality at a fraction of the compute cost. NVIDIA's Blackwell delivers 10× MoE inference improvement.
Microsoft Phi-4 (14B) beats larger models on reasoning. Apple runs 3B models with 2-bit quantization on-device. Gartner forecasts organizations will use small domain-specific models 3× more than general-purpose LLMs by 2027.
Stanford AI Index 2025 confirmed open models match closed ones on knowledge, math, and science benchmarks. DeepSeek (MIT), Qwen (Apache 2.0), Llama, and Mistral compete at the frontier. But geopolitical bans create new barriers.
EU AI Act phases in through 2027 with risk-based classification. The US has no comprehensive federal law, creating state-level patchwork. China enforces labeling requirements. No AI lab scored above D in existential safety planning.
Ilya Sutskever declared internet data "the fossil fuel of AI." But $7.8 trillion is committed to AI infrastructure through 2030. Scaling has shifted to post-training, test-time compute, and tooling — not just bigger pre-training runs.
Dario Amodei targets late 2026/early 2027. Sam Altman says "closer than most think." Jensen Huang: 2029. Yann LeCun: decades. Metaculus median: 2037. OpenAI places the field at Level 2 (Reasoners), pushing into Level 3 (Agents).
The current landscape as of early 2026 — a snapshot of the models that define the frontier.
| Model | Organization | Parameters | Context | Architecture | Access |
|---|---|---|---|---|---|
| GPT-5 | OpenAI | Undisclosed | 400K | MoE (rumored) | Closed |
| Claude Opus 4.6 | Anthropic | Undisclosed | 1M | Dense | Closed |
| Gemini 2.5 Pro | Undisclosed | 2M | MoE | Closed | |
| DeepSeek-V3 | DeepSeek | 671B (37B active) | 128K | MoE + MLA | Open |
| Llama 4 Maverick | Meta | 400B (17B active) | 1M | MoE + iRoPE | Open |
| Qwen3-235B | Alibaba | 235B (22B active) | 128K | MoE | Open |
| Mistral Large 3 | Mistral | Undisclosed | 128K | Dense | Open |
See how well you absorbed the material. Click an answer to check — no second chances.