Neural Networks — An Interactive Guide

Chapter 01

The Artificial Neuron

Inspired by biology, an artificial neuron receives inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function. Adjust the sliders below to see it in action.

Input x₁ 0.50

Input x₂ 0.30

Weight w₁ 0.80

Weight w₂ -0.40

Bias 0.10

Activation Function

Weighted Sum: 0.00

Output: 0.00

Chapter 02

Activation Functions

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns beyond simple linear relationships. Hover over the graphs to explore each function.

Sigmoid

σ(x) = 1 / (1 + e^-x)

Squashes values between 0 and 1. Historically popular for binary classification, but suffers from vanishing gradients at extreme values.

ReLU

f(x) = max(0, x)

The most widely used activation today. Computationally efficient and avoids vanishing gradients for positive values. Can "die" if neurons stop activating.

Tanh

f(x) = (e^x - e^-x) / (e^x + e^-x)

Outputs between -1 and 1, zero-centered unlike sigmoid. Often used in recurrent networks. Still suffers from vanishing gradients at saturation.

Leaky ReLU

f(x) = max(0.01x, x)

Fixes the "dying ReLU" problem by allowing a small negative slope. Ensures neurons always have a non-zero gradient, keeping them alive during training.

Chapter 03

Network Playground

Watch a neural network learn in real time. Click "Train" to see how the network adjusts its weights to classify the data points. Blue and orange represent two different classes.

Epoch 0

Loss —

Accuracy —

Architecture 2 → 8 → 8 → 1

Chapter 04

How Networks Learn

Backpropagation is the engine of learning. It calculates how much each weight contributed to the error, then adjusts them to reduce it. Click each step to see the process animated.

Step 01

Forward Pass

Input data flows through the network, layer by layer. Each neuron computes its weighted sum and applies the activation function, producing an output prediction.

Step 02

Calculate Loss

The network's prediction is compared to the true answer using a loss function (like mean squared error). This quantifies how wrong the network is.

Step 03

Backward Pass

Using the chain rule of calculus, gradients flow backward through the network. Each weight learns its share of responsibility for the error.

Step 04

Update Weights

Weights are nudged in the direction that reduces the loss, proportional to the learning rate. This cycle repeats thousands of times until the network converges.

Chapter 05

Gradient Descent

Click on the landscape to place a ball and watch it roll downhill toward the minimum. Gradient descent works the same way — always moving in the direction of steepest descent. Adjust the learning rate to see its effect.

Imagine standing on a mountain in thick fog. You can only feel the slope under your feet. Gradient descent works the same way: at each step, it measures the local slope (gradient) and moves downhill.

Learning Rate 0.05

Too small: The ball barely moves. Training takes forever.

Too large: The ball overshoots, bouncing past the minimum.

Just right: Smooth convergence to the lowest point.

Chapter 06

Architectures That Changed Everything

Different problems demand different structures. Here are the architectures that shaped modern AI.

▩

Feedforward

The Foundation · 1958

The simplest architecture: data flows in one direction from input to output. Each layer transforms the data, extracting increasingly abstract features. The building block for all others.

Classification Regression Tabular Data

◖

CNN

Convolutional · 1989

Applies sliding filters across input data to detect spatial patterns. Convolution layers learn to recognize edges, textures, and objects. Revolutionized computer vision and image understanding.

Image Recognition Object Detection Medical Imaging

↺

RNN / LSTM

Recurrent · 1997

Processes sequences by maintaining hidden state — a form of memory. LSTMs add gating mechanisms to control information flow, solving the vanishing gradient problem in long sequences.

Time Series Speech Recognition Music Generation

❖

Transformer

Attention Is All You Need · 2017

Uses self-attention to weigh relationships between all elements simultaneously, regardless of distance. No recurrence needed. Powers GPT, BERT, and virtually all modern large language models.

Language Models Translation Code Generation Vision

Chapter 07

A Brief History

From mathematical curiosity to world-changing technology — the key milestones in neural network history.

1943

McCulloch-Pitts Neuron

Warren McCulloch and Walter Pitts create the first mathematical model of an artificial neuron, proving that simple units could compute logical functions.

1958

The Perceptron

Frank Rosenblatt builds the Mark I Perceptron, the first hardware implementation. The New York Times calls it the embryo of a computer that "will be able to walk, talk, see, write, reproduce itself."

1986

Backpropagation

Rumelhart, Hinton, and Williams popularize the backpropagation algorithm, finally making it practical to train multi-layer networks.

2012

AlexNet & Deep Learning

Alex Krizhevsky's deep CNN crushes the ImageNet competition, reducing error rates by half. The deep learning revolution begins. GPUs become essential.

2017

Attention Is All You Need

Google researchers introduce the Transformer architecture. Self-attention replaces recurrence, enabling massive parallelism and spawning GPT, BERT, and the era of large language models.

2020s

The Age of Foundation Models

Models with billions of parameters demonstrate emergent abilities: reasoning, code generation, creative writing. AI becomes a general-purpose tool reshaping every industry.

How Neural Networks Think