Track 0 · Foundations · Lesson 3

Neural networks in one page

After this lesson you can explain what a neuron and a layer are, why the nonlinear activation is the whole point, and what "backpropagation" buys you — without any calculus.

Level: beginner Read time: ~9 min Prerequisites: How models learn

The house-price model was a straight line: w × size + b. Straight lines are limited — most real relationships bend. A neural network is how we build models that can bend arbitrarily, while still being trainable by the exact gradient-descent loop from the last lesson. The trick is to stack many simple pieces.

The piece: a neuron

A single neuron does two things. First it takes a weighted sum of its inputs and adds a bias — that's just the linear model again: z = w₁x₁ + w₂x₂ + … + b. Then it passes that sum through a nonlinear activation function. The most common one is ReLU ("rectified linear unit"), which is almost insultingly simple: if the number is negative, output 0; otherwise output the number unchanged.

The weights and bias are parameters — the knobs gradient descent will tune. A neuron, then, is a tiny learnable function with a kink in it.

Why the nonlinearity is the entire point

Here is the fact that makes neural networks work: if you stack linear layers with no activation between them, the whole stack collapses back into a single linear function. A line of lines is still a line. You gain nothing from depth.

The activation breaks that. Insert a nonlinearity between layers and stacking them lets the network compose bends into bends — approximating essentially any function given enough neurons. So the activation isn't a detail; it is the reason a "deep" network can model things a line never could.

Key idea

Linear layers provide capacity; the nonlinear activation provides expressiveness. Remove the activations and a 50-layer network is mathematically identical to a single linear layer.

From neuron to layer to network

Put many neurons side by side, all reading the same inputs, and you have a layer. Feed one layer's outputs into the next and you have a stack of layers. The layers between input and output are hidden layers — "hidden" only because you don't directly observe their values. "Deep learning" is just neural networks with several hidden layers.

Running data left to right — input, through each layer, to the output — is the forward pass. It's the same "model(inputs)" step from the training loop, now unpacked into layer-by-layer arithmetic.

inputs hidden layer output
Every edge is a weight; every hidden/output node adds a bias and applies an activation. A real network has many such layers and millions of edges.

How does it learn? Backpropagation

Training still needs the gradient: for every one of those millions of weights, which way should it move to reduce the loss? Backpropagation is the efficient algorithm that computes all of them in a single backward sweep. Intuitively, it runs the network in reverse, distributing "blame" for the final error back through the layers — each weight learns how much it contributed to the mistake.

You will essentially never implement backpropagation by hand. Frameworks like PyTorch record the operations of the forward pass and differentiate them automatically ("autograd"). What matters for you: backprop produces the gradient, and gradient descent uses it. Same loop as before — now over a network instead of a line.

Heads up

People say a network has "billions of parameters." Those parameters are exactly these weights and biases, counted across every layer. Training sets all of them; inference reads all of them. There is nothing else inside.

Why this matters for language

A language model is a particular, large neural network whose inputs and outputs are about text. It has special structure — the Transformer — that we'll meet in two lessons. But it is trained by backprop + gradient descent on a loss, exactly like the toy network above. Before we get to the Transformer, we have to solve one problem: a network does arithmetic on numbers, and text is not numbers. That's the next lesson.

Key terms

Neuron
A weighted sum of inputs plus a bias, passed through a nonlinear activation.
Weight / bias
The learnable parameters of a neuron.
Activation function (ReLU)
The nonlinearity that lets stacked layers model non-linear patterns; ReLU outputs 0 for negatives, the input otherwise.
Layer / hidden layer
Many neurons in parallel; hidden layers sit between input and output.
Forward pass
Computing outputs from inputs, layer by layer.
Backpropagation
The algorithm that computes the gradient for every parameter in one backward sweep.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.