How Models Learn
- #AI
- #Backpropagation
- #Loss Function
- #Gradient Descent
In the previous post I covered what AI is and how it developed. This one goes a step further into how models actually learn — and in particular, two concepts I personally found confusing for a long time: the loss function and backpropagation.
What learning is
In AI, a model is a function that takes input and produces output. What that function does depends entirely on the numbers packed inside it — the parameters, weights and biases .
Learning is the process of nudging those numbers, over and over, until the model's output gets close to the right answer. When the output is a number (regression — say, house prices), you look at how far off the prediction is. When it's a category (classification — say, cat vs. dog), you look at how close to 1 the model's probability for the correct category is. Either way, you adjust the numbers to shrink that gap, and you keep repeating.
The one-line version: learning means repeatedly adjusting the parameters by a small amount in the direction that reduces the gap between prediction and ground truth, one batch of data at a time.
Each batch goes through four steps.
- ① Forward pass — the model takes input and produces a prediction
- ② Loss — the difference between prediction and ground truth is distilled into a single number
- ③ Backward pass — "which direction should each parameter move to reduce the loss?" is computed for all parameters at once (= gradient)
- ④ Update — the parameters are moved in that direction by a step proportional to the learning rate
What actually happens inside a single step: every data point in the batch (say, 32) is fed through the model at once, producing 32 predictions. The loss for each is computed separately, then averaged into one number — the batch loss — and the parameters are updated once based on that average. That means 32 updates don't happen; just one step, based on the mean loss.
Run this cycle over the full dataset multiple times (one full pass is called an epoch) and the model's predictions gradually converge on the right answers.
Preparing the data
Before you can run the training cycle, you need to get the data into a form the model can accept. The usual steps look like this.
- Collect and clean — gather data, remove corrupted values and duplicates
- Split — divide into train (70%) / validation (15%) / test (15%)
- Normalize — align the numeric scale across features
- Encode — convert text and categories into numeric vectors
- Batch — divide into fixed-size batches
In real projects, getting the data ready takes more effort than writing the model code. That said, when you're studying or need to iterate quickly, Hugging Face's datasets library lets you pull public datasets and get straight to experimenting.
from datasets import load_dataset
# California housing dataset (1990 census, regression task)
ds = load_dataset("gvlassis/california_housing")
print(ds)
# DatasetDict({
# train: Dataset({ features: [...], num_rows: 16640 }),
# validation: Dataset({ features: [...], num_rows: 2000 }),
# test: Dataset({ features: [...], num_rows: 2000 }),
# })
print(ds["train"][0])
# {'MedInc': 8.3252, 'HouseAge': 41.0, ..., 'MedHouseVal': 4.526}Forward pass
The forward pass is the step where you compute what answer the model gives with its current parameters. Input flows in one direction toward the output — hence the name.
For the simplest linear model:
Multiply the input by and add , and you're done. Two parameters — and — determine everything the model does.
import numpy as np
w, b = 2.0, 1.0 # parameters
x = np.array([1.0, 2.0, 3.0]) # 3 inputs
y_hat = w * x + b # forward pass
print(y_hat) # [3. 5. 7.]A neural network stacks this transformation across multiple layers with a nonlinearity (like ReLU) between each pair. That's what lets it learn curved patterns, not just straight lines.
For the simplest two-layer network:
linear transform → nonlinearity (ReLU) → linear transform, in sequence. To add more layers, just repeat the same pattern (linear → nonlinearity).
import numpy as np
# 2-layer network: x → ReLU(W1·x + b1) → W2·a1 + b2 = ŷ
x = np.array([1.0, 2.0])
W1 = np.array([[0.5, -0.3], # layer 1 weights (3 × 2)
[0.8, 0.2],
[-0.1, 0.4]])
b1 = np.array([0.1, -0.2, 0.05])
W2 = np.array([0.7, -0.5, 0.3]) # layer 2 weights (1 × 3)
b2 = 0.1
z1 = W1 @ x + b1 # layer 1 linear
a1 = np.maximum(0, z1) # ReLU (nonlinearity)
y_hat = W2 @ a1 + b2 # layer 2 linear → output
print(y_hat) # -0.175Before training begins, and are random numbers. So the first forward pass usually produces a prediction that's way off. As training progresses and and are gradually adjusted, the output finally starts to converge on the right answer.
Loss function
The forward pass has produced a prediction . The ground truth is . The job of the loss function is to express how far apart those two values are as a single number.
Two inputs (ground truth, prediction), one positive output. The closer the two values, the smaller the output; the further apart, the larger. The function you use depends on the task. Here are the two most common.
Regression — MSE (Mean Squared Error)
For tasks that predict a continuous number — like house prices — you use mean squared error (MSE).
For each data point, you square the difference between the true value and the prediction, then average everything. Squaring serves two purposes: it removes the sign so positive and negative errors don't cancel each other out, and it penalizes large errors more heavily. An error of 2 becomes 4, but an error of 10 becomes 100. A five-fold difference in error balloons into a twenty-five-fold difference in the loss — big mistakes matter a lot more.
import numpy as np
y = np.array([3.0, 5.0, 7.0])
y_hat = np.array([2.5, 5.5, 8.0])
mse = np.mean((y - y_hat) ** 2)
print(mse) # 0.5Classification — Cross-Entropy
For tasks that predict a category — like "cat or dog" — you use cross-entropy.
A classification model's output is typically a probability distribution over categories — e.g., [cat 0.7, dog 0.2, bird 0.1]. The ground truth is represented as a one-hot vector with a 1 in the correct position and 0s everywhere else — e.g., [1, 0, 0] (correct answer: cat). Cross-entropy measures how different those two distributions are.
Here is the -th value of the one-hot ground truth, and is the probability the model assigned to that position.
Compute it by hand once and you'll see how simple it is. With ground truth [1, 0, 0] and prediction [0.7, 0.2, 0.1]:
Because is 0 everywhere except the correct position, all the other terms vanish. Only the correct position survives. In the end, cross-entropy collapses to applied to the probability the model assigned to the correct answer — it measures nothing but "how confident was the model about the right class?"
When the model's confidence in the correct answer is close to 1, the loss is close to 0. As that confidence drops toward 0, the loss explodes. Confident and right gets a small penalty; confident and wrong gets a huge one.
import numpy as np
y_hat = np.array([0.7, 0.2, 0.1]) # model probabilities
y = np.array([1.0, 0.0, 0.0]) # ground truth: class 0
ce = -np.sum(y * np.log(y_hat))
print(ce) # 0.3567Choosing a loss function is straightforward
- Output is a continuous value → MSE family (regression)
- Output is a category → Cross-entropy (classification)
- For both — the function must be differentiable.
Why differentiability matters is explained in the next two sections on backpropagation and the update step.
Backpropagation
To reduce the loss, you need to know which direction to move each parameter, and by how much, so that the loss goes down. The value that carries that information is the gradient.
The gradient of the loss with respect to a single parameter , written , is a single number that tells you: "if I nudge up slightly right now, how much does the loss change?"
- Sign encodes direction — positive means increasing increases the loss; negative means the opposite.
- Magnitude encodes steepness — a large value means the loss changes a lot in that direction; near zero means it's nearly flat.
Together they tell you exactly which way to move and how much of a difference it'll make.
The problem is that a model has far more than one parameter. A single step update requires a separate gradient for every parameter — .
A real neural network has hundreds of millions of parameters, and each one influences the loss through multiple layers. Computing gradients one by one from scratch is essentially impossible.
That's where the chain rule — a basic rule of calculus — and the algorithm that exploits it, backpropagation, come in.
The chain rule
Suppose a function is composed of two steps. For example, with .
A change in causes a change in , which causes a change in . It's a chain of dominoes.
The key insight is simple — the effect of on equals the product of the rates of change at each step.
Working it out directly:
- (derivative of )
- (derivative of )
- Multiply:
Neural networks work exactly the same way — the chain just gets longer, one link per layer. The total effect of a weight on the loss is found by multiplying the derivatives all the way to the end.
Why compute backward?
The chain is established. Now — which direction should we traverse it for efficiency?
The key observation is that multiple parameters share the tail end of the chain (the side). The question is whether you recompute that shared tail every time, or compute it once and reuse it.
A small analogy: suppose you need to compute , , and all at once. Going left to right, you redo the same multiplications. Going right to left (), each step reuses what came before.
Neural networks work the same way.
- Forward (parameter → loss) — for each weight, you'd follow the chain from that weight all the way to from scratch. With hundreds of millions of weights, you'd repeat the same tail hundreds of millions of times.
- Backward (loss → parameters) — you compute the gradient near once, then carry it backward one layer at a time. The shared tail is reused naturally, so a single backward pass yields gradients for all parameters at once.
That's why it's called "back" propagation. Data flows forward; gradients flow backward.
The efficiency difference is enormous. Computing gradients parameter by parameter would scale with the square of the number of layers; backpropagation is just one forward pass plus one backward pass, scaling linearly. That efficiency is what makes training huge neural networks possible.
How it works in a neural network
Let's trace the flow through a small two-layer network. In the forward pass, input flows in one direction through , ReLU, and all the way to the loss .
In the backward pass, you start at and send gradients backward one step at a time.
Each step does the same thing: the gradient from the previous step × the local derivative at this step. As the backward pass moves through each layer, the gradient of the loss with respect to that layer's weights — — drops out, and the gradient to pass to the next layer is formed simultaneously. By the time it reaches the end, every layer's weight gradients are in hand.
# assumes x, z1, a1, z2, y, W1, W2 from the forward pass above
# backprop: starting from L, one step at a time
dL_dz2 = -2 * (y - z2) # start: ∂L/∂z₂
dL_dW2 = dL_dz2 * a1 # × a₁ → gradient of W₂
dL_da1 = dL_dz2 * W2 # × W₂ (passed to next step)
dL_dz1 = dL_da1 * (z1 > 0) # × ReLU′ (1 if z₁>0, else 0)
dL_dW1 = np.outer(dL_dz1, x) # × x → gradient of W₁Every line follows the same pattern: gradient from the previous step × local derivative at this step. dL_dW2 is the gradient for the second layer's weights; dL_dW1 is for the first — one drops out each time the backward pass crosses a layer. Once all the weight gradients are collected, they're handed to the update step and one training step is complete.
Update
Backpropagation has computed the gradient for every parameter at once. Now you take those gradients and decide how to move each parameter — and one step is done.
Gradient descent
When the parameter value changes, so does the loss. For the same data, a different means a different prediction, which means a different loss. If you plot every pair for one parameter, you get a U-shaped curve — a loss curve. With more than one parameter it becomes a surface, but the idea is the same.
We want to reach the bottom of that surface. But we don't know the whole curve in advance and can't jump to the answer in one shot. Instead, starting from some (usually random), we look at only the gradient at the current position and take one step at a time. It's exactly like descending a mountain in fog — you can't see far ahead, but you can feel the slope underfoot, so you follow it downhill one step at a time.
The sign of the gradient tells you which way is downhill.
- Gradient positive → increasing increases the loss → decrease to reduce the loss
- Gradient negative → increasing decreases the loss → increase to reduce the loss
In both cases, moving opposite to the gradient reduces the loss. Condensed into one line, that's the update rule of gradient descent.
- — the gradient itself (what backpropagation gave us)
- (minus) — we want to go in the opposite direction of the gradient
- (eta) — the learning rate, a step size controlling how far we move
A large means big steps; a small means tiny steps. Too large and you overshoot the minimum and bounce up the other side (divergence). Too small and you barely move after many steps. Typical values range from 0.001 to 0.1.
A small demo
Let's say the loss is and the optimal value is . Wherever starts, it should converge to 3.
First, let's find the gradient by hand. Differentiating with respect to gives .
If differentiation isn't familiar: when you differentiate a squared expression, the exponent 2 comes down in front, and stays once. That's where the 2 comes from — it's the derivative coefficient. The learning rate appears separately as
lr = 0.1in the code below.
At each step, we compute this gradient and move by the learning rate in the opposite direction.
w = 0.0 # initial value
lr = 0.1 # learning rate
for step in range(20):
grad = 2 * (w - 3) # gradient dL/dw
w = w - lr * grad # step in the opposite direction of the gradient
loss = (w - 3) ** 2
print(f"step {step:2d} w={w:.4f} loss={loss:.4f}")Output (first few and last):
step 0 w=0.6000 loss=5.7600
step 1 w=1.0800 loss=3.6864
step 2 w=1.4640 loss=2.3593
...
step 19 w=2.9654 loss=0.0012
Twenty steps and we're nearly at the answer. All we did at each step was look at the gradient and move 0.1 in the opposite direction — and it naturally converged to the minimum. This example had one parameter, so we could differentiate by hand. A real neural network has hundreds of millions, and that's exactly why backpropagation was needed to get all the gradients at once.
Optimizer
Backpropagation has already computed the gradient for every parameter. How do you apply the 1D rule above to hundreds of millions of ?
The answer is simple — apply the same rule to each parameter independently. Each one moves by its own gradient, no more.
Writing out hundreds of millions of lines is unwieldy, so in practice you bundle all parameters into and write it as one line.
"All parameters simultaneously" means they're all updated in one step — not that they all move by the same amount. Each parameter has its own gradient, so each one moves in its own direction by its own amount.
The algorithm that performs this update step is called the optimizer. Call it once and every parameter in the model is updated according to the rule above.
The simplest form we've been using is SGD (Stochastic Gradient Descent). There are also smarter optimizers like Adam and RMSprop that adapt the learning rate or add momentum — all variations on gradient descent that apply small corrections to the formula above. In practice, Adam is the default most people reach for.
Putting it all together — training a model with PyTorch
Now let's tie all four steps into a single training loop using PyTorch. In the earlier sections I wrote each step out by hand to make the mechanics visible, but in practice PyTorch provides handy abstractions (nn.Linear, nn.MSELoss, optim.SGD, loss.backward()) that handle all of it.
Let's train a model to discover the ground truth (, ) from 100 data points drawn from .
import torch
import torch.nn as nn
# data — true: y = 2x + 1 (+ small noise)
x = torch.linspace(-3, 3, 100).unsqueeze(1)
y = 2 * x + 1 + torch.randn(100, 1) * 0.3
# model · loss · optimizer
model = nn.Linear(1, 1) # trainable w, b built in
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
# training loop — the four-step skeleton
for epoch in range(200):
y_hat = model(x) # ① forward
loss = loss_fn(y_hat, y) # ② loss
optimizer.zero_grad() # clear previous gradients
loss.backward() # ③ backprop (autograd computes automatically)
optimizer.step() # ④ update
w, b = model.weight.item(), model.bias.item()
print(f"After training: w={w:.2f}, b={b:.2f}")
# After training: w≈2.00, b≈1.00 (slight variation due to noise)Key methods
nn.Linear(in, out)— a linear transform layer with learnable and built inmodel.parameters()— the learnable parameters to hand to the optimizernn.MSELoss()— mean squared error (regression loss)torch.optim.SGD(params, lr)— gradient descent optimizermodel(x)— runs the forward pass (internally callsforward(x))loss.backward()— autograd traverses the forward graph in reverse and computes gradients for all parameters automaticallyoptimizer.step()— updates all parameters at once using the computed gradientsoptimizer.zero_grad()— clears the gradients from the previous step (without this they accumulate)
After 200 epochs, the model has converged to within a whisker of the true answer.
Wrapping up
The four steps in this post form the shared skeleton of every neural network training run.
- Forward pass — compute what answer the model gives with its current parameters
- Loss function — express how far that answer is from the ground truth as a single number
- Backpropagation — compute the direction of improvement (gradients) for all parameters at once, efficiently
- Update — move all parameters in that direction by the learning rate (gradient descent)
No matter how deep the network or how complex the task, the training loop stays anchored to these four steps. Some architectures add or modify steps, but transformers, GPT, ViT — most of the models you'll encounter run on exactly this skeleton.
Writing this post gave me a chance to look at each step a lot more closely. Concepts like gradient descent and the loss function that I'd had a vague sense of for a long time finally clicked once I slowed down to research them carefully and explain them in my own words.
This post may contain factual or interpretive errors. If you spot anything wrong or have a question, feel free to leave a comment.