From Derivatives to Machine Learning

§ 01

What a derivative is

A derivative answers one question: how fast is something changing right now?

Think of yourself driving. Your speedometer reads 60 mph. That is a derivative — it tells you the rate at which your position is changing at this exact instant. Not over the last hour, not on average for the trip, but right now.

Geometrically, a derivative is the slope of a curve at one specific point. To see how this works, drag the sliders below. The variable $x$ moves you along the curve $y = x^2$, and $h$ controls how far apart two points $P$ and $Q$ are.

The teal line is the tangent — the curve's exact slope at point $P$. The dashed coral line is the secant, which connects $P$ to a nearby point $Q$ and tells you the average slope between them. As you drag $h$ toward $0$, the secant collapses onto the tangent, and the two slope numbers merge.

That collapse is the derivative.

§ 02

The limit definition formula

Formally, what you saw in the slider above is captured by:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

Read it as a recipe:

f(x): the height of the curve at point $x$
f(x+h): the height a tiny step $h$ to the right
f(x+h) − f(x): how much the height changed over that step (the rise)
h: how wide the step was (the run)
÷ h: rise over run — the slope between the two points
lim h→0: squeeze the step width down to zero

The whole expression says: take the slope between two nearby points, then shrink the gap to nothing. What's left is the slope at a single point — the derivative.

Worked example: f(x) = x²

Write $f(x+h)$: $$f(x+h) = (x+h)^2 = x^2 + 2xh + h^2$$
Subtract $f(x)$: $$f(x+h) - f(x) = 2xh + h^2$$
The $x^2$ terms cancel — the part of the function that doesn't change with $h$ gets subtracted away.
Divide by $h$: $$\frac{2xh + h^2}{h} = 2x + h$$
Every term in the numerator has an $h$ in it — that's why the limit exists.
Take the limit as $h \to 0$: $$f'(x) = \lim_{h \to 0}(2x + h) = 2x$$

So $f'(x) = 2x$. At $x = 1$ the slope is $2$; at $x = 3$ the slope is $6$ — exactly what the picture in §01 shows.

§ 03

Heights, rise, and run

"Height" in the formula means exactly what you'd think: the vertical distance from the x-axis up to the point on the curve. For $f(x) = x^2$:

At $x = 2$, the point is $(2, 4)$. Height $= 4 = f(2)$.
At $x = 3$, the point is $(3, 9)$. Height $= 9 = f(3)$.

Drop a vertical line from any point on the curve to the x-axis — its length is the height. The interactive plot below shows all three pieces — heights, run, and rise — at once.

The little right triangle hugging the curve has horizontal side $h$ (run), vertical side $f(x+h) - f(x)$ (rise), and hypotenuse equal to the secant line. Slope is rise over run — same idea as for straight lines, applied to two points sitting on a curve.

§ 04

How the formula is the derivative

The connection between the formula and the concept "derivative" is not "they're related" — the formula is the literal definition.

Here is the chain that ties everything together:

derivative = limit of secant slope = tangent slope = instantaneous rate of change

$$\underbrace{f'(x)}_{\text{derivative}} \;=\; \underbrace{\lim_{h \to 0}}_{\text{shrink h}} \; \underbrace{\frac{f(x+h) - f(x)}{h}}_{\text{secant slope}} \;=\; \underbrace{\text{slope of curve at }x}_{\text{tangent slope}}$$

The problem is fundamental: a single point has no width — there's no rise and no run to divide. Slope needs two points. So we cheat cleverly: we use two points ($P$ and $Q$), then slide $Q$ closer and closer to $P$. Both the rise and the run shrink toward zero, but their ratio settles on a specific number — the slope right at $P$.

That settling-on-a-number act is what "$\lim_{h \to 0}$" means.

A useful three-layer hierarchy

Layer	What it is	What it does
Geometric idea	Slope of a curve at one point	What a derivative means
Algebraic definition	The limit formula	How you prove what it equals
Shortcut rules	Power rule, product rule, chain rule	How you compute it quickly in practice

Every time you use the power rule to write "derivative of $x^5$ is $5x^4$," you're cashing in work that was already done once using the limit definition to confirm the shortcut is correct.

§ 05

Slope in disguise

You're always calculating a slope when you compute a derivative — but the slope isn't always called "slope." It gets renamed depending on what the two axes represent. The pattern is invariant:

$$\text{derivative} = \frac{\text{change in y-axis quantity}}{\text{change in x-axis quantity}}$$

If y is…	and x is…	the derivative is called…
Position	Time	Velocity
Velocity	Time	Acceleration
Total cost	Quantity produced	Marginal cost
Population	Time	Growth rate
Temperature	Time	Rate of heating/cooling
Charge	Time	Electric current
Volume of water	Time	Flow rate
Concentration	Time	Reaction rate
Loss (in ML)	A model parameter	Gradient

Acceleration as a derivative of a derivative

Suppose a ball drops from a height with position $s(t) = 5t^2$ meters.

Velocity (slope of position vs time): apply the limit definition and you get $s'(t) = 10t$. At $t = 2$ seconds, the ball moves at $20$ m/s. Units: meters per second.

Acceleration (slope of velocity vs time): apply the formula again to $v(t) = 10t$ and get $v'(t) = 10$. A constant $10$ m/s² — which matches gravity. Units: meters per second per second.

Both numbers came from the same formula. The geometric "slope" idea was running quietly in the background the whole time.

· · ·

§ 06

Partial differentiation

A regular derivative is the slope of a curve. A partial derivative is the slope of a surface — but only in one direction at a time.

So far every function has had one input: $y = f(x)$. Now consider a function with two inputs, like $f(x, y) = x^2 + y^2$. For every pair $(x, y)$ the function spits out a number $z$, and the graph is a surface in 3D — for this function, a bowl.

At any point on a surface, "slope" is ambiguous. Walking north has one steepness; walking east has a different one. Partial derivatives answer the question: "slope in which direction?"

Two partials, one for each variable

$$\frac{\partial f}{\partial x} \quad \text{(slope as you move in the x-direction)}$$ $$\frac{\partial f}{\partial y} \quad \text{(slope as you move in the y-direction)}$$

The curly $\partial$ ("partial") signals: I'm differentiating with respect to one variable, but there are others lurking that I'm pretending are constants.

The trick that makes computation easy

To compute $\partial f / \partial x$, treat $y$ as if it were just a number — a constant — and differentiate normally. Then to compute $\partial f / \partial y$, treat $x$ as a constant. No new differentiation rules. You temporarily freeze the other variable.

Worked example: f(x, y) = x²y + 3xy² + 7

$\partial f / \partial x$ (treat $y$ as constant):

$x^2 y$: $y$ is a constant multiplier → derivative is $2xy$
$3xy^2$: $3y^2$ is a constant multiplier → derivative is $3y^2$
$7$: derivative of a constant is $0$

$$\frac{\partial f}{\partial x} = 2xy + 3y^2$$

$\partial f / \partial y$ (treat $x$ as constant):

$x^2 y$: $x^2$ is a constant multiplier → derivative is $x^2$
$3xy^2$: $3x$ is a constant multiplier → derivative is $6xy$
$7$: derivative is $0$

$$\frac{\partial f}{\partial y} = x^2 + 6xy$$

Each partial derivative still depends on both $x$ and $y$ — the slope of the surface in one direction can change as you move in the other direction.

The geometric picture

To compute $\partial f / \partial x$ at a point, imagine slicing the surface with a vertical plane that runs in the x-direction. The slice through the surface is a curve, and the slope of that curve at your point is $\partial f / \partial x$. Different slice (in the y-direction) for $\partial f / \partial y$.

So a partial derivative reduces a hard 3D problem (slope of a surface) to a familiar 2D one (slope of a curve), by slicing.

§ 07

The ML connection: weights & loss

Partial derivatives are arguably the most important math idea in modern machine learning. Here's why.

The simplest model

A linear model with one input:

ŷ = w·x + b

It predicts an output $\hat{y}$ from an input $x$ using two tunable knobs: a weight $w$ and a bias $b$. Training the model means finding values of $w$ and $b$ that make predictions match the true labels $y$ as closely as possible.

To measure "as closely as possible," we define a loss function — a number that's high when predictions are bad and low when they're good:

$$L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2$$

The data points $(x_i, y_i)$ are fixed. The variables in this function are $w$ and $b$ — exactly the multivariable setup partial differentiation was built for.

The geometric picture

Plot the loss as a 3D surface: the two horizontal axes are $w$ and $b$, the vertical axis is $L$. Each point on this surface says, "if I picked these values for $w$ and $b$, here's how badly my model would do."

The surface usually looks like a bowl — somewhere there's a lowest point, and that's the $(w, b)$ combination that makes the model best. Training is the process of walking downhill on this loss surface.

Each partial derivative tells you which way to nudge one parameter.

Partials → direction of nudge

At your current spot on the loss surface:

If $\partial L / \partial w$ is positive: increasing $w$ increases loss → decrease $w$.
If $\partial L / \partial w$ is negative: increasing $w$ decreases loss → increase $w$.
Same logic for $b$.

Each partial tells you (1) which direction to nudge that parameter — opposite of the sign — and (2) how sensitive the loss is to it (the magnitude). Stack the partials together and you get the gradient:

$$\nabla L = \begin{bmatrix} \dfrac{\partial L}{\partial w} \\[6pt] \dfrac{\partial L}{\partial b} \end{bmatrix}$$

The gradient points in the direction of steepest ascent. We want to go down, so we step opposite — $-\nabla L$. That is gradient descent in one sentence.

The update rule

$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$ $$b \leftarrow b - \eta \cdot \frac{\partial L}{\partial b}$$

$\eta$ (eta) is the learning rate — a small positive number like $0.01$ that controls step size. Repeat thousands or millions of times. The minus sign is because we want to go down, and the gradient points up.

Concrete numerical example

Take a tiny dataset: one point $(x, y) = (2, 5)$. Loss is:

$$L(w, b) = (2w + b - 5)^2$$

By the chain rule:

$$\frac{\partial L}{\partial w} = 4(2w + b - 5), \quad \frac{\partial L}{\partial b} = 2(2w + b - 5)$$

Start at $w = 0, b = 0$:

Prediction $\hat{y} = 0$. True $y = 5$. Off by 5.
$\partial L / \partial w = 4(0 + 0 - 5) = -20$
$\partial L / \partial b = 2(0 + 0 - 5) = -10$

With $\eta = 0.01$:

New $w = 0 - 0.01 \cdot (-20) = +0.2$
New $b = 0 - 0.01 \cdot (-10) = +0.1$

New prediction: $\hat{y} = 0.2 \cdot 2 + 0.1 = 0.5$. We've moved from predicting $0$ to predicting $0.5$, closer to the true value. Repeat thousands of times and we converge near a combination that gives $\hat{y} = 5$.

§ 08

Why both w and b?

If $\partial L / \partial w$ tells you how to change $w$ to reduce loss, why have $b$ at all? The answer reveals why ML works the way it does.

What each parameter actually does

Look again at $\hat{y} = wx + b$:

$w$ (the slope) controls how steep the line is — how much $\hat{y}$ changes when $x$ changes. Tweaking $w$ rotates the line.
$b$ (the intercept) controls where the line sits vertically — it shifts the whole line up or down without changing its tilt.

If you only had $w$, your model would be $\hat{y} = wx$ — a line forced to pass through the origin $(0, 0)$. Real data rarely sits on such lines. You need $b$ to lift the line off the origin.

Try it yourself

The black dots are the data — the true relationship is $y = 2x + 1$, so the perfect fit is $w = 2, b = 1$. Click the "no intercept" preset: even with $w$ set to its optimal value of $2$, the line is forced through the origin and misses every point by exactly $1$ unit. No tilt-only adjustment can fix a vertical offset — only $b$ can do that. Each parameter has a specific role; they're not interchangeable.

The 3D loss-surface view

Picture the loss surface (a bowl) over the $(w, b)$ plane. The bottom of the bowl is at some specific $(w^*, b^*)$. If you only update $w$, you're constrained to a 1D slice of the bowl — a single line through $(w, b)$ space at your current $b$. The minimum along that slice is not the same as the minimum of the whole bowl. To reach the actual bottom you need both knobs.

Updating only $w$ would be like trying to find the bottom of a bowl while only being allowed to walk north–south.

§ 09

Vocabulary & the bigger picture

The core ML training vocabulary

x: input (a feature — pixel value, word embedding, sensor reading)
ŷ: prediction (model's output)
y: true label (ground truth from data)
w: weight (how much each input matters)
b: bias (baseline / offset, the model's "default" output)
L: loss (how wrong the predictions are)
∂L/∂w, ∂L/∂b: partial derivatives (which way to nudge each parameter)
∇L: gradient (vector of all partials — steepest-ascent direction)
η: learning rate (how big a nudge per step)
backpropagation: efficient algorithm for computing all partials at once via the chain rule
gradient descent: repeat the nudges until loss is minimized

The three steps every training loop performs

Forward pass. Feed input through the network. Compute prediction. Compute loss.
Backward pass. Compute $\partial L / \partial \theta$ for every parameter $\theta$ in the network — this is backpropagation, systematic application of partial derivatives + chain rule.
Update. For every parameter, $\theta \leftarrow \theta - \eta \cdot (\partial L / \partial \theta)$.

Repeat for billions of training examples until the loss bottoms out. When you hear "the model is learning," what's literally happening is: the partial derivatives are telling each parameter which way to move, and the parameters are sliding downhill on the loss surface.

§ 10

Is w derived from x?

No. They're completely different kinds of things.

$x$ is data — it comes from the outside world. You don't choose it; you're given it.
$w$ is a parameter — it's something the model owns and tunes. The whole point of training is to find good values for $w$.

The volume-knob analogy

Imagine a stereo. The music signal coming in is like $x$ — whatever the radio is broadcasting, you can't change it. The knob position is like $w$ — how much you're amplifying the signal. The sound coming out is like $\hat{y} = wx$. Different songs (different $x$'s) come and go through the radio. The knob ($w$) stays where you set it until you decide to turn it. The knob isn't "derived from" the music — it's a separate thing you tune to make the output sound right across all the music that comes through.

Where w actually comes from

Initialization. $w$ starts as a random number — usually small, drawn from a random distribution. Neural networks begin life as random nonsense, producing garbage predictions.

Training. $w$ gets updated repeatedly using gradient descent: $w \leftarrow w - \eta \cdot (\partial L / \partial w)$. After enough updates, $w$ settles into a value that makes the model good.

So $w$'s final value is determined by the data plus the loss function plus gradient descent. It's informed by $x$ in the sense that the data shapes what the right value of $w$ should be — but $w$ isn't computed from any single $x$.

The contrast in one table

	x — input	w — weight
Where it comes from	Dataset / outside world	Random init, then updated by training
Who controls it	You can't change it	The training algorithm sets it
When it changes	New value for each example	Updates each training step
What it represents	A measurable feature	The model's learned knowledge
After training	Still arrives fresh from data	Frozen — the model "is" its weights

Once a model is trained, the weights are saved to disk — that file is the trained model. When you later use it on new data, you load those weights and feed in fresh $x$'s. The $x$'s keep changing; the $w$'s stay locked in place.

A model is a bag of weights. Training figures out what those weights should be. Inference uses them on new x's.