Optimisation

Optimisation & Backpropagation优化与反向传播

How a model turns errors into gradients, then gradients into parameter updates.模型如何把错误变成梯度，再把梯度变成参数更新。

Gradient Descent

Move parameters opposite the gradient参数沿梯度反方向移动

loss = 1/2 * (w*x - target)^2
dE/dw = x * (w*x - target)
w_new = w_old - learning_rate * dE/dw

The gradient tells which direction increases loss. Subtracting it tries to reduce loss.梯度指向 loss 增大的方向，所以减去梯度就是尝试降低 loss。

Learning Rate

Step size controls stability步长决定速度和稳定性

Too small: training is correct but slow.太小：方向对，但学得慢。
Moderate: converges quickly.合适：收敛快。
Too large: oscillates or explodes.太大：震荡或爆炸。

Momentum

Momentum adds inertia动量给更新加惯性

Momentum keeps part of the previous update direction. A moderate value can speed up convergence, but too much can make training unstable.动量会保留一部分之前的更新方向。适中的动量可以加速收敛，太大则可能不稳定。

Forward Pass

Forward pass calculates predictions前向传播负责算预测

hidden_sum = bias_hidden + input * input_hidden_weights
hidden = activation(hidden_sum)
output = bias_output + hidden * hidden_output_weights

A forward pass moves from input to hidden activations to output predictions.前向传播从输入开始，经过隐藏层激活，最后得到输出预测。

Backpropagation

Backpropagation moves gradients backward反向传播把梯度往前传

output_gradient = prediction - target
weight_gradient = input_to_weight * output_gradient
hidden_gradient = sum(output_weights * output_gradients)
pre_activation_gradient = activation_derivative * hidden_gradient

For ReLU, a negative pre-activation gives derivative 0, so no gradient passes through that hidden unit.对 ReLU 来说，如果激活前的值是负数，导数为 0，这个隐藏节点就不会把梯度继续传回去。

Loss Functions

Different losses create different gradients不同 loss 会产生不同梯度

Squared error: 1/2 * (z - t)^2, gradient is z - t.，梯度是 z - t。
Binary cross entropy: -[t log y + (1-t)log(1-y)], common with sigmoid outputs.，常和 sigmoid 输出一起用。
Softmax cross entropy: common for one-of-many classification.常用于多类别单标签分类。

Parameter Update Checklist

A reliable manual backprop order手算反向传播的可靠顺序

Run the forward pass and store every intermediate value.先做前向传播，保存每个中间值。
Compute the loss.计算 loss。
Start gradients at the output layer.从输出层开始算梯度。
Move backward through weights, activations, and biases.依次通过权重、激活函数和 bias 往前传。
Update each parameter with old - learning_rate * gradient.用 旧值 - 学习率 * 梯度 更新每个参数。

Vanishing and Exploding Gradients

Deep chains multiply many derivatives深层链式求导会连续相乘

If many derivatives are smaller than 1, gradients shrink as they move backward. If many are larger than 1, gradients grow. This is why activation choice, initialisation, normalisation, and skip connections matter.如果很多导数都小于 1，梯度往前传时会越来越小；如果很多导数大于 1，梯度会越来越大。这就是激活函数、初始化、归一化和跳连结构重要的原因。

New Practice Prompt

Mini exercise for this page本页小练习

If w = 0.4, gradient is -0.6, and learning rate is 0.5, what is the new weight?如果 w = 0.4，梯度是 -0.6，学习率是 0.5，新权重是多少？

Answer: w_new = 0.4 - 0.5*(-0.6) = 0.7. Subtracting a negative gradient increases the parameter.w_new = 0.4 - 0.5*(-0.6) = 0.7。减去负梯度会让参数变大。