Optimisation & Backpropagation优化与反向传播
How a model turns errors into gradients, then gradients into parameter updates.模型如何把错误变成梯度,再把梯度变成参数更新。
Move parameters opposite the gradient参数沿梯度反方向移动
loss = 1/2 * (w*x - target)^2
dE/dw = x * (w*x - target)
w_new = w_old - learning_rate * dE/dw
The gradient tells which direction increases loss. Subtracting it tries to reduce loss.梯度指向 loss 增大的方向,所以减去梯度就是尝试降低 loss。
Step size controls stability步长决定速度和稳定性
- Too small: training is correct but slow.太小:方向对,但学得慢。
- Moderate: converges quickly.合适:收敛快。
- Too large: oscillates or explodes.太大:震荡或爆炸。
Momentum adds inertia动量给更新加惯性
Momentum keeps part of the previous update direction. A moderate value can speed up convergence, but too much can make training unstable.动量会保留一部分之前的更新方向。适中的动量可以加速收敛,太大则可能不稳定。
Forward pass calculates predictions前向传播负责算预测
hidden_sum = bias_hidden + input * input_hidden_weights
hidden = activation(hidden_sum)
output = bias_output + hidden * hidden_output_weights
A forward pass moves from input to hidden activations to output predictions.前向传播从输入开始,经过隐藏层激活,最后得到输出预测。
Backpropagation moves gradients backward反向传播把梯度往前传
output_gradient = prediction - target
weight_gradient = input_to_weight * output_gradient
hidden_gradient = sum(output_weights * output_gradients)
pre_activation_gradient = activation_derivative * hidden_gradient
For ReLU, a negative pre-activation gives derivative 0, so no gradient passes through that hidden unit.对 ReLU 来说,如果激活前的值是负数,导数为 0,这个隐藏节点就不会把梯度继续传回去。
Different losses create different gradients不同 loss 会产生不同梯度
- Squared error:
1/2 * (z - t)^2, gradient isz - t.,梯度是z - t。 - Binary cross entropy:
-[t log y + (1-t)log(1-y)], common with sigmoid outputs.,常和 sigmoid 输出一起用。 - Softmax cross entropy: common for one-of-many classification.常用于多类别单标签分类。
A reliable manual backprop order手算反向传播的可靠顺序
- Run the forward pass and store every intermediate value.先做前向传播,保存每个中间值。
- Compute the loss.计算 loss。
- Start gradients at the output layer.从输出层开始算梯度。
- Move backward through weights, activations, and biases.依次通过权重、激活函数和 bias 往前传。
- Update each parameter with
old - learning_rate * gradient.用旧值 - 学习率 * 梯度更新每个参数。
Deep chains multiply many derivatives深层链式求导会连续相乘
If many derivatives are smaller than 1, gradients shrink as they move backward. If many are larger than 1, gradients grow. This is why activation choice, initialisation, normalisation, and skip connections matter.如果很多导数都小于 1,梯度往前传时会越来越小;如果很多导数大于 1,梯度会越来越大。这就是激活函数、初始化、归一化和跳连结构重要的原因。
Mini exercise for this page本页小练习
If w = 0.4, gradient is -0.6, and learning rate is 0.5, what is the new weight?如果 w = 0.4,梯度是 -0.6,学习率是 0.5,新权重是多少?
Answer: w_new = 0.4 - 0.5*(-0.6) = 0.7. Subtracting a negative gradient increases the parameter.w_new = 0.4 - 0.5*(-0.6) = 0.7。减去负梯度会让参数变大。