Formula Review

Formula Summary公式汇总

A compact formula sheet for quick memory checks and concept review.集中保存常用公式，方便快速复习和检查自己是否真的理解。

Topic主题	Formula公式	Memory hook记忆提示
Perceptron感知机	`score = w0 + w1x1 + w2x2`	Sign decides class.分类只看 score 的正负号。
Decision boundary决策边界	`w0 + w1x1 + w2x2 = 0`	Points on this line are exactly on the boundary.落在这条线上表示刚好处在分类边界。
Perceptron update感知机更新	`w = w + etax` or `w = w - etax`	Add for missed positive, subtract for missed negative.正类被分错就加，负类被分错就减。
OR或门	`bias = -0.5, weights = 1`	Any 1 makes score positive.只要有一个输入是 1，score 就变正。
AND与门	`bias = 0.5 - n, weights = 1`	All n inputs must be 1.必须 n 个输入全是 1 才输出正类。
Negated OR clause带非的 OR 子句	`negated literal weight = -1, bias = k - 0.5`	k is the number of negated literals in the clause.k 表示这个子句里取反变量的数量。
ReLUReLU 激活函数	`ReLU(u) = max(0, u)`	Negative values are shut off.负数被截成 0，正数保留原值。
SigmoidSigmoid 激活函数	`sigmoid(u) = 1 / (1 + exp(-u))`	Maps real numbers to 0..1.把任意实数压到 0 到 1 之间。
tanhtanh 激活函数	`tanh(u) = (exp(u)-exp(-u))/(exp(u)+exp(-u))`	Maps real numbers to -1..1.把任意实数压到 -1 到 1 之间。
Squared loss平方误差	`E = 1/2 * (prediction - target)^2`	Half cancels derivative 2.前面的 1/2 会抵消平方求导出来的 2。
Squared loss gradient平方误差梯度	`dE/dz = z - t`	Prediction minus target.预测值减去目标值。
Gradient descent梯度下降	`w_new = w_old - eta * gradient`	Move opposite gradient.沿着梯度的反方向更新参数。
Momentum idea动量思想	`velocity = mom*velocity + gradient`	Keep some previous update direction.保留一部分之前的更新方向。
Bayes贝叶斯公式	`P(H\|E) = P(E\|H)P(H) / P(E)`	Update belief after evidence.看到证据后更新对假设的相信程度。
Total probability全概率公式	`P(E) = sum_i P(E\|H_i)P(H_i)`	Add all ways the evidence can happen.把证据发生的所有可能来源加起来。
Entropy熵	`H(p) = sum p(x)[-log2 p(x)]`	Average uncertainty.衡量平均不确定性。
Cross entropy交叉熵	`H(p,q) = -sum p(x)log q(x)`	Expected surprise under q.用 q 预测真实分布 p 时的平均惊讶程度。
Binary cross entropy二分类交叉熵	`E = -[t log(y) + (1-t)log(1-y)]`	Classification loss for one sigmoid output.常用于一个 sigmoid 输出的二分类 loss。
KL divergenceKL 散度	`D_KL(p\|\|q) = sum p(x)log[p(x)/q(x)]`	Extra cost using q for p.用 q 近似 p 时多付出的编码代价。
Gaussian entropy高斯熵	`H = 1/2 log\|Sigma\| + 1 + log(2*pi)`	Covariance controls spread.协方差决定分布有多分散。
2D diagonal determinant二维对角矩阵行列式	`\|[[a,0],[0,b]]\| = a*b`	Product of diagonal entries.对角线元素相乘。
Trace迹	`Trace([[a,0],[0,b]]) = a + b`	Add diagonal entries.把对角线元素加起来。
Diagonal inverse对角矩阵逆	`diag(a,b)^-1 = diag(1/a, 1/b)`	Invert each diagonal entry.每个对角线元素分别取倒数。
Gaussian KL to identity高斯到单位协方差的 KL	`1/2[\|\|mu\|\|^2 + Trace(Sigma) - log\|Sigma\| - d]`	Useful when the comparison Gaussian has identity covariance.当比较对象的协方差是单位矩阵时常用。
Wasserstein 2D diagonal二维对角 Wasserstein 距离	`W2^2 = \|\|mu1-mu2\|\|^2 + sum(sigma_i - tau_i)^2`	Compare centre and standard deviations.同时比较中心位置和各方向标准差。
SoftmaxSoftmax	`Prob(i) = exp(zi) / sum_j exp(zj)`	Turn logits into probabilities.把 logits 转成总和为 1 的概率。
Log softmaxLog softmax	`log Prob(k) = zk - log sum_j exp(zj)`	Used before cross entropy derivation.常用于推导交叉熵梯度。
Correct-class log-softmax gradient正确类 log-softmax 梯度	`d log Prob(k)/dz_k = 1 - Prob(k)`	Correct class pushed up.正确类别的 logit 会被推高。
Other log-softmax gradients其他类 log-softmax 梯度	`d log Prob(k)/dz_j = -Prob(j)`	Incorrect classes pushed down.错误类别的 logits 会被压低。
Identical input optimum相同输入的最优预测	`z = positive_count / total_count`	Best single prediction is the empirical rate.最佳单一预测等于样本里的正类比例。
Hidden unit pre-activation隐藏单元激活前值	`u = b + sum_i x_i*w_i`	Weighted sum before activation.进入激活函数前的加权和。
Backprop weight gradient反向传播权重梯度	`dW = input * downstream_gradient`	Input into weight times gradient after weight.权重前面的输入乘以后面传回来的梯度。
Backprop through ReLU通过 ReLU 反传	`du = 0 if u <= 0, else dy`	Closed ReLU stops gradient.关闭的 ReLU 会阻断梯度。
Convolution size卷积输出尺寸	`out = floor((in - filter) / stride) + 1`	No padding version.这是无 padding 时的计算方式。
Conv filter parameters单个卷积核参数	`filter_w * filter_h * channels + 1`	One bias per filter.每个 filter 额外有一个 bias。
Conv layer parameters卷积层总参数	`(filter_w * filter_h * channels + 1) * filters`	One parameter set per filter.每个 filter 有一套独立参数。
Conv layer neurons卷积层神经元数量	`out_w * out_h * filters`	Every filter produces one feature map.每个 filter 产生一张 feature map。
Conv connections卷积连接数量	`neurons * (filter_w * filter_h * channels + 1)`	Count every filter use at every location.每个位置使用 filter 的连接都要计入。
Dense layer parameters全连接层参数	`(input_units + 1) * output_units`	Bias plus all input-output weights.每个输出单元都有所有输入权重和一个 bias。
Batch normalisation批归一化	`x_hat = (x - mean) / sqrt(var + eps)`	Normalise batch values.把一个 batch 内的值标准化。
Batch norm scale/shift批归一化缩放和平移	`y = gamma*x_hat + beta`	Learn back a useful scale and offset.再学习合适的缩放和平移。
Residual block残差块	`output = x + F(x)`	Shortcut plus learned change.原输入加上学到的变化量。
Dense connection密集连接	`output_l = concat(x0, x1, ..., x_l-1)`	Later layers reuse earlier features.后面的层直接复用前面层的特征。