Probability

Probability & Information概率与信息量

Bayesian updating, entropy, KL divergence, Gaussian uncertainty, and distribution distance in one compact review.集中复习贝叶斯更新、熵、KL 散度、高斯分布不确定性和分布距离。

Bayes' Rule

Update belief after seeing evidence看到证据后更新相信程度

P(H | E) = P(E | H) P(H) / P(E)

H: hypothesis.：假设。
E: observed evidence.：已经看到的证据。
P(E): total probability of seeing the evidence across all possible causes.：在所有可能原因下看到这个证据的总概率。

Entropy

Entropy measures uncertainty熵衡量不确定性

H(p) = sum_x p(x) * [-log2 p(x)]

High entropy means outcomes are more spread out. Low entropy means the distribution is more concentrated.熵高表示结果更分散、更不确定；熵低表示分布更集中。

KL Divergence

KL measures the extra cost of using the wrong distributionKL 衡量用错分布的额外代价

D_KL(p || q) = sum_x p(x) log2[p(x) / q(x)]

KL is not symmetric. D_KL(p || q) and D_KL(q || p) usually have different meanings and values.KL 不是对称的，D_KL(p || q) 和 D_KL(q || p) 通常含义和值都不同。

Gaussian Entropy

For Gaussians, covariance controls entropy高斯分布的熵由协方差控制

H = 1/2 log|Sigma| + 1 + log(2*pi)   # for 2D Gaussian

Changing the mean moves the distribution. Changing the covariance changes how spread out it is.改变 mean 只是移动分布；改变 covariance 才会改变分布有多分散。

Wasserstein Distance

Distance between shapes and locations同时比较位置和形状的距离

W2^2 = ||mu1 - mu2||^2 + sum_i (sigma_i - tau_i)^2

For diagonal covariance, compare the means and the standard deviations in each direction.对角协方差时，可以分别比较均值位置和每个方向的标准差。

Huffman Coding

Frequent symbols should get shorter codes越常出现的符号编码越短

A distribution such as [1/2, 1/4, 1/8, 1/8] naturally suggests code lengths [1, 2, 3, 3]. Entropy is the expected number of bits under an ideal code.像 [1/2, 1/4, 1/8, 1/8] 这样的分布，自然对应编码长度 [1, 2, 3, 3]。熵可以理解成理想编码下的平均 bit 数。

Cross Entropy

Cross entropy measures how surprised the model is交叉熵衡量模型有多意外

Binary cross entropy:
E = -[t log(y) + (1-t) log(1-y)]

Multi-class cross entropy:
E = -sum_i target_i log(prob_i)

If the correct class probability is low, cross entropy becomes large. This makes it useful for classification training.如果正确类别的概率很低，交叉熵会变大，所以它很适合训练分类模型。

Direction Matters

Why KL is not a normal distance为什么 KL 不是普通距离

D_KL(p || q) averages the penalty using samples from p. D_KL(q || p) averages using samples from q. If one distribution puts mass where the other has almost none, one direction can become much larger.D_KL(p || q) 是按 p 的样本来平均惩罚；D_KL(q || p) 是按 q 的样本来平均。一个分布在另一个几乎没概率的地方有大量概率时，某个方向会特别大。

New Practice Prompt

Mini exercise for this page本页小练习

A classifier assigns probability 0.8 to the correct binary label. What is the cross entropy loss for that one example?一个二分类模型给正确标签的概率是 0.8。这个样本的交叉熵是多少？

Answer: -log(0.8) ≈ 0.223. Higher correct probability means lower loss.-log(0.8) ≈ 0.223。正确类别概率越高，loss 越低。