Week 10 · Floating Point & Text Encoding
第 10 周 · 浮点与文本编码
Representing Numbers & Characters Reliably
可靠地表示数值与字符
IEEE-754 arithmetic, rounding pitfalls, ASCII/Unicode evolution, and UTF encoding strategies
覆盖 IEEE-754 算术、舍入陷阱、ASCII/Unicode 演进与 UTF 编码策略
🎯 Learning Objectives学习目标
- Decode IEEE-754 single/double precision bit patterns.解码 IEEE-754 单精度与双精度位模式。
- Identify rounding errors, catastrophic cancellation, and special values (NaN, infinity).识别舍入误差、灾难性抵消以及特殊值(NaN、无穷)。
- Compare text encodings (ASCII, Latin-1, UTF-8/16/32) and justify Unicode adoption.比较文本编码(ASCII、Latin-1、UTF-8/16/32),并阐述为何采用 Unicode。
- Convert between Unicode code points and UTF-8 byte sequences.在 Unicode 码点与 UTF-8 字节序列之间转换。
- Craft exam-ready explanations for floating point equality pitfalls and locale-sensitive text handling.编写考试级论述,解释浮点比较陷阱与文本区域性处理注意事项。
🧭 Exam Alignment考试对齐
- IEEE decoding — 22T2 Q8 asks to convert bitstrings to decimal values.IEEE 解码 — 22T2 Q8 要求将比特串转换为十进制。
- Equality traps — Tutorial10 emphasises why
d == d + 1 may be true for large doubles.等值陷阱 — Tutorial10 讨论为何大数双精度可出现 d == d + 1。
- UTF-8 mapping — Lab10 «utf8_encoder» mirrors final exam encoding tasks.UTF-8 转换 — Lab10 “utf8_encoder” 对应期末编码题。
- Locale Q&A — Past finals question why ASCII cannot represent emoji or CJK characters.区域性问答 — 历届期末询问 ASCII 为什么不能表示表情或中日韩字符。
- Cancellation reasoning — 20T3 Q7 includes catastrophic cancellation analysis.抵消分析 — 20T3 Q7 探讨灾难性抵消问题。
Coverage goals: decode ≥5 floating-point examples, document rounding-error mitigation strategies, and practice encoding ≥4 Unicode characters into UTF-8.
覆盖目标:解码 ≥5 个浮点示例,记录舍入误差缓解策略,并练习将 ≥4 个 Unicode 字符编码成 UTF-8。
📚 Core Concepts核心概念
IEEE-754 LayoutIEEE-754 布局
Floating point uses sign, exponent, mantissa. Single precision: 1/8/23 bits; double precision: 1/11/52 bits.浮点数由符号位、指数位、尾数位组成。单精度:1/8/23 位;双精度:1/11/52 位。
Rounding & Cancellation舍入与抵消
Limited precision forces rounding errors. Catastrophic cancellation occurs when subtracting near-equal numbers.精度有限导致舍入误差;相近数相减会引发灾难性抵消。
Special Values特殊值
Zero, denormalised numbers, infinities, NaNs — understand representation and comparisons (NaN != NaN).零、非正规数、无穷与 NaN 的表示及比较规则(如 NaN != NaN)。
From ASCII to Unicode从 ASCII 到 Unicode
ASCII covers 128 codes, extended ASCII adds local sets; Unicode aims to cover all characters with code points (U+0000 … U+10FFFF).ASCII 仅覆盖 128 个字符;扩展 ASCII 引入局部字符集;Unicode 通过码点(U+0000…U+10FFFF)覆盖全球文字。
UTF-8 EncodingUTF-8 编码
Variable-length encoding: 1–4 bytes. ASCII compatible; high bits signal continuation.UTF-8 是可变长编码(1–4 字节),兼容 ASCII,高位标志继续字节。
UTF-16 & SurrogatesUTF-16 与代理对
UTF-16 uses 16-bit code units; surrogate pairs encode characters beyond BMP. Endianness indicated via BOM.UTF-16 以 16 位单元表示,代理对编码超出 BMP 的字符,并通过 BOM 指定字节序。
🧪 Worked Examples示例串讲
Example 1 — Decode IEEE-754 Float示例 1 — 解码 IEEE-754 浮点数
Bit pattern: 0 10000000 11000000000000000000000 → (-1)^0 × 1.5 × 2^1 = 3.0.比特模式 0 10000000 11000000000000000000000 → (-1)^0 × 1.5 × 2^1 = 3.0。
Remember bias (127 for float) when reconstructing exponent; see ./week10-encoding-floating.html#ieee-format.重建指数时要记得偏移量(单精度为 127),详见 ./week10-encoding-floating.html#ieee-format。
Example 2 — Encode U+1F600 in UTF-8示例 2 — 编码 U+1F600 的 UTF-8 序列
Split bits → 11110000 10011111 10011000 10000000 (0xF0 0x9F 0x98 0x80).
Follows 4-byte pattern; verifying ensures ability to handle emoji in exam tasks.拆分比特 → 11110000 10011111 10011000 10000000(0xF0 0x9F 0x98 0x80)。符合四字节模式,考试处理 emoji 时常用。
⚠️ Common Pitfalls易错点
- Comparing doubles with ==/!= without tolerance.直接使用 ==/!= 比较 double,未设置容差。
- Assuming decimal fractions like 0.1 are represented exactly; they are not in binary.误以为 0.1 等十进制小数在二进制中可精确表示。
- Treating UTF-8 as fixed width; slicing mid-byte corrupts characters.错误认为 UTF-8 是定长编码,截断字节会破坏字符。
- Forgetting BOM when writing UTF-16 files causing reverse byte order on some systems.写 UTF-16 文件时忘记 BOM,导致在部分系统上字节序反转。
🛠️ Practice Task实践任务
Create encode_lab.c: accept floating-point literals and Unicode code points, then output IEEE-754 bit patterns and UTF-8 byte sequences.编写 encode_lab.c:输入浮点字面量与 Unicode 码点,输出对应的 IEEE-754 位模式与 UTF-8 字节序列。
- Use
memcpy into uint64_t/uint32_t to inspect raw bits.通过 memcpy 将浮点值拷贝到 uint64_t/uint32_t 查看原始位。
- Implement UTF-8 encoder based on lead byte templates (1–4 bytes).根据首字节模板实现 UTF-8 编码(1–4 字节)。
- Optional Optional: add UTF-16LE encoder and ASCII fallback logic.可选 Optional:加入 UTF-16LE 编码与 ASCII 回退逻辑。
🧪 Tutorial & Lab Mapping教程与实验映射
Tutorial 10 HighlightsTutorial 10 精要
- IEEE decoding drills and catastrophic cancellation discussions.IEEE 解码训练与灾难性抵消讨论。
- Why
d == d+1 can be true; exploring representable range.分析 d == d+1 可能成立的原因,理解可表示范围。
- Manual UTF-8 encoding practice (BMP and supplemental planes).手动编码 UTF-8,包括 BMP 与增补平面字符。
Lab 10 Programming TasksLab 10 编程任务
- float_bits.c — print bit patterns of float/double inputs.float_bits.c — 输出 float/double 的位模式。
- float_accuracy.c — illustrate rounding/cancellation scenarios.float_accuracy.c — 演示舍入与抵消场景。
- utf8_encoder.c — convert code points to UTF-8.utf8_encoder.c — 将码点编码为 UTF-8。
- utf_sanitiser.c (challenge) — validate UTF-8 streams and report invalid sequences.utf_sanitiser.c(挑战) — 校验 UTF-8 流并报告非法序列。
📝 Study Log学习记录
- Inputs shared: floating_point.pdf, unicode.pdf, Lab10 spec, Tutorial10 sheet, exam archives.提供资料:floating_point.pdf、unicode.pdf、Lab10 说明、Tutorial10 讲义、历年试题。
- Prompt: “Combine floating-point decoding tables with UTF cheat-sheets for rapid revision.”提示词:“将浮点解码表与 UTF 速查表结合,便于快速复习。”
- Breakthrough: Visualising exponent bias clarified why very large integers lose +1 precision.收获: 可视化指数偏移帮助理解大整数加 1 仍不变的原因。
- Misconception fixed: Previously thought UTF-8 uses BOM; now clarified only UTF-16/32 require BOM for endianness.修正误区: 曾误认为 UTF-8 需要 BOM,已确认只有 UTF-16/32 需指示字节序。
- Action items: Practise encoding random Unicode points, review decimal approximation techniques, and prepare cheat-sheet table.后续行动: 练习编码随机 Unicode 码点,复习十进制近似技巧,并准备速查表。
Premium Quiz — 40 Questions on Floating Point & UnicodePremium 测验 — 40 道浮点与 Unicode 题
28 basic (IEEE/ASCII basics) · 8 intermediate (error analysis & encoding) · 4 advanced (combined scenarios)基础28题(IEEE/ASCII基础)· 中级8题(误差分析与编码)· 高级4题(综合场景)
🔒
Open Week 10 Quiz (Premium)
打开第 10 周测验(会员)
🔭 Next Steps后续重点
Week 11+ focuses on final consolidation: revisit high-weight quizzes, compile formula cheat-sheets, and simulate final exam timing.之后请集中整合:回顾高权重测验、整理公式速查表,并模拟期末考试时间。
📎 Resources & Checklist资源与检查表
- PPT: floating_point.pdf p1–28, unicode.pdf p1–46.PPT:floating_point.pdf 第1–28页,unicode.pdf 第1–46页。
- Autotest:
1521 autotest lab10_float_bits, lab10_float_accuracy, lab10_utf8_encoder, lab10_utf_sanitiser.自动测试:1521 autotest lab10_float_bits、lab10_float_accuracy、lab10_utf8_encoder、lab10_utf_sanitiser。
- Self-check: Can you convert U+20AC to UTF-8 quickly? Can you explain why NaN breaks equality checks?自检:能否快速将 U+20AC 转换为 UTF-8?能否解释为何 NaN 会破坏等值比较?