Lab 08: Text Processing & Simple ML Lab 08：文本处理与简单机器学习

Test Your Knowledge测试你的知识

Ready to test what you've learned about text processing and Naive Bayes?准备好测试你学到的文本处理和朴素贝叶斯知识了吗？

Take the Quiz (Members Only) 做测验（会员专属） PREMIUM

1. Defining a "Word" 1. 定义"单词"

In this lab, a word is a maximal, non-empty, contiguous sequence of alphabetic characters [a-zA-Z]. Any non-alphabetic character separates words.在本实验中，单词是最长的、非空的、连续的字母字符序列 [a-zA-Z]。任何非字母字符都分隔单词。

"The soul's desire" contains 4 words: The, soul, s, desire"The soul's desire" 包含 4 个单词：The, soul, s, desire
"hello-world" contains 2 words: hello, world"hello-world" 包含 2 个单词：hello, world
"123abc!def" contains 2 words: abc, def"123abc!def" 包含 2 个单词：abc, def

Use re.findall() with the pattern [a-zA-Z]+ to extract all words:使用 re.findall() 和模式 [a-zA-Z]+ 来提取所有单词：

import re

text = "The soul's desire"
words = re.findall(r'[a-zA-Z]+', text)
# ['The', 'soul', 's', 'desire']

⚠️ Don't use split()!不要用 split()！

split() splits on whitespace but doesn't handle punctuation correctly. "soul's" split gives ["soul's"] (1 item), not 2 words. Always use re.findall(r'[a-zA-Z]+', text).split() 按空白分割但不能正确处理标点。"soul's" 用 split 会得到 ["soul's"]（1项），不是2个单词。务必使用 re.findall(r'[a-zA-Z]+', text)。

2. Counting Total Words (total_words.py) 2. 统计总词数（total_words.py）

Read from stdin, count all words using our definition:从 stdin 读取，用我们的定义统计所有单词：

#!/usr/bin/env python3
import sys
import re

text = sys.stdin.read()
words = re.findall(r'[a-zA-Z]+', text)
print(f"{len(words)} words")

Key: sys.stdin.read()关键：sys.stdin.read()

sys.stdin.read() reads ALL of stdin as one string. Then re.findall() extracts every word. len() gives the count.sys.stdin.read() 将所有 stdin 作为一个字符串读取。然后 re.findall() 提取每个单词。len() 给出计数。

3. Counting a Specific Word (count_word.py) 3. 统计特定单词（count_word.py）

Count how many times a specific word (from command line) appears. Case-insensitive!统计某个特定单词（来自命令行）出现的次数。不区分大小写！

#!/usr/bin/env python3
import sys
import re

target = sys.argv[1].lower()
text = sys.stdin.read()
words = re.findall(r'[a-zA-Z]+', text)

count = 0
for word in words:
    if word.lower() == target:
        count += 1

print(f"{target} occurred {count} times")

Why .lower()?为什么用 .lower()？

The search is case-insensitive. Convert both the target and each word to lowercase before comparing. "Death", "death", "DEATH" should all match.搜索不区分大小写。比较前将目标和每个单词都转换为小写。"Death"、"death"、"DEATH" 都应该匹配。

4. Frequency Analysis & Dict-of-Dicts (frequency.py) 4. 频率分析与嵌套字典（frequency.py）

Calculate how frequently each artist uses a given word. This requires a dict of dicts — a dictionary where each value is another dictionary.计算每个 artist 使用给定单词的频率。这需要嵌套字典 — 一个字典，其每个值又是一个字典。

Data Structure 数据结构

# Dict-of-dicts: {artist: {word: count}}
word_counts = {}
total_words = {}

# Example structure:
# word_counts = {
#     "Justin Bieber": {"love": 493, "baby": 800, ...},
#     "Metallica":     {"death": 69, "fire": 45, ...},
#     ...
# }

Using glob to Find Files 使用 glob 查找文件

import glob

for file in glob.glob("lyrics/*.txt"):
    # file = "lyrics/Justin_Bieber.txt"
    name = file[len("lyrics/"):-len(".txt")]
    artist = name.replace("_", " ")  # "Justin Bieber"

    word_counts[artist] = {}
    total = 0

    with open(file) as f:
        text = f.read()
        words = re.findall(r'[a-zA-Z]+', text)
        for word in words:
            total += 1
            w = word.lower()
            if w in word_counts[artist]:
                word_counts[artist][w] += 1
            else:
                word_counts[artist][w] = 1
    total_words[artist] = total

Calculating & Printing Frequency 计算并打印频率

target = sys.argv[1].lower()

for artist in sorted(word_counts):
    count = word_counts[artist].get(target, 0)
    total = total_words[artist]
    frequency = count / total
    print(f"{count:4}/{total:6} = {frequency:.9f} {artist}")

Dict-of-Dicts Pattern嵌套字典模式

Outer dict = rows (artists). Inner dict = columns (word counts). Access: word_counts[artist][word]. Use .get(key, 0) to handle missing words gracefully.外层字典 = 行（artists）。内层字典 = 列（单词计数）。访问：word_counts[artist][word]。用 .get(key, 0) 安全处理不存在的单词。

5. Log Probability (log_probability.py) 5. 对数概率（log_probability.py）

To determine which artist is most likely to sing a phrase, we multiply the probabilities of each word. But multiplying many small numbers causes underflow. Solution: use log (multiplication becomes addition).要判断哪个 artist 最可能唱出某段歌词，我们将每个单词的概率相乘。但很多小数相乘会导致下溢。解决方案：使用对数（乘法变加法）。

import math

# Instead of:  P(phrase) = P(word1) * P(word2) * P(word3)
# We compute:  log P(phrase) = log P(word1) + log P(word2) + log P(word3)

Add-1 (Laplace) Smoothing 加一（Laplace）平滑

If an artist never uses a word, the probability is 0 — which makes the entire product 0. To avoid this, add 1 to every count:如果一个 artist 从未使用某个单词，概率为 0 — 这会使整个乘积为 0。为避免此问题，给每个计数加 1：

# Without smoothing: P(word|artist) = count / total
# With +1 smoothing: P(word|artist) = (count + 1) / total

count = word_counts[artist].get(word, 0) + 1  # add 1
total = total_words[artist]                     # DON'T add vocab_size!
log_prob += math.log(count / total)

⚠️ Smoothing Gotcha平滑陷阱

The lab says (0+1)/18205 — the denominator is the raw total word count, NOT total + vocab_size. Only add 1 to the numerator (count).题目说 (0+1)/18205 — 分母是原始总词数，不是 total + vocab_size。只给分子（计数）加 1。

Complete Implementation 完整实现

#!/usr/bin/env python3
import sys
import re
import glob
import math

# ... (build word_counts and total_words as before) ...

target_words = sys.argv[1:]

for artist in sorted(word_counts):
    log_prob = 0.0
    for word in target_words:
        w = word.lower()
        count = word_counts[artist].get(w, 0) + 1
        total = total_words[artist]
        log_prob += math.log(count / total)
    print(f"{log_prob:10.5f} {artist}")

Why Log?为什么用对数？

Probabilities become tiny: 0.0001 * 0.0003 * 0.00005 = 1.5e-12 — underflow risk概率变得极小：0.0001 * 0.0003 * 0.00005 = 1.5e-12 — 下溢风险
Log values stay manageable: log(0.0001) + log(0.0003) + log(0.00005) = -27.12对数值保持可控：log(0.0001) + log(0.0003) + log(0.00005) = -27.12
Higher log-prob = more likely (closer to 0 = better)更高的对数概率 = 更可能（越接近 0 = 越好）

6. Artist Identification (identify_artist.py) 6. 歌手识别（identify_artist.py）

Given mystery song files, identify the most likely artist using Naive Bayes: find the artist with the highest log-probability for all words in the song.给定神秘歌曲文件，使用朴素贝叶斯识别最可能的歌手：找到该歌曲所有单词对数概率最高的 artist。

#!/usr/bin/env python3
import sys
import re
import glob
import math

# ... (build word_counts and total_words as before) ...

for filename in sys.argv[1:]:
    with open(filename) as f:
        text = f.read()
    words = re.findall(r'[a-zA-Z]+', text)

    best_artist = None
    best_log_prob = float('-inf')

    for artist in word_counts:
        log_prob = 0.0
        for word in words:
            w = word.lower()
            count = word_counts[artist].get(w, 0) + 1
            total = total_words[artist]
            log_prob += math.log(count / total)
        if log_prob > best_log_prob:
            best_log_prob = log_prob
            best_artist = artist

    print(f"{filename} most resembles the work of {best_artist} "
          f"(log-probability={best_log_prob:.1f})")

Key Differences from log_probability.py与 log_probability.py 的关键区别

Words come from files (sys.argv), not command-line arguments单词来自文件（sys.argv），不是命令行参数
Repeated words add log-probability multiple times重复的单词多次累加对数概率
Only print the best artist (highest log-prob) per file每个文件只打印最佳歌手（最高对数概率）
No glob needed — shell expands song?.txt不需要 glob — shell 会展开 song?.txt

7. The Complete Pattern 7. 完整模式

All 5 exercises build on the same foundation. Here's how they relate:所有 5 个练习都建立在相同的基础上。它们的关系如下：

Exercise练习	Input输入	Core Logic核心逻辑
total_words.py	stdinstdin	`len(re.findall(r'[a-zA-Z]+', text))`
count_word.py	stdin + argv[1]stdin + argv[1]	Count words matching target (case-insensitive)统计匹配目标的单词（不区分大小写）
frequency.py	lyrics/ + argv[1]lyrics/ + argv[1]	Dict-of-dicts, `count / total`嵌套字典，`count / total`
log_probability.py	lyrics/ + argv wordslyrics/ + argv 单词	`sum(log((count+1)/total))sum(log((count+1)/total))`
identify_artist.py	lyrics/ + song fileslyrics/ + 歌曲文件	Find max log-prob artist per file找每个文件的最大 log-prob artist

Ready to Test?准备好测试了吗？

Take the quiz to check your understanding of text processing and Naive Bayes!做测验来检验你对文本处理和朴素贝叶斯的理解！

Take the Quiz (Members Only) 做测验（会员专属） PREMIUM

← Lab 07: Python Basics ← Lab 07：Python 基础