Lab 08: Text Processing & Simple ML Lab 08:文本处理与简单机器学习

Test Your Knowledge测试你的知识

Ready to test what you've learned about text processing and Naive Bayes?准备好测试你学到的文本处理和朴素贝叶斯知识了吗?

Take the Quiz (Members Only) 做测验(会员专属) PREMIUM

1. Defining a "Word" 1. 定义"单词"

In this lab, a word is a maximal, non-empty, contiguous sequence of alphabetic characters [a-zA-Z]. Any non-alphabetic character separates words.在本实验中,单词是最长的、非空的、连续的字母字符序列 [a-zA-Z]。任何非字母字符都分隔单词。

  • "The soul's desire" contains 4 words: The, soul, s, desire"The soul's desire" 包含 4 个单词:The, soul, s, desire
  • "hello-world" contains 2 words: hello, world"hello-world" 包含 2 个单词:hello, world
  • "123abc!def" contains 2 words: abc, def"123abc!def" 包含 2 个单词:abc, def

Use re.findall() with the pattern [a-zA-Z]+ to extract all words:使用 re.findall() 和模式 [a-zA-Z]+ 来提取所有单词:

import re text = "The soul's desire" words = re.findall(r'[a-zA-Z]+', text) # ['The', 'soul', 's', 'desire']

⚠️ Don't use split()!不要用 split()!

split() splits on whitespace but doesn't handle punctuation correctly. "soul's" split gives ["soul's"] (1 item), not 2 words. Always use re.findall(r'[a-zA-Z]+', text).split() 按空白分割但不能正确处理标点。"soul's" 用 split 会得到 ["soul's"](1项),不是2个单词。务必使用 re.findall(r'[a-zA-Z]+', text)

2. Counting Total Words (total_words.py) 2. 统计总词数(total_words.py)

Read from stdin, count all words using our definition:stdin 读取,用我们的定义统计所有单词:

#!/usr/bin/env python3 import sys import re text = sys.stdin.read() words = re.findall(r'[a-zA-Z]+', text) print(f"{len(words)} words")

Key: sys.stdin.read()关键:sys.stdin.read()

sys.stdin.read() reads ALL of stdin as one string. Then re.findall() extracts every word. len() gives the count.sys.stdin.read() 将所有 stdin 作为一个字符串读取。然后 re.findall() 提取每个单词。len() 给出计数。

3. Counting a Specific Word (count_word.py) 3. 统计特定单词(count_word.py)

Count how many times a specific word (from command line) appears. Case-insensitive!统计某个特定单词(来自命令行)出现的次数。不区分大小写!

#!/usr/bin/env python3 import sys import re target = sys.argv[1].lower() text = sys.stdin.read() words = re.findall(r'[a-zA-Z]+', text) count = 0 for word in words: if word.lower() == target: count += 1 print(f"{target} occurred {count} times")

Why .lower()?为什么用 .lower()?

The search is case-insensitive. Convert both the target and each word to lowercase before comparing. "Death", "death", "DEATH" should all match.搜索不区分大小写。比较前将目标和每个单词都转换为小写。"Death""death""DEATH" 都应该匹配。

4. Frequency Analysis & Dict-of-Dicts (frequency.py) 4. 频率分析与嵌套字典(frequency.py)

Calculate how frequently each artist uses a given word. This requires a dict of dicts — a dictionary where each value is another dictionary.计算每个 artist 使用给定单词的频率。这需要嵌套字典 — 一个字典,其每个值又是一个字典。

Data Structure 数据结构

# Dict-of-dicts: {artist: {word: count}} word_counts = {} total_words = {} # Example structure: # word_counts = { # "Justin Bieber": {"love": 493, "baby": 800, ...}, # "Metallica": {"death": 69, "fire": 45, ...}, # ... # }

Using glob to Find Files 使用 glob 查找文件

import glob for file in glob.glob("lyrics/*.txt"): # file = "lyrics/Justin_Bieber.txt" name = file[len("lyrics/"):-len(".txt")] artist = name.replace("_", " ") # "Justin Bieber" word_counts[artist] = {} total = 0 with open(file) as f: text = f.read() words = re.findall(r'[a-zA-Z]+', text) for word in words: total += 1 w = word.lower() if w in word_counts[artist]: word_counts[artist][w] += 1 else: word_counts[artist][w] = 1 total_words[artist] = total

Calculating & Printing Frequency 计算并打印频率

target = sys.argv[1].lower() for artist in sorted(word_counts): count = word_counts[artist].get(target, 0) total = total_words[artist] frequency = count / total print(f"{count:4}/{total:6} = {frequency:.9f} {artist}")

Dict-of-Dicts Pattern嵌套字典模式

Outer dict = rows (artists). Inner dict = columns (word counts). Access: word_counts[artist][word]. Use .get(key, 0) to handle missing words gracefully.外层字典 = 行(artists)。内层字典 = 列(单词计数)。访问:word_counts[artist][word]。用 .get(key, 0) 安全处理不存在的单词。

5. Log Probability (log_probability.py) 5. 对数概率(log_probability.py)

To determine which artist is most likely to sing a phrase, we multiply the probabilities of each word. But multiplying many small numbers causes underflow. Solution: use log (multiplication becomes addition).要判断哪个 artist 最可能唱出某段歌词,我们将每个单词的概率相乘。但很多小数相乘会导致下溢。解决方案:使用对数(乘法变加法)。

import math # Instead of: P(phrase) = P(word1) * P(word2) * P(word3) # We compute: log P(phrase) = log P(word1) + log P(word2) + log P(word3)

Add-1 (Laplace) Smoothing 加一(Laplace)平滑

If an artist never uses a word, the probability is 0 — which makes the entire product 0. To avoid this, add 1 to every count:如果一个 artist 从未使用某个单词,概率为 0 — 这会使整个乘积为 0。为避免此问题,给每个计数加 1:

# Without smoothing: P(word|artist) = count / total # With +1 smoothing: P(word|artist) = (count + 1) / total count = word_counts[artist].get(word, 0) + 1 # add 1 total = total_words[artist] # DON'T add vocab_size! log_prob += math.log(count / total)

⚠️ Smoothing Gotcha平滑陷阱

The lab says (0+1)/18205 — the denominator is the raw total word count, NOT total + vocab_size. Only add 1 to the numerator (count).题目说 (0+1)/18205 — 分母是原始总词数,不是 total + vocab_size。只给分子(计数)加 1。

Complete Implementation 完整实现

#!/usr/bin/env python3 import sys import re import glob import math # ... (build word_counts and total_words as before) ... target_words = sys.argv[1:] for artist in sorted(word_counts): log_prob = 0.0 for word in target_words: w = word.lower() count = word_counts[artist].get(w, 0) + 1 total = total_words[artist] log_prob += math.log(count / total) print(f"{log_prob:10.5f} {artist}")

Why Log?为什么用对数?

  • Probabilities become tiny: 0.0001 * 0.0003 * 0.00005 = 1.5e-12 — underflow risk概率变得极小:0.0001 * 0.0003 * 0.00005 = 1.5e-12 — 下溢风险
  • Log values stay manageable: log(0.0001) + log(0.0003) + log(0.00005) = -27.12对数值保持可控:log(0.0001) + log(0.0003) + log(0.00005) = -27.12
  • Higher log-prob = more likely (closer to 0 = better)更高的对数概率 = 更可能(越接近 0 = 越好)

6. Artist Identification (identify_artist.py) 6. 歌手识别(identify_artist.py)

Given mystery song files, identify the most likely artist using Naive Bayes: find the artist with the highest log-probability for all words in the song.给定神秘歌曲文件,使用朴素贝叶斯识别最可能的歌手:找到该歌曲所有单词对数概率最高的 artist。

#!/usr/bin/env python3 import sys import re import glob import math # ... (build word_counts and total_words as before) ... for filename in sys.argv[1:]: with open(filename) as f: text = f.read() words = re.findall(r'[a-zA-Z]+', text) best_artist = None best_log_prob = float('-inf') for artist in word_counts: log_prob = 0.0 for word in words: w = word.lower() count = word_counts[artist].get(w, 0) + 1 total = total_words[artist] log_prob += math.log(count / total) if log_prob > best_log_prob: best_log_prob = log_prob best_artist = artist print(f"{filename} most resembles the work of {best_artist} " f"(log-probability={best_log_prob:.1f})")

Key Differences from log_probability.py与 log_probability.py 的关键区别

  • Words come from files (sys.argv), not command-line arguments单词来自文件(sys.argv),不是命令行参数
  • Repeated words add log-probability multiple times重复的单词多次累加对数概率
  • Only print the best artist (highest log-prob) per file每个文件只打印最佳歌手(最高对数概率)
  • No glob needed — shell expands song?.txt不需要 glob — shell 会展开 song?.txt

7. The Complete Pattern 7. 完整模式

All 5 exercises build on the same foundation. Here's how they relate:所有 5 个练习都建立在相同的基础上。它们的关系如下:

Exercise练习 Input输入 Core Logic核心逻辑
total_words.py stdinstdin len(re.findall(r'[a-zA-Z]+', text))
count_word.py stdin + argv[1]stdin + argv[1] Count words matching target (case-insensitive)统计匹配目标的单词(不区分大小写)
frequency.py lyrics/ + argv[1]lyrics/ + argv[1] Dict-of-dicts, count / total嵌套字典,count / total
log_probability.py lyrics/ + argv wordslyrics/ + argv 单词 sum(log((count+1)/total))sum(log((count+1)/total))
identify_artist.py lyrics/ + song fileslyrics/ + 歌曲文件 Find max log-prob artist per file找每个文件的最大 log-prob artist

Ready to Test?准备好测试了吗?

Take the quiz to check your understanding of text processing and Naive Bayes!做测验来检验你对文本处理和朴素贝叶斯的理解!

Take the Quiz (Members Only) 做测验(会员专属) PREMIUM