Ready to test what you've learned about text processing and Naive Bayes?准备好测试你学到的文本处理和朴素贝叶斯知识了吗?
Take the Quiz (Members Only) 做测验(会员专属) PREMIUMIn this lab, a word is a maximal, non-empty, contiguous sequence of alphabetic characters [a-zA-Z]. Any non-alphabetic character separates words.在本实验中,单词是最长的、非空的、连续的字母字符序列 [a-zA-Z]。任何非字母字符都分隔单词。
"The soul's desire" contains 4 words: The, soul, s, desire"The soul's desire" 包含 4 个单词:The, soul, s, desire"hello-world" contains 2 words: hello, world"hello-world" 包含 2 个单词:hello, world"123abc!def" contains 2 words: abc, def"123abc!def" 包含 2 个单词:abc, defUse re.findall() with the pattern [a-zA-Z]+ to extract all words:使用 re.findall() 和模式 [a-zA-Z]+ 来提取所有单词:
split() splits on whitespace but doesn't handle punctuation correctly. "soul's" split gives ["soul's"] (1 item), not 2 words. Always use re.findall(r'[a-zA-Z]+', text).split() 按空白分割但不能正确处理标点。"soul's" 用 split 会得到 ["soul's"](1项),不是2个单词。务必使用 re.findall(r'[a-zA-Z]+', text)。
Read from stdin, count all words using our definition:从 stdin 读取,用我们的定义统计所有单词:
sys.stdin.read() reads ALL of stdin as one string. Then re.findall() extracts every word. len() gives the count.sys.stdin.read() 将所有 stdin 作为一个字符串读取。然后 re.findall() 提取每个单词。len() 给出计数。
Count how many times a specific word (from command line) appears. Case-insensitive!统计某个特定单词(来自命令行)出现的次数。不区分大小写!
The search is case-insensitive. Convert both the target and each word to lowercase before comparing. "Death", "death", "DEATH" should all match.搜索不区分大小写。比较前将目标和每个单词都转换为小写。"Death"、"death"、"DEATH" 都应该匹配。
Calculate how frequently each artist uses a given word. This requires a dict of dicts — a dictionary where each value is another dictionary.计算每个 artist 使用给定单词的频率。这需要嵌套字典 — 一个字典,其每个值又是一个字典。
Outer dict = rows (artists). Inner dict = columns (word counts). Access: word_counts[artist][word]. Use .get(key, 0) to handle missing words gracefully.外层字典 = 行(artists)。内层字典 = 列(单词计数)。访问:word_counts[artist][word]。用 .get(key, 0) 安全处理不存在的单词。
To determine which artist is most likely to sing a phrase, we multiply the probabilities of each word. But multiplying many small numbers causes underflow. Solution: use log (multiplication becomes addition).要判断哪个 artist 最可能唱出某段歌词,我们将每个单词的概率相乘。但很多小数相乘会导致下溢。解决方案:使用对数(乘法变加法)。
If an artist never uses a word, the probability is 0 — which makes the entire product 0. To avoid this, add 1 to every count:如果一个 artist 从未使用某个单词,概率为 0 — 这会使整个乘积为 0。为避免此问题,给每个计数加 1:
The lab says (0+1)/18205 — the denominator is the raw total word count, NOT total + vocab_size. Only add 1 to the numerator (count).题目说 (0+1)/18205 — 分母是原始总词数,不是 total + vocab_size。只给分子(计数)加 1。
Given mystery song files, identify the most likely artist using Naive Bayes: find the artist with the highest log-probability for all words in the song.给定神秘歌曲文件,使用朴素贝叶斯识别最可能的歌手:找到该歌曲所有单词对数概率最高的 artist。
song?.txt不需要 glob — shell 会展开 song?.txtAll 5 exercises build on the same foundation. Here's how they relate:所有 5 个练习都建立在相同的基础上。它们的关系如下:
| Exercise练习 | Input输入 | Core Logic核心逻辑 |
|---|---|---|
| total_words.py | stdinstdin | len(re.findall(r'[a-zA-Z]+', text)) |
| count_word.py | stdin + argv[1]stdin + argv[1] | Count words matching target (case-insensitive)统计匹配目标的单词(不区分大小写) |
| frequency.py | lyrics/ + argv[1]lyrics/ + argv[1] | Dict-of-dicts, count / total嵌套字典,count / total |
| log_probability.py | lyrics/ + argv wordslyrics/ + argv 单词 | sum(log((count+1)/total))sum(log((count+1)/total)) |
| identify_artist.py | lyrics/ + song fileslyrics/ + 歌曲文件 | Find max log-prob artist per file找每个文件的最大 log-prob artist |
Take the quiz to check your understanding of text processing and Naive Bayes!做测验来检验你对文本处理和朴素贝叶斯的理解!
Take the Quiz (Members Only) 做测验(会员专属) PREMIUM