Ready to test what you've learned about scraping the web with Python?准备好测试你学到的 Python 网页抓取知识了吗?
Take the Quiz (Members Only) 做测验(会员专属) PREMIUMLab 09 teaches you three progressively more "Pythonic" ways to download a web page and extract course codes & names from UNSW's timetable site.Lab 09 教你三种逐步 Python 化的方法来下载网页并从 UNSW 时间表网站提取课程代码和名称。
courses.sh: Pure shell with curl + sed + paste + uniq + sort.Problem 1 — courses.sh:纯 shell,使用 curl + sed + paste + uniq + sort。courses_subprocess.py: Python shells out to curl via the subprocess module, then parses with re.findall.Problem 2 — courses_subprocess.py:Python 用 subprocess 模块调用 curl,再用 re.findall 解析。courses_requests.py: Pure Python — requests.get() downloads, BeautifulSoup parses the DOM.Problem 3 — courses_requests.py:纯 Python —— requests.get() 下载,BeautifulSoup 解析 DOM。| Task任务 | Shell | subprocess.py | requests.py |
|---|---|---|---|
| Download下载 | curl -sL URL |
subprocess.run(["curl", ...]) |
requests.get(url).text |
| Parse解析 | sed regex |
re.findall() |
BeautifulSoup.find_all() |
| Dedup & Sort去重排序 | uniq -w8 | sort |
dict + sorted() |
dict + sorted() |
For a prefix like COMP, the URL is:对于像 COMP 这样的前缀,URL 是:
The HTML contains each course in two paired <a> tags that share the same href:HTML 中每个课程用两个配对的 <a> 标签表示,它们共享同一个 href:
Because code and name share the same href, a regex like <a href="(COMP\d+)\.html">([^<]+)</a> matches both anchors. The first capture is (COMP1010, COMP1010), the second is (COMP1010, The Art of Computing).因为代码和名字共享 href,像 <a href="(COMP\d+)\.html">([^<]+)</a> 这样的正则会匹配两个锚标签。第一次捕获是 (COMP1010, COMP1010),第二次是 (COMP1010, The Art of Computing)。
courses.sh (Shell)
3. Problem 1 — courses.sh(Shell)
One pipeline: download → extract anchors → pair code+name → dedup by code → sort.一条管道:下载 → 提取锚标签 → 配对代码+名字 → 按代码去重 → 排序。
paste - -?为什么用 paste - -?sed emits one line per anchor: COMP1010 COMP1010 then COMP1010 The Art of Computing. paste - - joins every two lines into one, so we can keep just the name columns.sed 为每个锚标签输出一行:先 COMP1010 COMP1010 再 COMP1010 The Art of Computing。paste - - 将每两行合并为一行,这样我们就可以只保留名字的列。
courses_subprocess.py
4. Problem 2 — courses_subprocess.py
Python calls out to curl via the subprocess module, then parses with re.findall.Python 用 subprocess 模块调用 curl,再用 re.findall 解析。
capture_output=True — capture stdout and stderr.捕获 stdout 和 stderr。text=True — return strings instead of bytes.返回字符串而不是 bytes。Using if code not in courses: courses[code] = name keeps the first match — which is the code-anchor where name == code. Output becomes COMP1010 COMP1010. Fix: drop the if, let the second (name) match overwrite.使用 if code not in courses: courses[code] = name 会保留第一个匹配 —— 也就是 name == code 的代码锚标签。输出会变成 COMP1010 COMP1010。修复:去掉 if,让第二个(名字)匹配覆盖掉第一个。
courses_requests.py
5. Problem 3 — courses_requests.py
Pure Python. Use requests to download, BeautifulSoup with html5lib to parse the DOM tree.纯 Python。用 requests 下载,用 BeautifulSoup 配合 html5lib 解析 DOM 树。
| Call调用 | Returns返回 |
|---|---|
BeautifulSoup(html, 'html5lib') |
Parsed DOM tree解析后的 DOM 树 |
soup.find_all('a') |
List of all <a> tags所有 <a> 标签的列表 |
link.get('href') |
href attribute (safe: returns None if missing)href 属性(安全:缺失时返回 None) |
link.text / link.text.strip() |
Text inside the tag标签内的文字 |
html5lib?为什么用 html5lib?The lab spec requires it. html5lib is the most lenient parser — it handles broken/malformed HTML the same way a real browser does, so timetable quirks don't break your script.题目要求使用它。html5lib 是最宽松的解析器 —— 像真实浏览器一样处理损坏或不规范的 HTML,所以时间表页面的怪异格式不会让你的脚本崩溃。
regex_prime.txt
6. 挑战题 —— regex_prime.txt
Write a regex that matches composite (non-prime) unary numbers. A unary number is a string of x characters: xxxx = 4, xxxxx = 5, etc.写一个匹配合数(非质数)一元数的正则表达式。一元数就是一串 x 字符:xxxx = 4,xxxxx = 5,等等。
^ / $ — anchor to whole string.^ / $ —— 锚定整个字符串。(x{2,}?) — capture 2 or more x's (any divisor ≥ 2), non-greedy.(x{2,}?) —— 捕获 2 个或更多 x(任何 ≥ 2 的因子),非贪婪。\1+ — backreference: repeat the captured group one or more times.\1+ —— 反向引用:重复捕获组一次或多次。A number n is composite iff it can be expressed as n = a × b where a ≥ 2 and b ≥ 2. The regex says "find a block of ≥2 x's and repeat it ≥1 more time to cover the whole string" — i.e. total length is a multiple of a divisor ≥ 2.一个数 n 是合数当且仅当可以表示为 n = a × b,其中 a ≥ 2 且 b ≥ 2。这个正则说"找一段 ≥2 个 x,并再重复 ≥1 次覆盖整个字符串" —— 也就是总长度是一个 ≥2 因子的倍数。
| Input输入 | Length长度 | Prime?质数? | Match?匹配? |
|---|---|---|---|
xxxx | 4 = 2×2 | No | ✅ |
xxxxxx | 6 = 2×3 | No | ✅ |
xxx | 3 | Yes | ❌ |
xxxxx | 5 | Yes | ❌ |
xxxxxxx | 7 | Yes | ❌ |
x | 1 | — | ❌ |
chmod +x applied so scripts are executable.执行了 chmod +x,脚本可执行。#!/bin/dash or #!/usr/bin/env python3).第一行有 shebang(#!/bin/dash 或 #!/usr/bin/env python3)。give.在 give 之前所有 autotest 都在本地通过。CODE Name, sorted, one per line.输出是 CODE Name,已排序,每行一条。30 questions covering subprocess, regex, requests, BeautifulSoup and the composite-number regex.30 道题,涵盖 subprocess、正则、requests、BeautifulSoup 和合数正则。
Start Week 9 Quiz 开始 Week 9 测验