Week 9: Subprocess, Requests & BeautifulSoup Week 9：Subprocess、Requests 与 BeautifulSoup

Test Your Knowledge测试你的知识

Ready to test what you've learned about scraping the web with Python?准备好测试你学到的 Python 网页抓取知识了吗？

Take the Quiz (Members Only) 做测验（会员专属） PREMIUM

1. Lab Overview — Three Ways to Fetch & Parse 1. 实验概览 —— 三种抓取与解析方式

Lab 09 teaches you three progressively more "Pythonic" ways to download a web page and extract course codes & names from UNSW's timetable site.Lab 09 教你三种逐步 Python 化的方法来下载网页并从 UNSW 时间表网站提取课程代码和名称。

Problem 1 — courses.sh: Pure shell with curl + sed + paste + uniq + sort.Problem 1 — courses.sh：纯 shell，使用 curl + sed + paste + uniq + sort。
Problem 2 — courses_subprocess.py: Python shells out to curl via the subprocess module, then parses with re.findall.Problem 2 — courses_subprocess.py：Python 用 subprocess 模块调用 curl，再用 re.findall 解析。
Problem 3 — courses_requests.py: Pure Python — requests.get() downloads, BeautifulSoup parses the DOM.Problem 3 — courses_requests.py：纯 Python —— requests.get() 下载，BeautifulSoup 解析 DOM。

Task任务	Shell	subprocess.py	requests.py
Download下载	`curl -sL URL`	`subprocess.run(["curl", ...])`	`requests.get(url).text`
Parse解析	`sed` regex	`re.findall()`	`BeautifulSoup.find_all()`
Dedup & Sort去重排序	`uniq -w8 \| sort`	`dict` + `sorted()`	`dict` + `sorted()`

2. The Target URL & HTML Structure 2. 目标 URL 与 HTML 结构

For a prefix like COMP, the URL is:对于像 COMP 这样的前缀，URL 是：

http://www.timetable.unsw.edu.au/2024/${prefix}KENS.html
# e.g. http://www.timetable.unsw.edu.au/2024/COMPKENS.html

The HTML contains each course in two paired <a> tags that share the same href:HTML 中每个课程用两个配对的 <a> 标签表示，它们共享同一个 href：

<td class="data"><a href="COMP1010.html">COMP1010</a></td>
<td class="data"><a href="COMP1010.html">The Art of Computing</a></td>

⚠️ The Trap陷阱

Because code and name share the same href, a regex like <a href="(COMP\d+)\.html">([^<]+)</a> matches both anchors. The first capture is (COMP1010, COMP1010), the second is (COMP1010, The Art of Computing).因为代码和名字共享 href，像 <a href="(COMP\d+)\.html">([^<]+)</a> 这样的正则会匹配两个锚标签。第一次捕获是 (COMP1010, COMP1010)，第二次是 (COMP1010, The Art of Computing)。

3. Problem 1 — `courses.sh` (Shell) 3. Problem 1 — `courses.sh`（Shell）

One pipeline: download → extract anchors → pair code+name → dedup by code → sort.一条管道：下载 → 提取锚标签 → 配对代码+名字 → 按代码去重 → 排序。

#!/bin/dash
prefix=$1
url="http://www.timetable.unsw.edu.au/2024/${prefix}KENS.html"

curl --location --silent "$url" |
    sed -n "s|.*<a href=\"\(${prefix}[0-9]*\)\.html\">\([^<]*\)</a>.*|\1 \2|p" |
    paste - - |
    awk '{ print $1, $4, $5, $6, $7, $8, $9, $10 }' |
    sort -u

Why `paste - -`?为什么用 `paste - -`？

sed emits one line per anchor: COMP1010 COMP1010 then COMP1010 The Art of Computing. paste - - joins every two lines into one, so we can keep just the name columns.sed 为每个锚标签输出一行：先 COMP1010 COMP1010 再 COMP1010 The Art of Computing。paste - - 将每两行合并为一行，这样我们就可以只保留名字的列。

4. Problem 2 — `courses_subprocess.py` 4. Problem 2 — `courses_subprocess.py`

Python calls out to curl via the subprocess module, then parses with re.findall.Python 用 subprocess 模块调用 curl，再用 re.findall 解析。

Step 1 — Run curl via subprocess Step 1 —— 用 subprocess 调 curl

import subprocess, sys

prefix = sys.argv[1]
url = f"http://www.timetable.unsw.edu.au/2024/{prefix}KENS.html"

result = subprocess.run(
    ["curl", "--location", "--silent", url],
    capture_output=True,
    text=True
)
html = result.stdout

Key subprocess.run() argssubprocess.run() 关键参数

capture_output=True — capture stdout and stderr.捕获 stdout 和 stderr。
text=True — return strings instead of bytes.返回字符串而不是 bytes。
Pass command + args as a list, not a single string (avoids shell-injection).把命令和参数作为列表传入，不要用单个字符串（避免 shell 注入）。

Step 2 — Parse with re.findall Step 2 —— 用 re.findall 解析

import re

pattern = rf'<a href="({prefix}\d+)\.html">([^<]+)</a>'
matches = re.findall(pattern, html)
# [('COMP1010', 'COMP1010'),
#  ('COMP1010', 'The Art of Computing'),
#  ('COMP1511', 'COMP1511'),
#  ('COMP1511', 'Programming Fundamentals'),
#  ...]

Step 3 — Dedup with a dict (last-wins) Step 3 —— 用 dict 去重（后者覆盖前者）

courses = {}
for code, name in matches:
    courses[code] = name   # second match overwrites first → keeps name, not code

for code in sorted(courses):
    print(f"{code} {courses[code]}")

⚠️ Common Bug常见 Bug

Using if code not in courses: courses[code] = name keeps the first match — which is the code-anchor where name == code. Output becomes COMP1010 COMP1010. Fix: drop the if, let the second (name) match overwrite.使用 if code not in courses: courses[code] = name 会保留第一个匹配 —— 也就是 name == code 的代码锚标签。输出会变成 COMP1010 COMP1010。修复：去掉 if，让第二个（名字）匹配覆盖掉第一个。

5. Problem 3 — `courses_requests.py` 5. Problem 3 — `courses_requests.py`

Pure Python. Use requests to download, BeautifulSoup with html5lib to parse the DOM tree.纯 Python。用 requests 下载，用 BeautifulSoup 配合 html5lib 解析 DOM 树。

Complete Script 完整脚本

#!/usr/bin/env python3
import sys
import re
import requests
from bs4 import BeautifulSoup

prefix = sys.argv[1]
url = f"http://www.timetable.unsw.edu.au/2024/{prefix}KENS.html"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')

courses = {}
for link in soup.find_all('a'):
    href = link.get('href') or ''
    match = re.match(rf'^({prefix}\d+)\.html$', href)
    if match:
        code = match.group(1)
        courses[code] = link.text.strip()   # last-wins overwrites

for code in sorted(courses):
    print(f"{code} {courses[code]}")

BeautifulSoup Essentials BeautifulSoup 核心用法

Call调用	Returns返回
`BeautifulSoup(html, 'html5lib')`	Parsed DOM tree解析后的 DOM 树
`soup.find_all('a')`	List of all `<a>` tags所有 `<a>` 标签的列表
`link.get('href')`	`href` attribute (safe: returns `None` if missing)`href` 属性（安全：缺失时返回 `None`）
`link.text` / `link.text.strip()`	Text inside the tag标签内的文字

Why `html5lib`?为什么用 `html5lib`？

The lab spec requires it. html5lib is the most lenient parser — it handles broken/malformed HTML the same way a real browser does, so timetable quirks don't break your script.题目要求使用它。html5lib 是最宽松的解析器 —— 像真实浏览器一样处理损坏或不规范的 HTML，所以时间表页面的怪异格式不会让你的脚本崩溃。

6. Challenge — `regex_prime.txt` 6. 挑战题 —— `regex_prime.txt`

Write a regex that matches composite (non-prime) unary numbers. A unary number is a string of x characters: xxxx = 4, xxxxx = 5, etc.写一个匹配合数（非质数）一元数的正则表达式。一元数就是一串 x 字符：xxxx = 4，xxxxx = 5，等等。

^(x{2,}?)\1+$

Why It Works 为什么这样写

^ / $ — anchor to whole string.^ / $ —— 锚定整个字符串。
(x{2,}?) — capture 2 or more x's (any divisor ≥ 2), non-greedy.(x{2,}?) —— 捕获 2 个或更多 x（任何 ≥ 2 的因子），非贪婪。
\1+ — backreference: repeat the captured group one or more times.\1+ —— 反向引用：重复捕获组一次或多次。

Intuition直觉

A number n is composite iff it can be expressed as n = a × b where a ≥ 2 and b ≥ 2. The regex says "find a block of ≥2 x's and repeat it ≥1 more time to cover the whole string" — i.e. total length is a multiple of a divisor ≥ 2.一个数 n 是合数当且仅当可以表示为 n = a × b，其中 a ≥ 2 且 b ≥ 2。这个正则说"找一段 ≥2 个 x，并再重复 ≥1 次覆盖整个字符串" —— 也就是总长度是一个 ≥2 因子的倍数。

Input输入	Length长度	Prime?质数？	Match?匹配？
`xxxx`	4 = 2×2	No	✅
`xxxxxx`	6 = 2×3	No	✅
`xxx`	3	Yes	❌
`xxxxx`	5	Yes	❌
`xxxxxxx`	7	Yes	❌
`x`	1	—	❌

7. Testing & Submission 7. 测试与提交

# Autotest each problem
2041 autotest shell_courses
2041 autotest python_courses_subprocess
2041 autotest python_courses_requests
2041 autotest regex_prime

# Submit by Monday 2026-04-20 12:00
give cs2041 lab09_shell_courses             courses.sh
give cs2041 lab09_python_courses_subprocess courses_subprocess.py
give cs2041 lab09_python_courses_requests   courses_requests.py
give cs2041 lab09_regex_prime               regex_prime.txt

Pre-submit Checklist提交前检查清单

chmod +x applied so scripts are executable.执行了 chmod +x，脚本可执行。
Shebang at line 1 (#!/bin/dash or #!/usr/bin/env python3).第一行有 shebang（#!/bin/dash 或 #!/usr/bin/env python3）。
All autotests pass locally before give.在 give 之前所有 autotest 都在本地通过。
Output is CODE Name, sorted, one per line.输出是 CODE Name，已排序，每行一条。

Ready for the Quiz?准备好做测验了吗？

30 questions covering subprocess, regex, requests, BeautifulSoup and the composite-number regex.30 道题，涵盖 subprocess、正则、requests、BeautifulSoup 和合数正则。

Start Week 9 Quiz 开始 Week 9 测验