Week 9: Subprocess, Requests & BeautifulSoup Week 9:Subprocess、Requests 与 BeautifulSoup

Test Your Knowledge测试你的知识

Ready to test what you've learned about scraping the web with Python?准备好测试你学到的 Python 网页抓取知识了吗?

Take the Quiz (Members Only) 做测验(会员专属) PREMIUM

1. Lab Overview — Three Ways to Fetch & Parse 1. 实验概览 —— 三种抓取与解析方式

Lab 09 teaches you three progressively more "Pythonic" ways to download a web page and extract course codes & names from UNSW's timetable site.Lab 09 教你三种逐步 Python 化的方法来下载网页并从 UNSW 时间表网站提取课程代码和名称。

  • Problem 1 — courses.sh: Pure shell with curl + sed + paste + uniq + sort.Problem 1 — courses.sh:纯 shell,使用 curl + sed + paste + uniq + sort
  • Problem 2 — courses_subprocess.py: Python shells out to curl via the subprocess module, then parses with re.findall.Problem 2 — courses_subprocess.py:Python 用 subprocess 模块调用 curl,再用 re.findall 解析。
  • Problem 3 — courses_requests.py: Pure Python — requests.get() downloads, BeautifulSoup parses the DOM.Problem 3 — courses_requests.py:纯 Python —— requests.get() 下载,BeautifulSoup 解析 DOM。
Task任务 Shell subprocess.py requests.py
Download下载 curl -sL URL subprocess.run(["curl", ...]) requests.get(url).text
Parse解析 sed regex re.findall() BeautifulSoup.find_all()
Dedup & Sort去重排序 uniq -w8 | sort dict + sorted() dict + sorted()

2. The Target URL & HTML Structure 2. 目标 URL 与 HTML 结构

For a prefix like COMP, the URL is:对于像 COMP 这样的前缀,URL 是:

http://www.timetable.unsw.edu.au/2024/${prefix}KENS.html # e.g. http://www.timetable.unsw.edu.au/2024/COMPKENS.html

The HTML contains each course in two paired <a> tags that share the same href:HTML 中每个课程用两个配对的 <a> 标签表示,它们共享同一个 href

<td class="data"><a href="COMP1010.html">COMP1010</a></td> <td class="data"><a href="COMP1010.html">The Art of Computing</a></td>

⚠️ The Trap陷阱

Because code and name share the same href, a regex like <a href="(COMP\d+)\.html">([^<]+)</a> matches both anchors. The first capture is (COMP1010, COMP1010), the second is (COMP1010, The Art of Computing).因为代码和名字共享 href,像 <a href="(COMP\d+)\.html">([^<]+)</a> 这样的正则会匹配两个锚标签。第一次捕获是 (COMP1010, COMP1010),第二次是 (COMP1010, The Art of Computing)

3. Problem 1 — courses.sh (Shell) 3. Problem 1 — courses.sh(Shell)

One pipeline: download → extract anchors → pair code+name → dedup by code → sort.一条管道:下载 → 提取锚标签 → 配对代码+名字 → 按代码去重 → 排序。

#!/bin/dash prefix=$1 url="http://www.timetable.unsw.edu.au/2024/${prefix}KENS.html" curl --location --silent "$url" | sed -n "s|.*<a href=\"\(${prefix}[0-9]*\)\.html\">\([^<]*\)</a>.*|\1 \2|p" | paste - - | awk '{ print $1, $4, $5, $6, $7, $8, $9, $10 }' | sort -u

Why paste - -?为什么用 paste - -

sed emits one line per anchor: COMP1010 COMP1010 then COMP1010 The Art of Computing. paste - - joins every two lines into one, so we can keep just the name columns.sed 为每个锚标签输出一行:先 COMP1010 COMP1010COMP1010 The Art of Computingpaste - - 将每两行合并为一行,这样我们就可以只保留名字的列。

4. Problem 2 — courses_subprocess.py 4. Problem 2 — courses_subprocess.py

Python calls out to curl via the subprocess module, then parses with re.findall.Python 用 subprocess 模块调用 curl,再用 re.findall 解析。

Step 1 — Run curl via subprocess Step 1 —— 用 subprocess 调 curl

import subprocess, sys prefix = sys.argv[1] url = f"http://www.timetable.unsw.edu.au/2024/{prefix}KENS.html" result = subprocess.run( ["curl", "--location", "--silent", url], capture_output=True, text=True ) html = result.stdout

Key subprocess.run() argssubprocess.run() 关键参数

  • capture_output=Truecapture stdout and stderr.捕获 stdoutstderr
  • text=Truereturn strings instead of bytes.返回字符串而不是 bytes
  • Pass command + args as a list, not a single string (avoids shell-injection).把命令和参数作为列表传入,不要用单个字符串(避免 shell 注入)。

Step 2 — Parse with re.findall Step 2 —— 用 re.findall 解析

import re pattern = rf'<a href="({prefix}\d+)\.html">([^<]+)</a>' matches = re.findall(pattern, html) # [('COMP1010', 'COMP1010'), # ('COMP1010', 'The Art of Computing'), # ('COMP1511', 'COMP1511'), # ('COMP1511', 'Programming Fundamentals'), # ...]

Step 3 — Dedup with a dict (last-wins) Step 3 —— 用 dict 去重(后者覆盖前者)

courses = {} for code, name in matches: courses[code] = name # second match overwrites first → keeps name, not code for code in sorted(courses): print(f"{code} {courses[code]}")

⚠️ Common Bug常见 Bug

Using if code not in courses: courses[code] = name keeps the first match — which is the code-anchor where name == code. Output becomes COMP1010 COMP1010. Fix: drop the if, let the second (name) match overwrite.使用 if code not in courses: courses[code] = name 会保留第一个匹配 —— 也就是 name == code 的代码锚标签。输出会变成 COMP1010 COMP1010。修复:去掉 if,让第二个(名字)匹配覆盖掉第一个。

5. Problem 3 — courses_requests.py 5. Problem 3 — courses_requests.py

Pure Python. Use requests to download, BeautifulSoup with html5lib to parse the DOM tree.纯 Python。用 requests 下载,用 BeautifulSoup 配合 html5lib 解析 DOM 树。

Complete Script 完整脚本

#!/usr/bin/env python3 import sys import re import requests from bs4 import BeautifulSoup prefix = sys.argv[1] url = f"http://www.timetable.unsw.edu.au/2024/{prefix}KENS.html" response = requests.get(url) soup = BeautifulSoup(response.text, 'html5lib') courses = {} for link in soup.find_all('a'): href = link.get('href') or '' match = re.match(rf'^({prefix}\d+)\.html$', href) if match: code = match.group(1) courses[code] = link.text.strip() # last-wins overwrites for code in sorted(courses): print(f"{code} {courses[code]}")

BeautifulSoup Essentials BeautifulSoup 核心用法

Call调用 Returns返回
BeautifulSoup(html, 'html5lib') Parsed DOM tree解析后的 DOM 树
soup.find_all('a') List of all <a> tags所有 <a> 标签的列表
link.get('href') href attribute (safe: returns None if missing)href 属性(安全:缺失时返回 None
link.text / link.text.strip() Text inside the tag标签内的文字

Why html5lib?为什么用 html5lib

The lab spec requires it. html5lib is the most lenient parser — it handles broken/malformed HTML the same way a real browser does, so timetable quirks don't break your script.题目要求使用它。html5lib 是最宽松的解析器 —— 像真实浏览器一样处理损坏或不规范的 HTML,所以时间表页面的怪异格式不会让你的脚本崩溃。

6. Challenge — regex_prime.txt 6. 挑战题 —— regex_prime.txt

Write a regex that matches composite (non-prime) unary numbers. A unary number is a string of x characters: xxxx = 4, xxxxx = 5, etc.写一个匹配合数(非质数)一元数的正则表达式。一元数就是一串 x 字符:xxxx = 4,xxxxx = 5,等等。

^(x{2,}?)\1+$

Why It Works 为什么这样写

  • ^ / $ — anchor to whole string.^ / $ —— 锚定整个字符串。
  • (x{2,}?) — capture 2 or more x's (any divisor ≥ 2), non-greedy.(x{2,}?) —— 捕获 2 个或更多 x(任何 ≥ 2 的因子),非贪婪。
  • \1+ — backreference: repeat the captured group one or more times.\1+ —— 反向引用:重复捕获组一次或多次

Intuition直觉

A number n is composite iff it can be expressed as n = a × b where a ≥ 2 and b ≥ 2. The regex says "find a block of ≥2 x's and repeat it ≥1 more time to cover the whole string" — i.e. total length is a multiple of a divisor ≥ 2.一个数 n 是合数当且仅当可以表示为 n = a × b,其中 a ≥ 2b ≥ 2。这个正则说"找一段 ≥2 个 x,并再重复 ≥1 次覆盖整个字符串" —— 也就是总长度是一个 ≥2 因子的倍数。

Input输入 Length长度 Prime?质数? Match?匹配?
xxxx4 = 2×2No
xxxxxx6 = 2×3No
xxx3Yes
xxxxx5Yes
xxxxxxx7Yes
x1

7. Testing & Submission 7. 测试与提交

# Autotest each problem 2041 autotest shell_courses 2041 autotest python_courses_subprocess 2041 autotest python_courses_requests 2041 autotest regex_prime # Submit by Monday 2026-04-20 12:00 give cs2041 lab09_shell_courses courses.sh give cs2041 lab09_python_courses_subprocess courses_subprocess.py give cs2041 lab09_python_courses_requests courses_requests.py give cs2041 lab09_regex_prime regex_prime.txt

Pre-submit Checklist提交前检查清单

  • chmod +x applied so scripts are executable.执行了 chmod +x,脚本可执行。
  • Shebang at line 1 (#!/bin/dash or #!/usr/bin/env python3).第一行有 shebang(#!/bin/dash#!/usr/bin/env python3)。
  • All autotests pass locally before give.give 之前所有 autotest 都在本地通过。
  • Output is CODE Name, sorted, one per line.输出是 CODE Name,已排序,每行一条。

Ready for the Quiz?准备好做测验了吗?

30 questions covering subprocess, regex, requests, BeautifulSoup and the composite-number regex.30 道题,涵盖 subprocess、正则、requests、BeautifulSoup 和合数正则。

Start Week 9 Quiz 开始 Week 9 测验