Q1 regex · Q2 shell pipelines · Q3 Python · Q4 shell scripts Q1 正则 · Q2 Shell 管道 · Q3 Python · Q4 Shell 脚本
📋 The Exam考试结构
12 questions, 100 marks, 3 hours working + 10 min reading. Questions divide into practical (Q1–Q5) and theory/longer scripts (Q6–Q12). This review covers Q1–Q4 of the 24T1 practice exam — the foundations every later question builds on. 12 题,100 分,3 小时答题 + 10 分钟阅读。分为 实操题 (Q1–Q5) 和 偏理论 / 较长脚本题 (Q6–Q12)。本页覆盖 24T1 模拟题的 Q1–Q4 — 这是后面所有题目共用的基础。
💡 Strategy策略
Each section is a knowledge dump + worked example + your real bugs. Don't just read — close the page and try writing each pattern from memory. Then take the quiz at the bottom (30 questions, click-to-reveal explanations). 每一节都是知识点 + 例题 + 你真正踩过的坑。不要只读 — 关掉页面,凭记忆写出每个模板。最后做底部 quiz(30 题,点击即看解析)。
Each line of awards.psv has 6 fields separated by |:
每行 awards.psv 用 | 分成 6 个字段:
-E: bare | vs \|
为什么必须 -E:裸 | vs \|
In extended regex (grep -E), bare | means OR (alternation). To match a literal pipe, you must escape it: \|. This is the single most common bug in Q1.
在扩展正则 (grep -E) 里,裸 | 表示 OR(或)。要匹配真正的竖线必须转义:\|。这是 Q1 最常犯的错误。
^ = start of line · $ = end of line. To pin down a middle field, surround it with literal pipes: \|Australia\|.
^ = 行首 · $ = 行尾。要锁死中间某个字段,用左右两边的竖线把它夹住:\|Australia\|。
Pipes on both sides ⇒ "Australia" must be field 5, not a substring inside a name or award title. 两边都有竖线 ⇒ "Australia" 必须是第 5 字段,不会误中名字或奖项里出现的子串。
.*
两个字段同时满足:用 .*
"Fields Medal winners from France" needs two conditions on the same line: field 1 = Fields Medal AND field 5 = France. Anchor each, join with .*:
"法国的 Fields 奖得主" 需要同一行同时满足两个条件:字段 1 = Fields Medal 且 字段 5 = France。两边各自锚定,中间用 .* 连接:
⚠️ Bug you hit你踩过的坑
Your first attempt was '^Fields Medal| *\|France\|'. The unescaped | made it OR not AND — it returned every Fields medalist plus every Frenchman of any award. Always escape (\|) when you mean a literal pipe.
你第一次写的是 '^Fields Medal| *\|France\|'。中间那个没转义的 | 变成了 OR — 结果同时返回所有 Fields 奖得主加上所有法国人。要表示字面竖线一定要转义 \|。
| Symbol | Meaning | Example |
|---|---|---|
. | any single character | a.c matches abc, a c |
* | 0 or more of previous | ab*c matches ac, abbbc |
+ | 1 or more of previous | ab+c matches abc, not ac |
? | 0 or 1 of previous | colou?r matches both spellings |
{n} | exactly n times | [0-9]{4} = a 4-digit number |
[A-Z] | character class (set) | any uppercase letter |
.* | 0+ of anything (workhorse) | "skip past whatever" |
\. | literal dot | matches the period in "F." |
⚠️ Bugs you hit (character classes) 你踩过的坑(字符类)
[194] ≠ 194. Brackets mean "one of these characters" — [194] matches a single 1, 9, OR 4. To match the literal sequence "194" (e.g. for the 1940s), write 194 with no brackets.
[194] ≠ 194。方括号是"这些字符里的一个" — [194] 匹配单个 1、9 或 4。要匹配字面序列 "194"(例如 1940 年代)应该直接写 194。
[a+z] ≠ [a-z]. Inside brackets, + is a literal plus, not a quantifier. Use a hyphen for ranges.
[a+z] ≠ [a-z]。方括号内 + 是字面加号,不是量词。范围用连字符。
[A-Z*] ≠ [A-Z]*. Inside brackets, * is literal. Outside, it's "0 or more of the previous thing."
[A-Z*] ≠ [A-Z]*。方括号内 * 是字面量。在外面才是"前一个东西出现 0 次或多次"。
. alone matches exactly one character. To allow many, write .* or .+.
. 单独使用只匹配恰好一个字符。要任意多个写 .* 或 .+。
Wrap a part with (...) to capture it. Refer back with \1, \2, etc. — group numbers count left-to-right.
用 (...) 把一部分捕获起来。用 \1、\2 等反向引用 — 编号按左括号从左到右数。
✅ Q1.4: first name + middle initial + last name all start with the same letter Q1.4:名 + 中间名首字母 + 姓 都用同一个字母开头
Read it left-to-right: pipe, capture an uppercase letter, lowercase rest, space, same letter + literal dot + space, same letter starting the surname, lowercase rest, pipe. 从左往右读:竖线,抓一个大写字母,剩下的小写,空格,同一个字母 + 字面点号 + 空格,同一个字母开头的姓,小写剩余,竖线。
Best pattern: a character class with * (zero or more). Cleaner than .* because it can't accidentally match pipes.
最佳写法:字符类 + *(零次或多次)。比 .* 更安全 — 不会意外吃掉竖线。
[\1\2] doesn't mean "match group 1 OR group 2." Inside brackets, \1 is a literal backslash-1. To say "match group 1 OR group 2," use alternation outside brackets: (\1|\2).
[\1\2] 不会被解读成"匹配组 1 或组 2"。在方括号里 \1 是字面字符。要表达"组 1 或组 2"用括号外的 alternation:(\1|\2)。
Group 1 = first digit. Group 2 = second digit. Then \2\1 mirrors them. $ anchors the year to the end of line because birth year is the last field.
组 1 = 第 1 位数字,组 2 = 第 2 位数字。然后 \2\1 镜像。$ 锁住行尾因为出生年是最后一个字段。
A pipeline is a chain of small programs, each doing one thing, connected by |. Output of one becomes input of the next. Don't write the whole thing at once — build it stage by stage and check the output after each stage.
管道 = 一串各司其职的小工具,用 | 串起来,前一个的输出 = 后一个的输入。不要一次写完整条管道 — 分阶段建造,每加一段就检查一次输出。
| 的三种含义
Same character, three jobs depending on context (quotes + position). This trips up everyone — once you see it, you can't unsee it. 同一个字符,根据上下文(引号、位置)扮演三种角色。第一次看会懵,看懂了就再也不会忘。
| Tool | Job | Common flag |
|---|---|---|
grep | filter lines by pattern | -E ERE · -v invert · -i ignore case · -c count |
cut | extract columns | -d 'X' -f N (delimiter, field — both required!) |
sort | order lines | -n numeric · -u unique · -r reverse |
uniq | collapse adjacent duplicates | -d only dups · -u only uniques · -c with counts |
wc | count | -l lines |
head / tail | first / last N lines | -n N |
seq | integers from n to m | e.g. seq 1 5 |
tr | translate / delete chars | -d delete · -s squeeze |
sed | substitute | 's/old/new/g' |
Read enrolment lines from stdin (5 fields: course|id|name|plan|gender). Output surnames of male students, sorted, no duplicates.
从 stdin 读 5 字段选课记录 (course|id|name|plan|gender)。输出男生姓氏,排序,去重。
✅ The 4-stage pattern四阶段模板
This is the master template for "filter → extract → dedupe → (count)" questions. Memorise it. 这是"过滤 → 提取 → 去重 → (计数)"类题目的母模板。背下来。
wc -l
计数变体:wc -l
Always dedupe by the uniquely-identifying field (student ID), not by name — two different students can share a name. Two "Wang, Wei" with different IDs are two people. 务必用唯一标识字段(学号)去重,不是名字 — 两个不同的学生可以同名。两个不同 ID 的 "Wang, Wei" 是两个人。
⚠️ Bugs you hit你踩过的坑
cut '|' -f3 — missing -d. Without it, cut defaults to TAB and your whole line becomes one field. Always: cut -d'|' -f3.
cut '|' -f3 — 缺 -d。没有它,cut 默认用 TAB 分隔,整行变成一个字段。永远写 cut -d'|' -f3。sort on numbers gives 1, 10, 2, 45 (alphabetical). Use sort -n for numeric order.
数字直接 sort 会得到 1、10、2、45(按字母)。要数字顺序用 sort -n。grep 'COMP3331' works but is loose — it would also match a hypothetical course "COMP33119." Defensive form: grep -E '^COMP3331\|'.
grep 'COMP3331' 能用但宽松 — 也会命中假设的 "COMP33119"。更稳的写法:grep -E '^COMP3331\|'。
sys.stdin is iterable — loop through it line by line. The program doesn't know whether the data comes from the keyboard, a file (< file.txt), or another command (cmd | ./script.py). That's the point.
sys.stdin 可迭代 — 用 for 循环逐行读取。程序不知道数据是从键盘、文件 (< file.txt) 还是另一个命令 (cmd | ./script.py) 来的 — 这正是它好用的地方。
line.rstrip('\n') — when you read a line, Python keeps the trailing \n. Without stripping, the last field becomes 'M\n' and == 'M' fails for every line.
line.rstrip('\n') — Python 读行会保留末尾的 \n。不去掉的话最后字段是 'M\n',== 'M' 永远不成立。fields[4] for "field 5" — Python lists are 0-indexed. Always subtract 1 from the spec field number. Or use fields[-1] for "the last field" (more robust).
fields[4] 对应"第 5 字段" — Python 列表是 0 索引。永远比题目编号减 1。或者用 fields[-1] 表示"最后一个字段"(更稳)。if ... != 'M': continue — early-exit filter. continue skips the rest of this iteration, jumps back to the top of the loop. Same as grep -E '\|M$' in shell.
if ... != 'M': continue — 早退出过滤。continue 跳过本次循环剩余代码,回到 for 顶部。等价于 shell 里的 grep -E '\|M$'。set() + .add() — auto-dedupe. Equivalent to sort -u in shell.
set() + .add() — 自动去重。等价于 shell 的 sort -u。sorted(surnames) — sets are unordered. You must wrap with sorted() or output is in arbitrary order.
sorted(surnames) — set 无序,必须用 sorted() 包一下,否则输出顺序乱。| Shell | Python |
|---|---|
grep PATTERN | if PATTERN not in line: continue |
grep -v PATTERN | if PATTERN in line: continue |
cut -d'X' -f3 | line.split('X')[2] |
sort -u | set() then sorted(s) |
wc -l | len(items) |
$1 | sys.argv[1] |
cat < file | for line in sys.stdin: |
⚠️ Top-5 Python bugs you hit 你踩过的 5 大 Python 坑
for line in sys.stdin → must be for line in sys.stdin:
缺冒号:for line in sys.stdin → 必须 for line in sys.stdin:= vs ==: if x = 5 is wrong; if x == 5 is comparison.
= 与 ==:if x = 5 错;if x == 5 才是比较。if fields[4] == F — Python looks up a variable named F, NameError. Write 'F'.
字符串没引号:if fields[4] == F — Python 把 F 当变量找,报 NameError。要写 'F'。print x is Python 2. Python 3 needs parens: print(x).
print x 是 Python 2 的写法。Python 3 必须加括号:print(x)。fields[2].
差一错误:题目说"第 3 字段" → Python 写 fields[2]。A file contains an unordered list of positive integers from n to m, with possibly one missing. Print the missing integer, or nothing if none missing. 文件里有一组从 n 到 m 的正整数(顺序打乱),可能缺一个。输出缺的那个;如果没缺,什么都不输出。
✅ Final answer (4 lines) 最终答案(4 行)
$1 is a filename, not the data
概念 1:$1 是文件名,不是数据
When you run ./practice_q4.sh numbers_1.txt, $1 = the string "numbers_1.txt". To get the actual numbers inside, you must use a tool that reads the file: cat "$1", sort "$1", etc. Always quote: "$1" in case the filename has spaces.
运行 ./practice_q4.sh numbers_1.txt 时,$1 = 字符串 "numbers_1.txt"。要拿到里面的数字,必须用 cat "$1"、sort "$1" 等读文件的工具。永远加引号 "$1",防止文件名有空格。
$() captures output
概念 2:$() 捕获输出
Without $(), a command's output goes to the screen and is gone. With $(), you catch it as a string into a variable.
不用 $(),命令的输出会打到屏幕上然后消失。用 $() 把输出抓成字符串存到变量里。
⚠️ Variable assignment rule 变量赋值规则
No spaces around =! n=42 works. n = 42 fails (shell thinks n is a command).
= 两边不能有空格!n=42 对,n = 42 错(shell 会把 n 当命令)。
-n means different things
概念 3:-n 在不同命令含义不同
| Command | -n means |
|---|---|
sort -n | numeric sort (not alphabetical) |
head -n 5 | show this many lines |
tail -n 1 | show this many lines |
sort | uniq -u
概念 4:核心技巧 — sort | uniq -u
Combine the actual numbers (from file) with the expected complete sequence (from seq). Numbers in both lists appear twice. The missing number appears once. uniq -u prints lines that appear exactly once.
把实际数字(来自文件)和完整期望序列(来自 seq)合并。两边都有的数字出现 2 次,缺的那个只出现 1 次。uniq -u 只打印恰好出现 1 次的行。
( cmd1 ; cmd2 )
概念 5:子 shell 分组 ( cmd1 ; cmd2 )
Parentheses run two (or more) commands and merge their outputs into one stream, which can then be piped. Semicolons separate commands within the group. 小括号把两条(或多条)命令的输出合并成一条流,再用管道送给下一个命令。分号分隔组内的命令。
💡 Why this is elegant 为什么这个写法优雅
If nothing is missing, every number appears twice → uniq -u outputs nothing. Spec satisfied with no special-case code.
如果什么都没缺,每个数字都出现 2 次 → uniq -u 不输出任何东西。完全符合题目要求,不用写特判分支。
cat "$1" — confirm $1 plumbing works.步骤 1:cat "$1" — 确认 $1 通了。$(), echo to confirm.步骤 2:用 $() 抓 min/max,echo 出来确认。seq "$n" "$m" on its own.步骤 3:单独跑 seq "$n" "$m"。30 bilingual questions · click an option to see the answer + explanation immediately 30 题双语 · 点击选项立即显示答案 + 解析
Start Quiz → 开始测验 →