Back to COMP2041 返回 COMP2041
FINAL EXAM REVIEW 期末复习

COMP2041 Final Review — 24T1 Practice COMP2041 期末复习 — 24T1 模拟题

Q1 regex · Q2 shell pipelines · Q3 Python · Q4 shell scripts Q1 正则 · Q2 Shell 管道 · Q3 Python · Q4 Shell 脚本

Exam Format & How To Use This Page 考试格式 & 如何使用本页

📋 The Exam考试结构

12 questions, 100 marks, 3 hours working + 10 min reading. Questions divide into practical (Q1–Q5) and theory/longer scripts (Q6–Q12). This review covers Q1–Q4 of the 24T1 practice exam — the foundations every later question builds on. 12 题,100 分,3 小时答题 + 10 分钟阅读。分为 实操题 (Q1–Q5)偏理论 / 较长脚本题 (Q6–Q12)。本页覆盖 24T1 模拟题的 Q1–Q4 — 这是后面所有题目共用的基础。

💡 Strategy策略

Each section is a knowledge dump + worked example + your real bugs. Don't just read — close the page and try writing each pattern from memory. Then take the quiz at the bottom (30 questions, click-to-reveal explanations). 每一节都是知识点 + 例题 + 你真正踩过的坑。不要只读 — 关掉页面,凭记忆写出每个模板。最后做底部 quiz(30 题,点击即看解析)。

Q1 Regex with grep -E grep -E 正则表达式

The data: pipe-separated awards 数据格式:竖线分隔的获奖记录

Each line of awards.psv has 6 fields separated by |: 每行 awards.psv| 分成 6 个字段:

Award Name | Year | Winner Name | Gender | Country | Birth Year ACM Turing Award|2000|Andrew Chi-Chih Yao|Male|China|1946 Nobel Prize for medicine|1963|Andrew F. Huxley|Male|United Kingdom|1917 Fields Medal|1982|William Thurston|Male|United States|1946

Why -E: bare | vs \| 为什么必须 -E:裸 | vs \|

In extended regex (grep -E), bare | means OR (alternation). To match a literal pipe, you must escape it: \|. This is the single most common bug in Q1. 在扩展正则 (grep -E) 里,裸 | 表示 OR(或)。要匹配真正的竖线必须转义:\|。这是 Q1 最常犯的错误。

grep -E 'A|B' → match lines containing 'A' OR 'B' grep -E 'A\|B' → match lines containing the literal text 'A|B'

Anchors: ^, $, and field boundaries 锚点:^, $ 和字段边界

^ = start of line · $ = end of line. To pin down a middle field, surround it with literal pipes: \|Australia\|. ^ = 行首 · $ = 行尾。要锁死中间某个字段,用左右两边的竖线把它夹住:\|Australia\|

Q1.1 worked example: Australian winners Q1.1 例题:澳大利亚获奖者

grep -E '\|Australia\|' awards.psv

Pipes on both sides ⇒ "Australia" must be field 5, not a substring inside a name or award title. 两边都有竖线 ⇒ "Australia" 必须是第 5 字段,不会误中名字或奖项里出现的子串。

AND across two fields: use .* 两个字段同时满足:用 .*

"Fields Medal winners from France" needs two conditions on the same line: field 1 = Fields Medal AND field 5 = France. Anchor each, join with .*: "法国的 Fields 奖得主" 需要同一行同时满足两个条件:字段 1 = Fields Medal 且 字段 5 = France。两边各自锚定,中间用 .* 连接:

grep -E '^Fields Medal\|.*\|France\|' awards.psv

⚠️ Bug you hit你踩过的坑

Your first attempt was '^Fields Medal| *\|France\|'. The unescaped | made it OR not AND — it returned every Fields medalist plus every Frenchman of any award. Always escape (\|) when you mean a literal pipe. 你第一次写的是 '^Fields Medal| *\|France\|'。中间那个没转义的 | 变成了 OR — 结果同时返回所有 Fields 奖得主加上所有法国人。要表示字面竖线一定要转义 \|

Quantifiers and characters 量词与字符

SymbolMeaningExample
.any single charactera.c matches abc, a c
*0 or more of previousab*c matches ac, abbbc
+1 or more of previousab+c matches abc, not ac
?0 or 1 of previouscolou?r matches both spellings
{n}exactly n times[0-9]{4} = a 4-digit number
[A-Z]character class (set)any uppercase letter
.*0+ of anything (workhorse)"skip past whatever"
\.literal dotmatches the period in "F."

⚠️ Bugs you hit (character classes) 你踩过的坑(字符类)

  • [194]194. Brackets mean "one of these characters" — [194] matches a single 1, 9, OR 4. To match the literal sequence "194" (e.g. for the 1940s), write 194 with no brackets. [194]194。方括号是"这些字符里的一个" — [194] 匹配单个 1、9 或 4。要匹配字面序列 "194"(例如 1940 年代)应该直接写 194
  • [a+z][a-z]. Inside brackets, + is a literal plus, not a quantifier. Use a hyphen for ranges. [a+z][a-z]。方括号内 + 是字面加号,不是量词。范围用连字符。
  • [A-Z*][A-Z]*. Inside brackets, * is literal. Outside, it's "0 or more of the previous thing." [A-Z*][A-Z]*。方括号内 * 是字面量。在外面才是"前一个东西出现 0 次或多次"。
  • . alone matches exactly one character. To allow many, write .* or .+. . 单独使用只匹配恰好一个字符。要任意多个写 .*.+

Capture groups & backreferences 捕获组与反向引用

Wrap a part with (...) to capture it. Refer back with \1, \2, etc. — group numbers count left-to-right. (...) 把一部分捕获起来。用 \1\2 等反向引用 — 编号按左括号从左到右数。

Q1.4: first name + middle initial + last name all start with the same letter Q1.4:名 + 中间名首字母 + 姓 都用同一个字母开头

\|([A-Z])[a-z]+ \1\. \1[a-z]+\|

Read it left-to-right: pipe, capture an uppercase letter, lowercase rest, space, same letter + literal dot + space, same letter starting the surname, lowercase rest, pipe. 从左往右读:竖线,抓一个大写字母,剩下的小写,空格,同一个字母 + 字面点号 + 空格,同一个字母开头的姓,小写剩余,竖线。

Optional middle name (or no middle name) 中间名可有可无

Best pattern: a character class with * (zero or more). Cleaner than .* because it can't accidentally match pipes. 最佳写法:字符类 + *(零次或多次)。比 .* 更安全 — 不会意外吃掉竖线。

# Same first/last initial, optional middle: grep -E '\|([A-Z])[a-z]+ [a-zA-Z. ]*\1[a-z]+\|' awards.psv

Backreferences DON'T work inside [...] 反向引用不能放在 [...] 里

[\1\2] doesn't mean "match group 1 OR group 2." Inside brackets, \1 is a literal backslash-1. To say "match group 1 OR group 2," use alternation outside brackets: (\1|\2). [\1\2] 不会被解读成"匹配组 1 或组 2"。在方括号里 \1 是字面字符。要表达"组 1 或组 2"用括号外的 alternation:(\1|\2)

4-digit palindrome year (ABBA) 4 位回文年份 (ABBA)

# 1881, 1991, 2002 ... grep -E '\|([0-9])([0-9])\2\1$' awards.psv

Group 1 = first digit. Group 2 = second digit. Then \2\1 mirrors them. $ anchors the year to the end of line because birth year is the last field. 组 1 = 第 1 位数字,组 2 = 第 2 位数字。然后 \2\1 镜像。$ 锁住行尾因为出生年是最后一个字段。

Q2 Shell Pipelines Shell 管道

The philosophy 管道哲学

A pipeline is a chain of small programs, each doing one thing, connected by |. Output of one becomes input of the next. Don't write the whole thing at once — build it stage by stage and check the output after each stage. 管道 = 一串各司其职的小工具,用 | 串起来,前一个的输出 = 后一个的输入。不要一次写完整条管道 — 分阶段建造,每加一段就检查一次输出。

⚠️ The three meanings of | ⚠️ | 的三种含义

grep -E '\|M$' | cut -d'|' -f3 ↑ ↑ ↑ ↑ | | | └─ INSIDE quotes: cut's literal delimiter | | └─ OUTSIDE quotes: SHELL pipe (connects commands) | └─ INSIDE quotes: end-of-line anchor (from grep regex) └─ INSIDE quotes: ESCAPED literal pipe (regex)

Same character, three jobs depending on context (quotes + position). This trips up everyone — once you see it, you can't unsee it. 同一个字符,根据上下文(引号、位置)扮演三种角色。第一次看会懵,看懂了就再也不会忘。

Tools you'll chain 常用工具

ToolJobCommon flag
grepfilter lines by pattern-E ERE · -v invert · -i ignore case · -c count
cutextract columns-d 'X' -f N (delimiter, field — both required!)
sortorder lines-n numeric · -u unique · -r reverse
uniqcollapse adjacent duplicates-d only dups · -u only uniques · -c with counts
wccount-l lines
head / tailfirst / last N lines-n N
seqintegers from n to me.g. seq 1 5
trtranslate / delete chars-d delete · -s squeeze
sedsubstitute's/old/new/g'

The Q2 problem Q2 题目

Read enrolment lines from stdin (5 fields: course|id|name|plan|gender). Output surnames of male students, sorted, no duplicates. 从 stdin 读 5 字段选课记录 (course|id|name|plan|gender)。输出男生姓氏,排序,去重。

COMP1511|3360379|Costner, Kevin Augustus |3978/1|M COMP1511|3364562|Carey, Mary |3711/1|F COMP3311|3383025|Thorpe, Ian Augustus |3978/3|M ...

The 4-stage pattern四阶段模板

#! /bin/dash grep -E '\|M$' | cut -d'|' -f3 | cut -d',' -f1 | sort -u # filter male extract name extract surname sort + dedupe

This is the master template for "filter → extract → dedupe → (count)" questions. Memorise it. 这是"过滤 → 提取 → 去重 → (计数)"类题目的母模板。背下来。

Counting variant: wc -l 计数变体:wc -l

# How many distinct students enrolled in COMP3331? grep -E '^COMP3331\|' enrolments.txt | cut -d'|' -f2 | sort -u | wc -l

Always dedupe by the uniquely-identifying field (student ID), not by name — two different students can share a name. Two "Wang, Wei" with different IDs are two people. 务必用唯一标识字段(学号)去重,不是名字 — 两个不同的学生可以同名。两个不同 ID 的 "Wang, Wei" 是两个人。

⚠️ Bugs you hit你踩过的坑

  • cut '|' -f3 — missing -d. Without it, cut defaults to TAB and your whole line becomes one field. Always: cut -d'|' -f3. cut '|' -f3 — 缺 -d。没有它,cut 默认用 TAB 分隔,整行变成一个字段。永远写 cut -d'|' -f3
  • Plain sort on numbers gives 1, 10, 2, 45 (alphabetical). Use sort -n for numeric order. 数字直接 sort 会得到 1、10、2、45(按字母)。要数字顺序用 sort -n
  • grep 'COMP3331' works but is loose — it would also match a hypothetical course "COMP33119." Defensive form: grep -E '^COMP3331\|'. grep 'COMP3331' 能用但宽松 — 也会命中假设的 "COMP33119"。更稳的写法:grep -E '^COMP3331\|'

Q3 Python (same as Q2, different language) Python(和 Q2 同题,换语言)

stdin in Python Python 读 stdin

sys.stdin is iterable — loop through it line by line. The program doesn't know whether the data comes from the keyboard, a file (< file.txt), or another command (cmd | ./script.py). That's the point. sys.stdin 可迭代 — 用 for 循环逐行读取。程序不知道数据是从键盘、文件 (< file.txt) 还是另一个命令 (cmd | ./script.py) 来的 — 这正是它好用的地方。

The Q3 solution Q3 标准答案

#!/usr/bin/python3 import sys surnames = set() for line in sys.stdin: line = line.rstrip('\n') fields = line.split('|') if fields[4] != 'M': continue name = fields[2] surname = name.split(',')[0] surnames.add(surname) for surname in sorted(surnames): print(surname)

Why each line matters 每行为什么重要

Shell ↔ Python translation table Shell ↔ Python 对照表

ShellPython
grep PATTERNif PATTERN not in line: continue
grep -v PATTERNif PATTERN in line: continue
cut -d'X' -f3line.split('X')[2]
sort -uset() then sorted(s)
wc -llen(items)
$1sys.argv[1]
cat < filefor line in sys.stdin:

⚠️ Top-5 Python bugs you hit 你踩过的 5 大 Python 坑

  1. Missing colon: for line in sys.stdin → must be for line in sys.stdin: 缺冒号:for line in sys.stdin → 必须 for line in sys.stdin:
  2. = vs ==: if x = 5 is wrong; if x == 5 is comparison. ===if x = 5 错;if x == 5 才是比较。
  3. Unquoted string: if fields[4] == F — Python looks up a variable named F, NameError. Write 'F'. 字符串没引号:if fields[4] == F — Python 把 F 当变量找,报 NameError。要写 'F'
  4. print x is Python 2. Python 3 needs parens: print(x). print x 是 Python 2 的写法。Python 3 必须加括号:print(x)
  5. Off-by-one: spec says "field 3" → Python uses fields[2]. 差一错误:题目说"第 3 字段" → Python 写 fields[2]

Run it 运行方式

chmod +x practice_q3.py ./practice_q3.py < enrolments.txt # redirect file as stdin cat enrolments.txt | ./practice_q3.py # pipe in ./practice_q3.py # type lines + Ctrl+D

Q4 Shell Script — Find the Missing Integer Shell 脚本 — 找出缺失的整数

The problem 题目

A file contains an unordered list of positive integers from n to m, with possibly one missing. Print the missing integer, or nothing if none missing. 文件里有一组从 nm 的正整数(顺序打乱),可能缺一个。输出缺的那个;如果没缺,什么都不输出。

Final answer (4 lines) 最终答案(4 行)

#!/bin/dash n=$(sort -n "$1" | head -n 1) m=$(sort -n "$1" | tail -n 1) ( sort -n "$1" ; seq "$n" "$m" ) | sort -n | uniq -u

Concept 1: $1 is a filename, not the data 概念 1:$1 是文件名,不是数据

When you run ./practice_q4.sh numbers_1.txt, $1 = the string "numbers_1.txt". To get the actual numbers inside, you must use a tool that reads the file: cat "$1", sort "$1", etc. Always quote: "$1" in case the filename has spaces. 运行 ./practice_q4.sh numbers_1.txt 时,$1 = 字符串 "numbers_1.txt"。要拿到里面的数字,必须用 cat "$1"sort "$1" 等读文件的工具。永远加引号 "$1",防止文件名有空格。

Concept 2: $() captures output 概念 2:$() 捕获输出

Without $(), a command's output goes to the screen and is gone. With $(), you catch it as a string into a variable. 不用 $(),命令的输出会打到屏幕上然后消失。用 $() 把输出抓成字符串存到变量里。

sort -n "$1" | head -n 1 # → prints to screen, lost n=$(sort -n "$1" | head -n 1) # → captured into n="39"

⚠️ Variable assignment rule 变量赋值规则

No spaces around =! n=42 works. n = 42 fails (shell thinks n is a command). = 两边不能有空格!n=42 对,n = 42 错(shell 会把 n 当命令)。

Concept 3: -n means different things 概念 3:-n 在不同命令含义不同

Command-n means
sort -nnumeric sort (not alphabetical)
head -n 5show this many lines
tail -n 1show this many lines

Concept 4: the clever trick — sort | uniq -u 概念 4:核心技巧 — sort | uniq -u

Combine the actual numbers (from file) with the expected complete sequence (from seq). Numbers in both lists appear twice. The missing number appears once. uniq -u prints lines that appear exactly once. 把实际数字(来自文件)和完整期望序列(来自 seq)合并。两边都有的数字出现 2 次,缺的那个只出现 1 次。uniq -u 只打印恰好出现 1 次的行。

# Walk through with file = 39 45 40 44 41 43 (n=39, m=45): sort -n "$1" → 39 40 41 43 44 45 # 42 missing! seq 39 45 → 39 40 41 42 43 44 45 # complete ( sort -n "$1" ; seq "$n" "$m" ) → combined stream | sort -n → 39 39 40 40 41 41 42 43 43 44 44 45 45 | uniq -u → 42 ✓

Concept 5: subshell grouping ( cmd1 ; cmd2 ) 概念 5:子 shell 分组 ( cmd1 ; cmd2 )

Parentheses run two (or more) commands and merge their outputs into one stream, which can then be piped. Semicolons separate commands within the group. 小括号把两条(或多条)命令的输出合并成一条流,再用管道送给下一个命令。分号分隔组内的命令。

💡 Why this is elegant 为什么这个写法优雅

If nothing is missing, every number appears twice → uniq -u outputs nothing. Spec satisfied with no special-case code. 如果什么都没缺,每个数字都出现 2 次 → uniq -u 不输出任何东西。完全符合题目要求,不用写特判分支。

Build it incrementally 分步建造

  1. Step 1: cat "$1" — confirm $1 plumbing works.步骤 1:cat "$1" — 确认 $1 通了。
  2. Step 2: capture min/max with $(), echo to confirm.步骤 2:用 $() 抓 min/max,echo 出来确认。
  3. Step 3: seq "$n" "$m" on its own.步骤 3:单独跑 seq "$n" "$m"
  4. Step 4: combine + sort + uniq -u.步骤 4:合并 + sort + uniq -u。

📝 Test Yourself 📝 自我测试

30 bilingual questions · click an option to see the answer + explanation immediately 30 题双语 · 点击选项立即显示答案 + 解析

Start Quiz → 开始测验 →
Back to COMP2041 返回 COMP2041