Lab 02: UNIX Pipelines & Filters Lab 02:UNIX 管道与过滤器

Overview 概述

This lab focuses on chaining Unix commands together using pipelines and using filters like sort, cut, uniq, wc, grep, awk, and sed to process text data. Mastering these patterns is essential for the lab exercises and exam.

本实验重点学习如何使用管道将 Unix 命令串联起来,以及使用 sort、cut、uniq、wc、grep、awk、sed 等 过滤器来处理文本数据。掌握这些模式对于完成实验和考试至关重要。

What You'll Learn: 你将学到:

  • Chain commands with pipes (|)
  • 用管道 (|) 串联命令
  • Sort data by multiple columns with -k options
  • 使用 -k 选项按多列排序
  • Extract specific fields with cut
  • 使用 cut 提取特定字段
  • Count and filter with uniq -c and awk
  • 使用 uniq -c 和 awk 统计和过滤
  • Edit text streams with sed
  • 使用 sed 编辑文本流

1. UNIX Pipeline 1. UNIX 管道

The pipe | connects the standard output of one command to the standard input of the next.

管道 | 将前一个命令的标准输出连接到下一个命令的标准输入

command1 | command2 | command3

Example / 例子

$cat enrolments.psv | sort | head -n 10

Read file → sort lines → show first 10

读文件 → 排序 → 显示前10行

2. sort — Sort Lines 2. sort — 行排序

Option选项 Meaning含义 Example例子
-t'|'Set delimiter to |设置分隔符为 |sort -t'|'
-k2,2Sort by column 2 only只按第2列排序sort -k2,2
-nNumeric sort数字排序sort -k2,2n
-rReverse (descending)逆序(降序)sort -k1,1r
-uSort and remove duplicates排序并去重sort -u
-k6.5,6.7Column 6, chars 5–7第6列第5到7个字符sort -k6.5,6.7nr
⚠️ sort vs sort -n
Without -n, numbers sort alphabetically: 10 comes before 9.
With -n, they sort numerically: 9 comes before 10.
不加 -n 按字母排:10 排在 9 前面。
-n 按数字排:9 排在 10 前面。
💡 Multiple sort keys多个排序条件
$sort -t'|' -k2,2n -k1,1r file.psv
First sort by column 2 (numeric ascending), then by column 1 (alphabetic descending) when column 2 ties. 先按第2列数字升序排,第2列相同时再按第1列字母降序排。
📌 Understanding -k format 📌 理解 -k 格式

-kSTART,END[options]

Write写法Meaning含义
-k1,1Column 1 only, alphabetic ascending只有第1列,字母升序
-k1,1rColumn 1 only, alphabetic descending只有第1列,字母降序
-k2,2nColumn 2 only, numeric ascending只有第2列,数字升序
-k6.5,6.7Column 6, characters 5 through 7第6列第5到第7个字符
⚠️ -k1,14 means from column 1 to column 14 — NOT column 1 character 4! Always write -k1,1 to mean "column 1 only". ⚠️ -k1,14 是"从第1列到第14列",不是"第1列第4个字符"!要表示"只有第1列"必须写 -k1,1

3. cut — Extract Columns 3. cut — 提取列

cut slices a line into parts using a delimiter, then picks the part(s) you want.

cut 用分隔符把一行切成若干部分,然后取你想要的部分。

Option选项Meaning含义
-f1Take field/column 1 (Tab-separated by default)取第1列(默认Tab分隔)
-f1,3Take columns 1 and 3取第1和第3列
-d'|'Use | as delimiter用 | 作为分隔符

Example: extract hour from "Mon 10:00-12:00"例:从 "Mon 10:00-12:00" 提取小时

$echo "Mon 10:00-12:00" | cut -d'-' -f1 Mon 10:00 $echo "Mon 10:00-12:00" | cut -d'-' -f1 | cut -d':' -f1 Mon 10

Step 1: split by -, take part 1 → Mon 10:00
Step 2: split by :, take part 1 → Mon 10

第1步:用 - 分割,取第1部分 → Mon 10:00
第2步:用 : 分割,取第1部分 → Mon 10

4. uniq — Remove / Count Duplicates 4. uniq — 去重 / 统计重复

⚠️ uniq only removes adjacent duplicates! Always sort first.uniq 只删除相邻的重复行!一定要先排序。
Option选项Meaning含义
uniqRemove adjacent duplicate lines删除相邻重复行
uniq -cCount how many times each line appears统计每行出现次数
$sort file | uniq -c 3 COMP1511 1 COMP1521 2 COMP1531

Output format: count value — the count is in $1, value in $2.

输出格式:次数 值 — 次数是 $1,值是 $2

5. wc — Count Lines / Words / Chars 5. wc — 统计行/词/字符

Option选项Meaning含义
wc -lCount lines统计行数
wc -wCount words统计词数
wc -cCount characters统计字符数
💡 To get just the number (no filename):只输出数字(不显示文件名):
$cat file | wc -l 1275

6. grep — Filter Lines by Pattern 6. grep — 按模式过滤行

Option选项Meaning含义
grep 'pattern'Keep lines matching pattern保留匹配的行
grep -v 'pattern'Keep lines NOT matching pattern保留不匹配的行
grep '^COMP'Lines starting with COMP以 COMP 开头的行
grep -EExtended regex (enables |, +, ?)扩展正则(支持 |, +, ?)

7. awk — Filter by Count 7. awk — 按次数过滤

awk processes each line. After uniq -c, use it to filter by occurrence count.

awk 处理每一行。配合 uniq -c 可以按出现次数过滤。

$sort | uniq -c | awk '$1 >= 2 {print $2}'
Write写法Meaning含义
$1First column (the count from uniq -c)第1列(uniq -c 的次数)
$2Second column (the value)第2列(值)
$1 >= 2Condition: count ≥ 2条件:次数 ≥ 2
{print $2}Action: print second column动作:打印第2列

8. sed — Stream Editor 8. sed — 流编辑器

Command命令Meaning含义
s/old/new/Replace first match per line替换每行第一个匹配
s/old/new/gReplace ALL matches per line替换每行所有匹配
/pattern/dDelete lines matching pattern删除匹配的整行
s/pattern//Delete matched text, keep empty line删除匹配内容,保留空行
-n '/pattern/p'Print only matching lines只打印匹配行
/start/,/end/dDelete a range of lines删除范围内的行
📌 Capture groups in sed 📌 sed 捕获组

Use \(.*\) to capture content and \1 to reuse it.

\(.*\) 捕获内容,用 \1 引用它。

$sed 's/#include "\(.*\)"/#include <\1>/' program.c

"stdlib.h" → captured as \1 = stdlib.h → result: <stdlib.h>

"stdlib.h" → 捕获为 \1 = stdlib.h → 结果:<stdlib.h>

⚠️ Capture group must be paired: \( opens, \) closes. Missing one causes an error! ⚠️ 捕获组必须成对:\( 开始,\) 结束。少一个会报错!
📌 d vs s// — delete line vs keep empty line 📌 d 和 s// 的区别
Command命令Result结果
Delete whole line删整行/TODO/dLine disappears completely整行消失
Delete matched text删匹配文本s/\/\/ TODO.*//Empty line remains (leading spaces stay)空行保留(前面空格还在)

9. Common Pipeline Patterns 9. 常用管道模式

A
Count total lines统计总行数
$cat file | wc -l
B
Count unique values in a column统计某列有多少不同值
$cut -f1 file | sort -u | wc -l
C
Find most frequent value找出现最多的值
$cut -f1 file | sort | uniq -c | sort -k1,1nr -k2,2 | head -n 1
D
Find values appearing ≥ 2 times找出现 ≥ 2 次的值
$sort | uniq -c | awk '$1 >= 2 {print $2}' | sort -u
E
Find max value in a column找某列最大值
$cut -f6 file | sort -n | tail -n 1

Lab 02 Practice Quiz Lab 02 练习测验 NEW

Test your knowledge with 30 practice questions covering sort, cut, uniq, wc, grep, awk, and sed. Immediate feedback on your answers with detailed explanations.

通过30道练习题测试你对 sort, cut, uniq, wc, grep, awk, sed 的掌握程度。 立即获得答案反馈和详细解析。

Start Quiz → 开始测验 →