COMP2521 Week 9 - Hash Tables (Linear Probing)

📦 What is a Hash Table? 📦 什么是哈希表？

A hash table stores key-value pairs with O(1) average-case insert, search, and delete. It uses a hash function to map keys to array indices. 哈希表存储键-值对，平均情况支持 O(1) 插入、查找、删除。通过哈希函数将键映射到数组索引。

Core Concepts:核心概念：

Hash function: h(key) → integer index in [0, N-1]哈希函数：h(key) → [0, N-1] 中的整数索引
Ideal slot for key = h(key) % Nkey 的理想位置 = h(key) % N
Collision: two keys hash to the same slot碰撞：两个 key 映射到同一个槽位
Load factor α = numItems / numSlots (keep ≤ 0.5 for probing)负载因子 α = numItems / numSlots（探测法保持 ≤ 0.5）

Simple Hash Function 简单哈希函数

static inline unsigned int hash(int key, int N) {
    return key % N;
}

Example (N = 10): key 15 → slot 5; key 26 → slot 6; key 5 → slot 5 (collision with 15!) 例子（N = 10）：key 15 → 槽位 5；key 26 → 槽位 6；key 5 → 槽位 5（与 15 碰撞！）

🔍 Linear Probing 🔍 线性探测

When the ideal slot is taken, linear probing looks at the next slot, then the next, until it finds an empty slot (or matching key). Wrap around to index 0 when you reach the end. 当理想槽位被占时，线性探测查看下一个槽位，再下一个，直到找到空位（或匹配的 key）。到达末尾时绕回索引 0。

Insertion Algorithm 插入算法

void HashTableInsert(HashTable table, int key, int value) {
    if (table->numItems >= table->numSlots * MAX_LOAD_FACTOR) {
        resize(table);
    }
    int index = hash(key, table->numSlots);

    // Linear probing: walk forward until empty OR matching key
    while (!table->slots[index].empty) {
        if (table->slots[index].key == key) {
            table->slots[index].value = value;  // overwrite
            return;                              // DO NOT numItems++
        }
        index = (index + 1) % table->numSlots;   // wrap-around
    }

    // Found empty slot
    table->slots[index].key = key;
    table->slots[index].value = value;
    table->slots[index].empty = false;
    table->numItems++;
}

Three rules of linear probing insert: 线性探测插入的三条铁律：

If slot is empty → insert here, numItems++槽位为空 → 插入这里，numItems++
If slot has same key → overwrite value, DO NOT increment numItems槽位 key 相同 → 覆盖 value，不要增加 numItems
Otherwise → move to (index + 1) % N否则 → 移动到 (index + 1) % N

Walkthrough (N = 10) 插入演示（N = 10）

Insert (15, 4): h(15)=5 → slot 5 empty ✓ Insert (1, 12): h(1)=1 → slot 1 empty ✓ Insert (26, 7): h(26)=6 → slot 6 empty ✓ Insert (5, 28): h(5)=5 → slot 5 TAKEN (key=15) → slot 6 TAKEN (key=26) → slot 7 empty ✓ (probed 2 extra slots) Final table: index: 0 1 2 3 4 5 6 7 8 9 key: --- 1 --- --- --- 15 26 5 --- --- value: --- 12 --- --- --- 4 7 28 --- ---

🗑️ Backshift Deletion 🗑️ 回移删除

With linear probing, you cannot just mark a slot empty on delete — it breaks the probe chain. Use backshift: clear the slot, then rehash all subsequent non-empty slots until you hit an empty one. 使用线性探测时，删除槽位不能简单清空 — 会打断探测链。使用回移：清空槽位后，重新哈希所有后续非空槽位，直到遇到空位。

⚠️ Why simple delete breaks things: Suppose key 5 is at slot 7 (probed from slot 5). If you delete key 15 at slot 5 by marking it empty, then search(5) hashes to slot 5, sees empty, and gives up — even though 5 is at slot 7! ⚠️ 为什么简单删除会出问题： 假设 key 5 在槽位 7（从槽位 5 探测过来）。如果你把槽位 5 的 key 15 标记为空，那么 search(5) 映射到槽位 5，看到空，就放弃了 — 但 5 明明在槽位 7！

Deletion Algorithm 删除算法

void HashTableDelete(HashTable table, int key) {
    int i = getIndex(table, key);
    if (i == MISSING_INDEX) return;  // key not found

    // Clear the target slot
    table->slots[i].empty = true;
    table->numItems--;

    // Backshift: rehash subsequent non-empty slots
    int j = (i + 1) % table->numSlots;
    while (!table->slots[j].empty) {
        int rehashKey = table->slots[j].key;
        int rehashValue = table->slots[j].value;

        table->slots[j].empty = true;
        table->numItems--;

        HashTableInsert(table, rehashKey, rehashValue);
        j = (j + 1) % table->numSlots;
    }
}

Backshift Walkthrough 回移演示

Before delete(15): index: 5 6 7 key: 15 26 5 (5 was probed from slot 5) Step 1: Clear slot 5 (delete 15) index: 5 6 7 key: --- 26 5 Step 2: j=6 → non-empty (26) Save (26, value), clear slot 6, reinsert 26 h(26)=6 → slot 6 empty now → placed at slot 6 Step 3: j=7 → non-empty (5) Save (5, value), clear slot 7, reinsert 5 h(5)=5 → slot 5 is EMPTY now → placed at slot 5 ✓ (chain fixed!) Step 4: j=8 → empty → STOP After: index: 5 6 7 key: 5 26 --- ✓ search(5) works again

Why it works: an empty slot is a natural boundary — any key beyond it must have hashed to ≥ j, independent of the deleted key. 原理： 空槽位是天然边界 — 空位之后的 key 一定是从 ≥ j 的位置映射来的，与被删除的 key 无关。

💡 Applications (LeetCode Patterns) 💡 应用场景（LeetCode 模式）

Pattern 1: Existence Check — Find Missing Integer 模式 1：存在性检查 — 找缺失整数

Given an array with integers 1 to n and exactly one missing, find the missing one. Naive O(n²) → hash table O(n). 给定 1 到 n 中恰好少一个的数组，找出缺失的数。朴素 O(n²) → 哈希表 O(n)。

int findMissingInteger(int *integers, int n) {
    HashTable seen = HashTableNew();
    for (int i = 0; i < n - 1; i++) {
        HashTableInsert(seen, integers[i], 1);
    }
    int missing = -1;
    for (int i = 1; i <= n; i++) {
        if (!HashTableContains(seen, i)) {
            missing = i;
            break;
        }
    }
    HashTableFree(seen);
    return missing;
}

Pattern 2: Counter — Find the Vote Winner 模式 2：计数器 — 找投票赢家

Count frequency of each candidate. The one with the most votes wins; ties return NO_WINNER. 统计每个候选人的频次。票数最多的赢；平局返回 NO_WINNER。

int findWinner(int *ballots, int numBallots) {
    HashTable counts = HashTableNew();
    for (int i = 0; i < numBallots; i++) {
        int newCount = HashTableGetOrDefault(counts, ballots[i], 0) + 1;
        HashTableInsert(counts, ballots[i], newCount);
    }
    int winner = NO_WINNER, maxVotes = 0;
    bool tie = false;
    for (int i = 0; i < numBallots; i++) {
        int votes = HashTableGet(counts, ballots[i]);
        if (votes > maxVotes) {
            maxVotes = votes; winner = ballots[i]; tie = false;
        } else if (votes == maxVotes && ballots[i] != winner) {
            tie = true;
        }
    }
    HashTableFree(counts);
    return tie ? NO_WINNER : winner;
}

Pattern 3: Bijection — Similar Strings (Isomorphic) 模式 3：双射 — 相似字符串（同构）

Two strings are similar if there is a one-to-one mapping between their characters. Example: "adt" ↔ "bst" ✓, but "adt" ↔ "dcc" ✗ (d and t both map to c — not one-to-one). 两个字符串相似当且仅当存在字符间的一一映射。例子："adt" ↔ "bst" ✓，但 "adt" ↔ "dcc" ✗（d 和 t 都映射到 c — 不是一一对应）。

Key insight: use TWO hash tables to track both directions. A single hash table can only detect "one-to-many", missing "many-to-one" violations like "aa" ↔ "bc". 关键洞察：用两张哈希表追踪两个方向。单张哈希表只能检测"一对多"，漏掉"多对一"的违规，比如 "aa" ↔ "bc"。

bool areSimilarStrings(char *s1, char *s2) {
    if (strlen(s1) != strlen(s2)) return false;
    HashTable forward = HashTableNew();   // s1[i] -> s2[i]
    HashTable backward = HashTableNew();  // s2[i] -> s1[i]
    bool result = true;
    for (int i = 0; s1[i] != '\0'; i++) {
        int a = s1[i], b = s2[i];
        if (HashTableContains(forward, a)) {
            if (HashTableGet(forward, a) != b) { result = false; break; }
        } else if (HashTableContains(backward, b)) {
            result = false; break;   // many-to-one violation
        } else {
            HashTableInsert(forward, a, b);
            HashTableInsert(backward, b, a);
        }
    }
    HashTableFree(forward);
    HashTableFree(backward);
    return result;
}

⏱️ Time Complexity ⏱️ 时间复杂度

Operation操作	Average平均	Worst最坏
Insert插入	O(1)	O(n)
Search / Contains查找 / 包含	O(1)	O(n)
Delete (backshift)删除（回移）	O(1)	O(n)

Why O(n) worst case? If the load factor is too high or the hash function is poor, probe chains grow long and every operation has to walk through many slots. 为什么最坏是 O(n)？ 如果负载因子过高或哈希函数不好，探测链会变得很长，每次操作都要遍历许多槽位。

Rule of thumb: keep load factor ≤ 0.5 by resizing — this guarantees average O(1). 经验法则： 通过扩容保持负载因子 ≤ 0.5 — 保证平均 O(1)。

⚠️ Common Pitfalls ⚠️ 常见陷阱

❌ Wrong错误

Forgetting % numSlots → out-of-bounds忘记 % numSlots → 数组越界
Using = instead of == in comparison比较时用 = 而非 ==
Incrementing numItems on overwrite覆盖时增加 numItems
Simple clear-on-delete (breaks probe chain)简单清空删除（打断探测链）
Single hash table for bijection checks双射检查只用一张哈希表
Forgetting HashTableFree → memory leak忘记 HashTableFree → 内存泄漏

✅ Correct正确

Always (index + 1) % numSlots始终用 (index + 1) % numSlots
Use ==, or Yoda style CONST == i比较用 ==，或 Yoda 式 CONST == i
Overwrite returns without incrementing覆盖时 return 不增加计数
Use backshift deletion使用回移删除
Two tables: forward and backward两张表：forward 和 backward
Always free allocated tables before return返回前总是释放分配的表

📝 Summary 📝 总结

Core memorisation points: 核心记忆点：

Insert: while (!empty) { same-key overwrite / index = (i+1) % N }插入：while (!empty) { 相同 key 覆盖 / index = (i+1) % N }
Delete (backshift): clear target, then rehash all non-empty slots forward until empty found删除（回移）：清空目标，然后重哈希所有后续非空槽位直到遇空
Existence pattern: dump all into table → query 1..n存在性模式：全扔进表 → 查 1..n
Counter pattern: GetOrDefault(k, 0) + 1 → Insert计数器模式：GetOrDefault(k, 0) + 1 → Insert
Bijection pattern: two hash tables for two directions双射模式：两张哈希表检查两个方向

💭 Analogy: a hash table is like gym lockers. Your membership number mod N is your locker. If it's taken, try the next one (linear probing). When someone checks out (deletion), anyone after them who "squeezed past" needs to try again — maybe they can move to a better locker now (backshift). 💭 类比：哈希表就像健身房储物柜。会员号 mod N 是柜号。被占了就试下一个（线性探测）。有人退柜时（删除），排在他后面"挤过去"的人要重新刷卡 — 也许他们现在能回到更合适的柜子（回移）。