Lecture 4: Hashing
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 4: Hashing Rafael Oliveira University of Waterloo Cheriton School of Computer Science [email protected] September 21, 2020 1 / 78 Overview Introduction Hash Functions Why is hashing? How to hash? Succinctness of Hash Functions Coping with randomness Universal Hashing Hashing using 2-universal families Perfect Hashing Acknowledgements 2 / 78 Computational Model Before we talk about hash functions, we need to state our model of computation: Definition (Word RAM model) In the word RAMa model: all elements are integers that fit in a machine word of w bits Basic operations (comparison, arithmetic, bitwise) on such words take Θ(1) time We can also access any position in the array in Θ(1) time aRAM stands for Random Access Model 3 / 78 Naive approach: use an array A of m elements, initially A[i] = 0 for all i, and when a key is inserted, set A[i] = 1. Insertion: O(1), Deletion: O(1), Search: O(1) Memory: O(m log(m)) (this is very bad!) Want to also achieve optimal memory O(n log(m)). For this we will use a technique called hashing. A hash function is a function h : U ! [0; n − 1], where jUj = m >> n. A hash table is a data structure that consists of: a table T with n cells [0; n − 1], each cell storing O(log(m)) bits a hash function h : U ! [0; n − 1] From now on, we will define memory as # of cells. What is hashing? We want to store n elements (keys) from the set U = f0; 1;:::; m − 1g, where m >> n, in a data structure that supports insertions, deletions, search \as efficiently as possible." 4 / 78 Insertion: O(1), Deletion: O(1), Search: O(1) Memory: O(m log(m)) (this is very bad!) Want to also achieve optimal memory O(n log(m)). For this we will use a technique called hashing. A hash function is a function h : U ! [0; n − 1], where jUj = m >> n. A hash table is a data structure that consists of: a table T with n cells [0; n − 1], each cell storing O(log(m)) bits a hash function h : U ! [0; n − 1] From now on, we will define memory as # of cells. What is hashing? We want to store n elements (keys) from the set U = f0; 1;:::; m − 1g, where m >> n, in a data structure that supports insertions, deletions, search \as efficiently as possible." Naive approach: use an array A of m elements, initially A[i] = 0 for all i, and when a key is inserted, set A[i] = 1. 5 / 78 Want to also achieve optimal memory O(n log(m)). For this we will use a technique called hashing. A hash function is a function h : U ! [0; n − 1], where jUj = m >> n. A hash table is a data structure that consists of: a table T with n cells [0; n − 1], each cell storing O(log(m)) bits a hash function h : U ! [0; n − 1] From now on, we will define memory as # of cells. What is hashing? We want to store n elements (keys) from the set U = f0; 1;:::; m − 1g, where m >> n, in a data structure that supports insertions, deletions, search \as efficiently as possible." Naive approach: use an array A of m elements, initially A[i] = 0 for all i, and when a key is inserted, set A[i] = 1. Insertion: O(1), Deletion: O(1), Search: O(1) Memory: O(m log(m)) (this is very bad!) 6 / 78 What is hashing? We want to store n elements (keys) from the set U = f0; 1;:::; m − 1g, where m >> n, in a data structure that supports insertions, deletions, search \as efficiently as possible." Naive approach: use an array A of m elements, initially A[i] = 0 for all i, and when a key is inserted, set A[i] = 1. Insertion: O(1), Deletion: O(1), Search: O(1) Memory: O(m log(m)) (this is very bad!) Want to also achieve optimal memory O(n log(m)). For this we will use a technique called hashing. A hash function is a function h : U ! [0; n − 1], where jUj = m >> n. A hash table is a data structure that consists of: a table T with n cells [0; n − 1], each cell storing O(log(m)) bits a hash function h : U ! [0; n − 1] From now on, we will define memory as # of cells. 7 / 78 Why is hashing useful? Designing efficient data structures (dictionaries) for searching Data streaming algorithms Derandomization Cryptography Complexity Theory many more 8 / 78 Will settle for: # collisions small with high probability. Ideally, want hash function to map different keys into different locations. Definition (Collision) We say that a collision happens for hash function h with inputs x; y 2 U if x 6= y and h(x) = h(y). By pigeonhole principle, impossible to achieve without knowing keys in advance. Challenges in Hashing Setup: Universe U = f0;:::; m − 1g of size m >> n where n is the size of the range of our hash function h : U ! [0; n − 1] Store O(n) elements of U (keys) in hash table T (which has n cells) 9 / 78 Will settle for: # collisions small with high probability. Challenges in Hashing Setup: Universe U = f0;:::; m − 1g of size m >> n where n is the size of the range of our hash function h : U ! [0; n − 1] Store O(n) elements of U (keys) in hash table T (which has n cells) Ideally, want hash function to map different keys into different locations. Definition (Collision) We say that a collision happens for hash function h with inputs x; y 2 U if x 6= y and h(x) = h(y). By pigeonhole principle, impossible to achieve without knowing keys in advance. 10 / 78 Challenges in Hashing Setup: Universe U = f0;:::; m − 1g of size m >> n where n is the size of the range of our hash function h : U ! [0; n − 1] Store O(n) elements of U (keys) in hash table T (which has n cells) Ideally, want hash function to map different keys into different locations. Definition (Collision) We say that a collision happens for hash function h with inputs x; y 2 U if x 6= y and h(x) = h(y). By pigeonhole principle, impossible to achieve without knowing keys in advance. Will settle for: # collisions small with high probability. 11 / 78 Simplest version to keep in mind: 1 Pr [h(x) = h(y)] ≤ 8x 6= y 2 U h2R H poly(n) Assumptions: keys are independent from hash function we choose. we do not know keys in advance (even if we did, nontrivial problem!) Question Still could have collisions. How do we handle them? Our solution: family of hash functions Construct family of hash functions H such that the number of collisions is small with high probability, when we pick hash function uniformly at random from the family H. 12 / 78 Assumptions: keys are independent from hash function we choose. we do not know keys in advance (even if we did, nontrivial problem!) Question Still could have collisions. How do we handle them? Our solution: family of hash functions Construct family of hash functions H such that the number of collisions is small with high probability, when we pick hash function uniformly at random from the family H. Simplest version to keep in mind: 1 Pr [h(x) = h(y)] ≤ 8x 6= y 2 U h2R H poly(n) 13 / 78 Our solution: family of hash functions Construct family of hash functions H such that the number of collisions is small with high probability, when we pick hash function uniformly at random from the family H. Simplest version to keep in mind: 1 Pr [h(x) = h(y)] ≤ 8x 6= y 2 U h2R H poly(n) Assumptions: keys are independent from hash function we choose. we do not know keys in advance (even if we did, nontrivial problem!) Question Still could have collisions. How do we handle them? 14 / 78 This setting is same as our balls-and-bins setting! So, if we have to store n keys: Expected number of keys in a location:1 maximum number of collisions (max load) in one particular location: O(log n= log log n) keys Solving collisions: store all keys hashed into location i by a linked list. Known as chain hashing. Could also pick two random hash functions and use power of two choices. Collision bound becomes O(log log n). Random Hash Functions? Natural to consider following approach: From all functions h : U ! [0; n − 1], just pick one uniformly at random. 15 / 78 So, if we have to store n keys: Expected number of keys in a location:1 maximum number of collisions (max load) in one particular location: O(log n= log log n) keys Solving collisions: store all keys hashed into location i by a linked list. Known as chain hashing. Could also pick two random hash functions and use power of two choices. Collision bound becomes O(log log n). Random Hash Functions? Natural to consider following approach: From all functions h : U ! [0; n − 1], just pick one uniformly at random. This setting is same as our balls-and-bins setting! 16 / 78 Solving collisions: store all keys hashed into location i by a linked list. Known as chain hashing. Could also pick two random hash functions and use power of two choices. Collision bound becomes O(log log n).