CMU 15-445/645 Database Systems (Fall 2018) :: Hash Tables

CMU 15-445/645 Database Systems (Fall 2018) :: Hash Tables

Hash Tables Lecture #06 Database Systems Andy Pavlo 15-445/15-645 Computer Science Fall 2018 AP Carnegie Mellon Univ. 2 UPCOMING DATABASE EVENTS MapD Talk → Thursday Sept 20th @ 12:00pm → CIC 4th Floor CMU 15-445/645 (Fall 2018) 3 ADMINISTRIVIA Project #1 is due Wednesday Sept 26th @ 11:59pm Homework #2 is due Friday Sept 28th @ 11:59pm CMU 15-445/645 (Fall 2018) 4 REMINDER If you have a question during the lecture, raise your hand and stop me. Do not come up to the front after the lecture. There are no stupid questions(*). CMU 15-445/645 (Fall 2018) 5 COURSE STATUS We are now going to talk about how Query Planning to support the DBMS's execution engine to read/write data from pages. Operator Execution Access Methods Two types of data structures: → Hash Tables Buffer Pool Manager → Trees Disk Manager CMU 15-445/645 (Fall 2018) 6 DATA STRUCTURES Internal Meta-data Core Data Storage Temporary Data Structures Table Indexes CMU 15-445/645 (Fall 2018) 7 DESIGN DECISIONS Data Organization → How we layout data structure in memory/pages and what information to store to support efficient access. Concurrency → How to enable multiple threads to access the data structure at the same time without causing problems. CMU 15-445/645 (Fall 2018) 8 HASH TABLES A hash table implements an associative array abstract data type that maps keys to values. It uses a hash function to compute an offset into the array, from which the desired value can be found. CMU 15-445/645 (Fall 2018) 9 STATIC HASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 abc record. 1 Ø 2 def To find an entry, mod the key by the ⋮ number of elements to find the offset in the array. n xyz CMU 15-445/645 (Fall 2018) 9 STATIC HASH TABLE Allocate a giant array that has one slot hash(key) for every element that you need to 0 record. 1 abcdefghi 2 xyz123 To find an entry, mod the key by the ⋮ number of elements to find the offset defghijk in the array. n CMU 15-445/645 (Fall 2018) 10 ASSUMPTIONS You know the number of elements hash(key) ahead of time. 0 1 abcdefghi Each key is unique. 2 xyz123 Perfect hash function. ⋮ defghijk → If key1≠key2, then n hash(key1)≠hash(key2) CMU 15-445/645 (Fall 2018) 11 HASH TABLE Design Decision #1: Hash Function → How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate. Design Decision #2: Hashing Scheme → How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys. CMU 15-445/645 (Fall 2018) 12 TODAY'S AGENDA Hash Functions Static Hashing Schemes Dynamic Hashing Schemes CMU 15-445/645 (Fall 2018) 13 HASH FUNCTIONS We don’t want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate. CMU 15-445/645 (Fall 2018) 14 HASH FUNCTIONS MurmurHash (2008) → Designed to a fast, general purpose hash function. Google CityHash (2011) → Based on ideas from MurmurHash2 → Designed to be faster for short keys (<64 bytes). Google FarmHash (2014) → Newer version of CityHash with better collision rates. CLHash (2016) → Fast hashing function based on carry-less multiplication. CMU 15-445/645 (Fall 2018) 15 HASH FUNCTION BENCHMARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 18000 32 64 192 128 12000 6000 Throughput (MB/sec) Throughput 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018) 16 HASH FUNCTION BENCHMARKS Intel Core i7-8700K @ 3.70GHz std::hash MurmurHash3 CityHash FarmHash CLHash 36000 192 128 24000 64 32 12000 Throughput (MB/sec) Throughput 0 1 51 101 151 201 251 Key Size (bytes) Source: Fredrik Widlund CMU 15-445/645 (Fall 2018) 17 STATIC HASHING SCHEMES Approach #1: Linear Probe Hashing Approach #2: Robin Hood Hashing Approach #3: Cuckoo Hashing CMU 15-445/645 (Fall 2018) 18 LINEAR PROBE HASHING Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table. → To determine whether an element is present, hash to a location in the index and scan for it. → Have to store the key in the index to know when to stop scanning. → Insertions and deletions are generalizations of lookups. CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) A B A | val C <key>|<value> D E F CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) B | val A B A | val C D E F CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) B | val A B A | val C D C | val E F CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) B | val A B A | val C D C | val E D | val F CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) B | val A B A | val C D C | val E D | val F E | val CMU 15-445/645 (Fall 2018) 19 LINEAR PROBE HASHING hash(key) B | val A B A | val C D C | val E D | val F E | val F | val CMU 15-445/645 (Fall 2018) 20 NON-UNIQUE KEYS Choice #1: Separate Linked List → Store values in separate storage area for each key. CMU 15-445/645 (Fall 2018) 20 NON-UNIQUE KEYS Value Lists Choice #1: Separate Linked List XYZ value1 value2 → Store values in separate storage area for ABC each key. value3 value1 value2 Choice #2: Redundant Keys → Store duplicate keys entries together in the hash table. XYZ|value1 ABC|value1 XYZ|value2 XYZ|value3 ABC|value2 CMU 15-445/645 (Fall 2018) 21 OBSERVATION To reduce the # of wasteful comparisons, it is important to avoid collisions of hashed keys. This requires a hash table with ~2x the number of slots as the number of elements. CMU 15-445/645 (Fall 2018) 22 ROBIN HOOD HASHING Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys. → Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key. CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) A B A | val [0] # of "Jumps" From First Position C D E F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] C D E F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] A[0] == C[0] C D C | val [1] E F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] C D C | val [1] C[1] > D[0] E D | val [1] F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] A[0] == E[0] C D C | val [1] E D | val [1] F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] A[0] == E[0] C D C | val [1] C[1] == E[1] E D | val [1] F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] A[0] == E[0] C D C | val [1] C[1] == E[1] E D | val [1] D[1] < E[2] F CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] A[0] == E[0] C D C | val [1] C[1] == E[1] E E | val [2] D[1] < E[2] F D | val [2] CMU 15-445/645 (Fall 2018) 23 ROBIN HOOD HASHING hash(key) B | val [0] A B A | val [0] C D C | val [1] E E | val [2] F D | val [2] D[2] > F[0] F | val [1] CMU 15-445/645 (Fall 2018) 24 CUCKOO HASHING Use multiple hash tables with different hash functions. → On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location. Look-ups and deletions are always O(1) because only one location per hash table is checked. CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) ⋮ ⋮ CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) A|val ⋮ ⋮ CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) A|val Insert B hash1(B) hash2(B) ⋮ ⋮ CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) B|val A|val Insert B hash1(B) hash2(B) ⋮ ⋮ CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) B|val A|val Insert B hash1(B) hash2(B) ⋮ ⋮ Insert C hash1(C) hash2(C) CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) C|val A|val Insert B hash1(B) hash2(B) ⋮ ⋮ Insert C hash1(C) hash2(C) CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) C|val A|val Insert B hash1(B) hash2(B) ⋮ ⋮ Insert C hash1(C) hash2(C) hash1(B) CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) C|val B|val Insert B hash1(B) hash2(B) ⋮ ⋮ Insert C hash1(C) hash2(C) hash1(B) CMU 15-445/645 (Fall 2018) 25 CUCKOO HASHING Hash Table #1 Hash Table #2 Insert A hash1(A) hash2(A) C|val B|val Insert B hash (B) hash (B) 1 2 A|val ⋮ ⋮ Insert C hash1(C) hash2(C) hash1(B) hash2(A) CMU 15-445/645 (Fall 2018) 26 CUCKOO HASHING Make sure that we don’t get stuck in an infinite loop when moving keys.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    78 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us