Double Hashing
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  SAHA: a String Adaptive Hash Table for Analytical Databasesapplied sciences Article SAHA: A String Adaptive Hash Table for Analytical Databases Tianqi Zheng 1,2,* , Zhibin Zhang 1 and Xueqi Cheng 1,2 1 CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; [email protected] (Z.Z.); [email protected] (X.C.) 2 University of Chinese Academy of Sciences, Beijing 100049, China * Correspondence: [email protected] Received: 3 February 2020; Accepted: 9 March 2020; Published: 11 March 2020 Abstract: Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production. Keywords: hash table; analytical database; string data 1.
- 
												  Chapter 5 HashingIntroduction hashing performs basic operations, such as insertion, Chapter 5 deletion, and finds in average time Hashing 2 Hashing Hashing Functions a hash table is merely an of some fixed size let be the set of search keys hashing converts into locations in a hash hash functions map into the set of in the table hash table searching on the key becomes something like array lookup ideally, distributes over the slots of the hash table, to minimize collisions hashing is typically a many-to-one map: multiple keys are if we are hashing items, we want the number of items mapped to the same array index hashed to each location to be close to mapping multiple keys to the same position results in a example: Library of Congress Classification System ____________ that must be resolved hash function if we look at the first part of the call numbers two parts to hashing: (e.g., E470, PN1995) a hash function, which transforms keys into array indices collision resolution involves going to the stacks and a collision resolution procedure looking through the books almost all of CS is hashed to QA75 and QA76 (BAD) 3 4 Hashing Functions Hashing Functions suppose we are storing a set of nonnegative integers we can also use the hash function below for floating point given , we can obtain hash values between 0 and 1 with the numbers if we interpret the bits as an ______________ hash function two ways to do this in C, assuming long int and double when is divided by types have the same length fast operation, but we need to be careful when choosing first method uses
- 
												  Chapter 5. Hash TablesCSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 5 Hashing and Hash Tables Hashing and Hash Tables 1 Introduction A hash table is a look-up table that, when designed well, has nearly O(1) average running time for a nd or insert operation. More precisely, a hash table is an array of xed size containing data items with unique keys, together with a function called a hash function that maps keys to indices in the table. For example, if the keys are integers and the hash table is an array of size 127, then the function hash(x), dened by hash(x) = x%127 maps numbers to their modulus in the nite eld of size 127. Key Space Hash Function Hash Table (values) 0 1 Japan 2 3 Rome 4 Canada 5 Tokyo 6 Ottawa Italy 7 Figure 1: A hash table for world capitals. Conceptually, a hash table is a much more general structure. It can be thought of as a table H containing a collection of (key, value) pairs with the property that H may be indexed by the key itself. In other words, whereas you usually reference an element of an array A by writing something like A[i], using an integer index value i, with a hash table, you replace the index value "i" by the key contained in location i in order to get the value associated with it. For example, we could conceptualize a hash table containing world capitals as a table named Capitals that contains the set of pairs (Italy, Rome), (Japan, Tokyo), (Canada, Ottawa) and so on, and conceptually (not actually) you could write a statement such as print Capitals[Italy] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
- 
												  Lock-Free Hopscotch HashingLock-Free Hopscotch Hashing Robert Kelly∗ Barak A. Pearlmuttery Phil Maguirez Abstract ensure that reading the data structure is as cheap as In this paper we present a lock-free version of Hopscotch possible. The majority of operations on structures like Hashing. Hopscotch Hashing is an open addressing algo- hash tables are read operations, meaning that it pays rithm originally proposed by Herlihy, Shavit, and Tzafrir to optimise them for fast reads. [10], which is known for fast performance and excellent cache locality. The algorithm allows users of the table to skip or jump over irrelevant entries, allowing quick search, inser- tion, and removal of entries. Unlike traditional linear prob- ing, Hopscotch Hashing is capable of operating under a high load factor, as probe counts remain small. Our lock-free version improves on both speed, cache locality, and progress guarantees of the original, being a chimera of two concur- Figure 1: Table legend. The subscripts represent the rent hash tables. We compare our data structure to various optimal bucket index for an entry, the > represents other lock-free and blocking hashing algorithms and show a tombstone, and the ? represents an empty bucket. that its performance is in many cases superior to existing Bucket 1 contains the symbol for a tombstone, bucket 2 strategies. The proposed lock-free version overcomes some of contains an entry A which belongs at bucket 2, bucket the drawbacks associated with the original blocking version, 3 contains another entry B which also belongs at bucket leading to a substantial boost in scalability while maintain- 2, and bucket 4 is empty.
- 
												  SIMD Vectorized Hashing for Grouped AggregationSIMD Vectorized Hashing for Grouped Aggregation Bala Gurumurthy, David Broneske, Marcus Pinnecke, Gabriel Campero, and Gunter Saake Otto-von-Guericke-Universität, Magdeburg, Germany [email protected] Abstract. Grouped aggregation is a commonly used analytical func- tion. The common implementation of the function using hashing tech- niques suffers lower throughput rate due to the collision of the insert keys in the hashing techniques. During collision, the underlying tech- nique searches for an alternative location to insert keys. Searching an alternative location increases the processing time for an individual key thereby degrading the overall throughput. In this work, we use Single Instruction Multiple Data (SIMD) vectorization to search multiple slots at an instant followed by direct aggregation of results. We provide our experimental results of our vectorized grouped aggregation with various open-addressing hashing techniques using several dataset distributions and our inferences on them. Among our findings, we observe different impacts of vectorization on these techniques. Namely, linear probing and two-choice hashing improve their performance with vectorization, whereas cuckoo and hopscotch hashing show a negative impact. Overall, we provide in this work a basic structure of a dedicated SIMD acceler- ated grouped aggregation framework that can be adapted with different hashing techniques. Keywords: SIMD · hashing techniques · hash based grouping · grouped aggregation · direct aggregation · open addressing 1 Introduction Many analytical processing queries (e.g., OLAP) are directly related to the effi- ciency of their underlying functions. Some of these internal functions are time- consuming and even require the complete dataset to be processed before produc- ing the results. One of such compute-intensive internal functions is a grouped aggregation function that affects the overall throughput of the user query.
- 
											Script ( PDF , 8 MB )Main Memory DBMS G. Moerkotte February 4, 2019 2 Contents 1 Introduction9 2 Hardware 11 2.1 Memory................................ 11 2.1.1 Aligned vs. Unaligned Access................ 11 2.1.2 Address Translation (Virtual Memory)........... 11 2.1.3 Caches and the Memory Hierachie............. 15 2.1.4 Some Numbers of Some Processors............. 18 2.1.5 Prefetching.......................... 18 2.2 CPU.................................. 19 2.2.1 Pipelining........................... 19 2.2.2 Out-Of-Order-Execution................... 20 2.2.3 (Cost of) Branch (Mis-) Prediction............. 21 2.2.4 SIMD............................. 23 2.2.5 Simultaneous multithreading (SMT)............ 30 2.3 Cache Coherence........................... 30 2.4 Synchronization Primitives..................... 32 2.5 NUMA................................. 32 2.6 Performance Monitoring Unit.................... 34 2.7 References............................... 35 3 Operating System 37 4 Hash Tables 39 4.1 Hash Functions............................ 39 4.1.1 Why hash-functions matter................. 40 4.1.2 Properties........................... 42 4.1.3 Uniformity.......................... 42 4.1.4 Average-case Search Length................. 42 4.1.5 Expected Length of the Longest Probe Sequence (llps).. 42 4.1.6 Universality.......................... 43 4.1.7 k-Universal.......................... 46 4.1.8 (c; k)-Universal........................ 46 4.1.9 Dietzfelbinger: Universality without Primes........ 47 4.1.10 k-universal Hash Functions................. 47 4.1.11 Tabulation Hashing..................... 48 3 4 CONTENTS 4.1.12 Hashing string values.................... 48 4.2 Hash Table Organization....................... 49 4.2.1 Two versions of Chained Hashtable............. 49 4.2.2 Cuckoo-Hashing....................... 49 4.2.3 Robin-Hood-Hashing..................... 50 4.2.4 Hopscotch-Hashing...................... 50 5 Compression 51 6 Storage Layout 53 6.1 Row stores and Column Stores..................
- 
												  Implementation and Performance Analysis of Hash Functions and Collision Resolutions6.851: Advanced Data Structures Implementation and Performance Analysis of Hash Functions and Collision Resolutions Victor Jakubiuk Sanja Popovic Massachusetts Institute of Technology Spring 2012 Contents 1 Introduction 2 2 Theoretical background 2 2.1 Hash functions . 2 2.2 Collision resolution . 2 2.2.1 Linear and quadratic probing . 3 2.2.2 Cuckoo hashing . 3 2.2.3 Hopscotch hashing . 3 3 The Main Benchmark 4 4 Discussion 6 5 Conclusion 7 References 8 1 1 Introduction Hash functions are among the most useful functions in computer science as they map large sets of keys to smaller sets of hash values. Hash functions are most commonly used for hash tables, when a fast key-value look up is required. Because they are so ubiquitous, improving their performance has a potential for large impact. We implemented two hash functions (simple tabulation hashing and multiplication hash- ing), as well as four collision resolution methods (linear probing, quadratic probing, cuckoo hashing and hopscotch hashing). In total, we have eight combinations of hash functions and collision resolutions. They are compared against three most popular C++ hash libraries: unordered map structure in C++11 (GCC 4.4.5), unordered map in Boost library (version 1.42.0) and hash map in libstdc (GCC C++ extension library, GCC 4.4.5). We analyzed raw performance parameters: running time, memory consumption and cache behavior. We also did a comparison of the number of probes in the table depending on the load factor, as well as dependency of running time and load factor. 2 Theoretical background We begin with a brief theoretical overview of the hash functions and collisions resolution methods used.
- 
												  Robin Hood HashingCOP4530 Notes Summer 2019 TBD 1 Hashing Hashing is an old and time-honored method of accessing key/value pairs that have fairly random keys. The idea here is to create an array with a fixed size of N, and then “hash” the key to compute an index into the array. Ideally the hashing function should “spatter” the bits fairly evenly over the range N. 2 Hash functions As Weiss remarks, if the keys are just integers that are reasonably randomly distributed, simply using the mod function (key % N) (where N is prime) should work reasonably well. His example of a bad function would be one that has a table with 10 elements and keys that all end in 0, thus all computing to the same index of 0. For strings, Weiss points out that could just add the bytes in the string together into a sum and then again compute sum % N. This generalizes for any arbitrary key type; just add the actual bytes of the key in the same fashion. Another possibility is to use XOR rather the sum (though your XOR needs to be multibyte, with enough bytes to successfully get to the maximum range value.) You still have undesirable properties in the distribution of bits in most string coding schemes; ASCII leaves the high bit at zero, and UTF-8 unifor- mally sets the high bit for any code point outside of ASCII. This can be cured; C++ provides XOR-based hashing via FNV hashing as a standard option. Weiss gives this example for a good string hash function: 1 for(char c: key) 2 { 3 hash_value = 37 * hash_value+c; 4 } 5 return hash_value% table_size; 3 Collision resolution This is all well and good, but what happens when two dierent keys compute the same hash index? This is called a “collision” (and this term is used throughout the wider literature on hashing, and not just for the data structure hash table.) Resolving hashes, along with choosing hash functions, is among the primary concerns in designing hash tables.
- 
												  Practical Survey on Hash TablesPractical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 – 5 June 2012) “I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen.” Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf) Abstract • Hash table definition • Collision resolving schemas: – Chained hashing – Linear and quadratic probing – Cuckoo hashing • Some hash function theory • Simple tabulation hashing Omnipresence of hash tables • Symbol tables in compilers • Cache implementations • Database storages • Manage memory pages in Linux • Route tables • Large number of documents Hash Tables Considering a set of elements 푆 from a finite and much larger universe 푈. A hash table consists of: ℎ푎푠ℎ 푓푢푛푐푡표푛 ℎ: 푈 → {0, . , 푚 − 1} 푣푒푐푡표푟 푣 표푓 푠푧푒 푚 Collisions same hash for two different 26.17.41.60 keys 126.15.12.154 What to do? 202.223.224.33 • Ignore them 7.239.203.66 • Chain colliding values • Skip and try again 176.136.103.233 • Hash and displace • Find a perfect hash function War Story: cache with hash tables Problem: An application which application gets some data from an expensive repository. hash table Solution: Hash table with collision replacement. Key point: a big chunk of users “watched” a lot of common data. Data source Collision Resolution Schemas • Chained hashing • Open hash: linear and quadratic probing • Cuckoo hashing And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc. Chained Hashing Each slot contains a linked list. 0 푛 푂( ) = 푂(1) for all operations.
- 
												  Hashing: Part 2 332Adam BlankLecture 11 Winter 2017 CSE 332: Data Structures and Parallelism CSE Hashing: Part 2 332 Data Structures and Parallelism HashTable Review 1 Hashing Choices 2 Hash Tables 1 Choose a hash function Provides 1 core Dictionary operations (on average) 2 Choose a table size We call the key space the “universe”: U and the Hash Table T O( ) We should use this data structure only when we expect U T 3 Choose a collision resolution strategy (Or, the key space is non-integer values.) S S >> S S Separate Chaining Linear Probing Quadratic Probing Double Hashing hash function mod T collision? E int Table Index Fixed Table Index Other issues to consider: S S ÐÐÐÐÐÐÐ→Hash Table Client ÐÐÐÐ→ HashÐÐÐÐÐ→ Table Library 4 Choose an implementation of deletion ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ 5 Choose a λ that means the table is “too full” Another Consideration? What do we do when λ (the load factor) gets too large? We discussed the first few of these last time. We’ll discuss the rest today. Review: Collisions 3 Open Addressing 4 Definition (Collision) Definition (Open Addressing) A collision is when two distinct keys map to the same location in the Open Addressing is a type of collision resolution strategy that resolves hash table. collisions by choosing a different location when the natural choice is full. A good hash function attempts to avoid as many collisions as possible, There are many types of open addressing. Here’s the key ideas: but they are inevitable. We must be able to duplicate the path we took. How do we deal with collisions? We want to use all the spaces in the table.
- 
												  Fundamental Data StructuresFundamental Data Structures PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Thu, 17 Nov 2011 21:36:24 UTC Contents Articles Introduction 1 Abstract data type 1 Data structure 9 Analysis of algorithms 11 Amortized analysis 16 Accounting method 18 Potential method 20 Sequences 22 Array data type 22 Array data structure 26 Dynamic array 31 Linked list 34 Doubly linked list 50 Stack (abstract data type) 54 Queue (abstract data type) 82 Double-ended queue 85 Circular buffer 88 Dictionaries 103 Associative array 103 Association list 106 Hash table 107 Linear probing 120 Quadratic probing 121 Double hashing 125 Cuckoo hashing 126 Hopscotch hashing 130 Hash function 131 Perfect hash function 140 Universal hashing 141 K-independent hashing 146 Tabulation hashing 147 Cryptographic hash function 150 Sets 157 Set (abstract data type) 157 Bit array 161 Bloom filter 166 MinHash 176 Disjoint-set data structure 179 Partition refinement 183 Priority queues 185 Priority queue 185 Heap (data structure) 190 Binary heap 192 d-ary heap 198 Binomial heap 200 Fibonacci heap 205 Pairing heap 210 Double-ended priority queue 213 Soft heap 218 Successors and neighbors 221 Binary search algorithm 221 Binary search tree 228 Random binary tree 238 Tree rotation 241 Self-balancing binary search tree 244 Treap 246 AVL tree 249 Red–black tree 253 Scapegoat tree 268 Splay tree 272 Tango tree 286 Skip list 308 B-tree 314 B+ tree 325 Integer and string searching 330 Trie 330 Radix tree 337 Directed acyclic word graph 339 Suffix tree 341 Suffix array 346 van Emde Boas tree 349 Fusion tree 353 References Article Sources and Contributors 354 Image Sources, Licenses and Contributors 359 Article Licenses License 362 1 Introduction Abstract data type In computing, an abstract data type (ADT) is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics.
- 
												  SCALABLE GRAPH PROCESSING on RECONFIGURABLE SYSTEMS by ROBERT G. KIRCHGESSNER a DISSERTATION PRESENTED to the GRADUATE SCHOOL OFSCALABLE GRAPH PROCESSING ON RECONFIGURABLE SYSTEMS By ROBERT G. KIRCHGESSNER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 © 2016 Robert G. Kirchgessner To my parents, Robert and Janette, and my wife Minjeong ACKNOWLEDGMENTS I would like to express my deepest gratitude to all those who have helped me down this long road towards completing my doctoral studies. I thank my advisor, Dr. Alan George, for his wisdom, guidance, and support both academically and personally throughout my graduate studies; and my co-advisor, Dr. Greg Stitt, whose invaluable academic insights helped shape my research. I thank Vitaliy Gleyzer, for his invaluable feedback, suggestions, and guidance, making me a better researcher, and MIT/LL for their support and resources which made this work possible. I would also like to thank my committee members, Dr. Herman Lam and Dr. Darin Acosta, for their important suggestions, advice, and feedback on my work. Additionally, I thank my friends and colleagues: Kenneth Hill, Bryant Lam, Abhijeet Lawande, Adam Lee, Barath Ramesh, and Gongyu Wang, who have always provided me with support, both in my research and personal life. Furthermore, I would like to thank my loving wife, without whom this would have not been possible, and my parents, who knew I could achieve this well before I knew it myself. Last but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022.