Near-Optimal Space Perfect Hashing Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Near-Optimal Space Perfect Hashing Algorithm Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil IR Course CS Department, UFMG, February 19th, 2014 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 1 Objective of the Presentation Present a perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: Key set fits in the internal memory Internal Random Access Memory algorithm ( RAM ) Key set larger than the internal memory External Memory (cache-aware) algorithm ( EM ) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 2 Objective of the Presentation Present a perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: Key set fits in the internal memory Internal Random Access Memory algorithm ( RAM ) Key set larger than the internal memory External Memory (cache-aware) algorithm ( EM ) Theoretically well-founded, time efficient, higly scalable and near space-optimal LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 3 Perfect Hash Function Static key set S of size n 0 1 ... n -1 Hash Table Perfect Hash Function 0 1 ... m -1 S ⊆ U, where |U| = u LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 4 Minimal Perfect Hash Function Static key set S of size n 0 1 ... n -1 Minimal Perfect Hash Function Hash Table 0 1 ... n -1 S ⊆ U, where |U| = u LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 5 Where to use a PHF or a MPHF? Access items based on the value of a key is ubiquitous in Computer Science Work with huge static key sets: In data warehousing applications: On -Line Analytical Processing (OLAP) applications In Web search engines: Large vocabularies Map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 6 Crawling Pages Web Crawling Pages Parser URLs To Be Crawled Visited New URLs URLs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 7 Crawling Pages Web Crawling Pages Parser URLs To Be Crawled Visited New URLs URLs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 8 Representing Visited URLS MPHFs is the most compact way of representing the set of visited URLs Enable to keep much more URLs in main memory of each machine When the set of new URLs becomes large, a new MPHF is generated for the whole set of URLs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 9 Indexing Collection of Vocabulary Inverted List documents Term 1 Doc 1 Doc 5 ... Doc 1 Term 2 Doc 1 Doc 2 ... Doc 2 Term 3 Doc 3 Doc 4 ... Doc 3 Term 4 Doc 7 Doc 9 ... Indexing Doc 4 Term 5 Doc 6 Doc 10 ... Doc 5 Term 6 Doc 1 Doc 5 ... ... Term 7 Doc n Term 8 ... Term t Doc 9 Doc 11 ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 10 Representing the Vocabulary Vocabulary Inverted List Term 1 Doc 1 Doc 5 ... Term 2 Doc 1 Doc 2 ... Term 3 Doc 3 Doc 4 ... Term 4 Doc 7 Doc 9 ... Term 5 Doc 6 Doc 10 ... Term 6 Doc 1 Doc 5 ... Term 7 Term 8 ... Term t Doc 9 Doc 11 ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 11 Mapping URLs to Web Graph Vertices 0 Web Graph 1 Vertices URL 1 URLS URL 2 2 URL 3 URL 4 3 URL 5 4 URL 6 URL 7 5 ... URL n 6 .. n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 12 Mapping URLs to Web Graph Vertices 0 Web Graph 1 Vertices URL 1 URLS URL 2 2 URL 3 M 3 URL 4 P URL 5 H 4 URL 6 F URL 7 5 ... URL n 6 .. n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 13 Information Theoretical Lower Bounds for Storage Space n2 PHFs (m ≈ n): Storage Space ≥ log e m MPHFs (m = n ): Storage Space ≥ nlog e m < 3n log e ≈ .1 4427 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 14 Uniform Hashing Versus Universal Hashing Key universe Hash function Range M of size m U of size u LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 15 Uniform Hashing Versus Universal Hashing Key universe Hash function Range M of size m U of size u Uniform hashing # of functions from U to M? m u # of bits to encode each fucntion u log m Independent functions with values uniformly distributed LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 16 Uniform Hashing Versus Universal Hashing Key universe Hash function Range M of size m U of size u Uniform hashing Universal hashing # of functions from U to M? A family of hash functions HHH m u is universal if: for any pair of distinct # of bits to encode each fucntion keys (x 1, x 2) from U and u log m a hash function h chosen Independent functions with uniformly from HHH then: values uniformly distributed 1 Pr( h(x ) = h(x )) ≤ 1 2 m LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 17 Intuition Behind Universal Hashing We often lose relatively little compared to using a completely random map (uniform hashing) If S of size n is hashed to n2 buckets, with probability more than ½, no collisions occur Even with complete randomness, we do not expect little o( n2) buckets to suffice (the birthday paradox) So nothing is lost by using a universal family instead! LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 18 Related Work Theoretical Results (use uniform hashing) Practical Results (assume uniform hashing for free) Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 19 Theoretical Results Work Gen. Time Eval. Time Size (bits) Mehlhorn (1984) Expon. Expon. O(n) Schmidt and Not analyzed O(1) O(n) Siegel (1990) Hagerup and O(n+log log u) O(1) O(n) Tholey (2001) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 20 Theoretical Results – Use Uniform Hashing Work Gen. Time Eval. Time Size (bits) Mehlhorn (1984) Expon. Expon. O(n) Schmidt & Not analyzed O(1) O(n) Siegel (1990) Hagerup & O(n+log log u) O(1) O(n) Tholey (2001) Theoretic EM O(n) O(1) O(n) (CIKM 2007) F.C. Botelho, R. Pagh N. Ziviani. “Practical Perfect Hashing Algorithm in Nearly Optimal Space”, Information Systems 38(1), 2013, 108-131. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 21 Practical Results – Assume Uniform Hashing For Free Work Gen. Time Eval. Time Size (bits) Czech, Havas & Majewski (1992) O(n) O(1) O(n log n) Majewski, Wormald, O(n) O(1) O(n log n) Havas & Czech (1996) Pagh (1999) O(n) O(1) O(n log n) Botelho, Kohayakawa, O(n) O(1) O(n log n) & Ziviani (2005) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 22 Practical Results – Assume Uniform Hashing For Free Work Gen. Time Eval. Time Size (bits) Czech, Havas & Majewski (1992) O(n) O(1) O(n log n) Majewski, Wormald, O(n) O(1) O(n log n) Havas & Czech (1996) Pagh (1999) O(n) O(1) O(n log n) Botelho, Kohayakawa, O(n) O(1) O(n log n) & Ziviani (2005) RAM O(n) O(1) O(n) (WADS 2007) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 23 Practical Results – Assume Uniform Hashing For Free Work Gen. Time Eval. Time Size (bits) Czech, Havas & Majewski (1992) O(n) O(1) O(n log n) Majewski, Wormald, O(n) O(1) O(n log n) Havas & Czech (1996) Pagh (1999) O(n) O(1) O(n log n) Botelho, Kohayakawa, O(n) O(1) O(n log n) & Ziviani (2005) RAM O(n) O(1) O(n) (WADS 2007) Heuristic EM O(n) O(1) O(n) (CIKM 2007) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 24 Empirical Results Gen. Eval. Size Work Application Time Time (bits) Fox, Chen & Heath Index data in Exp. O(1) O(n) (1992) CD-ROM Lefebvre & Hoppe Sparse O(n) O(1) O(n) (2006) spatial data Chang, Lin & Chou Not Data mining O(n) O(1) (2005, 2006) analyzed LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 25 Internal Random Access and External Memory Algorithms Near-optimal space Evaluation in constant time Function generation in linear time Simple to describe and implement Known algorithms with near -optimal space either: Require exponential time for construction and evaluation, or Use near-optimal space only asymptotically, for large n Acyclic random hypergraphs Used before by Majewski et all (1996): O(n log n) bits We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 26 Random Hypergraphs (r-graphs) 3-graph: 0 1 2 3 4 5 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 27 Random Hypergraphs (r-graphs) 3-graph: h0(jan) = 1 h 1(jan) = 3 h 2(jan) = 5 0 1 2 3 4 5 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 28 Random Hypergraphs (r-graphs) 3-graph: h0(jan) = 1 h 1(jan) = 3 h 2(jan) = 5 0 1 h0(feb) = 1 h 1(feb) = 2 h 2(feb) = 5 2 3 4 5 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 29 Random Hypergraphs (r-graphs) 3-graph: h0(jan) = 1 h 1(jan) = 3 h 2(jan) = 5 0 1 h0(feb) = 1 h 1(feb) = 2 h 2(feb) = 5 2 3 h0(mar) = 0 h 1(mar) = 3 h 2(mar) = 4 4 5 3-graph is induced by three uniform hash functions Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 30 The MPHF Proposed by Czech, Havas e Majewski ..