Near-Optimal Space Perfect Hashing Algorithms

Fabiano Cupertino Botelho Supervisor - Nivio Ziviani Near-Optimal Space Perfect Hashing Algorithms PhD. dissertation presented to the Grad- uate Program in Computer Science of the Federal University of Minas Gerais as a par- tial requirement to obtain the PhD. degree in Computer Science. Belo Horizonte September 29, 2008 To my dear wife Jana´ına. To my dear parents Maria Lúcia and JoséV´ıtor. To my dear sisters Gleiciane and Cristiane. Acknowledgements To God for having granted me life and wisdom to realize a dream of childhood and for the great help in difficult moments. To my dear wife Jana´ına Marcon Machado Botelho for the love, understanding by several times when I could not give her the attention she deserves, companionship and encouragement during moments in which I desired to give up everything. Jana thank you for sharing your life with me and the victories won during the entire doctorate. With the grace of God in our lives we will continue to be very happy. To my dear parents Maria Lúcia de Lima Botelho and JoséVitor Botelho for sacrifices made in the past that have given support for this achievement. To my dear sisters Cristiane Cupertino Botelho and Gleiciane Cupertino Botelho for the love of the best two sisters in the world. To my dear aunt Márcia Novaes Alves and my dear uncle Sudário Alves for always welcome me with affection, giving me much support throughout my doctorate. To Prof. Nivio Ziviani for the excellent work of supervision and for being an example of professionalism and dedication to work. His extensive experience in academic research, and particularly in the areas of information retrieval and algorithms have been of extreme importance to realize this work. In addition, his excellent support, attention and encouragement were of great importance not only for completing the doctorate, but also for my academic and professional life. To Prof. Rasmus Pagh with whom I’ve learned a lot about techniques for designing and analyzing hashing algorithms, being crucial his participation in this thesis. To Prof. Yoshiharu Kohayakawa for the attention dedicated to the discussions that contributed to improve the quality of this work. Thanks also to receive me at the Institute of Mathematics and Statistics at the University of São Paulo and for all the support given to my work during the time I spent in São Paulo. To Prof. Edleno Silva de Moura for trusting on me and for always encouraging me. Thanks also to receive me at the Department of Computer Science at the Federal University of Amazonas during the time I spent in Manaus. To the other Professors that evaluated this thesis, namely, Gaston Gonnet, Antônio Al- fredo Loureiro, Wagner Meira Jr. and Jayme Luiz Szwarcfiter for having accepted to participate of the PhD. defense and for the relevant criticisms and suggestions. To Djamal Belazzougui for the intelligent suggestions and contributions made to this thesis and to the CMPH library. To Davi Reis for having conceived the idea of the CMPH library, which was fundamental to disseminate the results obtained in this thesis. To my colleague and friend Marco Antônio Pinheiro de Cristo for the fun moments we spent together during our English classes and for always encoraging me. To my colleague and friend Thierson Couto for his friendship, and to be always ready to cooperate. To my colleague and friend David Menotti for the discussions, suggestions and criticisms that contributed much in the beginning of this work. To my colleague and friend David Fernandes for having received me in your home during the time I spent in Manaus and for his endless friendship. To my colleagues and friends of our great and unforgettable soccer team Curucu and their wives for the friendship conquered during the period we spent together. Thanks Pedro Neto, Maur´ıcio Figueiredo, Eduardo Freire Nakamura, Ruiter Caldas, AndréLins, JoséPinheiro, Guillermo Camara Chavez, Martin Gomez Ravetti, David Patricio Viscarra del Pozo and David Menotti for the amazing and fun moments that served to relieve the stress of this difficult period of doctorate. To colleagues and friends from that period of our undergraduate course that, through the mailing list intrigas99, always supported me being close or distant. I thank also for all the good laughs that I gave when I was reading some posts of the list, which certainly helped a lot to ease the tension in difficult times. To my colleagues and friends of the Laboratory for Treating Information (LATIN) An´ısio Mendes Lacerda, Alvaro´ Pereira Jr., Charles Ornelas Almeida, Claudine Santos Badue, Daniel Galinkin, Denilson Pereira, Guilherme Vale Menezes, Hendrickson R. Langbehn, Humberto Mossri, Marco Antônio Pinheiro de Cristo, Marco Aurélio Barreto Modesto, Pável Calado and Wladmir Cardoso Brandão for the criticism and suggestions provided during the defense preparation and for the climate of friendship we have established within LATIN. To Professors and employees of the Department of Computer Science at the Federal University of Minas Gerais that in various ways contributed to the completion of this work. To Professors and employees of the Department of Computer Engineering at the Federal Center for Technological Education of Minas Gerais for having received me so well and in a so respectful manner to integrate the department team. To the scholarships granted by CAPES (Coordination of Improvement of Higher Edu- cation) and CNPq (National Council for Scientific and Technological Development), which served as subsidy for the time dedicated to this thesis. Abstract A perfect hash function (PHF) h : S [0, m 1] for a key set S U of size n, where → − ⊆ m n and U is a key universe, is an injective function that maps the keys of S to unique ≥ values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possi- ble range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in web search engines, or item sets in data mining techniques. In this thesis we present a simple, highly scalable and near-space optimal perfect hashing algorithm. Evaluation of a PHF on a given element of S requires constant time, and the dominating phase in the construction algorithm consists of sorting n fingerprints of O(log n) bits in O(n) time. The space usage depends on the relation between m and n. For m = n the space usage is in the range 2.62n to 3.3n bits, depending on the constants involved in the construction and in the evaluation phases. For m = 1.23n the space usage is in the range 1.95n to 2.7n bits. In all cases, this is within a small constant factor from the information theoretical minimum of approximately 1.44n bits for MPHFs and 0.89n bits for PHFs, something that has not been achieved by previous algorithms, except asymptotically for very large n. This small space usage opens up the use of MPHFs to applications for which they were not useful in the past. We demonstrate the scalability of our algorithm by constructing an MPHF for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 50 minutes, using a commodity PC. We also present a distributed and parallel implementation of the algorithm, which generates an MPHF for the same URL set, using a 14 computer cluster, in approximately 4 minutes, achieving an almost linear speedup. Also, for 14.336 billion 16-byte random integers distributed among the 14 participating ma- chines, the algorithm outputs an MPHF in approximately 50 minutes, with a performance degradation of 20%. Resumo Uma fun¸cão hash perfeita (FHP) h : U [0, m 1] para um conjunto de chaves S U → − ⊆ de tamanho n, onde m n e U éum universo de chaves, éuma fun¸cão injetora que ≥ mapeia as chaves de S para valores únicos. Uma fun¸cão hash perfeita m´ınima (FHPM) éuma FHP com m = n, o menor intervalo poss´ıvel. Fun¸cões hash perfeitas m´ınimas são amplamente utilizadas para armazenamento eficiente e recupera¸cão rápida de itens de conjuntos estáticos, como palavras em linguagem natural, palavras reservadas em linguagens de programa¸cão ou sistemas interativos, URLs (universal resource locations) em máquinas de busca, ou conjuntos de itens em técnicas de minera¸cão de dados. Nesta tese nós apresentamos um algoritmo de hashing perfeito altamente escalável e de espa¸co quase ótimo. A avalia¸cão de uma FHP sobre um dado elemento de S requer tempo constante, e a fase dominante no algoritmo de constru¸cão consiste da ordena¸cão de n fingerprints de O(log n) bits em tempo O(n). A utiliza¸cão de espa¸co depende da rela¸cão entre m e n. Para m = n a utiliza¸cão de espa¸co estádentro do intervalo 2, 62n à3, 3n bits, dependendo das constantes envolvidas nas fases de constru¸cão e avalia¸cão. Para m = 1, 23n a utiliza¸cão de espa¸co estádentro do intervalo 1, 95n à2, 7n bits. Em todos os casos, isto estádistante por um pequeno fator constante do m´ınimo teórico de aproximadamente 1, 44n bits para FHPMs e 0, 89n bits para FHPs, uma coisa que não foi alcan¸cada por algoritmos anteriores, exceto assintóticamente para valores de n muito grandes.

Near-Optimal Space Perfect Hashing Algorithms

CS 473: Algorithms, Fall 2019

Fundamental Data Structures Contents

Algorithms in a Nutshell 2E

Optimal Algorithms for Minimal Perfect Hashing

The Tree Model for Hashing: Lower and Upper Bounds

Lecture 08 Hashing

Data Structures

Choosing Best Hashing Strategies and Hash Functions