Similarity Search: Algorithms for Sets and Other High Dimensional Data
Total Page:16
File Type:pdf, Size:1020Kb
Similarity Search: Algorithms for Sets and other High Dimensional Data Thomas Dybdahl Ahle Advisor: Rasmus Pagh Submitted: March 2019 ii Abstract We study five fundemental problems in the field of high dimensional algo- rithms, similarity search and machine learning. We obtain many new results, including: • Data structures with Las Vegas guarantees for `1-Approximate Near Neigh- bours and Braun Blanquet Approximate Set Similarity Search with perfor- mance matching state of the art Monte Carlo algorithms up to factors no(1). • “Supermajorities:” The first space/time tradeoff for Approximate Set Sim- ilarity Search, along with a number of lower bounds suggesting that the algorithms are optimal among all Locality Sensitive Hashing (LSH) data structures. • The first lower bounds on approximative data structures based on the Or- thogonal Vectors Conjecture (OVC). • Output Sensitive LSH data structures with query time near t + nr for out- putting t distinct elements, where nr is the state of the art time to output a single element and tnr was the previous best for t items. • A Johnson Lindenstrauss embedding with fast multiplication on tensors (a Tensor Sketch), with high probability guarantees for subspace- and near- neighbour-embeddings. Resumé Vi kommer ind på fem fundementale problemer inden for højdimensionelle algoritmer, similarity search og machine learning. Vi opdager mange nye resultater, blandt andet: • Datastrukturer med Las Vegas garantier for `1-Approksimativ Nær Nabo og Braun Blanquet Approksimativ Mængde-similaritetssøgning der matcher state of the art Monte Carlo algoritmer op til no(1) faktorer i køretid og plads. • “Kvalificerede flertal:” De første tid-/plads-trade-offs for Approksimativ Mængde-similaritetssøgning samt et antal nedre grænser der indikerer at algoritmerne er optimale inden for Locality Sensitive Hashing (LSH) data- strukturer. • De første nedre grænser for approksimative datastrukturer baseret på Ort- hogonal Vectors Conjecture (OVC). • En Output Sensitive LSH datastrukturer med forespørgselstid næsten t + nr når den skal returnere t forskellige punkter. Her er nr den tid det tager state of the art algoritmen at returnere et enkelt element og tnr tiden det før tog at returnere t punkter. • En Johnson Lindenstrauss embedding der tillader hurtig multiplikation på tensorer (en Tensor Sketch). I modsætning til tidligere resultater er den med høj sandsynlighed en subspapce embedding. iii Acknowledgements Starting this thesis four years ago, I thought it would be all about coming up with cool new mathematics. It mostly was, but lately I have I started to suspect that there may be more to it. I want to start by thanking my superb Ph.D. advisor Rasmus Pagh. I always knew I wanted to learn from his conscientiousness and work-life balance. Working with him I have enjoyed his open office door, always ready to listen to ideas, be they good, bad or half baked. Even once he moved to California, he was still easy to reach and quick to give feedback on manuscripts or job search. I may not have been the easiest of students – always testing the limits – but when I decided to shorten my studies to go co-found a startup, Rasmus was still supportive all the way through. I would like to thank all my co-authors and collaborators, past, present and future. The papers I have written with you have always been more fun, and better written, than the once I have done on my own. Academia doesn’t always feel like a team sport, but perhaps it should. The entire Copenhagen algorithms society has been an important part of my Ph.D. life. Mikkel Thorup inspired us with memorable quotes such as “Only work on things that have 1% chance of success,” and “always celebrate breakthroughs immediately, tomorrow they may not be.” Thore Husfeldt made sure we never forgot the ethical ramifications of our work, and Troels Lund reminded us that not everyone understands infinities. At ITU I want to thank Johan Sivertsen, Tobias Christiani and Matteo Dusefante for our great office adventures and the other pagan tribes. In the first year of my Ph.D. I would sometimes stay very long hours at the office and even slept there once. A year into my Ph.D. I bought a house with 9 other people, in one of the best decisions of my life. Throughout my work, the people at Gejst have been a constant source of support, discussions, sanity and introspection. With dinner waiting at home in the evening I finally had a good reason to leave the office. In 2017 I visited Eric Price and the algorithms group in Austin, Texas. He and ev- eryone there were extremely hospitable and friendly. I want to thank John Kallaugher, Adrian Trejo Nuñez and everyone else there for making my stay enjoyable and exciting. In general the international community around Theoretical Computer Science has been nothing but friendly and accepting. Particularly the Locality Sensitive Hashing group around MIT and Columbia University has been greatly inspiring, and I want to thank Ilya Razenshteyn for many illuminating discussions. In Austin I also met Morgan Mingle, who became a beautiful influence on my life. So many of my greatest memories over the last two years focus around her. She is also the only person I have never been able to interest in my research.. 1 Finally I want to thank my parents, siblings and family for their unending support. Even as I have been absent minded and moved between countries, they have always been there, visiting me and inviting me to occasions small as well as big. 1She did help edit this thesis though — like the acknowledgements. Contents Contents v 1 Introduction 1 1.1 Overview of Problems and Contributions ........................ 2 1.2 Similarity Search ................................... 3 1.3 Approximate Similarity Search ............................. 4 1.4 Locality Sensitive Hashing ............................... 5 1.5 Las Vegas Similarity Search .............................. 8 1.6 Output Sensitive Similarity Search ........................... 11 1.7 Hardness of Similarity Search through Orthogonal Vectors ................ 12 1.8 Tensor Sketching ................................... 12 2 Small Sets Need Supermajorities: Towards Optimal Hashing-based Set Similarity 15 2.1 Introduction ..................................... 15 2.2 Preliminaries ..................................... 25 2.3 Upper bounds .................................... 26 2.4 Lower bounds .................................... 32 2.5 Conclusion ...................................... 38 2.6 Appendix ...................................... 39 3 Optimal Las Vegas Locality Sensitive Data Structures 41 3.1 Introduction ..................................... 41 3.2 Overview ...................................... 48 3.3 Hamming Space Data Structure ............................ 50 3.4 Set Similarity Data Structure ............................. 55 3.5 Conclusion and Open Problems ............................ 61 3.6 Appendix ...................................... 63 4 Parameter-free Locality Sensitive Hashing for Spherical Range Reporting 69 4.1 Introduction ..................................... 69 4.2 Preliminaries ..................................... 73 4.3 Data Structure .................................... 75 4.4 Standard LSH, Local Expansion, and Probing the Right Level ............... 75 4.5 Adaptive Query Algorithms .............................. 78 4.6 A Probing Sequence in Hamming Space ......................... 84 4.7 Conclusion ...................................... 87 vi Contents 4.8 Appendix ...................................... 88 5 On the Complexity of Inner Product Similarity Join 95 5.1 Introduction ..................................... 95 5.2 Hardness of IPS join ................................. 102 5.3 Limitations of LSH for IPS ............................... 108 5.4 Upper bounds .................................... 113 5.5 Conclusion ...................................... 116 6 High Probability Tensor Sketch 117 6.1 Introduction ..................................... 117 6.2 Preliminaries ..................................... 122 6.3 Technical Overview .................................. 124 6.4 The High Probability Tensor Sketch ........................... 125 6.5 Fast Constructions .................................. 127 6.6 Applications ..................................... 129 6.7 Appendix ...................................... 131 Bibliography 137 Chapter 1 Introduction Right action is better than knowledge, but in order to do what is right, we must know what is right. Charlemagne As the power of computing has increased, so has the responsibility given to it by society. Machine Learning and algorithms are now asked to identify hate speech on Facebook, piracy on YouTube, and risky borrowers at banks. These tasks all require computations to be done on the meaning behind text, video and data, rather than simply the characters, pixels and bits. Oddly the philosophically complex idea of “meaning” is now commonly represented in Computer Science simply as a vector in Rd. However, this still leaves the questions of what computations can be done, how, and using how many resources. For example, say we want to find duplicate posts in Facebook’s reportedly 300 billion comments a year. A classic but efficient way to represent the meaning of documents such as a Facebook post is the “bag of words” approach. We create a vector in f0, 1gd, indicating for every word in the English Language whether it is present in the document or not. We would then want to find two vectors with a large overlap, however such