(L ,J SEP 3 0 2009 LIBRARIES

(L ,J SEP 3 0 2009 LIBRARIES

Nearest Neighbor Search: the Old, the New, and the Impossible by Alexandr Andoni Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2009 © Massachusetts Institute of Technology 2009. All rights reserved. Author .............. .. .. ......... .... ..... ... ......... ....... Department of Electrical Engineering and Computer Science September 4, 2009 (l ,J Certified by................ Piotr I/ yk Associate Professor Thesis Supervisor Accepted by ................................ /''~~ Terry P. Orlando Chairman, Department Committee on Graduate Students MASSACHUSETTS PaY OF TECHNOLOGY SEP 3 0 2009 ARCHIVES LIBRARIES Nearest Neighbor Search: the Old, the New, and the Impossible by Alexandr Andoni Submitted to the Department of Electrical Engineering and Computer Science on September 4, 2009, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network traffic statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we effectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme efficiency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can find efficiently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classification method in machine learning: to give a label for a new object, find the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, finding duplicate files and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets. Our contribution is two-fold: we sig- nificantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and simi- larity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms. The high-level structure of our thesis is summarized as follows. New algorithms via the classical approaches. We give a new algorithm for the approximate NN problem in the d-dimensional Euclidean space. For an approximation factor c > 1, our algorithm achieves dn P query time and dnl +P space for p = 1/c 2 +o(1). This greatly improves on the previous algorithms that achieved p that was only slightly smaller than 1/c. The same technique also yields an algorithm with dno(p) query time and space near-linear in n. Furthermore, our algorithm is near-optimal in the class of "hashing" algorithms. Failure of the classical approaches for some hard distances. We give an evidence that the classical approaches to NN under certain hard distances, such as the string edit I ; (_iiriii__i_____ __~_l_;___r(~;l_;rl___ distance, meet a concrete barrier at a nearly logarithmic approximation. Specifically, we show that for all classical approaches to NN under the edit distance, involving embeddings into a general class of spaces (such as £1, powers of £2, etc), the resulting approximation has to be at least near-logarithmic in the strings' length. A new approach to NN under hard distances. Motivated by the above impossibility results, we develop a new approach to the NN problem, where the classical approaches fail. Using this approach, we give a new efficient NN algorithm for a variant of the edit distance, the Ulam distance, which achieves a double-logarithmic approximation. This is an exponential improvement over the lower bound on the approximation achievable via the previous classical approaches to this problem. Data structure lower bounds. To complement our algorithms, we prove lower bounds on NN data structures for the Euclidean distance and for the mysterious but important case of the fo distance. In both cases, our lower bounds are the first ones to hold in the same computational model as the respective upper bounds. Furthermore, for both problems, our lower bounds are optimal in the considered models. External applications. Although our main focus is on the NN problem, our techniques naturally extend to related problems. We give such applications for each of our al- gorithmic tools. For example, we give an algorithm for computing the edit distance between two strings of length d in near-linear time. Our algorithm achieves approxi- mation 20 ( ), improving over the previous bound of dl / 3 o(1) . We note that this problem has a classical exact algorithm based on dynamic programming, running in quadratic time. Thesis Supervisor: Piotr Indyk Title: Associate Professor Acknowledgments For the existence of this thesis, I owe a huge "Thank you" to many people. The hardest of all it is to express the entire extent of my gratitude to Piotr Indyk, my adviser. He combined all the right ingredients - including incessant support, genuine interest, and humor - to make my experience most enjoyable and complete. Reflecting on the past years, I am only impressed how he has always seemed to know precisely what to do in order to keep me captivated and enthusiastic about research. Thanks to his effort, not only have I reached my personal aspirations, but have seen what lies beyond them. I met Piotr in 2002, when I was an undergraduate student at MIT. In the summer of that year, he gave me a cute problem: prove a lower bound for embedding edit distance into normed spaces. At that time, I would only understand the "edit distance" part of the problem formulation, but that was enough for me. It was easy to like the problem: it seemed important with tangible motivation, and admitted some partial-progress solution that was non-trivial but still within reach. The project worked out well: in several weeks we had a result, later accepted to SODA. Moreover, 4.5 years later. I returned to the problem, and an improved bound is now part of the present thesis. This story is a perfect illustration of Piotr's exemplary advising style: get me interested with just the right combination of motivation, projection of personal enthusiasm, challenge, collaboration, and then let me explore. For putting all this tremendous effort and more, I will always remain in debt to him. I would also like to thank Robi Krauthgamer and Ronitt Rubinfeld, who, in many ways, have been my informal advisers. I spoke to Robi first time at SODA'O6, when I asked him about an internship opportunity at IBM Almaden. While he encouraged me to apply, he also mentioned that it is very competitive to obtain the internship. I was extremely lucky that it wasn't too competitive, and that we started our years-long collaboration that summer. Many of the results from this thesis have been developed in collaboration with him. I thank Robi for hosting me at IBM Almaden and Weizmann Institute, for being a brilliant and responsible collaborator, and, the last but not the least, for being a great friend. Ronitt has influenced me in many ways through advising on matters related to research, and academic life in general. I have learned from and enjoyed all my interactions with her, including collaborating on the two joint papers, co-organizing the theory colloquium, and teaching a class. I very much appreciated her personal, warm style which has always been refreshing, and brought me back from the steel world of razor-thin process of matching the upper and lower bounds to the world of human researchers doing what they like most, namely discovering exciting new facts (which could very well be the delightfully matching upper and lower bounds!). My summer internships at IBM Almaden and MSR SVC were of immense importance in my formation as a researcher. They provided the much desired "non-MIT" research exposure, and, above all, an opportunity to meet and collaborate with some outstanding people in a stellar environment. For this, I thank the IBM Almaden group, with special thanks to Ken Clarkson, Jayram, and Ron Fagin, as well as the MSR SVC group, with special thanks to Rina Panigrahy, Udi Wieder, Kunal Talwar, and Cynthia Dwork. My research has been greatly influenced by my collaborators: Anupam, Avinatan, Costis, David. Dorian, Huy, Jayram, Jessica, Kevin, Khanh, Krzysztof, Mayur, Michel, Mihai. Nicole, Ning, Noga, Piotr. Robi, Ronitt, Sebastien, Sofya, Tali, and Vahab. Thank you all, I have learned a lot from you. To many I am thankful for also being much more than "just" collaborators. A number of my friends have been the spine of my entire graduate student life (of both the "graduate student" and the "life" components). Without them I do not know how I could have fared this far: their optimism, enthusiasm, and support have been an inspiration. My old friend, Mihai, has always been great to talk to, on academic and non-academic topics, and party with (mostly on non-academic topics), which made him a superb office-mate and apartment-mate.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    178 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us