Tractable Algorithms for Proximity Search on Large Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Tractable Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar July 2010 CMU-ML-10-107 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Andrew W. Moore, Chair Geoffrey J. Gordon Anupam Gupta Jon Kleinberg, Cornell University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2010 Purnamrita Sarkar This research was sponsored by the National Science Foundation under grant numbers AST-0312498, IIS- 0325581, and IIS-0911032, the Defense Advanced Research Projects Agency under grant number S518700 and EELD grant number F30602-01-2-0569, the U.S. Department of Agriculture under grant numbers 53-3A94-03-11 and TIRNO-06-D-00020, the Department of Energy under grant numbers W7405ENG48 and DE-AC52-07NA27344, the Air Force Research Laboratory under grant numbers FA865005C7264 and FA875006C0216, and the National Institute of Health under grant number 1R01PH000028. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: random walks, proximity measures, graphs, nearest neighbors, link pre- diction, algorithms To, Ramaprasad Das, my grandfather, who is behind everything I am, or ever will be. iv Abstract Identifying the nearest neighbors of a node in a graph is a key ingre- dient in a diverse set of ranking problems, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. For find- ing these “near” neighbors, we need graph theoretic measures of similarity or proximity. Most popular graph-based similarity measures, e.g. length of short- est path, the number of common neighbors etc., look at the paths between two nodes in a graph. One such class of similarity measures arise from random walks. In the context of using these measures, we identify and address two im- portant problems. First, we note that, while random walk based measures are useful, they are often hard to compute. Hence we focus on designing tractable algorithms for faster and better ranking using random walk based proximity measures in large graphs. Second, we theoretically justify why path-based similarity measures work so well in practice. For the first problem, we focus on improving the quality and speed of near- est neighbor search in real-world graphs. This work consists of three main components: first we present an algorithmic framework for computing near- est neighbors in truncated hitting and commute times, which are proximity measures based on short term random walks. Second, we improve upon this ranking by incorporating user feedback, which can counteract ambiguities in queries and data. Third, we address the problem of nearest neighbor search when the underlying graph is too large to fit in main memory. We also prove a number of interesting theoretical properties of these measures, which have been key to designing most of the algorithms in this thesis. We address the second problem by bringing together a well known gen- erative model for link formation, and geometric intuitions. As a measure of the quality of ranking, we examine link prediction, which has been the pri- mary tool for evaluating the algorithms in this thesis. Link prediction has been extensively studied in prior empirical surveys. Our work helps us better understand some common trends in the predictive performance of different measures seen across these empirical results. vi Acknowledgements It is almost impossible to express my gratitude to Andrew Moore, my advisor, in a couple of words. To him, I owe many things. Through his brilliant insights, infectious enthusiasm, and gentle guidance he has helped me formulate the problems of this thesis, and solve them. He has been like a father figure for me through all these years. Besides having time for brainstorming on new ideas and exciting algorithms, he has also lent a patient and affectionate ear to my personal problems. I have learnt many things about research and life from him. I have learnt that, when it comes to dealing with issues with people, the best way is to be honest, direct and sincere. And above all, I have learnt that, it is not the success and fame, but “loving the science” that brings us to work every morning. This thesis was shaped by the insightful suggestions, and perceptive comments of my thesis committee members. In fact, the last chapter of this thesis was mostly written be- cause of the great questions they had brought up during my thesis proposal. Anupam Gupta and Geoffrey Gordon have always had open doors for all my questions, doubts, and half-formed ideas; I am indeed very thankful to them. I am grateful to Jon Kleinberg, for always finding time to talk to me about my research at conferences, or during his brief visits to Carnegie Mellon. He has also been extremely supportive and helpful during the stressful times of job search. I am very grateful to many of the faculty members at Carnegie Mellon. Gary Miller’s lectures and discussions on Spectral Graph Theory and Scientific Computing have been a key to my understanding of random walks on graphs. Drew Bagnell and Jeff Schneider have given valuable suggestions related to my research and pointed me to relevant litera- ture. Artur Dubrawski has been very supportive of my research, and has always inspired me to apply my ideas to completely new areas. I will like to thank Carlos Guestrin, who has helped me immensely in times of need. Tom Mitchell and John Lafferty have helped me through some difficult times at CMU; I am grateful to them. I would like to acknowl- edge Christos Faloutsos, Steve Fienberg and Danny Sleator for their help and guidance during my years at CMU. I will also like to thank Charles Isbell at Georgia Tech; interning with him as an undergraduate gave me my first taste of ideas from machine learning and statistics, and so far it has been the most enjoyable and fulfilling experience. During my stay at Carnegie Mellon, I have made some great friends and life would not vii have been the same without them. I have had the pleasure to know Shobha Venkataraman, Sajid Siddiqi, Sanjeev Koppal, Stano Funiak, Andreas Zollmann, Viji Manoharan, Jeremy Kubica, Jon Huang, Lucia Castellanos, Khalid El-Arini, Ajit Singh, Anna Goldenberg, Andy Schlaikjer, Alice Zheng, Arthur Gretton, Andreas Krause, Hetunandan and Ram Ravichandran. They have looked out for me in difficult times; brainstormed about new exciting ideas; read my papers at odd hours before pressing deadlines; and last, but not least, helped me during job search. To my officemates Indrayana Rustandi, and Shing- Hon Lau, it was a pleasure to share an office, which led to the sharing of small wins and losses from everyday. Thanks to the current and past members of the Auton Lab for their friendship, and good advice. I am grateful to Diane Stidle, Cathy Serventi and Michele Kiriloff for helping me out with administrative titbits, and above all being so kind, friendly and patient. Finally, I would like to thank my family, my parents Somenath and Sandipana Sarkar, and my little brother Anjishnu for their unconditional love, endless support and unwaver- ing faith in me. To Deepayan, my husband, thanks for being a wonderful friend and a great colleague, and for being supportive of my inherent stubbornness. viii Contents 1 Introduction 1 2 Background and Related Work 9 2.1 Introduction . .9 2.2 Random Walks on Graphs: Background . 10 2.2.1 Random Walk based Proximity Measures . 11 2.2.2 Other Graph-based Proximity Measures . 17 2.2.3 Graph-theoretic Measures for Semi-supervised Learning . 18 2.2.4 Clustering with Random Walk Based Measures . 20 2.3 Related Work: Algorithms . 22 2.3.1 Algorithms for Hitting and Commute Times . 23 2.3.2 Algorithms for Computing Personalized Pagerank and Simrank . 24 2.3.3 Algorithms for Computing Harmonic Functions . 28 2.4 Related Work: Applications . 28 2.4.1 Application in Computer Vision . 29 2.4.2 Text Analysis . 29 2.4.3 Collaborative Filtering . 31 2.4.4 Combating Webspam . 31 2.5 Related Work: Evaluation and Datasets . 32 2.5.1 Evaluation: Link Prediction . 32 2.5.2 Publicly Available Data Sources . 33 2.6 Discussion . 35 ix 3 Fast Local Algorithm for Nearest-neighbor Search in Truncated Hitting and Commute Times 37 3.1 Introduction . 37 3.2 Short Term Random Walks on Graphs . 40 3.2.1 Truncated Hitting Times . 40 3.2.2 Truncated and True Hitting Times . 41 3.2.3 Locality Properties . 41 3.3 GRANCH: Algorithm to Compute All Pairs of Nearest Neighbors . 44 3.3.1 Expanding Neighborhood . 48 3.3.2 Approximate Nearest Neighbors . 49 3.4 Hybrid Algorithm . 51 3.4.1 Sampling Scheme . 52 3.4.2 Lower and Upper Bounds on Truncated Commute Times . 53 3.4.3 Expanding Neighborhood . 53 3.4.4 The Algorithm at a Glance . 54 3.5 Empirical Results . 55 3.5.1 Co-authorship Networks . 55 3.5.2 Entity Relation Graphs . 57 3.5.3 Experiments . 58 3.6 Summary and Discussion . 62 4 Fast Dynamic Reranking in Large Graphs 65 4.1 Introduction . 65 4.2 Relation to Previous Work . 67 4.3 Motivation . 69 4.4 Harmonic Functions on Graphs, and T-step Variations . 71 4.4.1 Unconditional vs. Conditional Measures . 72 4.5 Algorithms . 75 4.5.1 The Sampling Scheme . 76 4.5.2 The Branch and Bound Algorithm . 76 4.5.3 Number of Nodes with High Harmonic Function Value .