Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features by Pradeep Muthukrishnan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2011 Doctoral Committee: Professor Dragomir Radkov Radev, Chair Associate Professor Steven P. Abney Assistant Professor Honglak Lee Assistant Professor Qiaozhu Mei Assistant Professor Zeeshan Syed c Pradeep Muthukrishnan 2011 All Rights Reserved Dedicated to my parents ii ACKNOWLEDGEMENTS First and foremost, I am truly grateful to my advisor Dragomir Radev. This thesis would not have been possible without him and without the support he has given me over the past few years. His constant guidance, motivation and patience has been essential in my development as a scientist. Also, I would like to thank Qiaozhu Mei for many insightful discussions and feedback that helped shape and improve this work. I would also like to thank the other members of my dissertation committee, Steve Abney, Honglak Lee and Zeeshan Syed for their careful criticism and insightful feedback that improved the quality of this work. I would like to thank all members of the Computational Linguistics and Infor- mation Retrieval (CLAIR) group in the University of Michigan, Vahed Qazvinian, Ahmed Hassan, Arzucan Ozgur, and Amjad Abu-Jbara for being great colleagues and many insightful discussions that we had together. Special thanks to my roommates Kumar Sricharan and Gowtham Bellala for the innumerable late-night discussions on my research. My heartfelt thanks to all my friends who have been family for me here in Ann Arbor, especially Avani Kamath, Tejas Jayaraman and Gandharv Kashinath. I would also like to thank all my friends in India, especially Jason Lazard, Zeeshan Qureshi iii and Roshan Muthaiya. Words cannot express my gratitude to my parents. Not a single page of this thesis would have been possible without their great sacrifices and constant support. iv TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: viii LIST OF TABLES :::::::::::::::::::::::::::::::: x LIST OF APPENDICES :::::::::::::::::::::::::::: xii ABSTRACT ::::::::::::::::::::::::::::::::::: xiii CHAPTER I. Introduction ..............................1 1.1 Introduction . .1 II. Related Work ..............................9 2.1 Introduction . .9 2.2 Vector Similarity . 10 2.3 Set Similarity . 11 2.4 Word Similarity . 12 2.4.1 String Match-based Similarity . 12 2.4.2 Knowledge-Based Approaches . 13 2.5 Document Similarity . 19 2.6 Document Representation . 20 v 2.6.1 Free Text . 20 2.6.2 Bag of Words . 20 2.6.3 Stemmed Representation . 21 2.6.4 Knowledge Rich Representation . 21 2.7 Document Similarity Measures . 22 2.7.1 String-based approach . 22 2.7.2 Vector Space Similarity . 22 2.7.3 Knowledge-based approach . 23 2.7.4 Corpus-based approach . 24 2.7.5 Kernel Methods . 29 2.8 Structured Object Similarity . 32 2.8.1 Graph Similarity . 32 2.8.2 Tree Kernels . 34 2.9 Machine Learning Approaches . 36 2.10 Avoiding Pairwise Similarity Computation . 38 2.11 Evaluation . 40 III. Integrating Mulitple Similarity Measures ............ 45 3.1 Introduction . 45 3.2 Problem Motivation . 49 3.3 Problem Formulation . 50 3.4 Regularization Framework . 54 3.4.1 Convex Optimization - Direct Solution . 57 3.4.2 An Efficient Algorithm . 58 3.4.3 Proof of Convergence . 61 3.4.4 Layered Random Walk . 62 3.5 Experiments . 64 3.5.1 Experiment I . 64 3.5.2 Experiment II . 68 3.6 Results . 82 IV. Simultaneously Optimizing Feature Weights and Similarity . 87 4.1 Introduction . 87 4.2 Motivation . 87 4.3 Problem Formulation . 89 4.4 Regularization Framework . 91 vi 4.5 Efficient Algorithm . 94 4.5.1 Layered Random Walk Interpretation . 96 4.5.2 Relation to Node Label Regularization . 98 4.5.3 Computational Complexity . 102 4.5.4 MapReduce Implementation . 102 4.6 Experiments . 103 4.6.1 Experiment I . 104 4.6.2 Experiment II . 108 4.6.3 Experiment III . 109 4.7 Results and Discussion . 114 V. Topic Detection in Blogs ....................... 118 5.1 Introduction . 118 5.2 Related Work . 120 5.3 Existing redundancy measures . 122 5.4 Data . 124 5.5 Method . 127 5.5.1 Generating Candidate Phrases . 127 5.5.2 Greedy Algorithm . 130 5.6 Adaptive Algorithm . 134 5.7 Experiments . 137 5.8 Evaluation . 138 5.9 Results . 146 VI. Conclusion ............................... 148 6.1 Summary of Contributions . 148 6.2 Future Work . 152 APPENDIX :::::::::::::::::::::::::::::::::::: 155 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 182 vii LIST OF FIGURES Figure 2.1 Example WordNet taxonomy for nouns . 16 2.2 Example of Sub-Set Trees (Bloehdorn and Moschitti, 2007) . 35 3.1 An example graph: Proposed representation for representing objects with heteregeneous features. For example, P1, P2, P3 are documents and K1, K2 and K3 are keywords and A1, A2, A3 represent authors. Objects of the same type constitute a layer of graph. The different layers are separated by dotted lines. Intra-layer edges represent ini- tial similarity between the objects while inter-layer edges represent feature weights. For example the edge between P1 and P2 is ini- tialized to the similarity between them while the edge between P1 and K2 represents the feature weight of K2 in P1's feature vector representation . 53 3.2 Classification Accuracy on the different data sets. The number of points labeled is plotted along the x-axis and the y-axis shows the classification accuracy on the unlabeled data. 79 3.3 Classification accuracy of linear combination of textual similarity and link similarity measures on the different data sets. 81 3.4 Cotraining algorithm (Blum and Mitchell, 1998) . 85 4.1 The graph at the top is the original graph while the transformed graph is on the bottom. P1, P2 and P3 are objects and K1, K2 and K3 are features. P12, P23, P13 correspond to edges e1, e2 and e3 respectively. K12, K13 and K23 correspond to edges e4, e5 and e6. L1, L2, L3 and L4 are inter-layer edges corresponding to e7, e8, e9 and e10. 100 viii 4.2 Classification Accuracy on the different data sets. The number of points labeled is plotted along the x-axis and the y-axis shows the classification accuracy on the unlabeled data. 107 5.1 Example blog post discussing video games (Hoopdog, 2007) . 125 5.2 Example blog post discussing video games (Hoopdog, 2007) . 126 5.3 Generalized Venn diagram of topic overlap in the Virginia Tech col- lection . 126 5.4 Bipartite graph representation of topic document coverage . 130 5.5 Subtopic redundancy vs. coverage. Each data set contains a different number of documents, hence the number of topics chosen for each data set is diferent. 146 A.1 Active Learning Algorithm for Clustering Using Minimum Set Cover 164 A.2 Toy Graph . 165 A.3 Modified Graph . 166 A.4 Classification Accuracy . 172 B.1 CGI interface used for matching new references to existing papers . 176 B.2 Snapshot of the different statistics for a paper . 177 B.3 Sample contents of the downloadable corpus . 181 ix LIST OF TABLES Table 2.1 The coefficients of correlation between human ratings of similarity (by Miller and Charles and by Rubenstein and Goodenough) and the five computational measures. 41 3.1 Top similar pairs of papers according to link similarity in the AAN data set . 67 3.2 Statistics of different data sets . 72 3.3 Top similar pairs of keyphrases extracted from the publications in the AAN data set . 73 3.4 Example phrases extracted from a publication from each class in the AAN dataset . 75 3.5 Details of a few sample papers classified according to research topic 76 3.6 Top few words from each cluster extracted from the keyword similar- ity graph . 77 3.7 Normalized Mutual Information scores of the different similarity mea- sures on the different data sets . 82 4.1 Normalized Mutual Information scores of the different similarity mea- sures on the different data sets . 108 4.2 Sample Statistics of the Yahoo! KDD Cup 2011 Data Set . 108 4.3 RMSE scores obtained by the different similarity measures on the 2011 KDD cup test set . 114 4.4 Clustering Coefficient of the different similarity graphs on the differ- ent data sets . 116 5.1 Major events summarized . 125 5.2 Coverage, overlap and relevance and evaluation scores for the gold standard, baseline and graph-based method and the improved graph- based method . 141 x 5.3 Precision, recall and F-score for the baseline, LDA, LDA using keyphrase detection,graph-based algorithm, and the Improved graph-based method142 5.4 Kappa scores for the gold standard . 143 5.5 Different topics chosen by the graph-based algorithm for the different datasets. Each data set contains a different number of documents, hence the number of topics chosen for each data set is diferent. 144 5.6 Number of generated topics for each collection. 145 5.7 Different topics chosen by the adaptive algorithm for the Virginia Tech Tragedy data set . 145 A.1 Example showing the covering relation for a toy graph . 161 B.1 Growth of Citation Volume . 178 B.2 Statistics for popular venues . 179 xi LIST OF APPENDICES Appendix A. Graph Clustering Using Active Learning . 156 B. ACL Anthology Network (AAN) Corpus . 173 xii ABSTRACT Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features by Pradeep Muthukrishnan Chair: Dragomir Radkov Radev Relational data refers to data that contains explicit relations among objects. Nowa- days, relational data are universal and have a broad appeal in many different ap- plication domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Process- ing (NLP) and Information Retrieval (IR) problems such as clustering, classification, word sense disambiguation, etc.