Contextualized Web Search: Query-Dependent Ranking and Social Media Search

CONTEXTUALIZED WEB SEARCH: QUERY-DEPENDENT RANKING AND SOCIAL MEDIA SEARCH A Thesis Presented to The Academic Faculty by Jiang Bian In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the College of Computing Georgia Institute of Technology December 2010 CONTEXTUALIZED WEB SEARCH: QUERY-DEPENDENT RANKING AND SOCIAL MEDIA SEARCH Approved by: Dr. Hongyuan Zha, Advisor Dr. Guy Lebanon College of Computing College of Computing Georgia Institute of Technology Georgia Institute of Technology Dr. Eugene Agichtein Dr. Richard Vuduc Mathematics and Computer Science College of Computing Department Georgia Institute of Technology Emory University Dr. Alexander Gray Date Approved: September 15, 2010 College of Computing Georgia Institute of Technology ACKNOWLEDGEMENTS I would never have been able to finish my dissertation without the guidance of my committee members, help from my friends, and support from my family. I owe an enormous debt of gratitude to my advisor, Dr. Hongyuan Zha, who has showed me the beauty of science, offered me precious opportunities for collabo- rations, and supplied sustenance, literally and metaphorically, through long seasons in Atlanta. I would like to express my deep gratitude to one of committee members and collaborators, Dr. Eugene Agichtein, for offering many insightful suggestions and friendly accessibility throughout my graduate study. I would like to thank Dr. Ding Zhou for his patient guidance on my early research work and my PhD life. This dissertation has also received contributions from outside the university, for which I owe my special thanks to Dr. Tie-Yan Liu and Dr. Zhaohui Zheng who mentored my two important industrial internships. I thank Dr. Tao Qin, Dr. Xin Li, Dr. Fan Li and Dr. Max Gubin for giving me so much invaluable advice from indus- try. I feel especially thankful to my colleagues from Georgia Institute of Technology, Emory University, Microsoft Research Asia, Yahoo! Labs, Facebook, University of Illinois at Urbana-Champaign, and Nokia Research Center, with whom I have enjoyed collaboration. Finally, I am deeply indebted to my parents for bringing me into this world and providing me with a sound education. Finally, I would like to thank Dr. Weiyun Chen, who was always standing by me through both good and bad times. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ iii LIST OF TABLES ................................. ix LIST OF FIGURES ................................ xi SUMMARY ..................................... xiv CHAPTERS I INTRODUCTION .............................. 1 1.1 Query-Dependent Ranking for Web Search . 3 1.2 Context-Aware Social Media Search . 4 1.3 Contributions . 9 1.4 Dissertation Organization . 10 II RELATED WORK .............................. 12 2.1 Ranking for Web Search . 12 2.1.1 Learning to Rank . 12 2.1.2 Incorporate Query Difference in Ranking . 13 2.2 Search in Social Media . 14 2.2.1 Information Retrieval over Community Question Answering 14 2.2.2 Modeling User Authority . 15 2.2.3 User Interactions and Related Spam in Web Search . 16 2.2.4 Information Extraction and Integration over Social Media . 17 III QUERY-DEPENDENT RANKING FOR SEARCH . 19 3.1 Query Difference in Ranking . 19 3.2 Incorporating Query Difference into Ranking . 21 3.2.1 Query-Dependent Loss Functions . 21 3.2.2 Learning Methods . 22 3.3 Applying Query-Dependent Loss To A Specific Query Categorization 23 iv 3.3.1 Query-Dependent Loss Functions Based on Query Taxonomy of Web Search . 24 3.3.2 Learning Methods . 27 3.3.3 Example Query-Dependent Loss Functions . 30 3.4 Experiments . 33 3.4.1 Experimental Setup . 33 3.4.2 Experimental Results . 35 3.4.3 Discussions . 40 3.5 Query-Dependent Loss Functions v.s. Query-Dependent Rank Func- tions . 43 3.6 Summary . 46 IV RANKING SPECIALIZATION FOR WEB SEARCH . 47 4.1 A Divide-and-Conquer Framework . 48 4.2 Identifying Ranking-Sensitive Query Topics . 50 4.2.1 Generating Query Features . 50 4.2.2 Generating Query Topics and Computing Topic Distribution for Queries . 51 4.3 A Unified Approach for Learning Multiple Ranking Models . 52 4.3.1 Problem Statement . 52 4.3.2 Topical RankSVM . 53 4.4 Ensemble Ranking for New Queries . 55 4.5 Experimental Setup . 56 4.5.1 Data Collection . 56 4.5.2 Evaluation Metrics . 58 4.5.3 Ranking Methods Compared . 58 4.6 Experimental Results . 59 4.6.1 Experiments with LETOR Dataset . 59 4.6.2 Experiments with SE-Dataset . 68 4.7 Summary . 71 v V RANKING OVER SOCIAL MEDIA .................... 74 5.1 Community Question Answering . 74 5.2 Learning Ranking Functions for CQA Retrieval . 78 5.2.1 Problem Definition of QA Retrieval . 78 5.2.2 User Interactions in Community QA . 79 5.2.3 Features and Preference Data Extraction . 80 5.2.4 Learning Ranking Function from Preference Data . 84 5.3 Experimental Setup . 85 5.3.1 Datasets . 85 5.3.2 Evaluation Metrics . 88 5.3.3 Ranking Methods Compared . 89 5.4 Experimental Results . 91 5.4.1 Learning Ranking Function . 91 5.4.2 Robustness to Noisy Labels . 92 5.4.3 Ablation Study on Feature Set . 93 5.4.4 QA Retrieval . 94 5.5 Summary . 97 VI LEARNING TO RECOGNIZE RELIABLE USERS AND CONTENT IN SOCIAL MEDIA ............................... 98 6.1 Content Quality and User Reputation in Social Media . 98 6.2 Learning Content Quality and User Reputation in CQA . 99 6.2.1 Problem Statement . 100 6.2.2 Coupled Mutual Reinforcement Principle . 101 6.2.3 CQA-MR: Coupled Semi-Supervised Mutual Reinforcement 104 6.3 Experimental Setup . 109 6.3.1 Data Collection . 109 6.3.2 Evaluation Metrics . 111 6.3.3 Methods Compared . 111 6.4 Experimental Results . 113 vi 6.4.1 Predicting Content Quality and User Reputation . 113 6.4.2 Quality-aware CQA Retrieval . 115 6.4.3 Effects of the QA quality and User Reputation Features . 117 6.4.4 Effects of the Amount of Supervision . 118 6.5 Summary . 120 VII ROBUST RANKING IN SOCIAL MEDIA . 122 7.1 User Votes in Social Media . 122 7.2 Vote Spam in Social Media . 124 7.2.1 Vote Spam Attack Models . 124 7.3 Robust Ranking in Community Question Answering . 126 7.3.1 Robust Ranking Method . 126 7.3.2 Datasets . 127 7.3.3 Ranking Methods Compared . 127 7.3.4 Evaluation on Robustness of Ranking to Vote Spam Attack 128 7.4 Experimental Results . 128 7.4.1 QA Retrieval . 128 7.4.2 Robustness to Vote Spam . 131 7.4.3 Analyzing Feature Contributions . 133 7.5 Summary . 136 VIII LEARNING TO ORGANIZE THE KNOWLEDGE IN SOCIAL MEDIA 137 8.1 Task Question and Its Structured Semantics . 137 8.2 Integrating Task Questions Based on Structured Semantics . 141 8.2.1 Problem Formulation . 141 8.2.2 Aspect Discovery and Clustering on CQA Content . 143 8.2.3 Generating Representative Description for Aspects . 149 8.3 Applications . 150 8.3.1 Identifying Semantics for New Questions . 150 8.3.2 Finding Similar Questions . 151 vii 8.3.3 Other Applications . 153 8.4 Experiments . 154 8.4.1 Data Collection . 155 8.4.2 Empirical Results and Analysis . 157 8.4.3 Quantitative Evaluation . 161 8.4.4 Experiments on External Applications . 162 8.5 Summary . 166 IX CONCLUSION ................................ 168 APPENDICES REFERENCES ................................... 170 VITA ........................................ 178 viii LIST OF TABLES 1 MAP value of RankNet, SQD-RankNet, and UQD-RankNet . 37 2 MAP value of ListMLE, SQD-ListMLE, and UQD-ListMLE . 38 3 Top 10 Results (Doc IDs) of One Informational Query (Query ID: 87) 41 4 Top 10 Results (Doc IDs) of One Navigational Query (Query ID: 52) 42 5 MAP value of RSVM, CRSVM, LRSVM, and TRSVM on TREC2003 and TREC2004 . 62 6 MAP value of RSVM, LRSVM, and TRSVM on MQ2003 and MQ2004 62 7 Top 8 most important features for RSVM on TREC2003 . 63 8 Top 8 most important features for CRSVM on TREC2003 . 63 9 Top 8 most important features for TRSVM on TREC2003 . 63 10 Features used to represent textual elements and user interactions . 81 11 Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) for GBrank and other metrics for baseline1 . 96 12 Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) for GBrank and other metrics for baseline2 . 96 13 Features Spaces: X(Q);X(A) and X(U)................ 105 14 Inter-annotator agreement and Kappa for question quality . 111 15 Accuracy of GBRank-MR, GBRank-Supervised, GBRank-HITS, GBRank, and Baseline (TREC 1999-2006 questions) . 117 16 Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) for GBrank, GBrank-robust and Baseline . 130 17 Information Gain for all features both when training and testing data are not polluted . 134 18 Information Gain for all features both when training and testing data are polluted . 135 19 Examples of Integrated Task Summaries of two instances of The Task Topic \travel" . 154 20 Examples of Integrated Task Summaries of two instances of The Task Topic \pets" . 155 21 Basic statistics of the collections of task questions from Yahoo! Answers156 ix 22 Basic statistics of the prior knowledge of aspect from eHow . 156 23 Inter-annotator agreement for aspect discovering . 161 24 Evaluation of Clustering Accuracy . 162 x LIST OF FIGURES 1 Accuracy in terms of NDCG of query-dependent RankNet compared with original RankNet on TREC2003 . 36 2 Accuracy in.

Load more