Effective and Efficient Approaches to Retrieving and Using Expertise In
Total Page:16
File Type:pdf, Size:1020Kb
Effective and Efficient Approaches to Retrieving and Using Expertise in Social Media Reyyan Yeniterzi CMU-LTI-15-008 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Jamie Callan, Chair William Cohen Eric Nyberg Aditya Pal (Facebook) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies. c 2015 Reyyan Yeniterzi For my dear parents and sister and the little princess Dila (2009-2015) iv Abstract The recent popularity of social media is changing the way people share and acquire knowledge. Companies started using intra-organizational social media applications in order to improve the communication and collaboration among employees. In addition to their professional use, people have been using these sites in their personal lives for information acquisition purposes, such as community question answering sites for their questions. In such environments the interactions do not always occur between users who know each other well enough to assess expertise of one another or trust the accuracy of their created content. This dissertation addresses this problem by estimating topic specific expertise scores of users which can be also used to improve the expertise related applications in social media. Expert retrieval has been widely studied using organizational documents; how- ever, the additional structure and information available in social media provide the opportunity to improve the developed expert finding approaches. One such differ- ence is the availability of different types of user created content, which can be used to represent users’ expertise and the information need being searched more effec- tively in order to retrieve an initial set of good expert candidates. The underlying social network structure constructed from the interactions among the users, such as commenting or replying, is also investigated and topic-specific authority graph con- struction and estimation approaches are developed in order to estimate topic-specific authorities from these graphs. Finally, the available timestamp information within social media is explored and a more dynamic expert identification approach which takes into account the recent topic-specific interest of users as well as their availability is proposed. This available information is explored and the proposed approaches are combined in an expert identification system which consists of three parts; (1) content-based retrieval, (2) authority estimation and (3) temporal modeling. Depending on the environment and task being tested, some or all of these parts can be used to identify topic-specific experts. This proposed system is applied to two data collections, an intra-organizational blog data and a popular community question answering site’s data, for three expertise estimation related tasks: identification of topic-specific expert bloggers, routing questions to users who can provide accurate and timely replies, and ranking replies based on responders’ question specific expertise. Statistically significant improvements are observed in all three tasks. In addition to improving the effectiveness of expert identification applications in social media, the proposed approaches are also more efficient which makes the proposed expert finding system applicable to real time environments. vi Acknowledgments First and foremost, I would like to thank to my advisor Jamie Callan, who in- troduced me to the field of Information Retrieval and advised me through my PhD studies in this field. Being his student and working with him was a privilege that I am grateful for. His constructive feedbacks and continuous guiding helped me to produce this dissertation, but more importantly prepared me to the academic world. For the last 6 years, he was not just a great advisor but also the mentor who guided me through the obstacles I have encountered and prepared me for the future ones. I am deeply grateful to my committee members, William Cohen, Eric Nyberg and Aditya Pal, who have kindly given me their valuable time and insightful feedbacks to make this thesis stronger. I express my sincerest gratitude to Ramayya Krishnan, who have supported my thesis in many ways. I am thankful to my previous advisors, and mentors: Kemal Oflazer, Bradley Malin, Dilek Hakkani Tur, Yucel Saygin, Berrin Yanikoglu, Esra Erdem and Ugur Sezerman who have introduced me to the research world. I am indebted to my friends and colleagues at the Carnegie Mellon University. They include but not limited to Bhavana, Derry, Yubin, Meghana, Sunayana, Jaime, Anagha, Jonathan, Le, David, Leman, Selen, Pradeep and Justin. I am also deeply grateful to Fatma Selcen and Hakan for their close friendship and mentoring throughout my years in Pittsburgh. They together with Mehmet Ertugrul made my life in Pittsburgh very enjoyable. I would also like to thank all my other close friends for their friendship and support. Last but most importantly, I would like to thank to my parents who have raised me as an individual who can follow her dreams and who does not give up easily. They have supported me at every stage of my life, and especially when I decided to go abroad for getting my PhD degree. This dissertation would not have been possible without the continuous motivation and support of my dear sister Suveyda. She was both there to lighten me up in my depressed moments and also there to celebrate my accomplishments. Therefore, I dedicate this dissertation to these three wonderful people in my life. viii Contents 1 Introduction 1 1.1 Retrieving Expertise in Social Media . .2 1.2 Using Expertise in Social Media . .5 1.2.1 Question Routing in CQA . .5 1.2.2 Reply Ranking in CQA . .5 1.3 Significance of the Research . .6 1.4 Overview of Dissertation Organization . .8 2 Related Work 9 2.1 Related Work on Expert Retrieval . .9 2.1.1 Document-Candidate Associations . 10 2.1.2 Profile-based Approach . 11 2.1.3 Document-based Approaches . 12 2.1.4 Graph-based Approaches . 14 2.1.5 Learning-based Approaches . 19 2.2 Related Work on Expert Retrieval in CQA . 20 2.2.1 User and Information Need Representation . 20 2.2.2 Expertise Estimation Approaches . 22 2.3 Summary . 26 3 The Proposed Expert Retrieval Architecture for Social Media 27 3.1 Content-based Approaches (Document Content) . 28 3.2 Authority-based Approaches (User Network) . 31 3.3 Temporal Approaches (Temporal Metadata) . 33 3.4 Summary . 34 4 Datasets and Experimental Methodology 35 4.1 The Corporate Blog Collection . 35 4.1.1 The Expert Blogger Finding Task . 37 4.2 StackOverflow Dataset . 39 4.2.1 The Question Routing Task . 41 4.2.2 The Reply Ranking Task . 42 4.3 Data Preprocessing, Indexing and Retrieval . 42 4.4 Parameter Optimization . 44 4.5 Statistical Significance Tests . 44 4.6 Bias Analysis on CQA Vote-based Test Collections . 44 ix 4.6.1 Temporal Bias . 44 4.6.2 Presentation Bias . 45 4.6.3 The Effects of These Biases . 46 4.6.4 A More Bias-Free Test Set Construction Approach . 49 4.6.5 Summary . 50 4.7 Summary . 50 5 Content-based Approaches 51 5.1 Representation of Expertise in CQA Sites . 51 5.1.1 Representation of Information Need . 52 5.1.2 Representation of Users . 53 5.1.3 Constructing Effective Structured Queries . 53 5.1.4 Identifying Experts . 58 5.1.5 Analyzing the Dataset . 59 5.1.6 Experiments with Different Representations of Information Need and User Expertise . 60 5.1.7 Experiments with Weighting the Question Tags . 66 5.2 Exploring Comments for Expertise Estimation in CQA Sites . 72 5.2.1 Commenting in StackOverflow . 72 5.2.2 Pre-Processing Comments . 75 5.2.3 Using Comments in Expertise Estimation . 78 5.2.4 Experiments with Different Representations of Information Need and User Expertise . 79 5.2.5 Experiments with Weighting the Question Tags . 82 5.2.6 Summary . 84 5.3 Content-based Expert Retrieval in Blogs . 84 5.4 Summary . 87 6 Authority-based Approaches 91 6.1 Background on User Authority Estimation . 92 6.1.1 PageRank and Topic-Sensitive PageRank . 92 6.1.2 Hyperlink-Induced Topic Search (HITS) . 93 6.2 Constructing More Topic-Specific Authority Networks . 95 6.3 Using Expertise in Authority Estimation . 97 6.3.1 Using Expertise as Influence . 97 6.3.2 Using Expertise in Teleportation . 98 6.4 Adapting HITS to CQA Authority Networks . 99 6.5 Experiments . 100 6.5.1 Authority Networks . 101 6.5.2 Experiments with Topic-Candidate Graph . 103 6.5.3 Experiments with Using Initially Estimated Expertise . 112 6.5.4 Experiments with HITSCQA ............................ 119 6.6 Summary . 119 x 7 Temporal Approaches 123 7.1 Motivation for Temporal Modeling . 123 7.2 Related Work . 124 7.3 Temporal Discounting . 126 7.4 Temporal Modeling of Expertise . 126 7.4.1 Constructing Time Intervals . 127 7.4.2 Temporal Count-based Approaches . 127 7.5 Experiments . 128 7.5.1 Baseline Temporal Approaches . 128 7.5.2 Question Routing Experiments . 129 7.6 Summary . 133 8 Combining Approaches 135 8.1 The Expert Blogger Finding Task . 135 8.2 The Question Routing Task . 136 8.3 The Reply Ranking Task . 138 8.4 Summary . ..