Extracting Users in Community Question-Answering in Particular Contexts
Total Page:16
File Type:pdf, Size:1020Kb
EXTRACTING USERS IN COMMUNITY QUESTION-ANSWERING IN PARTICULAR CONTEXTS by LONG T. LE A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey In partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Professor Chirag Shah And approved by New Brunswick, New Jersey October, 2017 c 2017 Long T. Le ALL RIGHTS RESERVED ABSTRACT OF THE DISSERTATION Extracting Users in Community Question-Answering in Particular Contexts by Long T. Le Dissertation Director: Professor Chirag Shah Community Question-Answering (CQA) services, such as Yahoo! Answers, Stack Overflow and Brainly, have become important sources of seeking and sharing information. Online users use CQA to look for information and share knowledge on topics ranging from arts to travel. The questions posted on CQA sites often rely on the wisdom of the crowd, or the idea that the best answer could come from a culmination of several answers by different people with varying expertise and opinions. Given that CQA is a user-driven service, user experience becomes an important aspect, affecting the activeness and even the survival of the site. In this work, we are interested in studying the behavior of the users who participate in CQA. Specifically, we wish to understand how different types of users could be identified based on their behaviors on a CQA- specific problem at hand. A user’s behavior depends on their particular context. For example, when we say that Alice is a “good user,” the interpretation of her behavior actually rests on the context in which it occurs. She might be a good user in the whole community, a good user for a specific topic, a good user for a particular question or a good user for a particular answer. In this dissertation, we will study and extract users in different levels of granularity. Users are the main driving force in CQA and understanding them allows us to know the current state of their respective sites. The findings in this dissertation will be useful in identifying specific CQA user types. ii Preface Parts of the dissertation are based on previous works published by the author in [62], [61] and [63]. iii Acknowledgements First, I want to express my gratitude to my research advisor, Professor Chirag Shah, for his advice, support and guidance. He helped me overcome difficulties and gave me critical com- ments. I thank him for his guidance in the face of unexpected obstacles. Without his help, I would not have finished this dissertation. I want to thank my committee members, Professor Amelie´ Marian, Professor Thu D. Nguyen and Professor Panos Ipeirotis for their comments and suggestions on my work. Their time and input are much appreciated, and they truly improved my research. Special thanks go to Profesor Thu D. Nguyen, who served as my academic advisor and provided support and encouragement throughout my time at Rutgers. I thank Professor Ahmed Elgammal and Pro- fessor Eric Allender for being members of my qualifying exam committee. Professor William Steiger also provided a great deal of sound advice. I also want to thank Prof. Tina Eliassi-Rad for advising me in my early stage of my Ph.D. program. I would like to thank my friends in the Department of Computer Science and the InfoSeek- ing Lab for creating an integral academic environment. Especially, I want to thank Erik Choi for a great collaboration. I want to thank Priya Govindan for discussing my work. I also want to thanks Diana Floegel for helping with some proof-reading. I want to thank my parents, Vinh Le and Kien Nguyen, and my brother, Hoang Le, for their love and support. Last, but not least, I want to thank my wife, Lan Hoang, for her love, encouragement and support. She always stood by me on the long road to this dissertation. iv Dedication To my family. v Table of Contents Abstract ::::::::::::::::::::::::::::::::::::::::: ii Preface ::::::::::::::::::::::::::::::::::::::::: iii Acknowledgements ::::::::::::::::::::::::::::::::::: iv Dedication :::::::::::::::::::::::::::::::::::::::: v List of Tables :::::::::::::::::::::::::::::::::::::: x List of Figures :::::::::::::::::::::::::::::::::::::: xii 1. Introduction ::::::::::::::::::::::::::::::::::::: 1 1.1. Overview . .1 1.2. Problem Definitions . .4 1.3. Thesis Statement . .6 1.4. Thesis Organization . .7 2. Background ::::::::::::::::::::::::::::::::::::: 8 2.1. Online Q&A . .8 2.2. Community Q&A (CQA) . .8 2.3. Q&A in Social Networks . 10 2.4. Searchers Become Askers: Understanding the Transition . 11 2.5. Ranking in CQA . 12 2.6. Question and Answer Quality in CQA . 12 2.7. Expert Finding . 14 2.8. User Behavior in Online Communities . 15 2.8.1. User Engagement with Online Communities . 17 2.8.2. User Similarity in Social Media . 18 vi 2.8.3. Evolution of Users in CQA . 18 3. Finding Potential Answerers in CQA ::::::::::::::::::::::: 21 3.1. Motivation and Problem Definition . 21 3.2. Method . 22 3.2.1. Constructing a User Profile . 23 3.2.2. Computing Similarity Between Question and User Profile . 23 3.2.3. Inferring Users’ Topic Interests . 24 3.2.4. Similarity in Information Network . 24 3.2.5. Activity Level of User . 26 3.2.6. Summary of Our Framework to Find Potential Answerers . 26 3.3. Datasets and Characterization of the Data . 27 3.3.1. Data Description . 27 3.3.2. Characterization of the Data . 31 3.4. Experiments . 35 3.4.1. Experimental Setup . 35 3.4.2. Results . 36 3.5. Discussion . 41 3.6. Conclusion . 42 4. Good and Bad Answerers ::::::::::::::::::::::::::::: 44 4.1. Motivation and Problem Definition . 44 4.2. Examining the Quality of an Answer . 45 4.2.1. Feature Extraction . 45 4.2.2. Classification . 50 4.3. Datasets and Characterization of the Data . 51 4.3.1. Brainly: An Educational CQA . 52 4.3.2. Stack Overflow: A Focused CQA . 58 4.4. Experiments and Results . 61 4.4.1. Experimental Setup . 61 vii 4.4.2. Main Results . 62 4.5. Discussion . 67 4.6. Conclusion . 70 5. Struggling Users in Educational CQA ::::::::::::::::::::::: 72 5.1. Datasets and Characterization of the Data . 73 5.2. Examining Struggling Users . 74 5.2.1. Definition of Struggling Users . 74 5.2.2. Existence of Struggling Users . 74 5.2.3. Social Connections of Struggling Users . 76 5.2.4. Time to Answer Question . 77 5.2.5. Difference Between Level of Education . 78 5.2.6. Activeness of Struggling Users . 79 5.2.7. Readability of Answers . 80 5.3. Detecting Struggling Users without Human Judgment . 80 5.3.1. Features Extraction . 80 5.3.2. Classification . 81 5.3.3. Experiment Setup . 81 5.3.4. Results . 83 5.4. Detecting Struggling Users at Their Early Stage with Human Judgment . 84 5.4.1. Experiment Set Up . 85 5.4.2. Results . 85 5.4.3. Experiment Results Discussion . 86 5.5. Discussion . 88 5.6. Conclusion . 89 6. Retrieving Rising Stars in CQA :::::::::::::::::::::::::: 91 6.1. Problem Definition . 91 6.2. Our Method . 92 6.2.1. Feature Extraction . 92 viii 6.2.2. Building Training Set . 93 6.2.3. Classification . 94 6.3. Data Description . 95 6.3.1. Data Pre-processing . 95 6.3.2. Defining the Rising-Star . 95 6.4. Experimental Setup . 96 6.5. Results . 97 6.5.1. Overview Results . 97 6.5.2. Applying Different Classification Algorithms . 97 6.6. Discussion . 100 6.6.1. Feature Importance . 100 6.6.2. Using Different Groups of Features . 100 6.6.3. Effect of the Community Size . 101 6.7. Conclusion . 102 7. Conclusions and Future Work ::::::::::::::::::::::::::: 103 7.1. Conclusions . 103 7.2. Future Work . 106 References :::::::::::::::::::::::::::::::::::::::: 108 ix List of Tables 1.1. Notations used in the dissertation. .5 3.1. Score system in Stack Overflow and Yahoo! Answer. 30 3.2. Description about data. 30 3.3. Most popular topics in two CQA sites. 34 3.4. Compare the MRR of different algorithms. QRec achieves highest MRR score in both data sets. 38 3.5. The importance of each similarity feature in QRec ................ 38 3.6. Comparing the Loss of Anarchy. QRec preserves the quality of best answers when the question is answered by a small set of potential answerers. 39 3.7. Comparing the Overload. QRec has reasonable overload. 41 4.1. Lists of features on educational CQA are classified into four groups of features: Personal,.