RECOMMENDING BEST ANSWER IN A COLLABORATIVE QUESTION ANSWERING SYSTEM

Mrs Lin Chen

Submitted in fulfilment of the requirements for the degree of Master of Information Technology (Research) School of Information Technology Faculty of Science & Technology Queensland University of Technology 2009

Recommending Best Answer in a Collaborative Question Answering System Page i

Keywords

Authority, Collaborative Social Network, Content Analysis, Link Analysis, Natural Language Processing, Non-Content Analysis, Online Question Answering Portal, Prestige, Question Answering System, Recommending Best Answer, Social Network Analysis, Yahoo! Answers.

© 2009 Lin Chen Page i Recommending Best Answer in a Collaborative Question Answering System Page ii

© 2009 Lin Chen Page ii Recommending Best Answer in a Collaborative Question Answering System Page iii

Abstract

The World Wide Web has become a medium for people to share information. People use Web- based collaborative tools such as question answering (QA) portals, /forums, email and instant messaging to acquire information and to form online-based communities. In an online QA portal, a user asks a question and other users can provide answers based on their knowledge, with the question usually being answered by many users. It can become overwhelming and/or time/resource consuming for a user to read all of the answers provided for a given question. Thus, there exists a need for a mechanism to rank the provided answers so users can focus on only reading good quality answers. The majority of online QA systems use user feedback to rank users’ answers and the user who asked the question can decide on the best answer. Other users who didn’t participate in answering the question can also vote to determine the best answer. However, ranking the best answer via this collaborative method is time consuming and requires an ongoing continuous involvement of users to provide the needed feedback. The objective of this research is to discover a way to recommend the best answer as part of a ranked list of answers for a posted question automatically, without the need for user feedback.

The proposed approach combines both a non-content-based reputation method and a content- based method to solve the problem of recommending the best answer to the user who posted the question. The non-content method assigns a score to each user which reflects the users’ reputation level in using the QA portal system. Each user is assigned two types of non-content-based reputations cores: a local reputation score and a global reputation score. The local reputation score plays an important role in deciding the reputation level of a user for the category in which the question is asked. The global reputation score indicates the prestige of a user across all of the categories in the QA system.

Due to the possibility of user cheating, such as awarding the best answer to a friend regardless of the answer quality, a content-based method for determining the quality of a given answer is proposed, alongside the non-content-based reputation method. Answers for a question from different users are compared with an ideal (or expert) answer using traditional Information Retrieval and Natural Language Processing techniques. Each answer provided for a question is assigned a content score according to how well it matched the ideal answer.

To evaluate the performance of the proposed methods, each recommended best answer is compared with the best answer determined by one of the most popular link analysis methods,

© 2009 Lin Chen Page iii Recommending Best Answer in a Collaborative Question Answering System Page iv

Hyperlink-Induced Topic Search (HITS). The proposed methods are able to yield high accuracy, as shown by correlation scores: Kendall correlation and Spearman correlation. The reputation method outperforms the HITS method in terms of recommending the best answer. The inclusion of the reputation score with the content score improves the overall performance, which is measured through the use of Top-n match scores.

© 2009 Lin Chen Page iv Recommending Best Answer in a Collaborative Question Answering System Page v

Table of Contents

Keywords ...... i Abstract ...... iii Table of Contents ...... v List of Figures ...... viii List of Tables...... xii List of Abbreviations ...... xvi Statement of Original Authorship ...... xvii Acknowledgments ...... xix CHAPTER 1: INTRODUCTION ...... 1 1.1 Background ...... 1 1.2 Related Works ...... 3 1.3 Research objectives ...... 4 1.4 Research Contribution ...... 5 1.5 Thesis Organisition ...... 6 1.6 Published Paper ...... 7 CHAPTER 2: BACKGROUND & LITERATURE REVIEW ...... 9 2.1 Online Social Network ...... 9 2.2 Social Network Analysis Methods for Yahoo! Answers ...... 11 2.3 Approaches to Identify Answer Quality ...... 14 2.3.1 Content Based Approach – Natural Language Processing and Information Retrieval .... 14 2.3.1.1 Information Retrieval ...... 15

2.3.1.2 Natural Language Processing ...... 18

2.3.1.3 Content Based Question Answering System Architecture ...... 20

2.3.1.4 Current status of Question Answering Systems...... 22

2.3.2 Reputation Based Approaches ...... 24 2.3.3 Link Analysis ...... 25 2.3.3.1 PageRank Algorithm ...... 25

2.3.3.2 Hyperlink-Induced Topic Search Algorithm ...... 27

2.3.4 Statistical Approach ...... 31 2.4 Conclusion ...... 33

© 2009 Lin Chen Page v Recommending Best Answer in a Collaborative Question Answering System Page vi

CHAPTER 3: ANALYSIS OF YAHOO! ANSWERS ...... 35 3.1 Yahoo! Answers Mechanism ...... 36 3.2 Graph Representation...... 41 3.3 The Bow Tie Structure Analysis ...... 43 3.4 Degree Centrality ...... 46 3.5 Question Quality & Answer Quality ...... 48 3.6 A Hierarchical Classification Structure for Placing Questions ...... 50 3.7 Conclusion ...... 52 CHAPTER 4: METHODOLOGY ...... 55 4.1 Overview of Methodology ...... 55 4.2 Reputation-based Method ...... 57 4.2.1 Local Reputation Score ...... 60 4.2.2 Global Reputation Score ...... 69 4.3 Content Method ...... 71 4.3.1 Question Type Analysis ...... 71 4.3.1.1 Support Vector Machine (SVM) ...... 71

4.3.1.2 Question Type Class ...... 72

4.3.2 Name Entity Recognition (NER) ...... 74 4.3.3 Question Answering Systems as an Expert ...... 76 4.3.3.1 Comparison of Various QA Systems ...... 77

4.3.3.2 Process of Matching User Answers and Expert Answer ...... 80

4.3.3.3 Semantic Expansion of Answer Keywords ...... 81

4.3.3.4 Query and Answer Matching ...... 82

4.4 Answer Score Fusion ...... 82 4.5 Conclusion ...... 85 CHAPTER 5: EXPERIMENTATION & RESULT ...... 87 5.1 Dataset ...... 87 5.2 Experiment Design ...... 89 5.2.1 Reputation Method Experiment Setup ...... 90 5.2.1.1 Local Reputation Score ...... 90

5.2.1.2 Global Reputation Score ...... 91

5.2.1.3 Settings for HITS ...... 91

5.2.2 Content Method ...... 93 5.2.2.1 Question Type Analysis ...... 93

5.2.2.2 Name Entity Recognition ...... 95

5.2.2.3 Question Pre-Processing ...... 97

© 2009 Lin Chen Page vi Recommending Best Answer in a Collaborative Question Answering System Page vii

5.3 Evaluation Criteria ...... 98 5.4 Results ...... 100 5.4.1 Reputation Method Evaluation & Results ...... 100 5.4.1.1 Trend Comparisons ...... 100

5.4.1.2 Kendall & Spearman correlation ...... 102

5.4.1.3 Weighting of Reputation Method ...... 103

5.4.2 Content Method Evaluation & Results ...... 109 5.4.3 Score Fusion Evaluation and Results ...... 111 5.5 Disscusion ...... 119 5.6 Conclusion ...... 120 CHAPTER 6: CONCLUSIONS ...... 123 6.1 Main Findings ...... 123 6.2 Contributions ...... 125 6.3 Future Work ...... 126 BIBLIOGRAPHY ...... 129 APPENDIX A ...... 138 APPENDIX B ...... 146

© 2009 Lin Chen Page vii Recommending Best Answer in a Collaborative Question Answering System Page viii

List of Figures

Figure 2.1. Basic architecture of a Question-Answering system (Prager, 2006)...... 21 Figure 2.2. Authorities and hubs (Liu, 2007)...... 28 Figure 3.1. An example of a question and its answers...... 37 Figure 3.2. Result from ...... 38 Figure 3.3. Points (Points and Levels, 2009)...... 40 Figure 3.4. Levels (Points and Levels, 2009)...... 40 Figure 3.5. Tripartite example...... 42 Figure 3.6. Bipartite example...... 43 Figure 3.7. Bow Tie structure (Borodin, et. al, 2003)...... 44 Figure 3.8. The algorithm in pseudocode (Tarjan, 1972)...... 45 Figure 3.9. Indegree and Outdegree...... 47 Figure 3.10. Question quality...... 49 Figure 4.1. Flowchart of process model...... 57 Figure 4.2. Sigmoid Function (www.wikipedia.com)...... 64 Figure 4.3. Answers distribution in Arts & Humanity...... 65 Figure 4.4. Best answer distribution in Arts & Humanity...... 65 Figure 4.5. Answer distribution in Science & Mathematics...... 66 Figure 4.6. Best answer distribution in Science & Mathematics...... 66 Figure 4.7. Answer distribution in Sports...... 67 Figure 4.8. Best answer distribution in Sports...... 67 Figure 4.9. Local reputation score algorithm...... 68 Figure 4.10. Global answer distributions...... 70 Figure 4.11. Global best answers distribution...... 70 Figure 4.12. Question type analysis algorithm...... 74 Figure 4.13. Process of generating combined score...... 84 Figure 5.1. Document in XML format...... 89 Figure 5.2. HITS algorithm used in the proposed experiment...... 92 Figure 5.3. Sample results from GATE...... 96 Figure 5.4. NER example result...... 97 Figure 5.5. Trend comparisons for bins 0-0.4...... 101 Figure 5.6. Trend comparisons for bins 0.4-1...... 101 Figure 5.7. Correlation score for Top-k users...... 103 Figure 5.8. Overall results for Top-n matching for weighting of reputation score...... 105 Figure 5.9. Arts & Humanity for Top-n matching for weighting of reputation score...... 106 Figure 5.10. Science & Math for Top-n matching for weighting of reputation score...... 107 Figure 5.11. Sports for Top-n matching for weighting of reputation score...... 108

© 2009 Lin Chen Page viii Recommending Best Answer in a Collaborative Question Answering System Page ix

Figure 5.12. Sample of output from proposed system...... 114 Figure 5.13. Overall match rate for tested combinations...... 115 Figure 5.14. Arts & Humanity match rate for tested combinations...... 116 Figure 5.15. Science & Math match rate for tested combinations...... 117 Figure 5.16. Sports match rate for tested combinations...... 118 Figure A.1. Overall reputation score weighting...... 139 Figure A.2. Arts & Humanity reputation score weighting...... 140 Figure A.3. Science & Math reputation score weighting...... 141 Figure A.4. Sports reputation score weighting...... 142 Figure B.1. Overall top n match score for first small dataset using different combinations of reputation and content scoring (1)...... 147 Figure B.2. Overall top n match score for first small dataset using different combinations of reputation and content scoring (2)...... 148 Figure B.3. Overall top n match score for first small dataset using different combinations of reputation and content scoring (3)...... 149 Figure B.4. Overall top n match score for first small dataset using different combinations of reputation and content scoring (4)...... 150 Figure B.5. Overall top n match score for first small dataset using different combinations of reputation and content scoring (5)...... 151 Figure B.6. Overall top n match score for first small dataset using different combinations of reputation and content scoring (6)...... 152 Figure B.7. Arts & Humanity top n match score for first small dataset using different combinations of reputation and content scoring (1)...... 153 Figure B.8. Arts & Humanity top n match score for first small dataset using different combinations of reputation and content scoring (2)...... 154 Figure B.9. Arts & Humanity top n match score for first small dataset using different combinations of reputation and content scoring (3)...... 155 Figure B.10. Science & Math top n match score for first small dataset using different combinations of reputation and content scoring (1)...... 156 Figure B.11. Science & Math top n match score for first small dataset using different combinations of reputation and content scoring (2)...... 157 Figure B.12. Science & Math top n match score for first small dataset using different combinations of reputation and content scoring (3)...... 158 Figure B.13. Sports top n match score for first small dataset using different combinations of reputation and content scoring (1)...... 159 Figure B.14. Sports top n match score for first small dataset using different combinations of reputation and content scoring (2)...... 160 Figure B.15. Sports top n match score for first small dataset using different combinations of reputation and content scoring (3)...... 161 Figure B.16. Overall top n match score for second small dataset using different combinations of reputation and content scoring (1)...... 167 Figure B.17. Overall top n match score for second small dataset using different combinations of reputation and content scoring (2)...... 168 Figure B.18. Overall top n match score for second small dataset using different combinations of reputation and content scoring (3)...... 169

© 2009 Lin Chen Page ix Recommending Best Answer in a Collaborative Question Answering System Page x

Figure B.19. Overall top n match score for second small dataset using different combinations of reputation and content scoring (3)...... 170 Figure B.20. Overall top n match score for second small dataset using different combinations of reputation and content scoring (4)...... 171 Figure B.21. Overall top n match score for second small dataset using different combinations of reputation and content scoring (5)...... 172 Figure B.22. Arts & Humanity top n match score for second small dataset using different combinations of reputation and content scoring (1)...... 173 Figure B.23. Arts & Humanity top n match score for second small dataset using different combinations of reputation and content scoring (2)...... 174 Figure B.24. Arts & Humanity top n match score for second small dataset using different combinations of reputation and content scoring (3)...... 175 Figure B.25. Science & Maths top n match score for second small dataset using different combinations of reputation and content scoring (1)...... 176 Figure B.26. Science & Maths top n match score for second small dataset using different combinations of reputation and content scoring (2)...... 177 Figure B.27. Science & Maths top n match score for second small dataset using different combinations of reputation and content scoring (3)...... 178 Figure B.28. Sports top n match score for second small dataset using different combinations of reputation and content scoring (1)...... 179 Figure B.29. Sports top n match score for second small dataset using different combinations of reputation and content scoring (2)...... 180 Figure B.30. Sports top n match score for second small dataset using different combinations of reputation and content scoring (3)...... 181 Figure B.31. Overall top n match score for big dataset using different combinations of reputation and content scoring (1)...... 187 Figure B.32. Overall top n match score for big dataset using different combinations of reputation and content scoring (2)...... 188 Figure B.33. Arts & Humanity top n match score for big dataset using different combinations of reputation and content scoring ...... 189 Figure B.34. Science & Maths top n match score for big dataset using different combinations of reputation and content scoring ...... 190 Figure B.35. Sports top n match score for big dataset using different combinations of reputation and content scoring ...... 191 Figure B.36. Overall top n match rate for big dataset using 0.75 reputation weighting and 0.25 content weighting...... 192 Figure B.37. Overall top n match rate for big dataset using 0.25 reputation weighting and 0.75 content weighting...... 193 Figure B.38. Arts & Humanity top n match rate for big dataset using 0.75 reputation weighting and 0.25 content weighting...... 194 Figure B.39. Science & Math top n match rate for big dataset using 0.25 reputation weighting and 0.75 content weighting...... 195 Figure B.40. Science & Math top n match rate for big dataset using 0.75 reputation weighting and 0.25 content weighting...... 196 Figure B.41. Sports top n match rate for big dataset using 0.25 reputation weighting and 0.75 content weighting...... 197

© 2009 Lin Chen Page x Recommending Best Answer in a Collaborative Question Answering System Page xi

Figure B.42. Sports top n match rate for big dataset using 0.75 reputation weighting and 0.25 content weighting...... 198

© 2009 Lin Chen Page xi Recommending Best Answer in a Collaborative Question Answering System Page xii

List of Tables

Table 3.1. Bow Tie comparison...... 46 Table 3.2. Yahoo! Answers categories...... 51 Table 3.3. The crossover of users between categories...... 52 Table 4.1. The coarse and fine grained question categories...... 73 Table 4.2. Question Answering system comparison...... 80 Table 4.3. An example of vectors for AnswerBus (A) and Yahoo! Answers (Y)...... 81 Table 5.1. Yahoo! Answers dataset statistics...... 87 Table 5.2. Values assigned to µ and σ ...... 90 Table 5.3. AnswerBus match score...... 110 Table 5.4. Question answer match score...... 111 Table 5.5. AnswerBus with WordNet match score...... 111 Table 5.6. Combined AnswerBus with question answer match score...... 111 Table 5.7. Combined AnswerBus with WordNet and question answer match score...... 111 Table A.1. 0.1 participation score weighting with 0.9 answer score weighting (QA-NC-0.1AS- 0.9BAS)...... 143 Table A.2. 0.2 participation score weighting with 0.8 answer score weighting (QA-NC-0.2AS- 0.8BAS)...... 143 Table A.3. 0.3 participation score weighting with 0.7 answer score weighting (QA-NC-0.3AS- 0.7BAS)...... 143 Table A.4. 0.4 participation score weighting with 0.6 answer score weighting (QA-NC-0.4AS- 0.6BAS)...... 143 Table A.5. 0.5 participation score weighting with 0.5 answer score weighting (QA-NC-0.5AS- 0.5BAS)...... 143 Table A.6. 0.6 participation score weighting with 0.4 answer score weighting (QA-NC-0.6AS- 0.4BAS)...... 144 Table A.7. 0.7 participation score weighting with 0.3 answer score weighting (QA-NC-0.7AS- 0.3BAS)...... 144 Table A.8. 0.8 participation score weighting with 0.2 answer score weighting (QA-NC-0.8AS- 0.2BAS)...... 144 Table A.9. 0.9 participation score weighting with 0.1 answer score weighting (QA-NC-0.9AS- 0.1BAS)...... 144 Table A.10. 0.25 participation score weighting with 0.75 answer score weighting (QA-NC-0.25AS- 0.75BAS)...... 144 Table A.11. 0.75 participation score weighting with 0.25 answer score weighting (QA-NC-0.75AS- 0.25BAS)...... 145 Table B.1. Top n match score for first small dataset using reputation scoring (QA-NC)...... 162 Table B.2. Top n match score for first small dataset using AnswerBus (QA-C(1))...... 162 Table B.3. Top n match score for first small dataset using question answer match (QA-C(2))...... 162 Table B.4. Top n match score for first small of dataset using combination of reputation and AnswerBus scoring (QA-0.5NC-0.5C(1))...... 162

© 2009 Lin Chen Page xii Recommending Best Answer in a Collaborative Question Answering System Page xiii

Table B.5. Top n match score for first small of dataset using combination of reputation and question answer match scoring (QA-0.5NC-0.5C(2))...... 162 Table B.6. Top n match score for first small of dataset using combination of reputation and content scoring (QA-0.5NC-0.5(C(1)-C(2)))...... 163 Table B.7. Top n match score for first small of dataset using different content scoring (QA-C(1)- C(2))...... 163 Table B.8. Top n match score for first small of dataset using HITS (QA-HITS)...... 163 Table B.9. Top n match score for first small of dataset using WordNet applying to AnswerBus (QA-C(1)-WN)...... 163 Table B.10. Top n match score for first small of dataset using combination of reputation and WordNet scoring (QA-0.5NC-0.5C(1)-WN)...... 163 Table B.11. Top n match score for first small of dataset using combination of reputation and content scoring (QA-0.5NC-0.5(C(1)-C(2))-WN)...... 164 Table B.12. Top n match score for first small of dataset using combination of different content scoring (QA-C(1)-C(2)-WN)...... 164 Table B.13. Top n match score for first small of dataset using different weights on reputation and content scoring (1) (QA-0.25NC-0.75C(1))...... 164 Table B.14. Top n match score for first small of dataset using different weights on reputation and content scoring (2) (QA-0.25NC-0.75C(2))...... 164 Table B.15. Top n match score for first small of dataset using different weights on reputation and content scoring (3) (QA-0.25NC-0.75(C(1)-C(2)))...... 164 Table B.16. Top n match score for first small of dataset using different weights on reputation and content scoring (4) (QA-0.25NC-0.75C(1)-WN)...... 165 Table B.17. Top n match score for first small of dataset using different weights on reputation and content scoring (5) (QA-0.25NC-0.75(C(1)-C(2))-WN)...... 165 Table B.18. Top n match score for first small of dataset using different weights on reputation and content scoring (6) (QA-0.75NC-0.25C(1))...... 165 Table B.19. Top n match score for first small of dataset using different weights on reputation and content scoring (7) (QA-0.75NC-0.25C(2))...... 165 Table B.20. Top n match score for first small of dataset using different weights on reputation and content scoring (8) (QA-0.75NC-0.25(C(1)-C(2)))...... 165 Table B.21. Top n match score for first small of dataset using different weights on reputation and content scoring (9) (QA-0.75NC-0.25C(1)-WN)...... 166 Table B.22. Top n match score for first small of dataset using different weights on reputation and content scoring (10) (QA-0.75NC-0.25(C(1)-C(2))-WN)...... 166 Table B.23. Top n match score for second small of dataset using reputation scoring (QA-NC). .. 182 Table B.24. Top n match score for second small of dataset using AnswerBus (QA-C(1))...... 182 Table B.25. Top n match score for second small of dataset using question answer match scoring (QA-C(2))...... 182 Table B.26. Top n match score for second small of dataset using different combination of reputation and content scoring (1) (QA-0.5NC-0.5C(1))...... 182 Table B.27. Top n match score for second small of dataset using different combination of reputation and content scoring (2) (QA-0.5NC-0.5C(2))...... 182 Table B.28. Top n match score for second small of dataset using different combination of reputation and content scoring (3) (QA-0.5NC-0.5(C(1)-C(2)))...... 183 Table B.29. Top n match score for second small of dataset using different combination of content scoring (1) (QA-C(1)-C(2))...... 183

© 2009 Lin Chen Page xiii Recommending Best Answer in a Collaborative Question Answering System Page xiv

Table B.30. Top n match score for second small of dataset using HITS (QA-HITS)...... 183 Table B.31. Top n match score for second small of dataset WordNet (QA-C(1)-WN)...... 183 Table B.32. Top n match score for second small of dataset using different combination of reputation and content scoring (4) (QA-0.5NC-0.5C(1)-WN)...... 183 Table B.33. Top n match score for second small of dataset using different combination of reputation and content scoring (5) (QA-0.5NC-0.5(C(1)-C(2))-WN)...... 183 Table B.34. Top n match score for second small of dataset using different combination content scoring (2) (QA-C(1)-C(2)-WN)...... 184 Table B.35. Top n match score for second small of dataset using different combination of reputation and content scoring (6) (QA-0.25NC-0.75C(1))...... 184 Table B.36. Top n match score for second small of dataset using different combination of reputation and content scoring (7) (QA-0.25NC-0.75C(2))...... 184 Table B.37. Top n match score for second small of dataset using different combination of reputation and content scoring (8) (QA-0.25NC-0.75(C(1)-C(2)))...... 184 Table B.38. Top n match score for second small of dataset using different combination of reputation and content scoring (9) (QA-0.25NC-0.75C(1)-WN)...... 184 Table B.39. Top n match score for second small of dataset using different combination of reputation and content scoring (10) (QA-0.25NC-0.75(C(1)-C(2))-WN)...... 185 Table B.40. Top n match score for second small of dataset using different combination of reputation and content scoring (11) (QA-0.75NC-0.25C(1))...... 185 Table B.41. Top n match score for second small of dataset using different combination of reputation and content scoring (12) (QA-0.75NC-0.25C(2))...... 185 Table B.42. Top n match score for second small of dataset using different combination of reputation and content scoring (13) (QA-0.75NC-0.25(C(1)-C(2)))...... 185 Table B.43. Top n match score for second small of dataset using different combination of reputation and content scoring (14) (QA-0.75NC-0.25C(1)-WN)...... 185 Table B.44. Top n match score for second small of dataset using different combination of reputation and content scoring (15) (QA-0.75NC-0.25(C(1)-C(2))-WN)...... 186 Table B.45. Top n match score for big of dataset using reputation scoring (QA-NC)...... 199 Table B.46. Top n match score for big of dataset using content scoring (1) ( QA-C(1) )...... 199 Table B.47. Top n match score for big of dataset using content scoring (2) (QA-C(2))...... 199 Table B.48. Top n match score for big of dataset using different combination of reputation and content scoring (1) (QA-0.5NC-0.5C(1))...... 199 Table B.49. Top n match score for big of dataset using different combination of reputation and content scoring (2) (QA-0.5NC-0.5C(2))...... 199 Table B.50. Top n match score for big of dataset using different combination of reputation and content scoring (3) (QA-0.5NC-0.5(C(1)-C(2)))...... 200 Table B.51. Top n match score for big of dataset using different combination of content scoring (1) (QA-C(1)-C(2))...... 200 Table B.52. Top n match score for big of dataset using HITS (QA-HITS)...... 200 Table B.53. Top n match score for big of dataset using different combination of content scoring (2) (QA-C(1)-WN)...... 200 Table B.54. Top n match score for big of dataset using different combination of reputation and content scoring (4) (QA-0.5NC-0.5C(1)-WN)...... 200 Table B.55. Top n match score for big of dataset using different combination of reputation and content scoring (5) (QA-0.5NC-0.5(C(1)-C(2))-WN)...... 201

© 2009 Lin Chen Page xiv Recommending Best Answer in a Collaborative Question Answering System Page xv

Table B.56. Top n match score for big of dataset using different combination of reputation and content scoring (6) (QA-C(1)-C(2)-WN)...... 201 Table B.57. Top n match rate for big of dataset using 0.25 for reputation weighting and 0.75 for AnswerBus weighting (QA-0.25NC-0.75C(1))...... 201 Table B.58. Top n match rate for big dataset using 0.25 for reputation weighting and 0.75 for question answer weighting (QA-0.25NC-0.75C(2))...... 201 Table B.59. Top n match rate for big dataset using 0.25 for reputation weighting and 0.75 for content weighting (1) (QA-0.25NC-0.75(C(1)-C(2)))...... 201 Table B.60. Top n match rate for big of dataset using 0.25 reputation weighting and 0.75 for AnswerBus with WordNet (QA-0.25NC-0.75C(1)-WN)...... 202 Table B.61. Top n match rate for big of dataset using 0.25 reputation weighting and 0.75 for content weighting (2) (QA-0.25NC-0.75(C(1)-C(2))-WN)...... 202 Table B.62. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (1) (QA-0.75NC-0.25C(1))...... 202 Table B.63. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (2) (QA-0.75NC-0.25C(2))...... 202 Table B.64. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (3) (QA-0.75NC-0.25(C(1)-C(2)))...... 202 Table B.65. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (4) (QA-0.75NC-0.25C(1)-WN)...... 203 Table B.66. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (5) (QA-0.75NC-0.25(C(1)-C(2))-WN)...... 203

© 2009 Lin Chen Page xv Recommending Best Answer in a Collaborative Question Answering System Page xvi

List of Abbreviations

• QA – Question Answering • cQA – Collaborative Question Answering • SNA – Social Network Analysis • SN – Social Network • NLP – Natural Language Processing • IR – Information Retrieval • VSM – Vector Space Model • NER – Name Entity Recognition • POS – Part of Speech • WSD – Word Sense Disambiguious • SCC – Strongly Connect Component • WCC – Weakly Connect Component • SVM –Support Vector Machine • TF-IDF – Term frequency inverse document frequency

© 2009 Lin Chen Page xvi Recommending Best Answer in a Collaborative Question Answering System Page xvii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signature: ______

Date: ______

© 2009 Lin Chen Page xvii Recommending Best Answer in a Collaborative Question Answering System Page xviii

© 2009 Lin Chen Page xviii Recommending Best Answer in a Collaborative Question Answering System Page xix

Acknowledgments

I would like to express my gratitude to all those who gave me this chance to complete this thesis.

I’d like to thank Dr Richi Nayak, as my supervisor, for her guidance. Thank you for providing the opportunity to allow me to achieve this degree. Thank you for your valuable suggestions and encouragement for the entire length of my research. For your tolerance of my delay in finishing this work and bearing with my ambiguous sentences and poor English in my draft thesis.

I would like to give my parents Mr Xuan Xin Chen and Mrs Xiu Juan Qian big hugs for their endless and unconditional love. Thanks to them for supporting me in studying overseas in my earlier years and for their encouragement and nagging “Have you finished your thesis yet?”.

Thank you to my lovely husband, Gavin Shaw for your company in my late night studies. Thank you to him for his discussions, suggestions and editing of my thesis. Thanks also for fixing up my programming. I am indebted to my husband for his slavery in doing housework during the writing up of my thesis.

Lastly, I’d like to thank John King for his support in giving advice on dataset collection.

© 2009 Lin Chen Page xix Recommending Best Answer in a Collaborative Question Answering System Page xx

© 2009 Lin Chen Page xx Chapter 1: Introduction Page 1

Chapter 1: Introduction

1.1 BACKGROUND With the advancement in Information Science and Computer Science fields, online social networks have gained popularity in the last decade. For example, Friendster attracted over 5 million registered users in the span of only a few months; MySpace ( http://www.myspace.com ) has over 250 million users (Jeyes, 2009); YouTube (http://www.youtube.com ) has over 6 billion daily video views (YouTube, 2009). Online social networks, unlike the Web, are organized around users. To join an online social network, users create a profile, publish content and build links to anyone they want to associate with. As a result, online social networks have become the sites for maintaining social relationships, for finding users with similar interests and for locating content and knowledge that has been contributed or endorsed by other users (Mislove, 2007). The great advantages that online social networks bring to users include: (1) low cost or no cost to access the contained resources, with the Web-based nature of these networks meaning data is publicly available; (2) the sharing of knowledge or experiences; (3) provision of up-to-date information; and (4) ubiquity. Because of their value and advantages, the study of these social network websites is of great importance to modern society.

A collaborative Question Answering (cQA) portal such as Yahoo! Answers (http://answers.yahoo.com), is an example of online collaborative social networks. The main purpose of a collaborative social network is to share the knowledge that users possess. Yahoo! Answers allows users to both submit questions to be answered and respond by providing answers to questions asked by other users. PC World acclaims Yahoo! Answers to be one of the best examples of community participation on the Web (PC world celebrates Yahoo! Answers, 2008). Launched on December 13, 2005, it attracted 0.7% of the Internet audience within a month and it surged to 4.2% of all the Web visitors in one year. It had become the third most popular website after Wikipedia and Dictionary.com in the education category within just six months (Sullivan, 2006). Users of Yahoo! Answers gain their award points through: (1) logging into the system; (2) providing answers to submitted questions; (3) voting for which answer they believe is the best for a submitted question; and (4) being the user who provided the best answer to a question. To determine the best answer for a

© 2009 Lin Chen Page 1 Chapter 1: Introduction Page 2 question, either the question author chooses the best answer or the community of users vote for the best answer.

There are shortcomings to Yahoo! Answers. The first problem is the poor quality of some of the submitted answers. Under the current points system, users are encouraged to answer as many questions as they can, regardless of the quality of answers they provide. The second problem is in regard to information overload. Each day, thousands of new questions are submitted to the system (Linden, 2006). Each question can be answered by many users and it can become overwhelming for the question author to read all the provided answers. Answers are not ranked based on the quality but listed in the order they were submitted or provided. Yahoo! Answers allows the question author to choose the best answer, or other users (who did not provide an answer to the question) to vote for the best answer. This results in a lot of manual work needing to be undertaken. Users who like to choose/vote for the best answer (to gain the point awarded for such action) must read all the answers provided before they decide which answer is best.

To a certain extent, the answer chosen by the users as the best answer is not necessarily the best quality answer. The decision of an asker is influenced by subjective reasoning such as the relations between users, the asker’s own point of view, his lack of knowledge on the subject and others. The asker may have chosen the answer as the best answer due to many reasons such as the answer most fulfils his subjective satisfaction, or the answer is responded by a more familiar answer author. Preference may be given to an answer author whose opinion most closely matches that of the asker, even if other answers are equally correct. Therefore, an automatic best answer recommendation system may improve these situations as it will choose the best answer objectively, rather than simply selecting ‘what the asker wants to hear’.

This thesis attempts to provide an automatic process of recommending the best answer for a posted question in a collaborative Question Answering network such as Yahoo! Answers. Having an automatic process for ranking the submitted answers will save the network users a lot of time and effort. In order to provide this solution, this thesis addresses the following questions. 1. How to decide the user answer quality in a cQA by using the reputation of the answer author or by using the content of the answer or by using both? 2. How Yahoo! Answers fairs as a social network or cQA? Can its features be used for selecting the best answer? 3. Can the cQA questions and answers be analysed with the existing Natural Language Processing or Information Retrieval techniques?

© 2009 Lin Chen Page 2 Chapter 1: Introduction Page 3

4. What evaluation criteria can be used for the proposed approach of ranking the users answers in a cQA and recommending the best user answer as there exist no standard evaluation criteria? In summary, the task of this research is therefore to design and develop an effective and efficient automatic approach to evaluate the quality of supplied answers, to propose the best answer to the question, and to present a ranked list of all the answers for a question according to their quality.

1.2 RELATED WORKS Several researchers have attempted to solve the problem of evaluating answer quality in relation to both online question answering (QA) portals and collaborative social networks (CSN). Previous research can be classified into 3 types of approaches based on their features. They are: (1) content-based methods using a combination of Natural Language Processing (NLP) and Information Retrieval (IR) techniques; (2) link analysis methods; and (3) statistical methods.

The content-based approach has been used to solve QA problems for several decades. In the 1960s, Baseball was the first QA system - it answered questions about the US baseball league over a period of one year. LUNAR, the second QA system, answered questions regarding the analysis of rocks returned by the Apollo moon missions. Both of these systems were closed- domain QA systems with the data being kept in a core database. The performances were proved to be excellent in terms of the accuracy achieved. The only flaw is that the early closed-domain QA systems were limited to a certain area/topic. Over time, many open-domain QA systems have been developed that allow questions on a multiple range of topics. Such systems include START, AnswerBus, BrainBoost, Ephyra and QuALim to name a few. With the increased popularity of QA systems, TREC (http://trec.nist.gov/) started the QA track in 1999. The approach to study the question however, is limited to simple types of questions (factoid, definition and list). A factoid question asks about a simple fact or relationship with the answer easily being expressed in just a few words, often a noun-phrase (Prager, 2006).

Current QA systems that are capable of evaluating answer quality do so based on the quality of the content contained within the answer. In the simplest process of answer evaluation, a user posts the question into a QA system. The QA system then analyses the question and finds one or more answer candidates from its input resources. Once the answer candidates are retrieved, the QA system then evaluates the content of each one and scores them based on the quality of their content. The quality of the content may also be evaluated on how well the answer relates to the question and/or the relevance of the question itself. The highest scoring candidate is then returned

© 2009 Lin Chen Page 3 Chapter 1: Introduction Page 4 as the answer. Many of the current QA systems are for closed domains, that is, specific topic areas, such as medical topics, or for limited types of questions only, such as descriptive questions. The problem with the current QA systems is that they suffer from low recall. The answer to a question is also limited to predefined categories (Kim et al., 2001).

A second solution to overcome these problems is the link analysis-based approach. Since the introduction of the link analysis method to determine a website’s ranking or popularity such as Kleinberg (1999) and Page (1998), it has been widely applied to the social network area as well. Examples include evaluating the popularity of different topics in a website (Nie et al., 2006); predicting the user’s needs based on link analysis (Eirinaki, 2007); academic paper recommendation system which evaluates the authority of the writer (Shimbo et al., 2007); computing the web-user relevance to a given topic (Wang et al., 2002). The popular link analysis algorithm HITS has been applied to recommending the authority of an answer author (Jurczyk & Agichtein, 2007). This research associated with HITS is the most relevant work to our research. This paper claims to achieve more than 60% accuracy in estimating the authority of answer authors in Yahoo! Answers for the top 30 users of the system. A further study still needs to be conducted to find out whether link analysis is suitable for Yahoo! Answers because HITS (Kleinberg, 1999) is based on the assumption that a good Hub attracts lots of authors. In other words, a good question is answered by many users. However, this is not always true in the case of Yahoo! Answers. Therefore, there exists the need to develop a mechanism that utilizes the authority of answer authors to determine the ranked list of answers.

The third approach is the statistical approach which utilizes the features of specific QA portals of CSNs. This approach is promising and upcoming. There has not been a lot of work done on QA systems using this approach. Some work has been done using specific QA portals and CSNs (Zhang, 2007; Jeon, 2006; Guo, 2008; Jeon et al., 2005; Harabagiu, 2006).

1.3 RESEARCH OBJECTIVES Yahoo! Answers is “the next generation of search … it is a kind of collective brain – a searchable database of everything everyone knows” (Noguchi, 2006). The automatic recommendation of the best answer to the question author will benefit users through ease of use, reduced effort in finding answers and a faster process for finding the best answer. This research aims to find a method to recommend the best answer automatically by utilizing: (1) a content-based approach using traditional natural language processing and information retrieval techniques to evaluate the

© 2009 Lin Chen Page 4 Chapter 1: Introduction Page 5 answer’s quality; and (2) a non-content-based approach using a reputation-based technique to evaluate the answer authors’ reputation or expertise.

The content-based approach aims to evaluate the quality of an answer using natural language processing and information retrieval methods. There currently exist many QA systems that can be used as an expert to compare with answers provided by users. Experiments on the evaluation of existing QA systems are conducted so that the best QA system can be selected from the returned answers of the best performing QA system compared against the answers provided by answer authors in Yahoo! Answers using IR techniques. Normally, the keyword-based method is utilized for finding the relevant information; however the answers from Yahoo! Answers are expressed in Natural Language. A keyword-based approach may not work well; therefore, WordNet is used in this research to allow for semantically related terms to be matched.

The non-content approach aims to evaluate and utilize the reputation of an answer author. The belief is that a user with a good reputation gives high quality answers and those with a bad reputation give low quality answers. Yahoo! Answers mechanisms are reviewed and a comprehensive analysis is conducted. The popular link analysis method, HITS has been widely applied when it comes to evaluating the quality of Web pages or the authority of users. This thesis attempts to answer whether HITS is suitable for measuring users’ reputation in online collaborative QA portals such as Yahoo! Answers. Based on the findings of the analysis of Yahoo! Answers, a method for utilizing the pattern of answer distribution and best answer distribution is proposed. Finally, in order to validate the proposed methods, extensive experiments are conducted.

1.4 RESEARCH CONTRIBUTION This thesis has developed an approach to automatically recommend the best answer along with a ranked list of answers for a posted question in online collaborative QA portal such as Yahoo! Answers. In particular, the contributions of this research are: • Comprehensive analysis of a collaborative QA portal. The analysis includes an examination of social network structure, shortcomings and organisational structure of Yahoo! Answers. • A non-content (statistical) based approach is proposed and developed for evaluating a users’ authority based on the concept of the reputation of the answer author.

© 2009 Lin Chen Page 5 Chapter 1: Introduction Page 6

• A content-based approach on Yahoo! Answers is developed to evaluate the quality of answers using IR and NLP techniques which have not been applied before to our best knowledge. • A combination of the reputation and content method is proposed in a way to be suitable as a possible plug-in for Yahoo! Answers allowing the recommendation of the best answer to the question author. It is hoped that it will reduce the amount of time that a user (who posted a question) spends on reading all of the answers. Therefore, the proposed method will improve the efficiency of using the Yahoo! Answers portal to ask questions.

1.5 THESIS ORGANISITION In Chapter 2, a review of relevant literature is conducted. The review focuses on the social network aspect of Yahoo! Answers. Possible social network analysis methods which may be applied to Yahoo! Answers are then discussed. Lastly, the three most common types of approaches studied in the literature are reviewed, including a discussion of the advantages and disadvantages of each approach. The proposed research utilizes the literature reviewed in this chapter to avoid the shortcomings of existing methods.

In Chapter 3, a comprehensive analysis of Yahoo! Answers is conducted in order to propose the most suitable method to solve the research problem. The analysis includes an overview of how Yahoo! Answers works, a Bow Tie analysis (In, Out, Core, Tendril structure), and an explanation of why HITS is not suitable for Yahoo! Answers. This chapter is important as it provides the preliminary critique of Yahoo! Answers and outlines why specific approaches to research are employed in this thesis.

In Chapter 4, the proposed methods, namely, reputation-based and content-based methods, are detailed. The reason for the unsuitability of the original degree centrality evaluation for the user’s reputation is also explained. The content-based method, with a focus on why some steps are utilized and some are not, and why certain QA tools are selected, is also explained.

In Chapter 5, details about the experimental setup, data format, statistics of the dataset and experimental results are presented. The results give a clear illustration of the performance of the content-based method, reputation-based method and the combined reputation-based and content method. This chapter also includes the comparative performance of the proposed methods and baseline methods such as HITS.

© 2009 Lin Chen Page 6 Chapter 1: Introduction Page 7

In the last chapter (Chapter 6), conclusions from the research are drawn and recommendations are made regarding possible future research and development activities.

1.6 PUBLISHED PAPER A paper based on the reputation method proposed in this thesis has been published.

Chen, Lin and Nayak, Richi. (2008). Expertise Analysis in a Question Answer Portal for Author Ranking. Paper presented at 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-08), Sydney, Australia, pp.134-140.

© 2009 Lin Chen Page 7 Chapter 1: Introduction Page 8

© 2009 Lin Chen Page 8 Chapter 2: Background & Literature Review Page 9

2Chapter 2: Background & Literature Review

The purpose of this chapter is to briefly present a social network - the online collaborative QA portal - and the associated techniques to determine the characteristics of this type of social network. This chapter starts by introducing Yahoo! Answers as an online social network. The following section discusses the social network analysis methods that can be used to understand the behaviour of users in Yahoo! Answers. Following that, the third section discusses the approaches that can be used to analyse the quality of answers provided by the users who participate in Yahoo! Answers.

2.1 ONLINE SOCIAL NETWORK A social network is a social structure of people who are related directly or indirectly to each other through a common relation or interest. Two fundamental elements in a social network are the people, as actors, and their connections, the ties. Different types of relations identify different networks, even when observations are restricted to the same set of actors (Knoke, 2008).

With the advent of the Web, online social networks have become popular. Online social networks are online communities of people who share interests and/or activities, or those who are interested in exploring the interests and activities of others. Online social networks can be of many types depending on the way people communicate with each other or interact amongst themselves. According to the functionality of online social networks, they can be divided into: communication social networks, collaborative social networks, multimedia social networks and entertainment social networks (Arrington, 2006).

Examples of the communication type of social networks include blogs, Internet forums, MySpace and . Users in this type of social network set up connections through conversations and discussions. In MySpace and Facebook, users are allowed to have a profile and communicate with the friends they have added in their friendship list. Collaborative types of social networks include Wikipedia, Yahoo! Answers and Epinions. The main purpose of a collaborative social network is to share the knowledge users possess, exchange information and opinions. Multimedia

© 2009 Lin Chen Page 9 Chapter 2: Background & Literature Review Page 10 social networks include and YouTube. These types of social networks allow users to share their photos, videos, music, and so on. Examples of entertainment social networks include Second Life and The Sims. The users of this type of social network set up their relations with other users through their virtual world characters.

Yahoo! Answers is a typical collaborative social network. Yahoo research has claimed that it is “the next generation of search … it is a kind of collective brain – a searchable database of everything everyone knows. It is a culture of generosity. The fundamental belief is that everyone knows something” (Noguchi, 2006, p.1). Yahoo! Answers is a community driven knowledge market website. Users communicate and exchange information through the asking and answering of questions. It is unlike other online QA portals such as Ask.com ( http://www.ask.com/ ) that are not collaborative. The knowledge in these portals comes from the Web databases. Such portals acts as question search engines. Answers are returned by finding the relevant information from the Web.

The characteristics and features of an online social network can be observed when the graph structure of a social network is modelled and analysed. An in-depth understanding of the graph structure of online social networks is necessary in order to evaluate the current online social networks, and to understand the impact online social networks have on the Internet (Mislove et al., 2007). The application of graph structure theory to Yahoo! Answers might result in the detection of flaws in existing methods and help propose suitable methods for Yahoo! Answers (e.g. how the study of the Web led to the discovery of algorithms for finding the sources of authority in the Web. The Graph Structure Theory is able to identify the Power Law, small-world networks and scale-free networks in online social networks (Mislove et al., 2007; Kumar et al., 1999; Adamic et al., 2003) and can help show the users’ distribution in Yahoo! Answers. Graph structure theory is also able to highlight Strongly Connected Component (SCC) and Weakly Connected Component (WCC) which gives an insight on how well the users in a social network are connected and how easy it is for a user to get help from other users.

−γ A power law network is one where the probability that a node has degree k is proportional to k , for large k and γ > 1. The parameter γ is known as the power law coefficient. (Mislove et al., 2007). Many online social networks follow the Power Law. Examples include the Web (Kumar et al., 1999), Flickr, LiveJournal and YouTube (Mislove et al., 2007) and to our knowledge no such study has been done on Yahoo! Answers. The power law, in the case of the Web, has a few websites that are quite popular, while the majority of websites have only a few visits from web users.

© 2009 Lin Chen Page 10 Chapter 2: Background & Literature Review Page 11

Small-world networks are networks that have a small diameter and feature a large number of clusters. They are a type of graph in which most nodes are not neighbours to one another, but most nodes can be reached from every other node through a small number of hops or steps. Online social networks in Adamic et al. (2003) have the behaviour characteristics of a small world network along with local clustering.

A scale-free network is one where the degree distribution of nodes follows a power law, at least asymptotically. It is a class or type of a power law network where high degree nodes have a tendency to be connected to other similar high degree nodes. (Mislove et al., 2007) Examples of scale-free networks include the Web and protein networks (Scale-free network, 2009).

Strongly Connected Component and Weakly Connected Component are two parts often found in a Web structure. A study of Web structure shows that the Web has a “bow tie” shape and it consists of a single large strongly connected component. The strongly connected component can be reached by other groups of nodes. Also, the SCC easily reaches other groups of nodes. A weakly connected component is a digraph in which every node is reachable from other groups of nodes. However, not necessarily every node in the weakly connected component can reach other groups of nodes (Tarjan, 1972).

In general, Yahoo! Answers is a collaborative online social network in which a question can be posted online and users answer the question to share their knowledge and their expertise. The communication in this social network is the amount of questions and answers posted on the portal. Yahoo! Answers as an online social network may also feature the social network structure mentioned previously. It is hoped that the analysis of social network structure on Yahoo! Answers will help the researcher to understand it better. The findings of this analysis are reported in Chapter 3 of this thesis.

2.2 SOCIAL NETWORK ANALYSIS METHODS FOR YAHOO! ANSWERS Social Network Analysis (SNA) is the study of social entities (people in organisation) and their interactions and relationships (Wasserman & Faust, 1994). The purpose of SNA is to understand the structure, behaviour and composition of social networks and thus improve the social network and social relations contained within. SNA has been applied in many situations from product marketing to search engines and organizational dynamics (Domingos & Richardson, 2001; Kim

© 2009 Lin Chen Page 11 Chapter 2: Background & Literature Review Page 12

& Srivastava, 2007). SNA has been used to discover how a rumour spreads and what social structures exist among people (Wasserman & Faust, 1994). By analysing a company’s email, SNA helps the manager to find the hidden leader who may not necessarily have a high official position, or much responsibility, but plays an important role in the company (Song et al., 2005; Cai et al., 2005; Matsuo et al., 2006; Kautz et al., 1997). In the same context, studies have analysed “who knows what knowledge”, “who think who knows who” and “who believes in what” (Srivastava et al., 2006; Pathak et al., 2007). The engine provides a classic example of social network analysis. The famous PageRank algorithm is based on the concept that the linked pages have relevant relationships to a certain extent (Brin & Page, 1998).

In the same line, it is expected that the in-depth study of Social Network Analysis will benefit in the study of Yahoo! Answers. SNA methods will help to better understand which methods should be applied to each situation and help to develop the adaptive social network analysis methods needed in the case of Yahoo! Answers.

There are three important concepts in SNA: actors, relations and ties. Actors refer to people or organizations (Wasserman & Faust, 1994). Relations are characterized by content, direction and strength (Garton et al., 1997). A tie connects a pair of actors by one or more relations. A tie has the features of content, direction and strength. An actor in Yahoo! Answers refers to the person who logs into Yahoo! Answers as a user and asks and/or answers questions. The content of the relation in Yahoo! Answers is the exchange of information by asking and/or answering questions. The direction of relations in Yahoo! Answers can not be bi-directional because the Yahoo! Answers portal does not allow the user to both ask and answer the same question. However, it can be in one direction between two users, where one user asks or answers a question but not both. The strength of relations in Yahoo! Answers can be identified by the frequency of two users’ exchange of information. Two actors in Yahoo! Answers may have multiple ties in which they are connected by relations of asking and answering several different questions.

There are several traditional SNA methods that can be applied to the Yahoo! Answers portal. These methods can be divided into two representations of networks: for modelling the structures of social networks, and analysis of networks for understanding the structures and communications in social networks. This section is limited to the methods that can be applied to the Yahoo! Answers portal.

Graph theory is one of the basic methods used to represent social networks. A graph is a two- dimensional diagram in which the points represent actors and lines represent the relations

© 2009 Lin Chen Page 12 Chapter 2: Background & Literature Review Page 13 between them. A graph is a non-directed graph if the connecting lines have no directions (arrowheads), otherwise, it is a directed graph. Matrices are also used to represent the characteristics of social networks. The numerical value in the matrix cell measures the strength of a relation between a pair of actors. The value can be binary or non-binary. Both the graph and matrix representations have advantages and disadvantages when used in social network analysis. Graphs provide a more forceful visual illustration of network structures but do not support mathematical manipulations. In contrast, matrices are less user-friendly, but they facilitate sophisticated mathematical and computer analyses of social network data.

Once the graph and matrices theories are utilized for modelling a social network, analysis methods can be applied to understand the structure, communications and evolution of the network. Previous researchers have studied the knowledge of the Web structure to devise better crawling strategies, by performing clustering and classification, to improve both browsing and the performance of search engines (Donato, 2005). In the context of Yahoo! Answers, analysis of the structure is necessary to identify the prestige of users. Centrality and prestige measures are common methods to identify an actor’s prominence within a complete network by summarizing the structural relations amongst all the nodes (actors).

Centrality attempts to identify those actors in a network that appear to be highly connected. This is often referred to as “directed” centrality, which looks at the number of direct incoming and out- going links (Gotta, 2008). Prestige is used on the prominent actor who is the object of extensive ties, thus focusing solely on the actor as a recipient (Wasserman & Faust, 1994). Centrality can be measured by degree, closeness and betweeness (Freeman, 1977). These measures vary in their applicability according to non-directed and directed relations and differ in granularity, whether at the individual actor level or at the group level or at the complete network level.

Actor degree centrality measures the extent to which a node connects to all other nodes in a social network. The central actors must be the most active in the sense that they have the most ties to other actors in the network or graph (Wasserman & Faust, 1994). It summarizes the actor’s ties to g -1 (where g is the network size or group size and “-1” is to exclude the feedback tie to the actor him or herself) other actors and then divides this summary by the number of g-1 actors to eliminate the effect of variation in network size on degree centrality.

Group degree centralization measures the extent to which the actors in a social network differ from one another in their individual degree centralities. The value of group degree centrality varies from 0 to 1. A value of 0 indicates every node has the same degree centrality which means

© 2009 Lin Chen Page 13 Chapter 2: Background & Literature Review Page 14 that every individual in a group is as important as each other individual in the group. A value of 1 indicates that the distribution of the group degree is quite uneven or hierarchical.

Closeness centrality reflects the closeness of one actor to the other actors in a social network. The closeness centrality is a function of the node’s geodesic distance to all other nodes. The idea is that an actor is central if it can quickly interact with all others. In the context of a communication relation, such actors do not need to rely on other actors for the relaying of information to their destination (Wasserman & Faust, 1994).

Betweeness centrality measures the extent to which other actors lie on the geodesic path, which is the length of the shortest path between any pair of actors in the network. It is an important indicator of the control over information exchange or resource flows within a network.

The prestige is the extent to which a social actor in a network “receives” or “serves as the object” of relations sent by others in the network. The prestige can be measured by counting the number of directed ties that an actor receives from other network actors for a specified relation.

2.3 APPROACHES TO IDENTIFY ANSWER QUALITY This section will review the approaches for analysing question answers that can be applied to Yahoo! Answers and identify the advantages and disadvantages of these kinds of approaches. Approaches to decide the quality of an answer can be classified into three types based on: (1) Information Retrieval and Natural Language Processing techniques; (2) link analysis, which includes HITS and PageRank type of analysis; and (3) statistical analysis.

2.3.1 Content Based Approach – Natural Language Processing and Information Retrieval A content-based approach is a method to find the answer to question which is selected from the database or the Web, based on the meaning of the question. Natural Language Processing and Information Retrieval are two major techniques used for content-based approaches. TREC (http://trec.nist.gov/ ) is one of the famous conferences and has held a QA track since 1999. In the early stages of the QA track in TREC, the experimental competition was set up to solve the simple question type, such as the definition type. The latest version of the competition tries to solve more difficult question types such as the descriptive type. All of the proposed methods in TREC are a combination of Information Retrieval and Natural Language Processing methods.

© 2009 Lin Chen Page 14 Chapter 2: Background & Literature Review Page 15

The reasons for using a combination of these two techniques are as follows. Traditional IR can track down the relevant documents by using key words. However, questions which ask for more specific information can not be solved by IR. For example, a question asks “Who is The Australian Prime Minister”. There are two answers. Answer A: Australian Prime Minister Kevin Rudd visited American …, Answer B: Australian Prime Minister is the head of government of the commonwealth of Australia. Both answers have the key words – Australian Prime Minister. But IR techniques can not differentiate the two answers and then decide the better answer. A question normally starts with wh-words (when, where, what, why, who, which, and how). These words are deemed as stop-words in IR processing which are removed during pre-processing. Therefore, the answer to the question may never be the right one. To find answers responding to natural language questions, considerable NLP would be required. NLP, as with most symbolic activities, tends to be very computationally expensive, making the processing of a million documents in response to a question out of bounds. Good IR systems, on the other hand are particularly adept at finding those few documents out of huge corpora that have the best participation of query terms. Thus, the most common approach is to use IR to narrow down the search to a relatively small number of documents and then process the remaining documents using NLP techniques to extract and rank the answers (Prager, 2006).

2.3.1.1 Information Retrieval Information retrieval is closely related with QA systems. Information retrieval aims to find the documents that are relevant to the user query. The QA task requires the system to provide the exact answer to the question. QA is more difficult than information retrieval in terms of returning more specific information. However, the quality of a QA system depends on the effectiveness of information retrieval. Returning the highly relevant documents from a QA system that utilizes information retrieval methods would result in a higher accuracy of answers from the system. Therefore, it is necessary to discuss the IR techniques that QA systems employ.

The simplest approach to modelling IR is known as the Vector Space Model (VSM). The VSM computes measures of similarity by constructing a series of vectors to represent each document in a collection, and a vector to represent the query. The idea behind this model is that in a rough sense, the meaning of a document or query can be conveyed by the words used within it. If the words within the document or query can be represented within a vector, it then becomes possible to compare the documents against the queries to determine the similarity of their content. If a query is considered to be the same as a document, a similarity coefficient (SC) that measures the similarity between a given document and query can then be computed. Those documents whose

© 2009 Lin Chen Page 15 Chapter 2: Background & Literature Review Page 16 content is measured by the terms present in the vector that represents the document, and which correspond the closest to the terms in the query vector, which represents the content of the query, are considered to be the most relevant ones. (Grossman & Frieder, 2004) An example of how to use information retrieval techniques in QA systems appears in the work of Feng (2006). The work aims to identify the matching of query and answers in threaded discussion. Suppose there are n passages in the corpus, noted as {p1 , p ,2 ... pn }and a student query noted as q. Now say there are m different unique words – { w ,1 w ,2 ...... wm } across all of the documents and posts.

Now let the number of occurrences of word w j in passage pi be tf ij and the number of passages in which word w j is present to be c j . Thus, the query and each passage in the corpus can be represented by a vector in the following format:

=< > q wq ,1 wq2 ,..... wqm Eq. 2.1

p =< w , w ,.... w > Eq. 2.2 i pi 1 pi 2 pim where m is the total number of words in the document and wi j = 0 if a word is missing in that passage. After normalization in the vectors, each weight can be computed by

tf log( n ) ij c = j wij Eq. 2.3 2 n 2 ∑ (tf ij ) [log( )] c j

When a question is posted, the relevant passages can be retrieved by calculating the cosine similarities equation as below.

m = cos_ sim (q, pi ) ∑ wqj * wp j Eq. 2.4 i j=1

The first important point when using information retrieval for QA is to identify the keywords in the question sentence. Kangavari (2008) suggests six steps to select keywords in a question sentence: (1) all words which are in ‘quotations’ and “double quotations”; (2) all words that are names; (3) all words that are adverbs (time, location, status); (4) all words that are a main verb or modal verb; (5) all words that are a subject.; and (6) all words that are an object. Moldovan et al. (2002 & 2003) explain the importance of the idea of keyword expansion. They give a detailed failure analysis of a QA system, showing that 37.8% of the errors are due to the retrieval module. There are several smaller modules under the retrieval module. In their experimental findings the

© 2009 Lin Chen Page 16 Chapter 2: Background & Literature Review Page 17 keyword selection contributes to 8.9% of the errors, keyword expansion contributes to 25.7% of the errors, and 1.6% of the errors are due to the actual retrieval and passage post filtering. The reason for the impact of keyword expansion is the choice of module for retrieval. They chose to use VSM instead of a Boolean model as the Boolean retrieval model is much more sensitive to query formulation. Also, stemming plays an important role in the errors of their retrieval model. Therefore, the choice of stemming and the retrieval method are directly related to the performance of the retrieval module.

Another point that needs extra attention when using IR for QA is the retrieval element: whether it is a passage-based retrieval or full-document retrieval, as the information retrieval strategy. Roberts (2002) reports a slight increase of 2.8% in documents that contain an answer when using two-paragraph passages instead of full-document retrieval. Clarke and Terra (2003) compare their version of passage retrieval to full-document retrieval. They found that full-document retrieval returns more documents that contain a correct answer, but that passage-based retrieval is still useful since it makes the process of identifying an actual answer easier.

Greenwood (2006) gives a detailed process on how information retrieval is used. The question and target (most important points in question or the emphasis of the question) are first combined to form a single IR query and then the query is used to retrieve the twenty most relevant passages (based on experiments) from the AQUAINT collection. While retrieval of more text means greater coverage, there comes a point at which the larger volume of text actually inhibits the ability of answer extraction components in extracting the correct answer.

There are benefits of using information retrieval as the main strategy for a QA system. Firstly, it is simple to build up the QA system when using information retrieval. The assumption is that the more words that match with the question, the higher the relevance of the passage or sentences. Not a lot of pre-processing steps are needed. The required pre-processing steps are the common ones like stop-word removal and stemming. After the answers are returned from the system, the cosine similarity formula can be used to select the answer. Secondly, it does not require lots of computation time. If the system uses an existing online-based search engine like Google, then the computation time is reduced as it only needs to calculate the similarity between the question and answer candidates and undertake some pre-and post-processing.

Although information retrieval is the simplest method when implementing a QA system, there are clear limitations with using term-based approaches. The answers to questions normally lie in a sentence (or several sentences) located in different parts of a passage. One of the problems in

© 2009 Lin Chen Page 17 Chapter 2: Background & Literature Review Page 18 using an IR-based system is the extraction of the right answer. It is hard for an IR system to tell the different meanings of the same term, as it can not determine what part of speech the term is from. Natural Language Processing on the other hand, makes use of the linguistic information to better analyse a term from relevant documents.

2.3.1.2 Natural Language Processing The main purpose of using Information Retrieval techniques in a QA system is to search for and discover relevant information that should be included in the answer provided. The information that is returned from the search may vary from only one result, to a few results, to many more. However, how can a QA system “understand” or comprehend the results it has retrieved and then determine which of the results it has as the one that is best or best meets the user’s needs. This is what natural language processing is designed to do. It was designed to be “smart” enough to decide and measure the quality of the results returned from the search that was performed. NLP is a modern computational technology, as well as a method for investigating and evaluating claims about the human language itself. It is a general study of the cognitive functions by computational process, in which the emphasis is normally focused on the role of knowledge representation. That is, it meets the need for representation of human knowledge of what is considered to be the real world in order for computers to understand the human language (NLP group, 2006).

There are two types of approaches in NLP for a QA system: the shallow approach and the deep approach. The shallow method often uses a keyword-based approach to locate interesting passages and sentences from the retrieved documents and then finds the match between the candidate text and the desired answer type. After that, it ranks the candidates based on syntactic features such as word order or location and similarity to query. It is more suitable for the “factoid” type of questions, which ask about simple facts or relationships and can be answered easily with just a few words, often as a noun-phrase (Prager, 2006). The deep method, which is a more complicated approach, involves multiple processes including; question type analysis, answer type prediction, query extension, answer analysis and also requires multiple NLP techniques. In the deep approach, answering the “what-is” questions is more challenging than answering the “who-is” questions. To improve the accuracy of the “what-is” type of question, the question needs to be divided into finer classes such as organization, location, disease and general substance. Then, techniques such as named-entity recognition, relation detection, co-reference resolution, syntactic alternations, word sense disambiguation, logic form transformation, logical inferences and common sense reasoning can be utilised. As a knowledge database, WordNet is often used to get semantic connections and definitions. In general, the deep approach relies more on natural language processing approaches (Gunter et al., 2003; Question Answering, 2008).

© 2009 Lin Chen Page 18 Chapter 2: Background & Literature Review Page 19

Several natural language processing techniques have been mentioned above. The following paragraphs will discuss those techniques in detail for a better understanding of them.

Part of Speech (POS) includes noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. The of POS is to take a section of text and identify the parts of speech for each token. In a question, the word “bark” can be used as verb. But it can also be used as a noun in an answer candidate. Because of POS, we know that the “bark” in the query doesn’t match the “bark” in the answer. It also can be used to identify phrases: for example, sequences of adjectives followed by nouns such as “big red sun”. But POS can’t detect the different word senses. Bark in “dog’s bark” and “tree bark” are both nouns. It is not easy to find the different meaning in this case, if POS is used (Grossman, 2004).

Name Entity Recognition (NER) is used to identify names, organizations and locations. If the QA system knows what type of answer is expected, NER can help to identify certain words so as to associate it better with the correct type of answer. Normally, NER is useful for “who(m)”, “where” and “when” types of questions. For example, take a question which asks “Who is president of the US?”. A candidate answer is that “Barack Obama is elected as 44 th US president”. In this example, the question asks for the name of the/a US president. NER is able to pick out Barack Obama as a name from the candidate answer and therefore, the answer matches the question.

Apposition is a pair of noun phrases in which one associates with the other (Question Answering, 2008). For example, in “The British Prime Minister, Tony Blair steps down on June 27”, the name “Tony Blair” is in apposition to “British Prime Minister”.

Relation is used to decide the relation type between two objects (Peng, 2005). A relation can be spouse-of, staff-of, parent-of, management-of and member-of. This type of relation is useful for the “who” type of question. Based-in, located(-in) and part-of are the types of relation used for “what” questions.

Co-reference means the pronoun in the sentence (Peng, 2005). For example, in “Secretary General Anan requested every side… He said …”, ‘He’, in this case, refers to Anan.

Word Sense Disambiguation (WSD) is the process to identify the right meaning of a word in a sentence, when the word can have multiple meanings (Grossman, 2004). For example, the word

© 2009 Lin Chen Page 19 Chapter 2: Background & Literature Review Page 20 jaguar has three meanings: the first is a type of cat, the second is the car brand and the third is a code name for an Apple operating system. In “The Jaguar XF fuses sports car styling and performance with the refinement, features and space of a luxury saloon”, the word jaguar in this context refers to a car.

2.3.1.3 Content Based Question Answering System Architecture Most QA systems that are based on two techniques (NLP & IR) have different software architecture, but they share the same basic architecture shown in Figure 2.1. After a question is put forward to a QA system, the Question Analysis module processes the question. As a result, keywords are selected from the question and the expected answer type is determined. Then, the search of relevant documents or passages is conducted after the keywords of the query are sent through to the search module. The search is either among the indexed datasets or Web resources. Results returned from the search module, documents or passages, combined with the answer type from the Question Analysis module are passed to the next module – Answer Extraction. The task of Answer Extraction is to extract the answer from the relevant documents or passages. The output of Answer Extraction is the recommended answer for the question.

The Question Analysis module includes extraction of keywords for the query, removal of stop- words and wh-words and normalization (lemmatization or stemming and/or case normalization) according to the search engine’s indexing regimen (Prager, 2006). The keywords may be expanded, using synonyms and/or morphological variants (Srihari & Li, 2000) or using full- blown query expansion techniques, for example issuing a query based on the keywords against an encyclopaedia and using top ranked retrieved passages to expand the keyword set (Ittycheriah et al., 2001). The answer type can also be deduced from the question type. Lots of systems have in- built hierarchies of question types based on the type of answers sought and attempt to place the input question into the appropriate category in the hierarchy. Moldovan et al. (2000) identify 25 types of question represented in a hierarchical structure. Hovy et al. (2001) build up a QA typology of 47 categories based on an analysis of 17,000 real questions. Harabagiu (2001) describes a manually crafted top-level answer type hierarchy which links to parts of WordNet to extend the set of possible answer types available to the system.

© 2009 Lin Chen Page 20 Chapter 2: Background & Literature Review Page 21

Figure 2.1. Basic architecture of a Question-Answering system (Prager, 2006).

Once the relevant documents or passages are returned from the Search module according to the query keywords, the Answer Extraction module is ready to seek the answer to the question. Answer Extraction will return a ranked ordered list of answers. By running named–entity recognition on the passages, a set of candidate answers is generated. Answer boundaries are determined directly by how much of the text is matched (Prager, 2006). The best answers come from the documents or passages whose textual context matches the question. There are four types of approaches for finding the matches between the textual context of an answer and the question: heuristic, pattern-based, relationship-based and logic-based.

Heuristic methods observe the important features for each candidate answer and decide on the values of those features. They then combine all the features mentioned above by using weights learned in training. Non-linear combinations of the features are better performing than linear combinations of features. The work of Ko (2007) is an example of this type of approach.

The pattern-based methods place emphasis on using shallow NLP. Sometimes it employs machine learning techniques to discover the relationship between the question and the answer. Then the patterns found using the machine learning techniques during the training process are applied to the same type of real world questions. Consider this question: “When was Mozart born?” The answer is that Mozart was born in 1756. A pattern-based method learns the pattern of how a question is asked and how the answer is responded to. When a similar question is asked,

© 2009 Lin Chen Page 21 Chapter 2: Background & Literature Review Page 22 such as, “When was Newton born?”, the system applies the pattern learnt and looks for “Newton was born in (birth year)”.

The relationship-based methods express the query in natural language. Thus, the relationships between the various question words can be used. The best answers come from the passages that possess the same relationships as the query with the score being generated from the number of relationship matches. There are two scores: one is the passage score, which is the function of the number of instantiated relationship matches, the second score is the intermediate candidate score, which is the function of the number of relationship matches with the variable. The total score is computed by combining the weighted two scores.

Logic-based methods first convert the question to the desired answer. Then the passages or documents from the search module are converted to a set of logical forms representing individual assertions. Real world knowledge is deemed to be the assertion. Lexical chains are added to verify the answer by negating the question and proving a contradiction. LCC Logic Prover is an example system which uses lexical chains and linguistic axioms and proves the question from the logic of an answer (Prager, 2006).

2.3.1.4 Current status of Question Answering Systems Most of the research on QA systems aims to solve a certain type of question such as factoid or definition questions. This is partly due to the complexities inherent in QA systems. From the start of the QA track in TREC 1999 until TREC 2002, the focus was on the closed-class questions. The participants were given the document collections and the questions were mostly fact-based (factoid) short answer questions such as “In what year did Joe DiMaggio compile his 56-game hitting streak?” and list questions such as “Name a film in which Jude Law acted?” (Voorhees, 2001; Voorhees, 2002). TREC (2004) introduced the definition kind of questions about a person, organization, and thing (such as “Who is Colin Powell?” and “What is mould?”) and passage answers which required the system to return a single text snippet in response to factoid questions. In TREC 2005, the range of the definition kind of questions was widened to include events (Voorhees, 2005). The direction of TREC 2006 changed from short interaction or cache interaction to live interaction. The goal of interactive QA is to push the QA system toward more complex information needs that exist within richer user contexts. (Dang et al., 2006).

The accuracy of the top QA systems which participated in TREC attained ratings ranging from 50% to 80% depending on the difficulty of the tasks for the different years. Although the

© 2009 Lin Chen Page 22 Chapter 2: Background & Literature Review Page 23 accuracy of different QA systems in TREC is reasonable, most questions are simple questions which ask for fact or definition. The question “In what year did Joe DiMaggio compile his 56- game hitting streak?” is quite straight forward in that it asks for a year (i,e. date). However, the circumstances for asking complex questions exist in a collaborative social network such as Yahoo! Answers. It has been pointed out that even if the question complexity is limited to the factual type, current fact-seeking question-answering technology has only a moderate impact on global-scale information-seeking environments such as Websearch (Pasa, 2005).

Moreover, in Yahoo! Answers, a question can be asked after a long description of context. Consider the following question from Yahoo! Answers. “I am 14 and I want to begin in the stock market. I have heard very good things about this product, and I am very impressed that it took 5 years and 3,000,000 million dollars to create, and they are selling it for 47 dollars. I would like to invest in this product starting out with the recommended 50 to 100 dollars. Have any of you had good results with this product?” The question that is asked at the end is “Have any of you had good results with this product?” However, the main purpose of the question is far beyond the description of the question. It actually asks for an opinion on whether the stock product is good or not and what experience users have with the product.

In summary, recent research in the QA area focuses on solving:

(1) One part of the QA system such as improving answer filtering techniques to improve the accuracy of answer classification (Moschitti, 2003; Wu &Yang, 2008; Kim, 2008; Ligozat, 2007; Cao, 2008; Li, 2008);

(2) Certain domains such as e-learning and e-teaching (Fu et al., 2008; Hung et al., 2005; Agrawal, 2008; Xu et al., 2008; Wang & Gao, 2008);

(3) A specific approach for a QA system such as pattern-based QA systems and ontology-based QA systems (Liu et al., 2007; Day et al., 2007; Guo, 2008; Wang & Li, 2008; Fu et al., 2008); (4) TREC-oriented tasks for factoid and/or definition question tasks (Lin, 2007; Han et al., 2006; Cui et al., 2007; Yang et al., 2003; Kor & Chua, 2007). To our knowledge, no research has been conducted on analysing the answering quality of users in Yahoo! Answers or applying a content- based approach to Yahoo! Answers.

Most research work on QA systems uses experiments conducted on an ideal dataset in which the question is well formed. The data from Yahoo! Answers used in this research is from real life and is composed purely of real questions that have been asked by users at Yahoo! Answers.

© 2009 Lin Chen Page 23 Chapter 2: Background & Literature Review Page 24

Conducting experiments on Yahoo! Answers and analysing the results may provide better insights into analysis of other QA portals. Some of the findings from results obtained here may also be applicable to other QA portals. Inevitably, some challenges exist for this research, as follows:

(1) The questions in Yahoo! Answers are not always in a single sentence. They may be expressed in several sentences. How you identify the exact or actual question is one of the discovery tasks of this research.

(2) The questions are not only expressed in simple sentences but also in complicated sentences. Most current QA systems only deal with simple structured questions. How to identify the right meaning of the question is not an easy task.

(3) Current QA systems do not deal with questions containing irrelevant information. Most questions in Yahoo! Answers have irrelevant sentences.

(4) There are lots of questions in Yahoo! Answers that may not have answers that can be obtained from the Web. Some questions in Yahoo! Answers, especially about “Science & Math”, require human logic to get the answer. Therefore, it may not be possible to retrieve an answer from the Web for this type of question in this research. The Web search engine can not be used as a source of the ideal answers or as an expert in this situation. A specific, perhaps even closed-domain, QA portal needs to be used for this purpose.

(5) The answers have the same meaning but may be expressed using different words or in a different fashion. It will be interesting to know the performance of exact word matching between the answer from the Web and the answer from Yahoo! Answers, and the performance of semantic matching using WordNet. In this thesis, the tasks related to question identification are attempted. Comparisons between the exact word matching and semantic matching using WordNet will also be made.

2.3.2 Reputation Based Approaches Reputation-based approaches do not consider the quality of answers, which is decided by their semantic meaning. Instead, a reputation-based approach take the features of a social network and the statistical features of a QA system into consideration. The reputation-based approaches covered in this section are tailored for collaborative social networks and are divided into two types. The first is link analysis-based approaches and the second is statistical-based approaches. Link analysis utilizes link information to decide the prestige of users in the social network. Two

© 2009 Lin Chen Page 24 Chapter 2: Background & Literature Review Page 25 essential algorithms based on link analysis are PageRank and HITS. Both of these are worth studying in this research. On the other hand, statistical approaches utilize the features which are relevant to the research topic such as answer acceptance ratio and user recommendation. A comprehensive review on reputation-based approaches is conducted here to decide on the best possible approach for this research.

2.3.3 Link Analysis Link analysis, as its name suggests, analyses the linkages between the actors in a social network to determine the relationships amongst them. These relationships can be used to derive rank prestige in social network analysis. In a collaborative social network like Yahoo!Answers, link analysis can be used to rank the answers according to prestige of users. Link analysis has been successfully used in the field of web searching (Liu, 2007; Ding et al., 2002; Pujol et al., 2002). A good example of how link analysis is applied to a search engine is “PageRank” which made “Google” quite famous. Two types of algorithms are commonly used for link analysis. The first is the PageRank algorithm and its variations (Ding et al., 2002; Zhu et al., 2001; Richardson et al., 2006; Mihalcea et al., 2004) and the second is the HITS algorithm and its variations (Gibson et al., 1998; Jurczyk, 2007).

2.3.3.1 PageRank Algorithm The PageRank algorithm was first introduced for static ranking of web pages on the basis that a PageRank value is computed for each page off-line and does not depend on search queries (Liu, 2007). In-links and out-links are the most important concepts in the algorithm. The in-links of page i are the hyperlinks that point to page i from other pages, but any links from the same site (or perhaps domain) do not count. The out-links of page i are the hyperlinks that point out to other pages from page i.

The PageRank Algorithm for a search engine has two assumptions. Firstly, a hyperlink pointing from one page to another page is an implicit conveyance of authority to the target page. Thus, the more in-links that a page receives, the more prestige that page has. Secondly, pages that point to page i have their own prestige scores. A page with a higher prestige score pointing to page i is more important than a page with a lower prestige score pointing to page i. In the algorithm, the Web is considered to be a graph denoted by G. A graph is the set of vertices and the edges in which the vertices are the representations of the web pages and the edges are the representations of the hyperlinks that link them. Thus the formula is G= (V, E) . The score of a page is denoted by P as shown in Eq. 2.5.

© 2009 Lin Chen Page 25 Chapter 2: Background & Literature Review Page 26

 E  P =  1( − d) + dA T P Eq. 2.5  n  where the probability 1-d allows for the surfer to jump to any page without a link, while d is the probability that the surfer chooses to follow an out-link from the current page. E is ee T and 1/n is the probability of jumping to a particular page (Liu, 2007).

PageRank can also be applied to classification or clustering tasks. The principle is based on the assumption that: (1) if page p1 has a link to page p2 , p1 should be similar to p2 in content; and (2) if p1 and p2 are co-cited by some common pages ( p3… ), p1 and p2 should be similar (Costa & Gong, 2005).

There are two major advantages to using PageRank on the Internet. In terms of Web content mining, PageRank is effective in avoiding spamming. Ranking the page as relevant or irrelevant does not depend on the number of key words or terms used in the page. Instead, it depends on the number of important links. Because adding an in-link into an author’s home page from other important pages is not so easy, it prevents or reduces the effect of spamming. Another advantage of PageRank is that it is calculated off-line and only one lookup of the stored value of PageRank is necessary if the value of PageRank is required at query time.

But PageRank is not perfect and there are two major disadvantages to PageRank. One complaint is that the value of PageRank only reflects the situation at a set time and so the value is not up to date. Also, PageRank favours those pages that have many in-links. However, a new page will not get many in-links initially and thus, PageRank favours the older pages which have established many of in-links over time. Another criticism is that PageRank may cause topic drift. If the Web page has multiple topics, two links from the same Web page may end up with different semantics (Liu, 2007; Gosta & Gong, 2005).

There are several improved algorithms. Tomlin proposes a generalization of the PageRank algorithm that computes flow values for the edges of the Web graph and a TrafficRank value for each page (Tomlin, 2003). Several other papers discuss personalization of PageRank algorithm. (Page et al., 1998; Haveliwala, 2002; Richardson & Domingos, 2002; Jen & Widom, 2003).

There have not been many applications of PageRank and its variations in online social networks. Zhang (2007) has modified the PageRank algorithm to better suit the situation in social network analysis. This variation of the algorithm calculates the expertise level of a person by not only

© 2009 Lin Chen Page 26 Chapter 2: Background & Literature Review Page 27 considering how many other people the user has helped, but also by considering whom they have helped. The assumption is that the users with higher expertise answer questions posted by those users who have lower expertise than them. The adjusted PageRank algorithm is implemented using the following steps:

(1) Assume user A has provided answers to users U1 ,U 2 ...U n ; (2) The ExpertiseRank equation to calculate the expertise score is:

= − + + ER (A) 1( d) d(ER (U1 /) C(U1 ) ... ER (U n /) C(U n )) Eq. 2.6

where C ( U i ) is the total number of users helping U i , the parameter d is a damping factor which ranges from 0 to 1 and the sum of all users PageRank score is 1. (3) The ExpertiseRank score is calculated using a simple iterative algorithm and corresponds to the principal eigenvector of the normalized adjacency matrix of the users (Zhang, 2007). This adopted algorithm works well for their situation because the expertise of the answer author is truly higher than those whom s/he helps. The problem with the ExpertiseRank algorithm when applied to Yahoo! Answers is whether the same condition, that an answer author has a higher expertise than the question author, holds. Further discussion of this point will be presented in Chapter 3.

2.3.3.2 Hyperlink-Induced Topic Search Algorithm Hyperlink-Induced Topic Search (HITS) is query dependent algorithm (Liu, 2007). In order to make it work, the data should be separated into two sets, known as the authority and the hub. An authority is a page with many in–links. The page contains good authoritative content so many people choose to link to it. A hub is a page with many out-links. A good hub tells people which pages are good authoritative pages and provides links directly to them. Authorities and hubs have a mutual reinforcement relationship in the sense that a good hub points to many good authorities and a good authority is pointed to by many good hubs (Figure 2.2).

© 2009 Lin Chen Page 27 Chapter 2: Background & Literature Review Page 28

An authority A hub

An authority

A hub

Figure 2.2. Authorities and hubs (Liu, 2007).

This algorithm starts by gathering the root set W, which is the set that contains the m highest ranked or related pages. Secondly, it enlarges the root set W by adding the pages pointed to by a page in the set W and the pages that point to a page in the set W. This yields the base set S. Thirdly, HITS works on the pages in S and assigns every page in the set S an authority score and a hub score. Let G= (V, E) to denote the link graph of S. V is the set of pages and E is the set of directed edges. Let L denote the adjacency matrix of the graph. If there exists a link between nodes i and j, then Lij equals 1, otherwise it is 0. Then let the authority score of the page i be a(i) and the hub score of page i be h(i). The mutual reinforcing relationship of the two scores is represented as follows:

a i)( = ∑ h( j) Eq. 2.7 ( ij ), ∈E

h i)( = ∑ a( j) Eq. 2.8 i,( j)∈E

HITS should not be used in the following three situations. Firstly, when there is a set of documents on one host pointing to a single document on a second host or when a single document on one host points to a set of documents on a second host. This leads to an inappropriate high hub score and/or high authority score. Secondly, when there are links that were generated automatically by link generating tools. Thirdly, when there are pages pointing to other pages which have no relevance to the query topic as this causes the topic drift problem, where the highly ranked hubs and authorities are not related to the original query topic (Bharat &

© 2009 Lin Chen Page 28 Chapter 2: Background & Literature Review Page 29

Henzinger, 1998). The HITS algorithm has mathematical limitations. First, if the dominant eigenvalue of M T M is repeated, the HITS algorithm converges to an authority vector which is not r unique, but dependant on the initial seed a0 . The authority vector can be any normalized vector in a dominant eigenvalue’s eigenspace. Second, the HITS algorithm yields zero authority weights for apparently important nodes of certain graphs (Miller et al., 2001). It is recognized that small changes to the Web graph topology can significantly change the final authority and hub vectors and their scores (Bharat & Henzinger, 1998; Lempel & Moran, 2000; Ng et al., 2001).

Regarding the problems mentioned above, improved algorithms are proposed by several different authors. Randomized HITS, which is similar to PageRank has been proved to be able to stabilize the HITS algorithm significantly (Ng et al., 2001). According to Randomized HITS, the surfer starts at a random page and then randomly chooses to either go to a new Web page or to follow the links coming out of the current page. A Stochastic Algorithm for Link Structure Analysis (SALSA) is recommended to improve the authority and hub computation (Lempel & Moran, 2000). Weights should be introduced in order to solve the problem that occurs when multiple pages from the same host point to one page on a second host (Bharat & Henzinger, 1998). Content-based similarity comparison is a way to deal with the topic drift problem. Depending on the result from the cosine similarity, the expanded page would be kept or discarded (Chakrabarti et al., 1998).

Jurczyk and Agichtein (2007) adapted HITS by treating question authors as the hubs and answer authors as the authorities. A high hub value is given to users who post many good questions. Low quality questions are not answered and thus the users who posted those questions receive a low hub score. A user’s authority score is the accumulation of all the hub scores of the question authors that this user has provided answers to. A user’s hub score is the accumulation of all the authority scores of those answer authors who have provided an answer to the user’s questions. The result of HITS in their paper outperforms the frequency method, which measures the authority of users by counting the number of answers that they have provided. However, it is debateable whether the quality of a question is decided by or related to the popularity of the question and whether the quality of an answer is decided by the hub score of the question author. It is necessary to conduct further analysis to decide on the accuracy of these points mentioned.

The HITS algorithm is also used for evaluating user expertise in JavaForum (Zhang, 2007). The assignment of the two scores (hub and authority) to each node is iterative. A good hub is a user who is helped by many expert users, and a good authority (an expert) is a user who helps many

© 2009 Lin Chen Page 29 Chapter 2: Background & Literature Review Page 30 other users (by answering their questions) who in turn, act as good hubs. As a result, HITS does not perform well in the experiments. It is concluded that HITS is not the most suitable algorithm to handle the human dynamics that shape the JavaForum online community (Zhang, 2007).

The Randomized HITS algorithm is used to measure the reputation of users (Gyongyi & Koutrika, 2008). The formula is given as Eq. 2.9 and Eq. 2.10, where α represents the authority score and ρ represents the hub score:

α (k+ )1 = ε + − ε ΑΤ ρ (k ) 1 1( ) row Eq. 2.9

ρ (k + )1 = ε + − ε Α α (k+ )1 1 1( ) col Eq. 2.10 where A is the adjacency matrix, with normalized rows and columns and ε is a reset probability. Compared with the original HITS algorithm, randomized HITS is a more stable algorithm. A user’s α score and ρ score are processed to yield the answer score and question score which are then used for the ranking of search results. The best answer score is the value of α for the answerer. The question score is a linear combination of the question author’s hub score and the best answer author’s authority score. As shown in Eq. 2.11, c is the weighting ratio when combining these two scores.

ρ + − α c i 1( c) j Eq. 2.11

It is pointed out in this paper that the community reinforces a user reputation, such that question authors who did not attract attention from authoritative answerers do not have high hub scores. This point is debateable as the question may not have been noticed by an authoritative answerer, which does not mean the question is bad. Instead, the quality of a question is decided by the nature of the question. It is also discovered that users receive a low authority score when they provide a reasonable number of best answers but with a disproportionately large number of questions. It should not be denied that users who ask lots of questions have limited knowledge and therefore, are not authoritative. However, the situation occurs where users provide a lot of answers to certain subjects or topics but also ask lots of questions in other subjects (where their knowledge is limited). Such users should be regarded as authoritative users in the topics where they actively answer. Chapter 3 includes some more discussions on the applicability of HITS to Yahoo! Answers for measuring expertise scores.

© 2009 Lin Chen Page 30 Chapter 2: Background & Literature Review Page 31

2.3.4 Statistical Approach Statistical approaches for determining the quality of answers can be as simple as using statistical analysis on non-textual features such as frequency of questions, answer length and answer acceptance ratio. These approaches can also be as complex as the use of sophisticated machine learning techniques on non-textual and contextual features in order to learn the question-answer pattern so as to predict the quality of answer. Several probability-based classification and clustering techniques such as Bayesian, Markov chain and Maximum Entropy have also been applied to solve at least part, if not the whole, of QA tasks (Pasca & Harabagiu, 2001; Clarke et al., 2001; Prager et al., 2002; Florian et al., 2003; Ratnaparkhi, 1996; Schmid, 1994).

Florian (2003) used a combination of classifiers such as a robust linear classifier, maximum entropy, transformation-based learning and hidden Markov modelling for name entity recognition. Ratnaparkhi (1996) proposed a Maximum Entropy model and used many contextual features to predict the POS tag. Prager (2002) proposed a scoring function utilizing the co- occurrence of answer type and question words in training data to solve answer type identification. Harabagiu (2006) presented a framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that uses a Markov chain, which follows a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question from which subquestions are derived from topic- relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art QA system in order to retrieve a set of passages that can be merged into a comprehensive answer by a Multi-Document Summarization system.

Zhang (2007) analysed the JavaForum, a large online help-seeking community, and proposed a Z- score measure based on the observation that the people asking questions lack knowledge while the people answering many asked questions have high expertise. The Z-score is then combined with people’s asking and replying patterns. The method also compares a person’s combined probability score of asking questions and answering questions with a standard deviation. If the person asks the same amount of questions as they answer, their Z-score is 0. If the number of questions they have asked is greater than the number of answers they have given, the Z-score is negative. If they ask fewer questions than they answer it is positive. The assumption is that the more answers that a user provides in comparison to the number of questions they have asked, the higher their expertise is and vice-versa. This may work well for Java Forum, but this assumption in Yahoo! Answers however, leads users to spamming, since a user in Yahoo! Answers is rewarded whenever they answer a question (even if the answer is bad). Moreover, the people

© 2009 Lin Chen Page 31 Chapter 2: Background & Literature Review Page 32 providing answers to the question author are not necessarily people with more knowledge, but perhaps are just expressing an opinion.

Jeon et al. (2006) indicate that the qualities of answers are decided by 5 factors: relevance, informativeness, objectiveness, sincereness and readableness. These factors are determined by 13 non-textual features. The features are: “Answerer’s Acceptance Ratio”, “Answer Length”, “Questioner’s Self Evaluation”, “Answerer’s Activity Level”, “Answerer’s Category Specialty”, “Print Counts”, “Copy Counts”, “Users’ Recommendation”, “Editor’s Recommendation”, “Sponsor’s Answer”, “Click Counts”, “Number of Answers”, and “Users’ Dis-Recommendation”. All these features are taken into consideration when deciding the quality of an answer, with a Maximum Entropy approach used to build up their quality predictor. The results proved to be significant. However, there are limitations to this method. The solution only judges the answer as good or bad and the quality judgment score is obtained manually. There is no ranked list of answers based on quality, and the role of these features in different QA portals varies. Moreover, many of these features are hard to collect in QA systems and the analysis of such features is time-consuming.

Another probability-based method is that used by Ko (2007) who presents a joint prediction model which is based on the probabilistic graphical model. It estimates the joint probability of all answer candidates, from which the probability of an individual candidate is inferred. The preparation for the joint prediction model involves getting the answer relevance features and answer similarity features. This involves building a synonym list by using WordNet, CIA World Factbook and Wikipedia; calculating the tf-idf and word distance through Wikipedia and Google. The whole model is complicated and in addition, it requires O(2 N) time complexity, where N is the number of answer candidates.

Guo (2008) recognizes that low participation rate of users in collaborative QA services is a crucial problem. In his work, he recommends possible answer providers for a question, instead of choosing/recommending the best answer from the answers submitted for a question. Two steps are needed: (1) Discovering latent topics in the content of questions and answers, as well as the latent interests of users in order to build user profiles; and (2) Recommending question answerers for newly arrived questions based on the latent topics and a term-level model. Similar works are also proposed by Jeon et al. (2005) and Cao (2008). Jeon et al. proposed an approach to estimate the question semantic similarity based on their answers and then use the translated model to find similar questions in large question answer archives. Cao tries to recommend related questions in collaborative QA systems.

© 2009 Lin Chen Page 32 Chapter 2: Background & Literature Review Page 33

There is no single rule on how to carry out a statistical method/approach for question answering. But the statistical methods are based on the observation of patterns or features of a particular system. Because statistical methods have been designed for certain systems, the performances of these systems are normally better than the content-based QA systems.

2.4 CONCLUSION This thesis addresses the problem of recommending the best answer automatically from several answers provided by many users of an online collaborative social network. To understand the concepts, this chapter began by introducing online social networks and collaborative social networks such as Yahoo! Answers. This chapter then summarised the social network analysis methods that can be applied to an online collaborative social network. Next, it discussed different approaches to address the issues of evaluating and recommending the best answer after a user has posted a question and answers have been provided by other users in an online collaborative social network. The discussions focused on 3 types of methods for evaluating and choosing the best answer.

The content-based methods, which rely on NLP and IR, are the most widely used approaches in this area. A question is asked in the format of NL with the straightforward solution being to analyse the human language and trying to interpret the human language into a machine- understandable language. The aim is to find the answer to the question based on the relevance or the quality of the answers given to a question. However, as human language is often complicated, NLP and IR based methods are often not intelligent enough to answer the question.

Link analysis approaches include PageRank and HITS. Both can be applied in similar applications as they both rank the authority of users. These approaches identify the best answer based on how good the authority of the users is. The only thing affecting the application of these methods to Yahoo! Answers is the reinforcement relation of the hub (question author) and authority (the answer author). But the precondition is wrong as the decision of goodness or badness of a hub is not dependent on how many answers the hub has. Rather, it depends on the content of the question and answer in the case of Yahoo! Answers.

The third approach is using statistical techniques, where many methods have been suggested. However, none of them have been considered suitable for the Yahoo! Answers. Therefore, in Chapter 4, an innovative way to solve the problem for Yahoo! Answers will be proposed.

© 2009 Lin Chen Page 33 Chapter 2: Background & Literature Review Page 34

© 2009 Lin Chen Page 34 Chapter 3: Analysis of Yahoo! Answers Page 35

3Chapter 3: Analysis of Yahoo! Answers

This chapter analyses the characteristics of Yahoo! Answers. These characteristics include network structure features as well as more general features. The purpose of the analysis is to establish a suitable method of choosing the best answer based on Yahoo! Answers’ unique features.

This chapter starts with the Yahoo! Answers mechanism which details the operation of Yahoo! Answers. It then discusses various graph representations that can be used to show and detail Yahoo! Answers. The Bow Tie structure is discussed in the third section. The Bow Tie structure is famous for its four parts with the percentage makeup of its four components being analysed to determine the general activity of Yahoo! Answers and its users. Based on the findings of network activity, a study has been conducted on whether or not the HITS algorithm is suitable for use in selecting the best answer. The fourth section focuses on indegree and outdegree which are measurements of degree centrality. The fifth section explores the possibilities of spamming in Yahoo! Answers. Questions are randomly chosen and the content of the questions and corresponding answers are analysed. In this way the quality of questions and answers are checked. In the last section, the hierarchical structure of the question categories/types in Yahoo! Answers is discussed.

Before analysing Yahoo! Answers, several important terms used in this thesis are explained below: • User: A user is the person who has registered an account in Yahoo! Answers. • Question author: The user who posted the given question in Yahoo! Answers. • Answer author: The user who provided an answer to the given question in Yahoo! Answers. • Question Answering (QA) portal: A website on which people can ask questions and receive answers. A portal returns answers by looking up all of the resources available. The resources may be on the Web or within a homogenous/heterogeneous database. • Collaborative Question Answering (cQA) portal: A website where people ask questions and the answer is returned/provided by other people through the sharing of their knowledge.

© 2009 Lin Chen Page 35 Chapter 3: Analysis of Yahoo! Answers Page 36

3.1 YAHOO! ANSWERS MECHANISM Yahoo! Answers is a place where people share facts, knowledge, opinions and personal experiences through asking or answering relations. As a collaborative Question Answering (cQA) portal, Yahoo! Answers is convenient to use. What Yahoo! Answers requires for returning answers to a question author is just for the question to be written using natural language. Unlike the traditional way of finding information online (for example Google ( www.google.com )), which asks the user for keywords to search for information, Yahoo! Answers asks for the question as whole sentence if it has not already been asked. After the users in Yahoo! Answers post their questions, the only thing they need to do is to wait. In a few days, the answers provided by different users are available for the question author. The amount of effort that a question author spends on Yahoo! Answers is much less than on a search using Google. For example, someone wants to ask a question about the theme of the third book of Harry Potter. In Yahoo! Answers, one can post the question and wait to get an answer. See Figure 3.1 for the returned answers for this example question.

Question: What is the theme to the third Harry Potter book?

Answer 1:

The theme. Okay. A prisoner, Sirius Black, breaks out of Askaban, intent on killing Harry. Or so people think. Convicted of brutally murdering his long-time friend Peter Pettigrew, Black suddenly escapes the prison, which no one has ever managed before. He then breaks into Hogwarts, into the Griffindor Common Room, and creeps into the boys' room with a dagger. Ron wakes up and shouts, so Black has to flee. I don't know if that answered your question, but there you go.

Chosen as the best answer by voter.

50% vote, 3 votes

Answer 2:

Prison

0% vote, 0 votes

Answer 3:

Um, I guess you could say Azkaban or Sirius Black.

17% vote, 1 vote.

Answer 4:

Many. Despair and Emotions (how to control them).

0% vote, 0 votes.

© 2009 Lin Chen Page 36 Chapter 3: Analysis of Yahoo! Answers Page 37

Answer 5:

The 3rd book is when JK Rowling really started to introduce a lot of darkness into the series but it is also where Harry began to learn about his parent not just from storys but from their friends. To pick just one thing that stands out as a theme is pretty diffecult in my opinion but if I had to chose one I would say it would be "Understanding Emotions"...The dementors draining a person of there feelings, Harry had to understand where he comes from, understand the way his parents where as students, and even understand that his godfather did not do it!! This was an emotion book for Harry. 17% vote, 1 vote.

Answer 6:

The theme of Harry Potter and the Prisoner of Azkaban is : things aren’t always as they seem. It was shown all through the book, as Harry became aware of various misconceptions. Trelawney was thought to be a complete fraud, but she made a genuine prediction. Neville was humiliated for losing the list of passwords to the Gryffindor common room, but it was Crookshanks who stole it. Sirius was believed to be a mass murderer and one of Voldemort's followers, but he was innocent on both counts. Peter Pettigrew was thought to have died bravely, but he faked his death while framing Sirius.

17% vote, 1 vote

Answer 7:

Things are not always what they seem.

0% vote, 0 votes

Answer 8:

it depends on what u mean by theme...if ur talking about what its about then here it is.... sirius black has (supposedly) escaped from azkaban to kill harry potter.....the twist is that he has actually came 2 kill ron's rat because he is actually peter pettigrew(the person sirius "killed" and got thrown into azkaban for ) any way evry1 thinks harry is in danger blah,blah,blah then pettigrew escapes in the end and if u havent read the book then thats all im gonna tell u.....srry. 0% vote, 0 votes.

Figure 3.1. An example of a question and its answers.

© 2009 Lin Chen Page 37 Chapter 3: Analysis of Yahoo! Answers Page 38

Figure 3.2. Result from Google.

Compared with the results from Yahoo! Answers, Google (Figure 3.2) doesn’t have an exact match with what the question asked. The best answer returned from Yahoo! Answers is reasonable. The first link from Google is about a book download which is not relevant to the question. It was not until the fourth link, that the answer is found with the link providing an extensive answer to the question. Information seeking in Yahoo! Answers is an example of focused retrieval whereas information seeking in Google is an example of document retrieval. Also, Yahoo! Answers receives results from collaboration among users whereas Google is a typical example of getting information out of heterogeneous and distributed databases.

Another example of a collaborative social network is Wikipedia, where knowledge is obtained from the contributions of its online community members, just like Yahoo! Answers. However, Wikipedia is structured more systematically, where the knowledge is structured into topics. For example, someone wants to know who Harry Potter’s best friends are. What they need to do in

© 2009 Lin Chen Page 38 Chapter 3: Analysis of Yahoo! Answers Page 39

Wikipedia is to find the Harry Potter topic and read through all the paragraphs, where they may find the answer in one of those paragraphs. In Yahoo! Answers, a user can post the question to the system and wait for the answers from other users. Once again, the difference in information seeking in both systems is document versus focused retrieval. Wikipedia users have to passively read the entire passage to get the exact information they want. Yahoo! Answers users ask for information actively and they get the exact information.

In short, Yahoo! Answers has the following characteristics: (1) It is a place where people can communicate, sharing their precious knowledge; (2) The method of acquiring knowledge is efficient and quick (Chi, 2008); (3) Question topics have a wide range, from politics to travel and from science to family, all in one portal; (4) The points system encourages users to actively participate in Yahoo! Answers; and (5) It is a social entertainment portal where users ask for opinions and discussion.

Yahoo! Answers gives members the chance to earn points as a way to encourage their participation. Users are allowed to provide an answer to any question within the four days immediately after the question author posts their question. Normally the ability to answer a question is disabled 4 days after the question was asked, however, the question author is able to extend or shorten the time period if they desire. Users other than the question author can only answer a given question once. After users post their answers, they are not allowed to edit it. They can, however, remove their answer before the question is closed. Each new user (those whose point level is still at stage one) has the ability to ask up to 5 questions per day.

There are several reasons why people enthusiastically provide answers. Some of the reasons are as follows: (1) To build up a personal reputation - providing answers, especially best answers, shows how knowledgeable that user is; (2) To compete - users are interested in knowing what position or level they have achieved in Yahoo! Answers. Users like to better others or compete with themselves in terms of getting points from Yahoo! Answers; (3) To help others by sharing information, opinion or advice; (4) To show professional courtesy - providing answers to those in need can encourage other users to do the same; and (5) To gain insight - obtaining different perspectives from other users by participating in answering questions (TechGazing, 2008).

The points system is illustrated in Figure 3.3. There exists the possibility that an answer may win the answer author as much as 50+10+2=62 points, if one provides an answer and the

© 2009 Lin Chen Page 39 Chapter 3: Analysis of Yahoo! Answers Page 40 answer is then selected as the best answer and the answer receives the maximum number of counted ‘thumbs-up’ (which is currently 50) from other users voting for their answer. Therefore, the number of points that a user can win for just one answer is substantial. In general, users in Yahoo! Answers receive points from: (1) logging into the Yahoo! Answers system; (2) providing an answer to a question; (3) voting for the best answer; and (4) providing the best answer to a question.

Figure 3.3. Points (Points and Levels, 2009).

Another calculation system is called the ‘level’ which indicates how active the user has been. The more points that one accumulates, the higher the level obtained. Figure 3.4 illustrates how levels are obtained in Yahoo! Answers. Level 7 is the highest level. The users at level 7 have unlimited access to voting for the best answers and can ask as many questions as they want to. They are allowed to provide as many answers to questions as they can. On the other hand, users at level 1, which are new users, can only ask up to 5 questions per day. They are only allowed to provide a maximum of 20 answers per day to posted questions. They can not provide any ratings to other users’ answers. It is easier for users higher than level 4 to get more points than the users at lower levels because these higher level users can answer as many questions as they want to as they are not limited to any daily answer quota.

Figure 3.4. Levels (Points and Levels, 2009).

© 2009 Lin Chen Page 40 Chapter 3: Analysis of Yahoo! Answers Page 41

The way the Yahoo! Answers system awards points to users encourages them to participate by answering as many questions as possible, regardless of the quality of their answers. Figures 3.3 and 3.4 show that each user can even get one point a day for just logging into the system and more points for participation in voting for a best answer. Points and levels are only good for representing the activeness of user participation in the site; they should never be considered to be a measure of the user’s expertise.

3.2 GRAPH REPRESENTATION The structure and communications of Yahoo! Answers can be represented as a graph G (V, E) with the question author or answer author as the node V and with an asking or answering relation as the edge E joining the nodes together. (Gyongyi, 2008) The graph of Yahoo! Answers can be displayed in different ways with different perspectives and therefore, different properties and phenomenon can be learnt.

One type of graph is the tripartite graph. As the meaning of its name indicates, there are three nodes or vertices involved in the graph. In this case, the three parties are the users, questions and answers. Let’s denote nodes as V, users as U, questions as Q, and answers as A: V = U ∪ Q ∪ A. Let’s denote the edges as E. Edges can be the links representing that the user posted a question, or that a user answered the question, or a user who chose/voted for the best answer. Figure 3.5 shows the example of a tripartite graph. In this graph, a circle denotes a user node. The user can be a question author and/or answer author (but not both for the same question). The diamonds represent a question node and the squares represent an answer node. A dashed line indicates that the user provided an answer to the question author, but was not selected as the best answer. A solid line indicates that the user provided an answer to the question author and it was selected as the best answer. In this example, u1 asks questions q1, and q2 . u2 provides an answer a1 to q1 and it has been chosen as the best answer (as shown by the solid line).

u3 provides an answer a2 to q1 as well, but it is not deemed to be the best answer (as indicated by the dashed line). A user node can connect to a question node as well as an answer node such as in the case of u2 .

© 2009 Lin Chen Page 41 Chapter 3: Analysis of Yahoo! Answers Page 42

a1 q1 a2

u3 q2 a3 u1

u2

a4 u4

a5

q3

Figure 3.5. Tripartite example.

Another type of graph is the bipartite graph. A bipartite graph is suitable for analysing the connections between users asking and answering questions. Suppose we have the bipartite

' ' ' ' graph G (V , E ) , where V includes question author U1 and answer author U 2 and therefore, ' = ∪ V U1 U 2 . The same user can appear in both roles as the question author and answer author. E ' is the edges which link the question author and answer author through asking and answering relations. Figure 3.6 is a simplified version of Figure 3.5 showing the question author and answer author relations in a bipartite graph. The circle represents question author nodes while a triangle represents answer author nodes. The edges are unweighted which means that an answer author may provide an answer to the question author only once per question, but they can answer multiple questions. For example, in Figure 3.5 u3 provides answer a2 to question q1 and answer a3 to q2 . Both of these questions were asked by u1 . In Figure 3.6, u1 and u3 relations is simplified to an edge. If the edge was weighted, the weight should be 2, but the edge is unweighted. Therefore, the weight is 1. Note u2 acts in both roles as a question author and answer author. Figure 3.6 is also a directed bipartite graph, with the direction of the edges pointing from the answer author to the question author, meaning the answer author provides the answer to the user who asked the question. A solid arrow indicates that the answer author has provided the best answer to the question author at least once. A dashed arrow indicates that the answer author has provided answers to the question author, but none of their answers have been

© 2009 Lin Chen Page 42 Chapter 3: Analysis of Yahoo! Answers Page 43

selected as the best answer. In Figure 3.6, u 2 provided the best answer to u1 while u3 provided the best answers to u1 and u2 in different questions.

u3

u1

u u2 4

u2

Figure 3.6. Bipartite example.

A tripartite graph with users, questions and answers represents three sets of parties and shows the comprehensive relationships in the graph between users. From the graph, information such as the question author, answer author, the number of questions the user has asked, the number of answers the user has provided, who asked which question and who answered which question can be seen clearly. The bipartite graph is a simplified graph based on the tripartite graph, where information about who asked which question and who answered which question is ignored. The bipartite graph places emphasis on the role of users as a question author or answer author.

3.3 THE BOW TIE STRUCTURE ANALYSIS The Bow Tie structure, as shown in Figure 3.7, has been successfully used to explain the dynamic behaviour of the Web and helps to understand the Web’s structure (Borodin et al., 2003). It has four distinct components: “Core”, “In”, “Out” and “Others” (“Tendrils” and “Tubes”). This research applies the Bow Tie structure to Yahoo! Answers in order to understand the behaviour of the users. The Core component is made up of the users who frequently participate by asking and answering questions. A large core indicates the presence of a community where many users interact, directly or indirectly. The In component is composed of users who always ask questions. The Out component is made up of users who predominately answer questions. The Tendrils and

© 2009 Lin Chen Page 43 Chapter 3: Analysis of Yahoo! Answers Page 44

Tubes components attach to either the In or the Out components or both. The Tendrils component contains those users who only answer questions posted by a user contained with the “In” component. The Tubes component contains those users who post questions which are only answered by users contained within the “Out” component.

To calculate the Core (SCC as strongly connected component), Tarjan’s algorithm is used. Tarjan’s algorithm is a graph theory algorithm for finding the strongly connected components of a graph. Under this theory, a vertex or node A is strongly connected to a vertex or node B if there exists two paths - one from A to B and another from B to A. The basic idea of the algorithm is as follows. A depth-first search begins from a start node. The strongly connected components form the subtrees of the search tree, the roots of which are the roots of the strongly connected components. When the search returns from a subtree, the nodes are taken from the stack (in visit order) and it determines whether each node is the root of a strongly connected component. If a node is the root of a strongly connected component, then it and all of the nodes taken off before it form that strongly connected component (SCC) or the Core (Tarjan, 1972).

Figure 3.7. Bow Tie structure (Borodin, et. al, 2003).

© 2009 Lin Chen Page 44 Chapter 3: Analysis of Yahoo! Answers Page 45

Figure 3.8. The algorithm in pseudocode (Tarjan, 1972).

The pseudocode for Tarjan’s algorithm is presented in Figure 3.8, where V stands for the users in Yahoo! Answers and E stands for either the relation of asking a question or answering a question. For the purpose of our analysis, the experiment randomly selects the starting node. The strongly connected components are only those that are reachable from the start node and thus it is possible that not all strongly connected nodes will be visited. This can be overcome by executing the algorithm several times from randomly chosen starting nodes. In the experiment, we run the algorithm 10,000 times, randomly selecting a node from which to start each iteration. From the algorithm, the In and Out components can be calculated as well. The In component can be calculated by counting the edges which have the relation pointing outward to other users. The Out component can be calculated by counting the edges which have the relation pointing forward to the user. The Tendrils and Tubes are considered as a combined Others part. The Others part can be calculated by deducting the values of the Core, In, and Out from the total number of edges.

Table 3.1 shows the result of conducting a Bow Tie structure analysis on Yahoo! Answers. A subset of questions and their answers in the categories of “Arts & Humanity”, “Science & Math” and “Sports” were selected to conduct this experiment. See chapter 5 for the detail of dataset. The results are compared with the Bow Tie Structure analysis of the Web. It indicates that only a few users (0.01%) just ask questions. 43.21% of users actively participate in Yahoo! Answers by asking and answering questions, while a little under a third of users (31.52%) act as a helper by mainly answering questions (without asking any of their own). This phenomenon indicates that

© 2009 Lin Chen Page 45 Chapter 3: Analysis of Yahoo! Answers Page 46 most users seek to earn as many points as possible in Yahoo! Answers by providing answers and lose as little as they can, by not asking lots of questions. In comparison to the Web, Yahoo! Answers is a less structurally balanced website. There are more users willing to participate by answering questions than asking questions.

Core In Out Tendrils, Tubes, Disconnected

Web 27.7% 21.2% 21.2% 29.9%

Yahoo! 43.21% 0.01% 31.52% 25.26% Answers

Table 3.1. Bow Tie comparison.

3.4 DEGREE CENTRALITY Centrality measure is used to identify highly connected actors in a network. Degree centrality, closeness centrality and betweeness centrality have different focus when defining connectivity between actors. Degree centrality recognizes that the most central or important actor has the most inflow and/or outflow of information (ties). The most central actor, in terms of closeness centrality, is the one with the shortest path. Betweeness centrality identifies the central actor as the one that exists on the maximum number of shortest paths between other nodes in the network. Degree centrality is most suitable method to use when analysing the structure of Yahoo! Answers. The degree centrality for the Yahoo! Answers portal is equal to measuring the users’ popularity/prestige by counting the number of questions and/or answers they have posted/submitted. It is not useful to calculate the betweeness and closeness centrality to find important users as there are no information pass paths. When a question is asked, users other than the question author don’t need to pass the question to other users and when a user answers a question, no one else needs to pass the answer on to the question author.

The degree centrality measures the activity and the participation of an actor in the network (Marcos et al., 2006). In the case of a relationship that considers the direction of the link, two indexes are defined: indegree and outdegree. Indegree is the number of links terminating at the node. In the case of Yahoo! Answers, it refers to the number of questions a user has asked. Outdegree is the number of links originating from the node, and for Yahoo! Answers it refers to the number of questions that a user has provided an answer to.

© 2009 Lin Chen Page 46 Chapter 3: Analysis of Yahoo! Answers Page 47

As the results show (Figure 3.9), the indegree and outdegree follow the power law, which explains the phenomena where large events are rare, but small ones are quite common. For example, there are a few words, such as “and” and “the”, that occur very frequently, but many which occur rarely. It can be seen from Figure 3.9 that only a small number of users ask a large number of questions. Most of the users ask only one or two questions while a smaller number of users ask ten or more questions. The same behaviour is reflected by users when it comes to answering questions. For example, there are around 10 5 users who either ask or answer just one question. Around 1000 users answer 10 questions while approximately 10 6.2 users ask 10 questions.

The indegree and outdegree indicate that only a small amount of users can really reach a high level (i.e. providing a great number of answers/best answers) and that a large number of users are at a lower level (i.e. providing a smaller number of answers/best answers). Therefore, only a small numbers of users are at the level of an expert while the majority of users are novice. This observation should be considered when proposing a method that determines the prestige of users in Yahoo! Answers. Only a small number of users should receive a high reputation score and the majority of users should receive a lower reputation score.

Figure 3.9. Indegree and Outdegree.

© 2009 Lin Chen Page 47 Chapter 3: Analysis of Yahoo! Answers Page 48

3.5 QUESTION QUALITY & ANSWER QUALITY Spamming, in the Yahoo! Answers context, refers to the behaviour whereby users ask questions for the purpose of obtaining conversation or by asking low quality questions; or users actively answer questions but provide irrelevant information or provide low quality answers. The Yahoo! Answers portal gives users bonus points when they answer questions. The Yahoo! Answers method of awarding points to its users means there is a possibility of spamming by users that will influence the quality of questions and answers.

Poor questions and answers appear frequently. Based on a random selection of 100 questions, it was found 21% of questions were of poor quality. Unfortunately, poor quality questions attract a number of users to answer them simply for points. For example, poor quality questions like “I got killed yesterday”, “I love life” and “How did you find who you are” received 7, 2 and 17 answers respectively.

Consider the two questions below as a comparison of question quality. Question 1 just asks for free items - it is not a question. Question 2 is higher quality than Question 1. But Question 2 has fewer users providing answers than Question 1. This kind of example is not a rare case in Yahoo! Answers.

Question 1: Does anyone have any brushes? I need brushes, because I don’t have any, if you have any please post them here or if you know of a site where I can get any free brushes please tell me. Answer 1: Are you that poor? Go buy one at a building supply store for like a dollar. How can we post a brush online? Answer 2: Sorry to burst your bubble but nothing online is free. And brushes that are that cheap will just loose all their hair the second they hit the paint. Answer 3: Are you talking about photoshop brushes? I think I remember reading here once that they have some at deiantart.com. But I can’t be sure, because I mostly use the old fashioned brushes with old fashioned paint.

Question 2: What term applies to a composition that features a human figure standing between two animals? Answer 1: Dinner for two.

Sometimes the number of answers for a question is dependent on the nature of the question. Some questions may invoke a discussion more easily. But some questions may not. For example, “What

© 2009 Lin Chen Page 48 Chapter 3: Analysis of Yahoo! Answers Page 49 does the word freedom mean to you?” has 24 answers. Because the question is philosophical, it attracts a lot of users to provide answers to it. Another example, “Why are there so many paranormal events, psychics UFO sighting in these day than ever before?”, is a discussion type question, and it receives 12 answers. But the question “Can you explain to me all the phases of microbial growth curve” requires some expertise in the associated field. The people who could answer this type of question are few. However, the question which requires expertise to answer does not mean the quality of the question is poor.

Another example is the question shown in Figure 3.10. The question author asks for the process on how to send a music file. It can be seen that both Answers 2 and 3 are repeating what the question asks for. Answers 2 and 3 would not be considered good quality answers. The answer authors who submitted Answers 2 and 3 have provided answers to the question, but it does not mean that either or both of them understand the question or know how to answer the question. Generally, a fair number of answers are submitted to questions of low quality. Thus the users who provide answers to a higher number of questions are not necessary more knowledgeable.

According to the HITS link analysis algorithm, the more answers a question gets, the better hub this question author is. The more questions a user answers, the higher the authority of the user. However, the spamming cases mentioned above often occur and therefore, methods that simply depend on links, number of questions and the number of answers will not work. There exists a need for the reputation-based methods of link analysis and statistical analysis to be combined to determine the expertise of users in the network.

Question: what’s the simplest way to send a music file from my hard drive via email? So that recipient can immediately play it?

Answer 1: If you include it as an attachment, the recipient should be able to simply play it directly. As long as the file is not too large and it does not have DRM, then everything should be ok. When you compose your email, make sure the file is attached, and you are not simply sending a link. Also, many servers refuse attachments more than 10 MB. If this is the case, you should have received an error reply. If the file you are sending has Digital rights management, the recipient will not be able to play it. The recipient’s computer must also be setup to automatically play sound files.

Answer 2: Try sending him the music file on your hard drive via email. That always seems to work for me.

Answer 3: Why don’t you try doing what you said you were going to do, send it via email.

Figure 3.10. Question quality.

© 2009 Lin Chen Page 49 Chapter 3: Analysis of Yahoo! Answers Page 50

3.6 A HIERARCHICAL CLASSIFICATION STRUCTURE FOR PLACING QUESTIONS Yahoo! Answers follows a hierarchical structure in order to place any question into the system. When a user wants to post a question, they have to post the question under a predefined category that they choose. Each question can only belong to one category. The way that questions are organized under a hierarchical structure makes it easier for Yahoo! Answers users to find questions and respond to them when compared with an unorganized style. In addition, this approach attracts the attention of more users who are only interested in certain categories. There are 27 top level categories and each top category has a number of subcategories as presented in Table 3.2. However, not every first level subcategory under a top level category has a second level subcategory. For example, there is a top category known as “Computers and Internet”. The first level subcategories include “Computer Networking”, “Hardware”, “Internet”, “Programming and Design”, “Security”, “Software” and “Other Computer”. “Computer Networking” has no second level subcategory. “Hardware” has second level subcategories which include “Add-on”, “Desktop”, “Laptop”, “Notebook”, “Monitors”, “Printers”, “Scanners” and “Other Hardware”. All questions have to be under a leaf or bottom level category, under which there are no more subcategories.

Top level category First level subcategory

Arts & Humanity Books & Authors; Dancing; Genealogy; History; Performing Arts; Philosophy; Poetry; Theatre & Acting; Visual Arts; Other-Arts& Humanity

Beauty & Style Fashion & Accessories; Hair; Makeup; Skin & Body; Other-Beauty &Style

Business & Finance Advertising & Marketing; Careers & Employment; Corporations; Credit; Insurance; Investing; Personal Finance; Renting & Real Estate; Small Business; Taxes; Other- Business & Finance

Cars & Transportation Aircraft; Boats & Boating; Buying & Selling; Car Audio; Car Makes; Commuting; Insurance & Registration; Maintenance & Repairs; Motorcycles; Rail; Safety; Other-Cars & Transportation

Computers & Internet Computer Networking; Hardware; Internet; Programming & Design; Security; Software; Other-Computers

Consumer Electronics Camcorders; Cameras; Cell Phones & Plans; Games & Gear; Home Theatre; Land Phones; Music & Music Players; PDAs & Handhelds; TVs; TiVo & DVRs; Other- Electronics

Dining Out Argentina; Australia; Austria; Brazil; Canada; Fast Food; France; Germany; ; Indonesia; Ireland; Italy; Malaysia; Mexico; New Zealand; Philippines; Singapore; Spain; Switzerland; Thailand; United Kingdom; ; Vietnam; Other-Dining Out

Education & Reference Financial Aid; Higher Education (University +); Home Schooling; Homework Help; Preschool; Primary & Secondary Education; Quotations; Special Education; Standards & Testing; Studying Abroad; Teaching; Trivia; Words & Wordplay; Other -Education

Entertainment & Music Celebrities; Comics & Animation; Horoscopes; Jokes & Riddles; Magazines; Movies; Music; Polls & Surveys; Radio; Television; Other-Entertainment

© 2009 Lin Chen Page 50 Chapter 3: Analysis of Yahoo! Answers Page 51

Environment Alternative Fuel Vehicles; Conservation; Global Warming; Green Living; Other- Environment

Family & Relationships Family; Friends; Marriage & Divorce; Singles & Dating; Weddings; Other-Family & Relationships

Food & Drink Beer, Wine & Spirits; Cooking & Recipes; Entertaining; Ethnic Cuisine; Non-Alcoholic Drinks; Vegetarian & Vegan; Other-Food & Drink

Games & Recreation Amusement Parks; Board Games; Card Games; Gambling; Hobbies & Crafts; Toys; Video & Online Games; Other-Games & Recreation

Health Alternative Medicine; Dental; Diet &Fitness; Diseases & Conditions; General Health Care; Men’s Health; Mental Health; Optical; Women’s Health; Other-Health

Home & Garden Cleaning & Laundry; Decorating & Remodelling; DIY; Garden & Landscape; Maintenance & Repairs; Other-Home & Garden

Local Businesses Argentina; Australia; Austria; Brazil; Canada; France; Germany; India; Indonesia; Ireland; Italy; Mexico; New Zealand; Singapore; Spain; Switzerland; Thailand; United Kingdom; United States; Vietnam; Other-Local Businesses

News & Events Current Events; Media & Journalism; Other- News & Events

Pets Birds; Cats; Dogs; Fish; Horses; Reptiles; Rodents; Other-Pets

Politics & Government Civic Participation; Elections; Embassies & Consulates; Government; Immigration; International Organizations; Law & Ethics; Law Enforcement & Police; Military; Politics; Other-Politics & Government

Pregnancy & Parenting Adolescent; Adoption; Baby Names; Grade-Schooler; Newborn & Baby; Parenting; Pregnancy; Toddler & Preschooler; Trying to Conceive; Other- Pregnancy & Parenting

Science & Mathematics Agriculture; Alternative; Astronomy & Space; Biology; Botany; Chemistry; Earth Sciences & Geology; Engineering; Geography; Mathematics; Medicine; Physics; Weather; Zoology; Other-Science

Social Science Anthropology; Dream Interpretation; Economics; Gender & Women’s Studies; Psychology; Sociology; Other-Social Science

Society & Culture Community Service; Cultures & Groups; Etiquette; Holidays; Languages; Mythology & Folklore; Religion & Spirituality; Royalty; Other-Society & Culture

Sports Auto Racing; Baseball; Basketball; Boxing; Cricket; Cycling; Fantasy Sports; Football (American); Football (Australian); Football (Canadian); Football (Soccer); Golf; Handball; Hockey; Horse Racing; Martial Arts; Motorcycle Racing; Olympics; Outdoor Recreation; Rugby; Running; Snooker & Pool; Surfing; Swimming & Diving; Tennis; Volleyball; Water Sports; Winter Sports; Wrestling; Other-Sports

Travel Africa & Middle East; Air Travel; Argentina; Asia Pacific; Australia; Austria; Brazil; Canada; Caribbean; Cruise Travel; Europe (Continental); France; Germany; India; Ireland; Italy; Latin America; Mexico; Nepal; New Zealand; Spain; Switzerland; Travel (General); United Kingdom; United States; Vietnam; Other-Destination

Yahoo! Products My Yahoo! ; Yahoo! 360; Yahoo! Answers; Yahoo! Autos; Yahoo! Bookmarks; Yahoo! Finance; Yahoo! Groups; Yahoo! Local; Yahoo! Mail; Yahoo! Message Boards; Yahoo! Messenger; Yahoo! Mobile; Yahoo! Music; Yahoo! Photos; Yahoo! Real Estate; Yahoo! Search; Yahoo! Shopping; Yahoo! Small Business; Yahoo! Toolbar; Yahoo! Travel; Yahoo! Widgets; Other-Yahoo! Products

Best of Answers Special Guests; Arts & Humanities; Beauty & Style; Business & Finance; Cars & Transportation; Computers & Internet; Consumer Electronics; Education & Reference; Food & Drink; Games & Recreation; Home & Garden; Pets; Politics & Government; Pregnancy & Parenting; Science & Mathematics; Society & Culture; Sports; Travel

Table 3.2. Yahoo! Answers categories.

© 2009 Lin Chen Page 51 Chapter 3: Analysis of Yahoo! Answers Page 52

It is observed that the majority of users prefer to answer questions in certain categories only. Table 3.3 shows the crossover of users between categories. The value in each cell indicates the percentage of users who participate in the row category and who participate only in the intersection of the row and column categories. For example, 60.76% of the users who provide answers in the “Arts & Humanity” category only participate in this category. 11.85% of users who provide answers in the “Arts & Humanity” category also provide answers to questions asked in the “Science & Math” category, but do not participate in “Sports”. 15.12% of users providing answers in the “Arts & Humanity” category also provide answers in the “Sports” category, but not “Science & Math”. Finally, 12.27% of users who provide answers in the “Arts & Humanity “ category participate in all three categories (“Arts & Humanity”, “Science & Math” and “Sports”). A test has been conducted on “Arts & Humanity”, “Science & Math” and “Sports” categories that shows around 2% of users within the test dataset answer across the three categories. A “Top- contributor” often focuses the majority of their contributions into one category. The users’ expertises are limited to certain categories. When deciding the best answer to a question, consideration should be given to the user’s expertise in the category in which the question is asked. Category User Participation

Category Arts & Science & Math Sports All Humanity

Arts & Humanity 60.76% 11.85% 15.12% 12.27%

Science & Math 9.10% 65.46% 15.47% 9.97%

Sports 4.25% 5.95% 85.95% 3.85%

Table 3.3. The crossover of users between categories.

3.7 CONCLUSION Yahoo! Answers is a cQA portal, that is, users receive answers from collaboration. Unlike Wikipedia, information seeking in Yahoo! Answers is focused retrieval. From the analysis of the Yahoo! Answers mechanism, it can be noted that Yahoo! Answers encourages user participation by awarding points to users. Large participation may boost the popularity of Yahoo! Answers. However, the quality of answers may vary. Graph representations show examples of Yahoo! Answers question author, answer author and their relation in tripartite and bipartite graphs. A tripartite graph provides a clear illustration of users’ ask and answer relations such as who asks which questions and who answers which questions. A bipartite graph is not as clear as a tripartite graph in terms of identifying the role of users (question author or answer author). A bipartite

© 2009 Lin Chen Page 52 Chapter 3: Analysis of Yahoo! Answers Page 53 graph is a simplified graph which aims to represent a certain part of the relation(s), such as the connections between users regardless of the role of users (like best answer relation).

Further analysis of Yahoo! Answers is undertaken focusing on four parts – Bow Tie structure, degree centrality, possibility of spamming and the hierarchical structure. As a result, it was found that Yahoo! Answers has an unbalanced network structure with low percentage of users in the “In” component and a comparatively high percentage of users in the “Core” and “Out” components, which accounts for 74.73% of users the in Bow Tie structure. As with most phenomenon that happen in nature, users’ asking and answering behaviour follows the power law. Indegree and Outdegree analysis has been conducted and it was discovered that a great number of users ask or answer a few questions; only a small number of users ask or answer a great number of questions. Corresponding to the power law phenomenon, consideration should be given to the fact that the majority of users should receive low expertise scores because of their small contribution and only a few users who provide a large number of answers or best answers should be given a high expertise score.

The observation is made that spamming is a serious problem in Yahoo! Answers which may affect the quality of the expertise scores if the HITS algorithm is adapted to measure user expertise scores in Yahoo! Answers. The authority score of a user is decided by the quality of the hub that the user has provided in answering the question. The quality of a hub is decided by the popularity of the question to the users. Because of the existence of this spamming problem, the popularity of the question (the number of answers received) does not always reflect directly on the quality of the hub.

The questions in the Yahoo! Answers portal are organised under a hierarchical structure, where every question is assigned into one and only one category path. The majority of users are interested in only answering under certain categories. The statistics shows that around 60% of users under the “Arts & Humanity” category only answer in this category and 65% of users under the “Science & Math” category only respond to “Science & Math” type of questions. Around 85% of users under the “Sports” category only answer sports related questions.

In general, all these findings help in the design of the proposed method to rank answers and recommend best answer for Yahoo! Answers. The Bow Tie structure and the problem of spamming disclose the reason why the HITS algorithm alone is not effective in the QA portal. Degree centrality provides information on the user distribution while an analysis of hierarchical structure indicates the usability of the category information.

© 2009 Lin Chen Page 53 Chapter 3: Analysis of Yahoo! Answers Page 54

© 2009 Lin Chen Page 54 Chapter 4: Methodology Page 55

4Chapter 4: Methodology

The purpose of Chapter 4 is to detail the plan and overview of the methodology. The design of the methodology is based on the findings of all the analysis presented in Chapter 3. The methodology consists of two parts, reputation-based and content. This chapter presents the steps for the methods, and the reasons for choosing a certain algorithm and method. This chapter includes the following sections: Overview of Methodology, Reputation-Based Non-Content Expertise Score Calculation, NLP-Based Content Score and Answer Score Fusion.

4.1 OVERVIEW OF METHODOLOGY The proposed methodology has two parallel processes to determine which answers should be presented to users in a cQA. One process is for calculating a reputation-based score that is based on the user interactions within the cQA. The second process is for calculating a content score, which is based on the content quality of answers in the cQA. Both the content and reputation- based scores are utilised to recommend the best answer to the question author in a cQA. The reputation-based score is based on users’ interactions with the cQA and to a certain extent favouritisms in a cQA such as Yahoo! Answers are possible. This is due to the awarding of points that such portals offer. Some people are driven by the points instead of providing help to other users. Some examples have been shown in preceding chapters. The content-based method on the other hand, overcomes the limitation of a reputation-based method by analysing the content of answers using IR and NLP techniques. However, a content-based method by itself can not select a good or even the best answer. Research reveals that the content-based methods suffer either from low recall or low accuracy as discussed in Chapter 2. Therefore, it is hoped that better performance in terms of accuracy can be obtained by combining these two processes.

In order to find the best answer to a question, the question and its associated answers are analysed for the purpose of calculating the content score and reputation score. Figure 4.1 illustrates the overview of the proposed method. For calculating the reputation score, all of the answer authors’ IDs need to pass to the “Extract User Reputation” process. This process receives input from two databases: one that stores users’ answering history in all categories in the cQA network; and another that stores users’ answering history of the specific category in which the question was asked. The “Extract User Reputation” process generates the local reputation score based on the

© 2009 Lin Chen Page 55 Chapter 4: Methodology Page 56 specific category history and the global reputation score based on all category history. Mostly, the local reputation score is calculated because the question needs to be answered under the category it was posted in. Local reputation score method evaluates the answer authors’ reputation in this specific category. The HITS hub scores and authority scores are calculated based on the users’ information across all categories. Local reputation score in one category can not tell the performance of a user who only participates in other categories (and not this one). In order to compare the performance of HITS and the proposed reputation method, the global reputation score should be taken into consideration. The local reputation score or the global reputation score for the answer authors is then passed to the Answer Score Fusion module.

For calculating the content score, the first task is to analyse the question type via the “Question Type Analysis” process. In parallel, answers are processed to determine the answer type in the “Name Entity Recognition” process. If the question type matches with the answer type, then a bonus score is given to the answer. Next, an ideal (or expert) answer is obtained. A high performance QA system is used to retrieve these answers from the question. The cosine similarity score is used to determine the content score of an answer in Yahoo! Answers by comparing the user answer and ideal answer from the selected QA system. If the cosine similarity between the user submitted answer and expert QA system is high then the content score for the answer in Yahoo! Answers is high. Finally, the content score is combined with the reputation score in the Answer Score Fusion module. The best answer is the answer with the highest score from either the reputation or content scores or combined score.

© 2009 Lin Chen Page 56 Chapter 4: Methodology Page 57

Figure 4.1. Flowchart of process model.

4.2 REPUTATION-BASED METHOD As discussed in Chapter 2, there are no fixed criteria about what features should be used in calculating the reputation score. Several reputation approaches are proposed based on a judgment of the quality of answers by using features within a system other than the contextual features of questions and answers. Joen (2006) suggests that good answers tend to be relevant, informative, objective, sincere and readable. However, there is no standard metric to measure and represent the quality of documents using reputation features.

This thesis proposes a reputation-based non-content method to recommend the best answers. Reputation is the opinion (more technically, a social evaluation) of the public toward a person, a

© 2009 Lin Chen Page 57 Chapter 4: Methodology Page 58 group of people, or organization. Reputation is known to be a ubiquitous, spontaneous and highly efficient mechanism of social control in societies (Knoke & Yang, 2008). In Yahoo! Answers, it is noticed that an answer author who has good reputation usually provides quality answers and an answer author who has bad reputation doesn’t care for the quality of the answers they provide.

There is some research that utilizes information about users’ reputation and applies it to ecommerce and academic Intranets for evaluation of the trustworthiness of the involved people. Zheleva (2008) builds up a reputation system for spam filtering that make use of the feedback of trustworthy users. The trustworthiness of email users depend on the direct interactions, that is, the spam reporting behaviour of each user. Reputation, in Wikipedia, is gained when the editing of an article that an author conducts is maintained by subsequent authors. But they lose reputation when their edits are undone by subsequent authors (Adler & Alfaro, 2007). In the academic area, a reputation-based method has been applied to decide on the authoritativeness of paper authors as the confidence of the author’s ability to review papers (Baumgartner et al., 2000). A more famous example is eBay. Reputation in eBay is a function of the cumulative positive and non-positive ratings of a seller or buyer over several recent periods (Mui et al., 2002). The positive reputation of a seller on eBay has a positive influence and in this case, helps them be more profitable (House & Wooders, 2001).

Reputation is a sociological aggregate of individuals’ opinions about one another (Wasserman & Faust, 1994). The reputation is often quantified by centrality measures (Katz, 1953). The simplest approach of measuring an actor’s reputation is the indegree of each actor, which is denoted as

g

di (ni ) by Wasserman (1994) where di is indegree, ni is node and i. ∑ x ji is the number of j=1 direct ties that node j has to the other nodes out of g-1 total nodes (not counting node i) in the group that point to node i. Reputation is related to the number of nominations or choices that one has received. Therefore, the equation can be written as shown in Figure 4.1. Furthermore, the reputation is dependent on the group size g, thus the equation can be as shown in Figure 4.2. The prestige of node i is subjected to the indegree of node i. The larger the indegree, the more prestige the actor has.

g = = ≠ PD (ni ) di (ni ) ∑ x ji (i j) Eq. 4.1 i=1

© 2009 Lin Chen Page 58 Chapter 4: Methodology Page 59

g

∑ x ji = P (n ) = j 1 Eq. 4.2 D i g −1 However, Equation 4.2 is not appropriate to use in Yahoo! Answers. The first reason is due to the consideration of group size in calculating reputation as observed in Equation 4.2. After applying this formula in the Yahoo! Answers context, the reputation value in most cases is quite small, considering the group size is defined as the size of the whole collection of Yahoo! Answers users. According to Wikipedia (2006), the group size of Yahoo! Answers was 60 million users by December 2006 and the number of indegrees ranged from 0 to 1000. In many cases, the reputation is rounded to “0” as the value is below that of the significant digits threshold. Even if the final result from Equation 4.2 is multiplied by a figure which is big enough to make it a large scale, the problem now will lie in how to choose the number and justify the reason for choosing it. Yahoo! Answers is a very large and sparse social network. Therefore, the calculation of reputation in Yahoo! Answers is difficult if Equation 4.2 is applied. The second reason is due to the very dynamic nature of Yahoo! Answers. We may wish to decrease the group size such that, for example, it is limited to users who answer only in the category where the question is posted. The number of g-1 may be smaller, but unfortunately the group size is volatile as the number of users participating in the category is always varying, which presents a new problem of the stability of the reputation number.

This discussion reveals that the standard formula of reputation calculation in social networks can not be applied to Yahoo! Answers. A new approach based on the concept of indegree is proposed in this thesis. Currently in the Yahoo! Answers system, users ask a question or voters provide feedback to those who offered answers to questions. To a great extent, this feedback serves as useful information in deciding users’ reputation. It is proposed in this thesis to measure the user reputation based on the number of answers and the number of best answers they have given. A user will receive a high reputation if they have answered many questions and if many of their answers have been chosen as the best answer to questions. Conversely, a user will receive a low reputation because of their infrequent participation in responding to questions. The reputation of a user is decided by two values. One is the user’s expertise level and the other is the user’s confidence level. User’s expertise level is determined by finding how active the user is and how good a user is in a subject area and overall. The user’s confidence level is decided by the confidence of the user in responding with good quality answers. The confidence of a user is determined by how good a user is in providing the best answers.

© 2009 Lin Chen Page 59 Chapter 4: Methodology Page 60

It is noticed that users tend to be more interested in participating in one or only a handful of categories rather than participating in all or many categories. An analysis was conducted on how many users are active in all 3 categories (“Arts & Humanity”, “Science & Math” and “Sports”). There were only 3221 users out of a total of 158,079 users across the 3 categories who provided answers to questions posted in these 3 different categories. This is only around 2.04% of the total users who are globally (in this case the 3 categories of “Arts & Humanity”, “Science & Math”, and “Sports”) active. A user’s confidence and expertise level is different for each of the different category contexts.

Since a question must be posted under a set category, the local expertise score which represents the user’s score in a certain category should be the first factor considered for deciding on the expertise of users in that category. If a user is new to a category but not new to other categories, the local score of the category which the user is new to will be 0, however their local score in other categories in which they participate will not be 0. Thus the user’s local score in one category does not affect the user’s local score in any other category.

A global score that indicates the user’s overall rank amongst all the users across all the categories should also be considered. This is because users may like to compare their reputation level in a holistic fashion. But the local score only counts the reputation of a user in a given category. If user A only answers questions in category A and user B only answers questions in category B, the local score of user A can not be compared with the local score of user B. The global score can bridge this problem of comparing reputation score in different categories by the use of the feedback about reputation regardless of what category the feedback is from. Another reason why global score is used is because HITS does not use category information, the scores for the hubs and authorities are calculated based on all categories. It is possible to compare HITS with local score in a certain category if the user has answered questions in that category only. However, a better comparison of performance between HITS and the proposed reputation score can only be made by using the global score across all categories.

4.2.1 Local Reputation Score Local reputation score measures a user’s (i.e., answer author) reputation in a single category. When a question is answered, the proposed method looks for the category in which the question is being posted and finds the reputation of all the users who answered this question from the database that stores the user’s reputation score in this category. The proposed method is able to

© 2009 Lin Chen Page 60 Chapter 4: Methodology Page 61 suggest the answer which was provided by the user with the highest reputation score amongst all the users who answered this question.

Let C denote the k categories in Yahoo! Answers, C = { C1, C 2, …, Ck}. Let U denote all the answer authors in all k categories in Yahoo! Answers, U = { U1, U 2, …, Uk}, where Ui is the set of the answer authors in category i. Let Ui denote the total number of p answer authors in i category,

Ui = { ui1 , u i2 , …, ujp }. The reputation score of answer author j in category i is calculated by combining the user’s confidence level and expertise level. The confidence level is calculated by a confidence score. The expertise level is calculated by a participation score and best answer score. It is as follows:

= × + R(uij ) con (uij ) (w1 f (uij ) w2 g(uij )) Eq. 4.3

where con (uij ) is the confidence level/score of uij calculated as 4.4:

n con (u ) = Eq. 4.4 ij m

where n is the number of best answers that uij is given in category i. m is the number of answers

+ = that uij provides in category i. w1 and w2 in Equation 4.3 are the weighting scores w1 w2 1. The weight is decided by empirical analysis. The details of the choice of the weights are discussed in Section 5.3.

f (uij ) in Equation 4.3 is the participation function to get the participation score of uij . g(uij ) in

Equation 4.3 is the best answer function to get the best answer score of uij . Both participation score and best answer score range from 0 to 1 and the local reputation score also ranges from 0 to 1. Zero indicates that the user is a new answer author and 1 indicates that the user is an expert with high reputation. Both the participation f (uij ) and best answer g(uij ) functions are determined as:

1 Eq. 4.5 ( x − µ ) ( − ) 1 + e σ

© 2009 Lin Chen Page 61 Chapter 4: Methodology Page 62

where x is the number of answers provided by the answer author uij in the case of the participation function f (uij ) and x is the number of best answers provided by the answer author µ uij in the case of best answer function g(uij ) . is a threshold value above which the participation function or the best answer function begins to score 0.5. σ is the variation in the number of answers or the number of the best answers in a category. Let x be the average of the number of answers or the average of the number of the best answers in the category i. Let xij be the number of answers that answer author uij has provided in category i in the case of participation function f (uij ). Let xij be the number of best answers that answer author uij has provided in category i in the case of best answer function g(uij ) . Let t be the total number of unique answer authors in the category i. The calculation of σ is as Equation 4.6.

(x − x) 2 σ = ij Eq. 4.6 t

Expertise level is a measure of user’s indegree and it is decided by the participation function f (uij ) and the best answer function g(uij ) . The participation and best answer functions determine how good a user is in a subject area as well as overall. Expertise level is the combination of the information on how active the user is (through the participation function) and the information on how good the user is at answering questions (through the best answer function). Both participation and best answer functions are considered for the expertise level, instead of just best answer function which is the reflection of how good is the user able to provide a best answer. This is because of the need to reduce the possibility of the situation where users have a high percentage of providing best answers but low participation for providing answers to questions. For example, a user provides 70 answers in the “Sports” category, of which 60 answers out of 70 answers are rated as a best answer. Suppose this user who provides the 60 best answers is considered to be a highly rated user in providing best answers. On average, users in this category only provide 20 best answers, but they also submit an average of 140 answers to this category. From this example, it can be observed that the user may be rated high in providing best answers, but the user certainly has not attained the level of being a so called expert because they have not provided enough answers as their participation is low. The simple reason for the use of Equation 4.6 is because the goodness or badness of users’ expertise is quite relative. Only when the number is compared to the number distribution of the whole system, is the idea of how good or bad a user’s expertise is obtained. More details of formula 4.6 will be explained in the next paragraph.

© 2009 Lin Chen Page 62 Chapter 4: Methodology Page 63

The confidence score is to ensure that the high reputation users answer a higher number of best answers out of the total number of answers that they provide. Let’s consider a scenario where a user is at high level of providing answers, and providing best answers when the user is compared to other users in the same category. However, the ratio of the user’s best answer to the number of answers in the category is low. If the reputation measure only includes the expertise score, which is the combination of participation score and best answer score, then the user is deemed to have a high reputation. But this is not right. Therefore, the confidence score is taken into consideration when determining one’s reputation score.

It is important to consider the distribution nature of answers/best answer that users provide. Figures 4.3 to 4.8 show how users behave in a social collaborative network in terms of the quantity of answers or best answers they provide. These figures reveal that very few users answer a large number of user’s questions. The majority of users provide answers to less than 10 questions. For example, the distribution of answers in the “Arts & Humanity” category (Figure 4.3) shows that more than 10,000 users out of 24,804 answer authors in the current Yahoo! Answers system only offered one answer to the questions posted in the Arts & Humanity category. A very small number of users offered answers to more than 10 questions. Equation 4.5 is developed to reflect this trend. Equation 4.5 is a variation on the sigmoid function. The sigmoid function, as shown in Figure 4.2, has the properties that when x goes to minus infinity, y goes to 0 and when x goes to positive infinity, y goes to 1.

The expected score distribution calculated using the participation and best answer functions should result in only a few active users getting a very high score. The majority of users who are not highly active should get a low score. This reflects the sigmoid distribution. An answer author receives little reward when they provide a small number of answers or best answers which are far less than the average provided by users. The answer author receives a big reward when they provide a large number of answers or best answers which are far more than the average provided by users. In order to represent participation and best answer functions as the sigmoid function, it is necessary to make some changes to the formula. The range of x is set to start from 0 and go to positive infinity. At the point of µ, x=1/2 . The variation value is σ . The threshold value µ in the participation and best answer functions accounts for this phenomenon, where. µ is the threshold value above which the participation and best answer functions begin to score 0.5. It can not just be the average number of answers or best answers provided by answer author in a given category. The majority of answer authors provide a low number of answers/best answers only. Thus the average value makes the threshold too low.

© 2009 Lin Chen Page 63 Chapter 4: Methodology Page 64

To determine µ for a category, it is necessary to determine the highest total number of answers or best answers an answer author can have in the category and still be included in the lowest 99.8% of users. Thus, only 0.2% of users have a higher total number of answers or best answers. µ is this total number of answers (or best answers) divided by two and rounded down to the nearest whole integer. The usage pattern also indicates that there is a variation in usage among highly active users (the top 0.2%). For example, there is an answer author who provides 831 answers and the next highest user provides 1603 answers in the “Sports” category. As there is no other answer author providing a total number of answers in between 831 and 1603, there is big gap between those two numbers. To distinguish among heavy users (answers) the variation factor is included in the participation and best answers functions. The detailed algorithm to get the local reputation score is presented in Figure 4,9.

Figure 4.2. Sigmoid Function ( www.wikipedia.com ).

© 2009 Lin Chen Page 64 Chapter 4: Methodology Page 65

Answers Distribution in Arts & Humanity

10000

8000

6000

4000

2000

The number of users users users users of number The 0 1 7 13 19 25 31 37 43 50 57 64 74 83 102 122 156 251 The number of answers

Figure 4.3. Answers distribution in Arts & Humanity.

Figure 4.4. Best answer distribution in Arts & Humanity.

© 2009 Lin Chen Page 65 Chapter 4: Methodology Page 66

Figure 4.5. Answer distribution in Science & Mathematics.

Figure 4.6. Best answer distribution in Science & Mathematics.

© 2009 Lin Chen Page 66 Chapter 4: Methodology Page 67

Figure 4.7. Answer distribution in Sports.

Figure 4.8. Best answer distribution in Sports.

© 2009 Lin Chen Page 67 Chapter 4: Methodology Page 68

∈ (1) Let xij xi where xij is the total number of answers that user j provides in category i. xi is the toal number of answers that all users have provided in category i. ∈ (2) Let yij yi where yij is the total number of best answers that user j provides in category i.

y i is the total number of best answers that all users have provided in category i.

(3) Let amn be the user id that provides answer to question m. Where m is the identity of question id. n is the total number of answers for question m.

(4) Let bm be the best answer provider id where m is the identity of question id. ∈ (5) For each category ci c (i=1 to total number of categories) µ σ Get , from distribution table of answers for ci µ σ Get b , b from distribution table of best answer for ci ∈ For each U ij U i

Initialize xij =0;

Initialize yij =0; End For ∈ For each U ij U i = If U ij amn = + xij xij 1 End If = If U ij bm = + yij yij 1 End If End For ∈ For each U ij U i 1 calculate participation score= x −µ − ij 1+ e σ 1 calculate best answer score= y −µ − ij b σ 1+ e b

yij calculate confidence score= xij

calculate reputation score=confidence score * ( w1 * participation score

+ w2 * best answer score)

End For End For

Figure 4.9. Local reputation score algorithm.

© 2009 Lin Chen Page 68 Chapter 4: Methodology Page 69

4.2.2 Global Reputation Score A user’s local reputation score is independently calculated for each category in which they contribute to. However, to indicate the user’s reputation within the overall network, the global reputation score is used to measure the reputation of an answer author across all the categories. The calculation of global reputation score is similar to how local reputation score is calculated, including the reputation level and the expertise level. The only difference is that the number of answers and best answers is considered across all the categories, instead of just a specific category like when determining the local reputation score. The reputation level, which is determined by the confidence score, is similar to that shown in Equation 4.4 but n stands for the number of best answers across all the categories and m stands for the number of answers across all the categories. The expertise level is decided by the user’s participation score and best answer score. This is because the number of answers or best answer distribution at the global level has the same trend as it holds in the individual categories. See the following Figures 4.10 and 4.11 for details. Thus, the formulas for participation score and best answer score are similar to the formulas for the local participation score and best answer score. “ x” will be either the number of answers (for the participation score) or the number of the best answers (for the best answer score) that a user provides at the global level. “ µ” will be the threshold value above which the participation and best answer functions begin to score 0.5 at the global level. “ σ ” is the variation factor for participation and best answer function at the global level. The method to decide the values of “ µ” and “ σ ” is the same as the method used to decide their values in the local reputation function. The reputation function will be the same as Equation 4.3 as well, but with the confidence function, participation function and best answer function at the global level.

© 2009 Lin Chen Page 69 Chapter 4: Methodology Page 70

Figure 4.10. Global answer distributions.

Figure 4.11. Global best answers distribution.

© 2009 Lin Chen Page 70 Chapter 4: Methodology Page 71

4.3 CONTENT METHOD Researchers have previously used non-content-based methods such as link analysis and statistics to recommend the best answer in a cQA. To the best of our knowledge, a content-based method has not been applied to recommend the best answer in a cQA such as Yahoo! Answers. It is necessary to take a content-based method into consideration when recommending the best answer. By doing so, it is hoped that the performance of the combined methods of non-content and content scoring will be better than either one of the methods alone.

As discussed in Chapter 2, a modern QA system normally contains “Question Analysis”, “Search” and “Answer Extraction” modules. This section will discuss the “Question Analysis” module which includes question type analysis and classification. As an expansion to the analysis of the question, the use of WordNet (http://wordnet.princeton.edu/) will be covered. Finally, a comparison of Question Answering engines will be discussed.

4.3.1 Question Type Analysis Question type analysis is the first step for most content-based QA systems. A good question type analysis feature assists question answering systems by identifying what is the most critical part in this question that needs to be addressed in the answer. For example, the question “Who wrote Harry Potter?” asks for the author of the Harry Potter book series. This type of question can be classified as a “person” type and a person’s name should appear in the desired answer. Through question type analysis, irrelevant answers can be filtered out before the process of comparing the answers from Yahoo! Answers with the answers from the expert question answering system.

One of the most frequently used approaches for the question type analysis is machine learning techniques. Classification is the most common solution for the question type analysis, and decision trees and the support vector machine (SVM) have been previously used to successfully classify a question into a certain question type (Li & Roth, 2002; Zhang & Lee, 2003). In this research, SVM is applied to the question classification problem. SVM has been reported to achieve highest accuracy, considering the classification performance, when it is compared against algorithms such as Nearest Neighbours, Naïve Bayes and Decision Tree (Zhang & Lee, 2003).

4.3.1.1 Support Vector Machine (SVM) The basic idea behind the support vector machine is that the original vector space can be separated by a line (Michelakis et al., 2004). The input data is categorized into two

© 2009 Lin Chen Page 71 Chapter 4: Methodology Page 72

∈ − classes yi ( )1,1 . It is assumed that the input data has ‘d’ dimensionality {(x 1,y 1)

,(x 2,y 2)….(x d,y d)}. Input data belongs to either one of the classes. The general function for 2 classes can be found, if the minimum ||W|| 2 exists as follows: ( • − ) ≥ yi w xi b 1 Eq. 4.7 Training examples satisfying the equation are called support vectors. Furthermore, they form the two hyperplanes when the distance between 2 hyperplanes (margin) is maximized (while the minimum of ||W|| 2 is found). To find the minimum of ||W|| 2, a function as follows should be used:

N N N α = α − α α • W ( ) ∑i 5.0 ∑∑ i j (xi x j )yi y j Eq. 4.8 i=1i = 1j = 1

where α≥0, xj is a training vectors. Then the equation can be transformed to:

= { • − } F(x j ) sign w* x j b Eq. 4.9

Where

r ∗ = α w ∑ i yi xi Eq. 4.10 i=1

The advantage of linear SVM is that the performance of SVM is independent of parameters, when the number of training data instances is over 50 (Vapnip, 1997). Being easy to construct (information that is easy for people to provide is fed into the system) and update is another advantage (Dumais, 1998). Furthermore, the error rate is smaller than the error rate when other algorithms are adopted, especially since the aim of SVMs is not to reduce the error rate. Even if an unequal number of training data items is distributed between 2 vectors, the position of the support vector won’t be changed. However, the disadvantages are that the training time can be large if there are a large number of training examples and execution can be slow for nonlinear SVM (Drucker & Wu, 1999).

4.3.1.2 Question Type Class Question classification means assigning a semantic category to a question (Zhang & Lee, 2003). Zhang and Lee (2003) proposed a two-layered question taxonomy (as shown in Table 4.1). This taxonomy has 6 coarse grained categories and 50 fine grained subcategories. Although the coarse

© 2009 Lin Chen Page 72 Chapter 4: Methodology Page 73 grained category can work for the question type analysis problem, a fine grained category definition is more beneficial in locating and verifying plausible answers. Coarse Fine

ABBR abbreviation, expansion

DESC definition, description, manner, reason

ENTY animal, body, color, creation, currency, disease/medical, event, food, instrument, language, letter, other, plant, product, religion, sport, substance, symbol, technique, term, vehicle, word

HUM description, group, individual, title

LOC city, country, mountain, other, state

NUM code, count, date, distance, money, order, other, percent, period, speed, temperature, size, weight

Table 4.1. The coarse and fine grained question categories.

The question type can be analysed by classifying the question into well defined question categories. This research work applies the SVM model to this task. The result of the classification of the question can reduce the amount of time used on selecting the best answer later on. The following is the algorithm (Figure 4.12) to implement question classification by SVM. Each question type category has unique words and these are the words that differentiate it from other question type categories. In this experiment, the Top-20 terms are selected for each category according to the importance of the terms for the category using tf/idf. Experiments were performed to select the Top-n terms to represent a category. The cross validation accuracy for the Top-20 terms/category, Top-50 terms/category, Top-100 terms/category, Top-200 terms/category and Top-500 terms/category was determined. It was found that using the Top-20 terms/category achieved the highest cross validation accuracy of all the tests. Because there are 50 categories, therefore, the number of SVM features is 50*20=1000.

Several experiments were conducted based on the algorithm presented in Figure 4.12. A more detailed experiment description can be found in Chapter 5, Section 5.2.2. A validation accuracy of 83% was achieved for the training dataset. However, when 500 randomly selected questions from Yahoo! Answers were applied, the accuracy drops to only 18%. The question type classification accuracy for testing data was checked by manually deciding on the question type of each question in the testing data and then comparing it with the question type assigned by the system as given in the test results. Because of the low accuracy obtained in the identification of the question type for questions posted in Yahoo! Answers, question type analysis is not added into the proposed system.

© 2009 Lin Chen Page 73 Chapter 4: Methodology Page 74

(1) Access 5000 labeled questions as training data // training data includes 50 types of questions available from //http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC

(2) Do stemming and stop word removal

(3) For each question category ( Ci where i=1 to 50) Calculate TF-IDF scores for all the words

(4) For each question category Choose top 20 words according to TF-IDF score

(5) Set up SVM features (20 top words/category*50 categories=1000 features) (6) Perform n-fold cross validation for training data (7) Choose the parameter c, gamma from the best performing SVM model (8) Load test data (9) For each question For each word If the word in testing data is also available in SVM feature Word is kept Else Discard word (10) Perform SVM classification for test data with parameters from training data (11) Get Result

Figure 4.12. Question type analysis algorithm.

4.3.2 Name Entity Recognition (NER) The role of Name Entity Recognition (NER) is to take atomic elements contained within text and place them into predefined categories. The predefined categories include proper names such as people, organizations and locations, time, quantities, monetary values and percentages (Borthwick, 1999). The use of NER in this study is to filter out answers that are poorly related to a question so that the process of recommending the best answer could be shortened. For example, let’s take the question, “Who was the legendary Trojan prince whose descendants founded Rome?” If answer author 1 replies: “Aeneas. Of course, his wife died in the fire when Troy was set in fire by the Greeks. This way….” and answer author 2 replies: “Are you sure there is such person?” From the previous task - Question Analysis, the question may be classified as Hum – individual (see Table 4.1). An NER tool then tries to find the words related to the name. In answer 1, NER can identify “Aeneas” as a person’s name. In answer 2, NER can not find any words related to a person’s name. By using an NER tool, irrelevant answers, such as answer 2, can be discarded and will not be processed in the next step.

© 2009 Lin Chen Page 74 Chapter 4: Methodology Page 75

There are several NER tools that can be considered for conducting this part of our research. Such NER tools used in this study include Balie (http://balie.sourceforge.net/), MinorThird (http://minorthird.sourceforge.net ) and GATE ( http://gate.ac.uk/ ).

Baseline information extraction (Balie) is a system for multilingual textual information extraction. Balie’s features for information extraction include finding and structuring data from free-written texts, named entity recognition, abbreviation resolution, key phrase extraction, identification of semantic role, and a large number of specific tasks such as finding emails and urls. Balie is based on using machine learning techniques to solve tasks. It is reportedly for programmatical use and being included in advanced IE systems (Nadeau, 2005).

MinorThird stands for “Methods for Identifying Names and Ontological Relationships in Text using Heuristics for Identifying Relationships in Data” (Cohen, 2004). It is a toolkit that allows users to store text manually and programmatically annotate the text and learning to extract entities and categorize text. The documents are stored in a database defined as TextBase. A user can write logical assertions (such as type and properties of text tokens) about the documents in the TextBase which can be saved in objects called TextLabels. There are four types of functions. The first is the extraction function which extracts portions of a document, such as names, places and noun phrases. The second is the classification function which is able to classify an email as real or spam. The third is a function called mixup which is the program annotating the documents based on the rules already defined in the program. The final function is called ApplyAnnotator which applies the saved classifier to a set of documents to output a set of predicted labels. The most useful function related to this research is the extraction function. To implement this function, 3 processes are involved: a train extractor process, test extractor process and train test extractor process. Because the first process needs training data which may require hand- annotation and this would require a lot of manual work, this software was not selected for this research.

GATE is architecture, a framework and a development environment for language engineering (Cunningham, 2002). The GATE architecture has facilitated the development of a number of successful applications for various language processing tasks such as Information Extraction, dialogue and summarisation, the building and annotation of corpora and quantitative evaluations of language engineering applications (Bontcheva, 2004). The key features of GATE are described in Bontacheva’ paper as follows: • Component-based development reduces the system integration overhead in collaborative research.

© 2009 Lin Chen Page 75 Chapter 4: Methodology Page 76

• Open source, documented and supported software enhances the repeatability of experiments. • Automatic performance measurement of language engineering components promotes quantitative comparative evaluation. • Distinction between low-level tasks such as data storage, data visualisation, discovery and loading of components and the high-level language processing tasks. • Clean separation between data structures and algorithms that process human language. • Consistent use of standard mechanisms for components to communicate data about language, and use of open standards such as Unicode and XML. • Insulation from idiosyncratic data formats – GATE performs automatic format conversion and enables uniform access to linguistic data. • Provision of a baseline set of language engineering components that can be extended and or replaced by users as required.

GATE is a GUI-based program which allows a user to conduct their experiment using a graphical interface. However, a command line-based program is easier for conducting a research experiment when importing and exporting data is less frequent.

Although NER software can successfully identify “Person”, “Location” and “Organization”, it can not recognize pronouns which appear quite frequently in answers. If NER is applied to the proposed system, it will need extra steps, which will add to the complexity of the system. However, it is only effective for “who” and “where” types of questions. Therefore, considering the time constraints on this research, it was decided not to add a NER process into the system.

4.3.3 Question Answering Systems as an Expert A content-based method compares the user answers with an ideal (or expert) answer to check for its quality. It is a big challenge to determine what the expert answer is. An expert answer can be obtained using a Web-based search engine such as Google or online Wikipedia by inputting the keywords that appeared in the question. However, as discussed earlier, these systems do not provide a focused retrieval feature and will retrieve whole documents that may or may not be the exact answer to the question. Consequently, this research utilises a QA system that supports focused retrieval to help decide on the quality of a user answers. The selection of a QA system is crucial as the content score will be affected by how well the answer returned by the QA system and the answers submitted by a user (which are being evaluated) agree or compare. Several QA systems are studied and test results are compared in order to select the best QA system available.

© 2009 Lin Chen Page 76 Chapter 4: Methodology Page 77

4.3.3.1 Comparison of Various QA Systems The candidate QA systems under consideration include START ( http://start.csail.mit.edu/ ), AnswerBus (http://www.answerbus.com/index.shtml ), Ephyra (http://www.ephyra.info/ ), QuALiM (http://demos.inf.ed.ac.uk:8080/qualim/ ) and askEd! (http://wikiferret.com/edw/pc/ ).

SynTactic Analysis using Reversible Transformations (START) was the first online Web-based QA system having been made available since December 1993 (http://start.csail.mit.edu/). The key techniques used are Knowledge Annotation and Knowledge Mining. Knowledge Annotation connects information seekers to information sources by associating the query to the database schema which indicates the object, property and value of the query. Knowledge Mining is based on the theory that as the size of a text collection increases, the occurrence of a correct answer tends to also increase. In the Knowledge Mining process the exact query, which employs pattern matching techniques, and the inexact query, which is the original query, are generated as the input questions. Documents are then retrieved from the search engine and those documents that are returned frequently are kept for the Answer Boosting step. The purpose of Answer Boosting is to double check the documents to prove their relevance to the question and to drop any extra words. The last step is Answer Projection, where the purpose of this process is to extract the passages where the answer is located and rank each answer according to its relevance (Lin et al., 2002; Lin & Katz, 2003; Katz et al., 2003).

AnswerBus is an open-domain QA system based on sentence-level Web information retrieval (Zheng, 2002). The structure of AnswerBus is a typical modern QA system. AnswerBus has a Question Type identification module where NER is used and the correlation between question types and expected answer words are stored in a database. The Search Engine Selection and Query Generation module are where 2 out of the 5 search engines are picked and the query for them is generated by taking into consideration the focus of the question and synonyms of the question terms. In the Sentence Extraction module, the score for each answer is generated according to the terms that the candidate answers and the query (including the expected answer type) have in common and thus match. Finally, the Answer Ranking module is where the answers are returned, ranked by their scores.

Ephyra is organized as a pipeline of standardized components for question analysis, query generation, search and answer extraction and selection (Schlaefer et al., 2006). It is a typical pattern-based QA system. The question normalization component drops punctuation, quotation marks and modifies verb constructions. The query generator transforms the question string into one or more queries by rephrasing the question and anticipating the format of the expected

© 2009 Lin Chen Page 77 Chapter 4: Methodology Page 78 answer. The question interpreter utilizes a pattern learning approach while the search component includes knowledge miner tools such as Google and knowledge annotator tools such as Wikipedia for the retrieval of relevant information for the query. Finally, the search results are processed by a set of filters such as a sentence segmentation filter, a keyword filter and an answer type filter (Schlaefer et al., 2006).

The QuALiM system only answers questions about topics available in Wikipedia and employs two answer strategies that contain a fallback mechanism and rephrasing algorithm (Kaisser, 2004). The fallback mechanism expands queries based on keywords and key phrases from the question using three rules/queries that are applied for query expansion. The first query contains all non-stop-words from the question. The second contains all noun phrases from the question. The third query contains all noun phrases and all non-stop-words that do not occur in the noun part-of-speech. The rephrasing algorithm consists of 3 parts. The first is a rephrasing algorithm which generates a sequence to match the question. For example, “When did Floyd Patterson win the title?”: the sequence for this question is (1) when, (2) did, (3) NP, (4) verb as part of speech, (5) NP or PP, (6) “?”. In the second part, the rephrasing algorithm generates a target template which is applied to answer sentences. The proposed target template indicates the sequence of words that is expected from the candidate answer. For the final part the rephrasing algorithm generates an answer type to filter out some answers (deemed inappropriate as their type does not match the filter). askEd is a special QA system which does not depend on any linguistic knowledge (Whittaker et al., 2005). Rather, askEd considers the QA task to be a classification problem. The claim is that the redundancy of data is effective enough for data expansion so that query expansion using complex linguistic analysis such as question type analysis and semantic analysis can be ignored. Search engines in this case are used as the tools for retrieval of documents instead of passages. askEd transforms the task of finding the best answer into a mathematical equation. To optimize the equation, three models are applied. The first is the retrieval model where the probability of an answer sequence, given a set of information bearing features, is calculated. The second is a filter model where the probability of the different ways of asking a question matching the classes of valid answers is calculated. The final model is the length model, which focuses on the probability of answer length given the type of question that is being asked. The performance of this system is reported to be competitive with other contemporary QA systems (Whittaker et al., 2005).

The QA systems mentioned above have different strategies for finding the best answer to a question. There is no way to decide which strategies are the better ones. All QA systems are high

© 2009 Lin Chen Page 78 Chapter 4: Methodology Page 79 in performance when applied to their corresponding dataset and no literature or research has been conducted on comparing the accuracy of the returned answers when a common dataset is used. Thus, it is necessary to conduct a test to identify the best performing QA system specific to a dataset.

The process of selecting the best QA system involves the following steps. Firstly, 100 questions will be randomly selected from the “Science & Math” category in Yahoo! Answers from 10 subcategories that have been randomly chosen. These subcategories are “Agriculture”, “Astronomy”, “Biology”, “Chemistry”, “Earth Science”, “Engineering”, “General”, “Weather”, “Medicine” and “Physics”. The “Science & Math” category is chosen because of the observation that questions asked in this category are of higher quality than any other category in our dataset. By testing the quality of the answers of quality questions, the results are more likely to reflect the performance of the QA system. If the test is conducted on low quality questions, most QA systems will not return any answers and therefore distort the evidence of the performance of the QA system. From each of these subcategories 10 questions are picked.

Next, the questions are posted to 4 QA systems, namely, AnswerBus, askEd, QuALim and START. Ephyra is not used in this testing as it takes a significant amount of time for Ephyra to execute each step. The time to start an instance of Ephyra is around 30 seconds. Every time a question is input into the system, it takes around 10 seconds to just analyse the question, with more time needed to execute the search for the answer and then perform the answer analysis steps. Moreover, the returned answer is expressed in several words instead of one or more sentences. The preliminary test using 10 factoid questions shows that the answer returned from Ephyra is of poor quality, with the responses by Ephyra for all 10 questions being judged to be wrong.

The returned answers from AnswerBus, askEd, QuALim and START are judged manually. The number of returned answers ranges from 0 to 15. When there is no answer found by a system the number of returned answers is 0. Any returned answers from a QA system is reviewed regardless of the ranking position of the answer in the system. If there is one correct answer regardless of its position in the returned answers set, the number of correct answers for the system is increased by human judgement of the answer (whether correct or not) based on relevance, informativeness, objectiveness, sincereness and reachableness (Jeon et al., 2006). Table 4.2 shows the test results of the 4 systems. From the results obtained, AnswerBus has the best performance and is selected to provide the expert answer in the proposed content-based method. Although the accuracy is low, it is the system that can best serve as an expert to provide a platform to test the answers

© 2009 Lin Chen Page 79 Chapter 4: Methodology Page 80 provided by users at Yahoo! Answers. As discussed in Chapter 2, almost all QA systems are built on the assumption that the questions are well formed and an answer can be found on the Web. For Yahoo! Answers, this is not the case. Given the time constraints, it was not possible to develop a new system that can act as an expert.

System AnswerBus askEd QuALim MIT START

Num. Correct Answer 34 30 24 5

Correct Percentage 34% 30% 24% 5%

Table 4.2. Question Answering system comparison.

4.3.3.2 Process of Matching User Answers and Expert Answer After the answers to a question are returned from AnswerBus, a comparison between the answers from AnswerBus and Yahoo! Answer user answers are made. A cosine similarity score is measured between each user’s answer and the expert answer. The function of the similarity score acts as the measure of relevance of the user’s answer to the answers from AnswerBus where the knowledge is obtained from the Web. Answers from AnswerBus for a question are combined to form a single vector. Assume A is the answer set returned from AnswerBus for a given question, { }⊂ and A1 , A2 , A3 ... An A where n is the number of answers. VA is the vector which holds the terms that appear in the answer set A and their corresponding term frequency. The users’ answers from Yahoo! Answers are denoted as Y1 ,Y2 ,Y3 ... Ym where m is the number of users who provided an answer to the question.

To compare the similarity of the answer set from AnswerBus with each user’s answer, a vector space model is set up. Table 4.3 shows an example of VSM.

© 2009 Lin Chen Page 80 Chapter 4: Methodology Page 81

Tf = W

A Y1 Y2 Y3

0 1 0 2 w1

0 2 0 3 w2

2 0 0 1 w3

4 0 3 5 w4

0 0 0 0 w5

1 1 5 0 w6

Table 4.3. An example of vectors for AnswerBus (A) and Yahoo! Answers (Y).

The formula for cosine similarity is expressed as:

× ∑ wA, j wY , j j Eq. 4.11 2 2 ∑ wA, j ∑ wY , j j j

Using the formula as above, the similarity scores for A and Y1 , A and Y2 and A and Y3 are: + + + + + = 0*1 0* 2 2*0 4*0 0*0 1*1 SC AY =0.089 1 02 + 02 + 22 + 42 + 02 +12 * 12 + 22 + 02 + 02 + 02 +12 + + + + + = 0*0 0*0 2*0 4*3 0*0 1*5 SC AY =0.636 2 02 + 02 + 22 + 42 + 02 +12 * 02 + 02 + 02 + 32 + 02 + 52 + + + + + = 0* 2 0*3 2*1 4*5 0*0 1*0 SC AY =0.769 3 02 + 02 + 22 + 42 + 02 +12 * 22 + 32 +12 + 52 + 02 + 02

In the example, answer 3 (Y 3) has the highest similarity score. This answer therefore has the best match with the answers returned from AnswerBus. The similarity score then becomes the content score in the proposed method.

4.3.3.3 Semantic Expansion of Answer Keywords WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets and provides short general definitions and records the various semantic

© 2009 Lin Chen Page 81 Chapter 4: Methodology Page 82 relations between these synonym sets (WordNet, 2009). The purpose of using WordNet in this research is to boost the content score if the answers from AnswerBus are semantically identical to an answer from Yahoo! Answers because of the use of different terms to express the same meaning. Keywords of the answers from AnswerBus are expanded by looking up WordNet and finding the corresponding synonyms if there are any. If a synonym of a word in the answer from AnswerBus appears in a user’s answer from Yahoo! Answers, the weight (term frequency) of the original word (to which the matched word is a synonym) in vector A is utilized. The rest of the process is the same as explained in the previous section.

4.3.3.4 Query and Answer Matching Another strategy used for calculating the content score is comparing the similarity of the keywords appearing in the question and the answers provided by the users. It is simple to implement this comparison and less computation time is spent. A good answer usually repeats the keywords appearing in the question thus showing an understanding of the question. Consider the example, “What is a social network?”. A good answer to the question starts with the terms social and network. Such an answer may be “A social network is a social structure made up of nodes which are generally individuals or organizations that are tied by one or more specific relations”. Some words that appear in the question are normally the focus of the question. The expected answer should also focus on the terms in the question in order to define the answer. Consider the example, “How do you think of the book of Harry Potter?”. If the answer to this question contains words “Harry Potter”, it is definitely more relevant to the question than an answer that does not have the words “Harry Potter” at all.

The method for comparing the question and user answers adopts the vector space model as well. Terms that appear in the question and the answers are included in the term list used to set up the VSM. The question forms the question vector, while each answer forms an answer vector with the term frequency used as the weighting. The cosine similarity is then used to measure the similarity.

4.4 ANSWER SCORE FUSION The purpose of answer score fusion is to smooth the process of ranking the answers and recommending the best answer. The best answer is the one with the highest answer score. The final answer score for an answer to a question is comprised of both the reputation-based and content scores. Let’s denote the reputation score as NC and the content score as C. There are several options to consider when deriving the content score and we have discussed three different

© 2009 Lin Chen Page 82 Chapter 4: Methodology Page 83 ways of measuring the content score for an answer. The first is using the similarity score obtained by comparing the answers returned from AnswerBus to the answer in Yahoo! Answers. This option is denoted as C1 . Another option considers the similarity score obtained comparing the expanded answers from AnswerBus using WordNet and the user answers in Yahoo! Answers and is denoted as C1 (WN ) . A third option considers the similarity score of the original question and answer provided in Yahoo! Answers and is denoted asC2 . The fourth option combines both C1 and C2 (the first and third options), while the fifth option combines C1 (WN ) and C2 (the second and third options).

The answer score ranges from 0 to 1 with 0 indicating the answer is of poor quality and 1 indicating a high quality answer. Both the reputation score NC and content score C range from 0 to 1 as well. To make the final combined answer score fall in the range of 0 to 1, weights are applied to NC and C. Denote WNC as the weight for reputation score and WC as the weight for content score and WNC +WC =1. No literature has been found regarding which weight ratio should be applied when combining a reputation score with a content score in order to achieve the best performance. The solution for this is to conduct the experiment and find the best weight ratio through multiple tests of the same dataset while trying different weight ratios. Equation 4.12 shows how the final answer score is obtained when combining the reputation and content scores.

Answer Score = wNC *content score + wC *reputation score Eq. 4.12

Figure 4.13 presents the process to obtain the combined reputation and content scores.

© 2009 Lin Chen Page 83 Chapter 4: Methodology Page 84

Identify category structure For each category to be processed, reset the result counters For each QA file in this category Extract the question subject and content Remove junk characters Identify the user Extract each answer to the question Remove junk characters from each answer Identify the author of each answer Identify which was selected as the best answer by the question asker For each answer If using Non reputation score Extract the local score of the answer author from the database End If If using online QA portal (AnswerBus) Post Question to online QA portal Receive result Process result HTML and extract answers Merge answers into a single general answer Remove junk charachers If WordNet is used Build VSM of the general answer from AnswerBus Identify synonyms for each word in the general answer and link Build VSM of the answer provided in Yahoo Compare VSM’s using cosine similarity where matching between a term in the answer and the synonyms of the general answer counts as though the original term in the general answer is present in the answer VSM and get score. Else Build VSM of the general answer from AnswerBus Build VSM of the answer provided in Yahoo. Compare VSM using cosine similarity and get score End If If using Yahoo Question to Yahoo Answer comparison Build VSM of the original Yahoo question that this answer is for Build VSM of the answer provided Compare VSM using cosine similarity and get score If QA portal was used Merge previous content score with this content score End If End If Combine non content and content score into a single score for answer and store with answer End For Rank all answers for this question from highest to lowest Output results for this QA file Update results for this cat Update results for the current top level cat Clear memory End For Output results for the that category Output overall results Figure 4.13. Process of generating combined score.

© 2009 Lin Chen Page 84 Chapter 4: Methodology Page 85

4.5 CONCLUSION This chapter discusses the reputation and content methods for recommending the best answer for a cQA. An analysis was conducted on utilizing the standard centrality degree formula for the calculation of a user’s reputation. However, as Yahoo! Answers is a very large and sparse social network the calculation of reputation in Yahoo! Answers is difficult if the standard centrality degree formula is used. User behaviour in cQA is discussed with this user behaviour being modelled as a modified sigmoid function. This is the foundation of the proposed reputation method. The user behaviour analysed from the dataset indicates that the number of users drops dramatically with an increase in the number of answers/best answers provided. So the majority of users only provide a few answers and/or best answers. Only a few users have provided many answers/best answers. The proposed non-content method which is based on the user’s reputation is explained.

This chapter also includes the proposal for a content score to solve the problem with the answer content quality. Question type identification and Name Entity Recognition were initially planned to be included in the content method, however experiments with both processes show very low accuracy. The number of features used in setting up SVM for question type classification was decided on and instead of 50, 100 or 200 terms/category, 20 terms per category was deemed to be the best choice from the point of view of classification accuracy. Name Entity Recognition is not as good as hoped for and it is not practical to use, as NER can only be used for “who” and “where” types of questions. A further problem exists in the form of the co-reference issue. The proposed content method, therefore, only considers Question Answering systems with Query and Answer matching. Many QA portals were considered to be used in this approach and amongst 6 existing QA systems, AnswerBus was the best in returning good answers. Web-based search engines were not under consideration for this research as search engines are good for document retrieval but Yahoo! Answers, on the other hand, requires focused retrieval. The similarity score is attained by comparing: (1) the answers returned from AnswerBus to user answers; (2) the answers returned from AnswerBus using synonym expansion from WordNet to user answers; and (3) the original question and answers provided by the users.

The reputation score and content score are finally combined to rank the answers according to user reputation and content quality respectively. Assignment of weights for the reputation score and content score is empirically decided.

© 2009 Lin Chen Page 85 Chapter 4: Methodology Page 86

© 2009 Lin Chen Page 86 Chapter 5: Experimentation & Result Page 87

5Chapter 5: Experimentation & Result

This chapter contains the details of the empirical analysis of the proposed method including the performance of the reputation-based method and content-based method to rank the answers submitted by users in a cQA. Firstly, how the dataset was collected for the experiment and why questions under the “Arts & Humanity”, “Science & Math” and “Sports” categories in Yahoo! Answers were selected is explained in this chapter. Secondly, the experiment design is discussed and includes detailed information of the parameters for various experiments and how each experiment was carried out. Third, the evaluation methods to measure the performance of the reputation and content methods are presented. Finally, the results are included and the findings are discussed.

5.1 DATASET Data for this research was obtained using the Yahoo! Answers Web Service (http://developer.yahoo.com/answers/ ) and is from three top level categories: “Arts & Humanity”, “Science & Mathematics” and “Sports”. Up to 1000 resolved questions are retrieved from each leaf category under these three top level categories. Some leaf categories have less than 1000 resolved questions and therefore all of the resolved questions that were available were retrieved. Table 5.1 shows the statistics of the dataset that were collected from Yahoo! Answers. There were some users who answered questions across multiple categories. That is why the total number of users in the individual top level categories does not sum to the total number of users in the dataset.

Answers Per Category Questions Answers Users Question

Arts & Humanity 13,683 59,517 34,887 4.35

Science & Math 16,337 83,939 43,812 5.14

Sports 50,882 357,102 109,945 7.02

Total 80,902 500,558 158,079 6.19

Table 5.1. Yahoo! Answers dataset statistics.

© 2009 Lin Chen Page 87 Chapter 5: Experimentation & Result Page 88

The reason why the dataset is set up this way has been mentioned previously – that is, to imitate the environment used by Jurczyk (2007) in their experiments. This is so that a comparison of HITS and our proposed method can be conducted. Yhaoo! Answers only allows each API user to submit a maximum of 5000 queries per day. In order to collect a similar amount of data as used in Jurczyk’s experiment the data was collected over a base period of 1 month from December 16, 2007 to January 19, 2008. Two steps are involved in downloading a document which contains a question and corresponding submitted answers from Yahoo! Answers. First, the category from which the data will be downloaded needs to be located/identified, and secondly a question ID for the desired question under this category needs to be retrieved by using the getByCategory service operation. Only the service operation called getQuestionID can return the details of a question and its answers and it requires the question id in order to retrieve this data. However, the question id can only be obtained from using the getByCategory service operation where the category id is input. The second step utilizes the service operation called getQuestionID to retrieve the document in XML format.

Each downloaded document contains the information about question id, the id of the user who posted the question, the time when the question was asked, the content of the question, the answer author’s id, the time when the answer authors answered the question, the content of the submitted answers, the chosen best answer and id of the user who submitted it and the corresponding score which ranges from 1 to 5. The best answer score is assigned by the question author with a score of 1 being the lowest and 5 being the highest score that the question author can give. An example of such a document can be found in Figure 5.1.

Several fields are extracted from the downloaded documents for the experiment. is extracted for the identification of the document. The question is extracted from and . The field usually carries the actual question while is usually a further and more detailed description of the question. Sometimes the question is asked in this field. The field is useful in identifying the question author and answer authors. The first instance of in the document is the id of the user who posted the question. The following fields are all the ids of users who have submitted an answer for this question. When using the HITS algorithm, the relation between question author and answer author is required. This information about question author id and answer author id is obtained from the fields of each document and is the easiest way for determining and recording the relations of askers and answer authors. It is also important information when ranking the user’s reputation based on their reputation and/or content score. The field is used for evaluating the performance of our proposed method against the best answer as selected

© 2009 Lin Chen Page 88 Chapter 5: Experimentation & Result Page 89 by the Yahoo! Answers users. The field under field contains the content of the answers provided by users and is used for comparing the user’s answer quality against the answers returned from another QA system.

Are channeling centers in aura related to familial one/s? This is perhaps part of very specific knowledge and ccould be used for different purposes.As many people use technik for entering in aura, it is interesting how this impact familial auric fields-of objects and of technik operators also. 2007-04-04 03:28:42 1175682522 http://answers.yahoo.com/question/?qid=20070404032842AA8zeFI Other - Alternative ebd273aee89d59951914957de5c7c3adaa ThanksBelit http://us.i1.yimg.com/us.yimg.com/i/us/sch/gr2/nophoto3_48x48.gif 1 0 The knowledge is called fantasy and superstition. When you realize there is no reality to it, please join us in the current century. 9a8ff47b502b8e24485bc83be1ef48c0aa Gene 1175684687 1176201763 The knowledge is called fantasy and superstition. When you realize there is no reality to it, please join us in the current century. 5 9a8ff47b502b8e24485bc83be1ef48c0aa Gene 2007-04-04 04:04:47 1175684687

Figure 5.1. Document in XML format.

5.2 EXPERIMENT DESIGN The experiment design section details how the experiments were carried out, their configuration and the variables used. These experiments were conducted to test the proposed reputation-based non-content method and a NLP- and IR-based content method.

© 2009 Lin Chen Page 89 Chapter 5: Experimentation & Result Page 90

5.2.1 Reputation Method Experiment Setup The non-content based reputation method utilizes a user’s reputation information within a given category as well as across multiple categories. Local reputation score is used to evaluate the user’s reputation within the given category. In almost all cases, the local reputation score should be used and combined with a content score later in the process, since the question was asked under a given category. The global reputation score is used to compare the performance of our proposed approach against that of the HITS algorithm, especially for the evaluation of the Top-k users at the global level. The HITS algorithm measures a user’s hub and authority scores in the overall network and the proposed global reputation score measures the reputation of users across the network.

5.2.1.1 Local Reputation Score Each document under the desired category is processed; the answer author id is kept if it is not already stored in an id list and the number of answers which has been provided by this answer author is increased by one for each answer they have submitted in the given category. If the answer was also chosen as the best answer then the number of the best answers for the answer author is also increased by one for each best answer given. Answer author id, the corresponding number of answers and the number of best answers the author/user has provided are stored in a database.

The distributions of the number of answers and the number of the best answers given by users in each category are obtained, and are shown in Figures 4.3, 4.5 and 4.7 in Chapter 4. The values of the parameters µ and σ are decided as discussed in Section 4.2.1 from this database. Table 5.2 shows the values assigned to µ and σ in each category and across the three categories (global). Each answer author’s local reputation score is calculated by applying the number of answers and the number of best answers that the answer author has provided into Equations 4.3, 4.4, 4.5 and 4.6, as explained in Chapter 4. The value of the local reputation score is also then stored in the database as well.

Arts & Arts& Hum Sci & Sci & Math Sports Sports Global Global Hum BestAnswer Math BestAnswer Answer BestAnswer Answer BestAnswer Answer Answer

µ 32 11 35 14 88 18 78 17

σ 6.79752 2.94687 14.12207 3.533655 17.97862 4.416297 16.92201 4.3

Table 5.2. Values assigned to µ and σ .

© 2009 Lin Chen Page 90 Chapter 5: Experimentation & Result Page 91

When calculating a user’s reputation, the expertise level of each user is measured through use of a participation score and best answer score as shown in Equation 4.3. The participation score and best answer score are combined using weights w1 and w2 respectively. The sum of weights w1 and w2 is always equal to 1. However, no work has been done to prove whether an answer author’s participation in a cQA portal or the answer author’s expertise level is more important when deciding on one’s reputation. Experiments were conducted using 11 different combinations of weights which included (0.1, 0.9), (0.2, 0.8), (0.3, 0.7), (0.4, 0.6), (0.5, 0.5), (0.6, 0.4), (0.7,

0.3), (0.8, 0.2), (0.9, 0.1), (0.25, 0.75), (0.75, 0.25). The first is w1 , the weight for the participation score and the second is w2 , the weight for the best answer score.

5.2.1.2 Global Reputation Score The process involved in calculating each answer author’s global reputation score is similar to how the local reputation score within a given category is determined. The values of µ and σ at the global level are shown in the last two columns of Table 5.2. The score for each answer author is also stored in the database. The weighting combinations of w1 and w2 for the global reputation are the same as mentioned in the preceding section.

5.2.1.3 Settings for HITS The expertise of an answer author in HITS is decided by the authority, that is, the number of questions answered by a user. If the answer author has provided answers to quality hubs, which are the question authors with lots of different answer author relations, the answer author will receive a much higher reward in HITS as opposed to the answer author who has provided an answer to a hub with only a few answer author relations.

The algorithm used to implement HITS (Kleinberg, 1999) for our experiment is as shown in Figure 5.2. A database is used as the storage medium for the information on question authors and answer authors’ relations. The merit of using a database includes being efficient in the storage and retrieval of the data. Firstly, the Tables TQ and TA should be set up. Table TQ has two attributes: the question author id as the key attribute and the answer author id. Answer author id, in one instance, may contain one or several answer author ids. From the database design point of view, the table is not well designed. However, the main purpose of the table is to capture the information about a question author and the corresponding answer authors. This information is used for the calculation of the hub scores. Table TA has the answer author id as the key attribute and the question author id attribute which is a list containing one or more question author ids

© 2009 Lin Chen Page 91 Chapter 5: Experimentation & Result Page 92

corresponding to the answer author id. Table TA is used to calculate the authority score. The HITS algorithm is run through twenty (20) iterations in order to make both the hub score and authority score stable. Both the hub and authority scores are designed to fall in the range of 0 to 1. Therefore, normalization is required at the end of each iteration of the HITS algorithm.

(1) Access the whole dataset (2) For each question

If the question author Id U Q is in the table TQ where question author id and corresponding answer authors Id information is stored

Find the position of U Q in TQ ; Append the answer author Id U (U ... U where n is the number of A A1 An answer authors have answered the question) to the answer author list Else

Insert question author Id U Q and answer author Id U A to the table End If

If Answer Author Id U A is already in the table TA where answer author Id and corresponding question authors Id information is stored

Find the position of U A in TA ; Append question author Id U (U ... U where m is the number of Q Q1 Qm question authors have the relation with the answer author) to question author list Else

Insert U A and U Q to TA End If End For (3) Initialize hub score h(i) to 0, authority score a(i) to 1 (4) For p=1 to 20 For each hub Calculate hub score h(i) = ∑ a( j)where the hub and corresponding

authorities’ relation can be retrieved from TQ End For For each authority Calculate authority score a(i) = ∑ h( j)where the authority and

corresponding hubs’ relation can be retrieved from TA End For For each hub Normalize h(i) = h(i) /Maximum hub score for that iteration; End For For each authority Normalize a(i) = a(i) /Maximum authority score for that iteration; End For End For

Figure 5.2. HITS algorithm used in the proposed experiment.

© 2009 Lin Chen Page 92 Chapter 5: Experimentation & Result Page 93

5.2.2 Content Method A good performing question type analysis and NER process leads to an efficient QA system. The roles of question type analysis and NER lie in filtering out some answer candidates so only relevant answer candidates are processed in the later stages of the QA system. This section presents the detailed experimental setup for question type analysis and NER. The reasons for the choice of parameters for question answer analysis and NER are discussed. It is also necessary to discuss question pre-processing here. Unlike the questions used in standard QA system experiments, questions in Yahoo! Answers are complicated. The method of identifying the question type in Yahoo! Answers question file is also discussed here. The details of using an existing QA system and the setup for attaining the content score have been discussed in Chapter 4. Therefore, this section does not cover these aspects previously mentioned in Chapter 4.

5.2.2.1 Question Type Analysis To categorise the question types, a training dataset should be obtained. The training data used here is available online from the website http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/ . This dataset was chosen as the questions were already labelled with their type. As the questions from Yahoo! Answers are unlabelled a large amount of manual labelling would have been needed in order to build a training set of sufficient size. To get a better result, we chose to use 5500 training data entries which cover all question types. The question types are categorized across 6 coarse types and 50 fine types as shown in Table 4.1 in Chapter 4. The purpose of a training dataset is to obtain the unique features, terms or structure for each question type. Even if a question changes, it should still have the same features if its type remains unchanged. The training data entries are generated by using the pre-processing steps of stop-word removal and stemming (based on the Porter algorithm). The software used to implement the SVM is bsvm, available from http://www.csie.ntu.edu.tw/~cjlin/bsvm/ .

The top 20, 50, 100, 200 and 500 terms of each question type are extracted to serve as the feature vector for training and testing. The weight for each feature in the different questions is calculated using the tf-idf method. Using tf-idf on the training data is done in order to identify the terms that are important or good at representing this question type so that questions can be classified. “tf” is based on the total number of times that a term appears in the same question type. If the term appears in a question once, but it appears in the same question type 4 times, then “tf” should be 4 instead of 1. The total number of documents is equal to the total number of question types. Therefore, the number of documents equates to 50 (as there are 50 fine question type categories). “df” is the total number of different question types in which the term appears. For the testing data

© 2009 Lin Chen Page 93 Chapter 5: Experimentation & Result Page 94 a binary weight is used. If the term is present then the weight for the feature is 1, otherwise it is 0. A term is only counted once, even if it appears in the same question type multiple times. In the testing data, the question type for the questions is not known. Using tf-idf on a question in the testing data does not guarantee that a term with a high weight is important to a given question type. It shows that a term is important for representing this question, but it may not be a term that is considered important in representing or deducing one or more question types. In addition, key terms are most likely to only appear a few times in a question. The majority of times they appear only once. As a consequence, a low tf score is generated for the term. Key terms may appear in many questions in the testing dataset and therefore, a low idf score can be generated. Terms may be important for one or more question types, but if they appear frequently in the testing dataset then a low score is allocated to these terms, indicating that they are a poor term to be used to discriminate questions. So tf-idf weighting for terms is not suitable for use on the testing data. Binary weighting of terms treats each word in a question as equal and uses the weights of the terms in the question types (derived from the training data) to discriminate the question.

For the testing dataset, 500 randomly selected questions from the collected Yahoo! Answers dataset are extracted. The stop-words are removed and stemming (using the Porter algorithm) is performed for these 500 questions from Yahoo! Answers before they are used in the question type analysis experiment.

Most of the settings for SVM training are kept at the default settings (the SVM type is a multi- class bound-constrained support vector classification, the kernel type is a radial basis function, and the gamma in the kernel function is set as 1/number of features ) except for “cost” which is set at 1000. The cross validation parameter is set to 5 and scaling of the data is done as well. As a result, the cross validation accuracy when using the Top-20 terms per category from the training dataset reaches 83.5469%. When using the Top-50 terms per category, accuracy reaches 69.4424%. For the Top-100 terms per category it reaches 69.4791%, while for the Top-200 terms per category, accuracy achieves 71.2032% and for the Top-500 terms per category the accuracy reaches 74.1379%. It is apparent that using the Top-20 terms per category is the best performing feature setup. Having more features included when building the SVM table does not necessarily attain better performance in this experiment and may be due to the short length of the questions in the training data. Most of the questions are only 6-8 words; therefore, the features which differentiate question types are not numerous. By including a large number of terms, the distinguishing power of some important terms is decreased. According to Zhang (2003), “every question is represented as binary feature vectors, because the term frequency of each word in a question usually is 0 or 1.” Zhang also states that stop-words should not be removed because the

© 2009 Lin Chen Page 94 Chapter 5: Experimentation & Result Page 95 common words like “what” and “is” are actually very important for question classification. When the same parameter settings as Zhang identifies are repeated in this experiment, the accuracy rate drops to only 55.6%. Comparing the results from having no stop-word removal and binary weights for the representation of features against the results obtained in which stop-words are removed and tf-idf is used for feature representations, the latter performs much better.

However, when the testing dataset is applied using the model generated from the training data, classification accuracy is low. Fifty questions are checked manually by looking up the original question and analysing the test result. The accuracy achieved for the testing data is only 18%. Several reasons may account for the low classification accuracy. The questions in the training data are simple questions such as “What are liver enzymes?”. Some questions in Yahoo! Answers are noticeably longer and are not clear in what they are asking. For example, “The british point of view in the…? I really need help. I can not find anything about the british point of view in the sinking of the lusitania. I found the germans now I need the british.. I really need help.. if you know what there point of view was or any sites or anything please help!!”. This question asks about “British point of view in the…”. The question is not very clear nor short. The description that goes with the question makes the question clear, that is, that the user wants information about the British point of view in the sinking of the Lusitanian. From this example, it is observed that the question is not well organized. Some of the questions in Yahoo! Answers have several questions in one. For example, one question asks for “Battle of Shiloh questions? What confederate general expressed concern at a council of war on the evening of April 6, 1862 that the Union Army had been alerted to their presence at Corinth? Who served as an assistant professor while still an undergraduate at the Military Academy, west point?”. This can cause mis- classification of the question type. Because of the low accuracy of identifying the question type, question type analysis was not added into the proposed system.

5.2.2.2 Name Entity Recognition As was discussed in Chapter 4, Annie GATE software is a user-friendly tool which is GUI-based. The settings for this tool are complex. First, the required corpus, which contains a collection of documents with question and answer information, needs to be added into GATE manually through the GUI. The researcher needs to identify the corpus location within the computer. Then an application pipeline should be set up for the corpus. Several processing resources such as Tokeniser, Sentence Splitter, POS (position of speech) Tagger are available in the application pipeline, and manual selection of the processing resources is required. In this case, the POS Tagger needs to be selected. The POS Tagger will be applied to a chosen document in the corpus.

© 2009 Lin Chen Page 95 Chapter 5: Experimentation & Result Page 96

After clicking the “Run” button, the results will be displayed. A sample of the results is shown in Figure 5.3. Each colour represents a different type of POS. To be used in another program, the result needs to be manually exported.

Figure 5.3. Sample results from GATE.

Another Name Entity Recognition software which has been tested is “Balie”. Compared with other software, it is more stable and can handle a batch of tasks. This software features “Person”, “Location” and “Organization” types. This corresponds to 2 types of name entities, with only “who” and “where” types of questions under testing. The main purpose of this test is to evaluate the usefulness of NER when certain types of questions are asked and whether they can correlate with/to a certain kind of name entity. Although the software can successfully identify “Person”, “Location” and “Organization” in most cases, it can not recognize pronouns which appear quite frequently in some answers. For example, a question asked in Yahoo! Answers under the “History” category was “Who was Leonardo Da Vinci?”. One answer author replied “ He was a polymath, meaning he was a in many different subjects, he is most famous for his drawings and designs but he was also a mathematician, engineer, inventor, anatomist,

© 2009 Lin Chen Page 96 Chapter 5: Experimentation & Result Page 97 sculptor, architect, botanist, musician and a writer.” When the answer is put through Balie, the result is shown in Figure 5.4. The software can not identify “he” as a person. Nor can it identify mathematician, engineer and/or inventor as a person. If NER is still applied to the proposed system, it will need extra steps which will add to the complexity of the system and is only effective for 2 types of questions (“who” and “where” types). A statistical analysis is conducted on Yahoo! Answers that indicates that only around 4% of questions (38 out of 1000 questions) are of the “who” and “where” types. Moreover, the “who” type of questions sometimes do not ask for a person, instead this type of question could be asking for agreement or opinion, such as “Who agrees with me…”, or asking for help, such as “Who can help me”. Therefore, it was decided not to add a NER process into the proposed system.

************************************************************ * Named Entity Recognition testing. * * - test recognition of an entity of each type * ************************************************************ ************************* * Entity-Noun Ambiguity * Rejected low: in ************************* ******************************** * Entity-Entity Classification * ******************************** ****************************** * Check Very Ambiguous Types * ****************************** ***************************** * Check Unknown Capitalized * ***************************** He was a polymath, meaning he was a genius in many different subject, he is most famous for his drawings and designs but he was also a mathematician, engineer, inventor, anatomist, sculptor, architect, botanist, musician and a writer.

Figure 5.4. NER example result.

5.2.2.3 Question Pre-Processing Before a question is posted to AnswerBus, pre-processing of the question should be performed. The purpose of pre-processing is to identify the question in the question answer file. In Yahoo! Answers, a question can appear in the or fields. Questions are also buried before or after a long description of context in either of the or fields. It is observed that the context most of time is not important. The context is the repetition of the question, or the question author’s opinion about the question, or why the question came to them (i.e. popped out). But cases exist in which the long description of context is important or essential to the question and in such cases, techniques for paragraph summarization are needed. Because it

© 2009 Lin Chen Page 97 Chapter 5: Experimentation & Result Page 98 is not the main purpose for this research and such cases do not happen regularly, no research and experiments are conducted for this.

The question is recognized by identifying the question mark that appears in one or both fields. If a sentence contains the question mark, it is regarded as the question. Question marks may appear several times in a sentence and in this case, the sentence is still deemed to be the question, but the extra question marks are deleted. If there are no question marks in either field, then the content of the field is used as the question. If the field is empty, then the content of the field is used.

5.3 EVALUATION CRITERIA To evaluate the performance of the proposed methods trend comparison, correlation score and Top-n match rate are used. Analysis of Yahoo! Answers indicates that this online social network follows a power law distribution. If the results from the experiments are correct, the results should also include a trend to follow the power law. That is, the majority of answer authors should have low scores and only a few answer authors should have high scores.

Correlation score evaluates the performance of the proposed method in ranking the answer authors against both benchmark results, that is, a human ranking method, and against the results obtained from the popular HITS method. The correlation score is able to indicate how well related the rank of n number of answer authors are according to the rank suggested by our proposed method (or suggested by the HITS algorithm) and the rank given according to the human ranking method. Two methods; Kendall’s Tau coefficient (Herlocker et al., 2004) and Spearman’s rho coefficient (Fagin et al., 2003) are used to calculate the correlation coefficient.

Kendall’s Tau coefficient is defined as:

4P τ = −1 Eq. 5.1 n(n − )1 where n is the number of the answer authors, and P is the sum, over all answer authors, of the number of answer authors ranked after the given answer author by both rankings.

Spearman’s rho is given as follows:

© 2009 Lin Chen Page 98 Chapter 5: Experimentation & Result Page 99

− n(∑xiyi) (∑xi)( ∑yi) ρ= Eq. 5.2 2 − 2 2 − 2 n(∑xi ) ( ∑xi) n(∑yi ) ( ∑yi)

where n is the number of answer authors in the dataset. x i is the rank of answer author i according to the first method. yi is the rank of answer author i according to the second method. The correlation coefficient ranges from -1 to 1 where -1 means a disagreement between the two rankings, 1 means that the agreement between two rankings is perfect and 0 means two rankings are neutral and independent.

Top-n match rate is defined as the degree of agreement between the proposed method and feedback returned from Yahoo! Answers for recommending the best answer. “n” in the Top-n match means the best answer matches are within the first n ranked answers based on the scores. For this research n is in the range of 1 to 5 in the experiments. Number of matched best answers Top-n match rate = Eq. 5.3 Total number of questions

As it is represented in Equation 5.3, the match rate is the ratio of number of questions for which the proposed method agrees with the user suggestion in selecting the best answer to the total number of questions in the category or dataset. Match score ranges from 0 to 1 with 0 as no matches for any question in the category and 1 as a match between the proposed best answer and the user-selected best answer for every question in the category. If the user-picked best answer matches with the top ranked answer from the proposed methods, then it is the Top-1 answer match. If the user-picked best answer matches with the second highest ranked answer from the proposed methods, then it is the Top-2 answer match and so on.

An ideal way to calculate the Top-n match rate should be the measure of agreement between the proposed method and human judgement by neutral voters who do not use Yahoo! Answers or judgement from human experts in the subject matter. However, human judgement is expensive and time consuming. Therefore, the feedback returned from Yahoo! Answers users is used in this process of evaluation. A test was conducted to determine the degree of agreement in choosing the best answer between the feedback from Yahoo! Answers users and manual judgement by independent users. 50 questions which were posted by top 18 users are randomly selected to compare the difference between Yahoo! Answers user best answer selection and independent users for choosing the best answer. The analysis shows that the best answers chosen by Yahoo!

© 2009 Lin Chen Page 99 Chapter 5: Experimentation & Result Page 100

Answers users agree 73.5% of the time with manual judgement by independent users for all categories. The agreement goes up to 87.2% for the Science & Math category.

5.4 RESULTS

5.4.1 Reputation Method Evaluation & Results To evaluate the reputation score, two methods are used. The first method aims to compare the trends of different reputation scores. The expected trend graph should show the power law distribution, that is, only a very few users should get a high expertise score and a large number of users should get a low expertise score. The second method aims to measure the accuracy of the reputation score and this is done by comparing the Top-k user’s similarity between the proposed method, HITS and human manual ranking.

5.4.1.1 Trend Comparisons Trend comparisons of reputation scores are conducted as follows. The global reputation score, the local reputation scores for the 3 categories, the HITS score and the baseline scores are binned into 11 categories. The scores range from 0 to 1. The first bin contains all scores with the value of 0. Each of the subsequent bins contains the scores within an incremental range of 0.1. The 11 bins are 0, 0-0.1, 0.1-0.2 and so on until 0.9-1.0. For the baseline, based on the number of answers the bins are 0, 1-100 answers (corresponding to a score of 0-0.1) etcetera, with the last bin containing 901 answers and above. For the number of best answers baseline, the bins are 0 best answers, 1- 35 best answers (corresponding to a score of 0-0.1) etcetera, with the last bin representing 316- 350 best answers. The global score is added so as to compare with the HITS score, as the HITS scores are calculated based on all categories.

Figures 5.5 and 5.6 show the scores gained by the proposed method, HITS and the baseline scores (number of answers and number of best answers). These graphs reveal that the global, local reputation and baseline scores follow the power law. However, HITS has an initial spike before the score starts to decrease in a similar fashion. The graph shows that the global reputation score and number of answers baseline score are similar for all bins from 0-0.1 to 0.7-0.8. HITS and the number of best answers baseline are also similar for the lower bins (mainly 0-0.1 to 0.3-0.4), after which the HITS scores drop off and high scores are assigned to very few users. HITS assigns a score to noticeably fewer users in comparison to the best answer baseline. It shows that HITS rewards users who have provided answers but lack best answers, with most of these users ending up with a score in the 0-0.1 bin for HITS.

© 2009 Lin Chen Page 100 Chapter 5: Experimentation & Result Page 101

The graph also shows that many users only provide a small number of answers (and therefore can only achieve a small number of best answers) and thus often will have a lower score. Only a small number of users participate enough to amass a large number of answers (and potentially a large number of best answers). In conclusion, the scores determined by the proposed method do follow

the expected power law distribution. The scores gained by proposed method are yet to be evaluated against human judgement.

HITS Global Non-content Score ArtHumanity ScienceMath Sports No of Answer No of Best Answer

6 5 4 3 2 1

No User(in log scale) scale) scale) scale) log User(in No 0 0 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 Bins

Figure 5.5. Trend comparisons for bins 0-0.4.

HITS Global Non-Content Score ArtHumanity ScienceMath Sports Num of answer Num of best answer

2

1.5

scale) 1

0.5 No User(in log scale) scale) log User(in No

0 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 Bins

Figure 5.6. Trend comparisons for bins 0.4-1.

© 2009 Lin Chen Page 101 Chapter 5: Experimentation & Result Page 102

5.4.1.2 Kendall & Spearman correlation The second set of experiments was conducted in order to compare: (1) correlation between human rated ranking of answer authors and HITS; and (2) the correlation between human ranking and proposed expertise scores.

Three sets of experiments are conducted. In the first set of experiments, the Top-6 answer authors are retrieved using the global reputation score, Local reputation scores for the 3 categories and HITS. For each of next set of experiments, the numbers of top answer authors are increased by 6, resulting in Top-12 and Top-18. For each answer author in all three sets of experiments, 50 questions that have been answered by the answer author are randomly retrieved. Manual rating involves choosing the best answer manually for all the questions. The answer authors are then ranked based on the number of best answers they provided. The user receives the highest ranking if they have provided the largest number of best answers; the higher the ranking the greater the number of best answers given.

The ranking of users based on the global reputation score, local non-content score for 3 categories, and HITS are evaluated against the manual ranking. Two common methods, Kendall’s Tau (Herlocker et al., 2004) and Spearman’s rho (Fagin et al., 2003), are used to compare the correlation of the automated score ranking against the manual ranking.

Figure 5.7 shows the correlation results. The graphs in Figure 5.7 indicate that HITS is poorly correlated with human ranking. The reason for the poor performance of HITS is that it awards a high rank to those users who answer lots of questions independent of the quality (e.g. number of best answers) of the answers and whether it is right or wrong.

The proposed reputation method has a high correlation with the human rankings. The local expertise score works the best in the “Science & Math” category, where the Kendall correlation score reaches around 1. However, it does not work so well in the “Arts & Humanity” category, where Kendall correlation scores are around 0.6. The reason for the big differences between the two categories is the difference in the nature of the questions and the quality of answers provided in these two categories. The questions in “Arts & Humanity” can be quite general and not a lot of expertise is necessarily required. The best answer as selected by the question author is quite subjective. In the “Science & Math” category, the question requires a certain degree of expertise. It is easier to tell a good answer from a bad answer in this category. The best answer selected by the question author is usually less subjective.

© 2009 Lin Chen Page 102 Chapter 5: Experimentation & Result Page 103

Figure 5.7. Correlation score for Top-k users.

5.4.1.3 Weighting of Reputation Method To decide on the weight assignment for the participation score and best answer score components of the reputation score, an experiment was conducted. In this test, several combinations of weights are tried. Different reputation scores for each answer author will be obtained from the different combination of weights. For each document which contains a question and answers, a ranked list is generated based on the reputation score. The answer author at the top of the ranked list is considered by the proposed system to be the reputable answer author giving the best answer. The proposed system-selected best answer is then compared to the best answer chosen by the Yahoo! Answers question author. If they match (i.e. the proposed system has picked the same answer author as the best answerer as the question author) then it is considered to be a Top-1 match. If they don’t match, the second highest ranked answer (as selected by the proposed system) is compared to the best answer as chosen by the question author. If this is a match it is considered a Top-2 match. If not, then the third highest ranked answer (as selected by the proposed system) is compared and so on for the top five ranked answers. At the end of the

© 2009 Lin Chen Page 103 Chapter 5: Experimentation & Result Page 104 experiment the results will show the match rate between the best answers chosen by the proposed system using just the reputation score (with different weights of the component parts) and the question author chosen best answer for the top 1 to 5. The results of match rate of the best answer within the Top-5 answer authors is shown in Figures 5.8, 5.9, 5.10 and 5.11. These figures are also presented in tabular form to show the exact values. These tables (A.1 to A.11), are presented in Appendix A.

Figures 5.8, 5.9, 5.10 and 5.11 show the results obtained for the dataset. The legend, for example, QA- NC-0.1AS-0.9BAS means that this accuracy of matching is obtained with reputation score only, with 0.1AS indicating that 0.1 weighting was used for the participation score and 0.9BAS indicates 0.9 weighting was used for the best answer score. As seen from these graphs, when the weight for the best answer score is decreased and the weight for the participation score is increased, the match rate decreases, except in the “Science & Math” category (Figure 5.10). In “Science & Math”, it is the other way around. The range of match rate in the same Top-n is small. Tables A.1 to A.11 in Appendix A, show that with the decrease of the best answer score weight, the match rate for the Top-1 match across all categories decreases from 0.5334 to 0.5097. The change of the match rate is slight. For the Top-2 to Top-5 match rate across all the categories, the change is even less than Top-1 match rate. The change of the range is limited to 0.025. For example, overall Top-2 match rate changes from 0.7671 to 0.7441 with the decrease of weighting score for best answer score. The Top-5 match rate changes from 0.9445 to 0.9343. Comparing the 3 different categories, the match rate in the “Arts & Humanity” category scores the highest while the lowest match rate appears in the “Sports” category. The highest Top-1 match rate is in the “Arts & Humanity” category with 0.6734. The lowest Top-1 match rate is in the “Sports” category with 0.5104. The changing of the match rate with the decreasing of the weighting for best answer score and increasing of weight for participation score in the 3 different categories is small. For the Top-1 match rates, the change in the match rates is limited to 0.04. For the Top-5 match rates, they are less than 0.03.

The results show the insensitivity of the proposed reputation method towards the weighting parameters. It is important to combine both the participation and best answer functions, as neither of them can work well if used alone.

© 2009 Lin Chen Page 104 Chapter 5: Experimentation & Result Page 105

Figure 5.8. Overall results for Top-n matching for weighting of reputation score.

© 2009 Lin Chen Page 105 Chapter 5: Experimentation & Result Page 106

Figure 5.9. Arts & Humanity for Top-n matching for weighting of reputation score.

© 2009 Lin Chen Page 106 Chapter 5: Experimentation & Result Page 107

Figure 5.10. Science & Math for Top-n matching for weighting of reputation score.

© 2009 Lin Chen Page 107 Chapter 5: Experimentation & Result Page 108

Figure 5.11. Sports for Top-n matching for weighting of reputation score.

The highest match rates are attained for the “Arts & Humanity” and “Sports” categories, when the weights for the participation score and best answer score are 0.0 and 1.0 respectively. However, the match rate does not vary much when the weights are varied. For the “Science & Math” category, the weighting of 1.0 for the participation score and 0.0 for the best answer score gives the best combination in terms of the match rate. As discussed in Chapter 4, the participation function reflects the user’s level of participation and the best answer function reflects the user’s level of expertise. A user may be good at providing best answers (higher than average), but the user’s participation can be lower than the average user. Without the participation function, the

© 2009 Lin Chen Page 108 Chapter 5: Experimentation & Result Page 109 user’s participation information may be missed and the reputation score is not truly reflective of the user’s reputation level. Similarly, for a user who participates a lot but is poor in providing best answers, without the participation function, the reputation score would be biased. Therefore, both functions are essential to calculate the reputation score as they are supplementary to each other. Both weighting scores should not be ignored. To get a better performance for the global reputation score, local reputation score for “Arts & Humanity”, and “Sports” categories, a higher weighting factor should be assigned to the best answer score than the participation score. In these two categories, the questions are general questions. Most users can provide answers to a question in these categories. So expertise is important when deciding the user’s reputation score. For the “Science & Math” category, weighting factors should be the other way around. In the “Science & Math” category, the question asks for specific knowledge. In most situations, general users are unable to provide answers and only those expert users are able to answer these questions. So the quantity of answers matters in deciding a user’s reputation score in the “Science & Math” category. A future version of this proposed reputation score may include the ability for the weights to be varied across different categories in the cQA in order to achieve the best performance possible.

5.4.2 Content Method Evaluation & Results Manually conducted human ranking is the most desirable technique for evaluation of this type of research. In the process of manually conducted human ranking, the researcher selects the best answer according to the quality of answers. The manually selected best answer is then compared with the results returned by the proposed content method. However, this evaluation method requires too much time and effort and is tedious for a large dataset such as Yahoo! Answers. Therefore, a comparison is done between the best answer recommended by the proposed content method and best answer as picked by the question author in Yahoo! Answers. The user-picked best answer may not be perfect, as sometimes there may be the issue of cheating behaviour, such as the question author picking his puppet account as the best answer provider. However, the sheer size of the dataset makes such cheating behaviour insignificant.

The process of evaluation starts with calculating the content score for each answer given in response to a question. The content score can be calculated in several different ways. The score can be obtained by: (1) comparing an answer in Yahoo! Answers with the answers provided from an expert (the AnswerBus QA portal in this case); (2) comparing an answer with the keywords of the question asked in Yahoo! Answers; (3) comparing an answer in Yahoo! Answers with the answers provided from AnswerBus with WordNet synonym expansion; (4) a combination of

© 2009 Lin Chen Page 109 Chapter 5: Experimentation & Result Page 110 approaches 1 and 2; and (5) a combination of approach 2 and 3, with weighting applied to each of the two components. Next, a ranked list is generated according to the content score in descending order. The process of determining the top 1 to top 5 matches is the same as described previously in Section 5.4.1.3 (Weighting of Reputation Method).

By comparing the recommended answer from the proposed content method and the user-picked best answer a Top-n match score can be obtained. Tables 5.3, 5.4, 5.5, 5.6 and 5.7 are for the 80% dataset with the results obtained using the 5 methods mentioned above. Amongst the 5 methods, the combined use of both AnswerBus and question answer comparison performs the best with a match score of 0.4794 for the Top-1 answer match score across 3 categories (Table 5.6). The results show that applying WordNet synonym expansion to AnswerBus results in a match score that has no noticeable difference to AnswerBus without WordNet. There may be several reasons behind obtaining the low Top-1 match rate: (1) An answer is not available to the question. Some questions in Yahoo! Answers ask for too detailed information, so that no information is available to answer the questions. (2) Some questions in Yahoo! Answers may not be appropriately addressed and may contain ambiguous words, such as pronouns. For example, if a user asks “Who is she?” without defining what/who ‘she’ refers to, the question is ambiguous and can not be answered appropriately. A user may also define who ‘she’ refers to in another sentence, but if that is not marked as a question then the developed system fails to identify it and thus does not know what/who ‘she’ refers to. (3) Some questions do not really ask for information but rather ask for an opinion. The content- based method is not good for opinion type questions. The best Top-5 match score can be seen to reach to 0.84. For all 5 methods, the match score in the “Arts & Humanity” category is the highest amongst the 3 top level categories. The worst match score is always obtained in the “Sports” category. Poor results for the “Sports” category are due to the large number of questions belonging to the opinion type.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 0.395 0.5933 0.7126 0.7897 0.8406 Arts & 0.5143 0.7113 0.8152 0.8796 0.9147 Humanity Science & 0.4775 0.6846 0.7923 0.8519 0.8885 Maths Sports 0.3355 0.5313 0.6285 0.7448 0.8047 Table 5.3. AnswerBus match score.

© 2009 Lin Chen Page 110 Chapter 5: Experimentation & Result Page 111

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 0.378 0.5808 0.7063 0.7861 0.8399 Arts & 0.4988 0.7079 0.8183 0.8794 0.9159 Humanity Science & 0.4635 0.6703 0.7851 0.8475 0.8869 Maths Sports 0.3176 0.5164 0.6499 0.7405 0.8037 Table 5.4. Question answer match score.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 0.3983 0.5987 0.7173 0.7925 0.8426 Arts & 0.5183 0.7143 0.818 0.8793 0.9146 Humanity Science & 0.4803 0.6885 0.7935 0.8525 0.8892 Maths Sports 0.3388 0.5377 0.6648 0.7493 0.8077 Table 5.5. AnswerBus with WordNet match score.

Cat Top 1 Top 2 Top 3 Top 4 Top 5

Overall 0.4794 0.6709 0.7789 0.8454 0.8886 Arts & 0.5744 0.7681 0.8616 0.9116 0.9412 Humanity Science & 0.573 0.7562 0.8505 0.8991 0.9269 Maths Sports 0.423 0.6165 0.7329 0.8097 0.8618 Table 5.6. Combined AnswerBus with question answer match score.

Cat Top 1 Top 2 Top 3 Top 4 Top 5

Overall 0.3939 0.5972 0.7189 0.7951 0.8468 Arts & 0.5077 0.7199 0.8269 0.8855 0.9191 Humanity Science & 0.4834 0.6886 0.7975 0.8581 0.8936 Maths Sports 0.3337 0.5338 0.6637 0.7499 0.8117

Table 5.7. Combined AnswerBus with WordNet and question answer match score.

5.4.3 Score Fusion Evaluation and Results The next task is to combine the content-based score and reputation-based score such that the accuracy of the proposed system’s best answer recommendation is better. Figure 5.12 shows an example result document that combines the reputation-based score and content-based score (returned from using AnswerBus) with a 0.5 weighting factor for each component. This allows

© 2009 Lin Chen Page 111 Chapter 5: Experimentation & Result Page 112 the researcher to check the result for further analysis. The user-selected best answer Id and system-selected best answer Id is recorded for further evaluation of the proposed system.

Different weightings for combining the reputation-based score and content-based score into a single score are also subjected to testing. The purpose of testing different weighting combinations is to decide whether the weighting factors affect the resultant match score or not. The weighting combinations that were tested were 0.5 & 0.5, 0.25 and 0.75 and 0.75 and 0.25 for reputation- based score and content-based score respectively. The final combined score can be based on: (1) reputation score with AnswerBus; (2) reputation score with Question Answer word matching; (3) reputation score with AnswerBus applying WordNet synonym expansion; and (4) reputation score with combined AnswerBus and Question Answer word matching. The combined score is to be compared with: (1) reputation-based score only; (2) AnswerBus score only; (3) Question and Answer word matching only; (4) AnswerBus combined with Question Answer word matching; (5) AnswerBus only with WordNet synonym expansion; and (6) HITS.

The overall dataset for these experiments has been divided into 3 datasets. One is known as the ‘big’ dataset (contains 80% of the retrieved questions) and the other two are smaller datasets (20% (1) & 20% (2)). The first of the 20% datasets contains the records that were excluded from the 80% dataset. The second 20% dataset contains randomly selected records of the whole dataset. The purpose of the setup of dataset for these experiments is to test the consistency of the results and to ensure interference from noise is reduced. Figure 5.13 shows the overall results for the ‘big’ dataset (80%). As it shows, the reputation-based score performs the best amongst all of the methods. All the results for the ‘big’ dataset and the other two small datasets are shown in Appendix B. In the ‘big’ dataset with a 0.5 weighting factor for both the reputation and content scores, the combination of reputation score with AnswerBus and Question Answer word matching is the second best performing measure. However, this differs from the results obtained from the two small datasets. In the first small dataset (20% (1)), the combined reputation score with AnswerBus was the second best performing, while in the second small dataset (20% (2)), the combination of reputation score with AnswerBus using WordNet synonym expansion was the second best performing. There is no distinguishable difference in score when determining the third best approach in the ‘big’ dataset, as all of the remaining methods perform very similarly. In the first small dataset (20% (1)), the third best performing method is the combination of reputation score with AnswerBus and Question Answer word matching. In the second small dataset (20% (2)), the third best is the combination of reputation score with AnswerBus using WordNet synonym expansion and Question Answer word matching. From the comparisons in the

© 2009 Lin Chen Page 112 Chapter 5: Experimentation & Result Page 113 different datasets, it is shown that the results are influenced by the choice of the data sample. But, in general, the reputation only approach appears to always perform the best.

The results from changing the weight assignment between the reputation score and content score is more significant, with high weight on reputation score and low weight on content score performing better than the other way around. From the tables in Appendix B, it is observed that when a weighting of 0.75 is used for the reputation score and 0.25 for the content score, better performance is obtained than when a weighting of 0.25 is used for the reputation score and 0.75 for the content score. Figures 5.14, 5.15 and 5.16 show the match rates for the three individual top level categories for the different combinations of scores. Amongst the 3 top level categories, the “Sports” category always obtains the worst match scores, while the “Arts & Humanity” category achieves the best match scores. This can also be seen from the tables in Appendix B. The match rate obtained from using the HITS algorithm is low and it is the second worst performing method, with the Question Answer word matching only approach, having the worse performance.

© 2009 Lin Chen Page 113 Chapter 5: Experimentation & Result Page 114

Question: Who explored Florida looking for the Fountain of Youth? Question Asker ID: RkGJkmwsaa

User Selected Best Answer (Best Answer Selected by Yahoo! Answers):

"Juan Ponce de Leon" Juan Ponce de Le??n (c. 1460 ??? July 1521 was a Spanish conquistador. He was born in Santerv??s de Campos (Valladolid). As a young man he joined the war to conquer Granada, the last Moorish state on the Iberian peninsula. Ponce de Le??n accompanied Christopher Columbus on his second voyage to the New World. He became the first Governor of Puerto Rico by appointment of the Spanish Crown. "He is also notable for his voyage to Florida, the first known European excursion there, as well as for being associated with the legend of the Fountain of Youth which is said to be in Florida." "The Fountain of Youth is a legendary spring that reputedly restores the youth of anyone who drinks of its waters. Florida is said to be its location, and stories of the fountain are some of the most persistent stories associated with the state."

User Selected Best Answer Answerer User ID: PmtnHFyQaa

System Selected Best Answer (Best Answer Selected by Proposed Methods):

"Juan Ponce de Leon" Juan Ponce de Le??n (c. 1460 ??? July 1521 was a Spanish conquistador. He was born in Santerv??s de Campos (Valladolid). As a young man he joined the war to conquer Granada, the last Moorish state on the Iberian peninsula. Ponce de Le??n accompanied Christopher Columbus on his second voyage to the New World. He became the first Governor of Puerto Rico by appointment of the Spanish Crown. "He is also notable for his voyage to Florida, the first known European excursion there, as well as for being associated with the legend of the Fountain of Youth which is said to be in Florida." "The Fountain of Youth is a legendary spring that reputedly restores the youth of anyone who drinks of its waters. Florida is said to be its location, and stories of the fountain are some of the most persistent stories associated with the state."

System Selected Best Answer Answerer User ID: PmtnHFyQaa

Reputation Rating: 0.193407 Content Rating: 0.6988696 Overall Rating: 0.4461383

Other Answers Rating by Proposed Methods

Answer (1): Ponce de Leon Answerer User ID: ZDrXFS2daa Reputation Rating: 0.0214295 Content Rating: 0.020149663 Overall Rating: 0.020789582

Answer (2): You mean? Answerer User ID: AA11260781 Reputation Rating: 0.00789477 Content Rating: 0.0 Overall Rating: 0.003947385 Figure 5.12. Sample of output from proposed system.

© 2009 Lin Chen Page 114 Chapter 5: Experimentation & Result Page 115 Top 1 Top 2 Top 3 Top 4 Top 5 Top

S N IT W N )- H W - 2 - A ( ) C ) Q - 2 ) ( 1 ( C N ) )- C W - 2 1 - A ( ( ) C C ) Q - ( 2 ) 5 ( 1 2 C ( . - C 0 ) - - 1 N A C ( C -W Q N ( ) 5 5 ) 7 . .7 (2 0 0 C - - - A C ) ) 1 ) Q N ( 2 5 C ( 2 ( C . - 0 .5 ) - 0 1 A - ( ) C ) Q C ( 2 N 5 ( 5 C . .2 - 0 ) - 0 1 - ( A C C Q N ( ) 5 5 ) 7 7 . . (2 0 0 C - - -

A C ) ions. Q N (1 5 C 2 ( . 5 0 . ) - 0 A - (2 Q C C N 5 .5 .2 0 0 - - ) A C (2 Q N C 5 5 .7 .7 0 0 - - A C ) Q N 2 5 ( N .2 C 5 -W 0 . ) - 0 A - (1 Q C C N 5 N 5 2 . . W 0 - - 0 ) - 1 A C ( Q N C 5 5 .7 .7 0 0 N - - W - A C ) Q N 1 5 ( .2 C 5 0 . ) - 0 A - (1 Q C C N 5 .5 .2 0 0 Overall Top n Match Rate for Tested Combinations Tested for Rate Match n Top Overall - - ) A C (1 Q N C 5 5 Figure 5.13. Overall match rate for tested combinat 7 . .7 0 0 - - A C ) Q N 1 5 ( .2 C 0 .5 - 0 A - Q C N .5 0 - A Q ) 2 ( C - N A W Q )- 1 ( C - A Q ) 1 ( C - A Q C N - A Q 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

© 2009 Lin Chen Page 115 Chapter 5: Experimentation & Result Page 116

Top 1 Top 2 Top 3 Top 4 Top 5 Top

S T N I W -H )- N A (2 Q C -W )- ) 1 ) ( (2 -C ) C 2 )- N A ( 1 Q C ( -W )- C ) 1 ( ) ( 5 (2 C .2 - 0 C - )- A 1 Q C ( N N C 5 ( -W .7 5 ) 0 .7 ) - 0 (2 A - C C Q )- N ) 5 (1 ) .2 C (2 0 ( - .5 -C 0 ) A - 1 Q ( C ) (C ) N 2 .5 5 ( 0 .2 - -C -0 ) A 1 Q C ( N C 5 ( 7 5 . 7 )) 0 . tions - 0 (2 A - C C Q - N ) 5 (1 2 . (C 0 combinations. - .5 0 A - ) Q C (2 N C 5 5 . 2 -0 . -0 A ) Q C 2 N ( 5 C .7 5 0 .7 - 0 A - Q C N ) 5 2 .2 ( 0 C N - .5 -W A 0 ) Q - C (1 N C 5 5 . N 0 .2 - 0 - -W A ) Q C 1 N ( 5 C .7 5 0 .7 - 0 N A - Q C -W N ) 5 (1 .2 0 C - .5 0 A - ) Q C (1 N C 5 5 . 2 0 . - 0 A - C ) Q 1 N ( 5 C .7 5 Arts & Humanity Top n Match Rate for Tested Combina Tested for Rate Match n Top Humanity & Arts 0 .7 - 0 A - Q C N ) Figure 5.14. Arts & Humanity match rate for tested 5 1 .2 ( 0 C - .5 A 0 - Q C N .5 -0 A Q ) (2 -C A N Q W )- (1 -C A Q ) (1 -C A Q

C -N A Q 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

© 2009 Lin Chen Page 116 Chapter 5: Experimentation & Result Page 117 Top 1 Top 2 Top 3 Top 4 Top 5 Top

S N T I W - H ) N - 2 A ( -W Q -C ) ) ) 1 2 ( ( C C N - ) )- 2 1 A ( -W ( ) Q -C C ) ) ( 2 1 5 ( ( 2 C . C - 0 )- - 1 A C ( N Q C N ( -W 5 ) 7 5 ) . .7 2 0 0 ( - - C A - C ) ) Q N 1 ) ( 2 5 C ( .2 ( 0 5 C - . )- 0 1 A - ( ) Q C C ) N ( 2 5 5 ( . .2 C 0 0 - - - ) 1 A C ( Q N C 5 ( ) 7 5 ) . .7 2

tions 0 - 0 ( - C A C )- Q N 1 5 ( 2 C

. ( ombinations. 0 5 - . 0 ) A - 2 Q C ( N C 5 5 . .2 0 0 - - ) A 2 C ( Q N 5 C 7 5 . .7 0 0 - - A C Q ) N 2 5 ( N .2 C 5 0 . -W - 0 ) A - 1 Q C ( N C 5 5 N . .2 0 0 -W - - ) A 1 C ( Q N 5 C 7 5 . .7 0 0 N - - A C -W Q ) N 1 5 ( .2 C 0 5 - . 0 ) A - 1 Q C ( C N 5 .5 2 0 . - 0 - ) A C 1 ( Q N 5 C 7 5 Science & Maths Top n Match Rate for Tested Combina Tested for Rate Match n Top Maths & Science . .7 0 0 - - A C Figure 5.15. Science & Math match rate for tested c Q ) N 1 5 ( .2 C 0 .5 - 0 A - Q C N .5 0 - A Q ) 2 ( C - N A Q -W ) 1 ( C - A Q ) 1 ( C - A Q C N - A Q 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

© 2009 Lin Chen Page 117 Chapter 5: Experimentation & Result Page 118

Top 1 Top 2 Top 3 Top 4 Top 5 Top

S T N I W -H )- N A (2 W Q -C - ) )) 1 ( (2 -C ) C 2 )- N A ( 1 Q ( W -C - ) C )) 1 ( ( 5 (2 .2 -C C -0 )- A 1 Q C ( N N C 5 ( -W .7 5 ) 0 .7 ) - 0 (2 A - C C Q )- N 1 )) 5 ( 2 .2 C ( 0 ( C - .5 - 0 ) A - 1 Q ( C C ) N ( ) 5 5 (2 . .2 -0 C -0 )- A C (1 Q N 5 (C 7 5 . 7 )) -0 . 2 -0 ( ons. A C Q C - N ) 5 (1 .2 C 0 ( - .5 0 A - ) Q C (2 N C .5 5 .2 -0 -0 A ) C 2 Q N ( 5 C .7 5 0 .7 - 0 A - Q C N ) 5 2 .2 ( 0 C N - .5 W A 0 - - ) Q C (1 N C 5 5 . 2 N -0 . -0 -W A ) Q C 1 N ( 5 C .7 5 0 .7 - 0 N A - C -W Q N ) 5 (1 .2 0 C - .5 0 A - ) Q C (1 N C .5 5 0 .2 - 0 Sports Top n Match Rate for Tested Combinations Tested for Rate Match n Top Sports A - C ) Q (1 N Figure 5.16. Sports match rate for tested combinati 5 C .7 5 0 .7 - 0 A - Q C N ) 5 1 .2 ( 0 C - .5 A 0 - Q C N .5 -0 A Q ) (2 -C A N Q W )- (1 -C A Q ) (1 -C A Q

C -N A Q 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

© 2009 Lin Chen Page 118 Chapter 5: Experimentation & Result Page 119

5.5 DISSCUSION From the evaluation conducted, it is noticed that the reputation method generates the most promising results. For the Top-18 answer authors, the correlation rate achieves close to 1. However, the overall Top-1 match rate for all the answer authors is less than 0.6. This outcome may have been influenced by noise as in Yahoo! Answers, the best answer is not really selected properly. Another reason may be due to the type or the nature of the questions being asked by the users. The quality of answers provided by opinion and discussion types of questions can not be easily determined by utilizing the user’s reputation. The question author chooses their favourite opinion (due to ratings/voting by other users, the points score of the answer authors or the user who most agrees with the question authors view) as the best answer. Just because the question author picked the best answer, it does not necessary mean it is the right answer as it is difficult to say whether an opinion is right or wrong. As shown in the results, the “Sports” category has the worst match rate when compared with the other two top level categories. A check was undertaken to determine the number of opinion and discussion types of questions in the “Sports” category. It was found that under the “Sports” category, lots of questions in football sub-categories are of the opinion type.

The content score is closely related to the performance of existing QA systems. The best performing QA system tested can achieve around 30% accuracy. Not surprisingly, our proposed content score by itself can not easily recommend the best answer. Another consideration is the availability of online knowledge. A question asked in Yahoo! Answers can be quite unique and web-based sources may not have sufficient knowledge to provide a good answer. This is one reason that may account for only obtaining an accuracy of 30% for the returned answers using QA systems. It is a complex and difficult process to assess the quality of answer content in Yahoo! Answers. To assess the quality, an answer needs to be validated against an expert answer and it is difficult for Yahoo! Answers to choose an expert. The questions in Yahoo! Answers do not ask for certain domain information but rather ask for all kinds of information from all fields. Furthermore, it is difficult to analyse the purpose of the questions when the questions are blended into a long paragraph. The meanings of some questions are not clear even under manual interpretation. They only make sense when the context is included. The current expert QA portal can not handle a question within a paragraph. Moreover, questions asked in Yahoo! Answers can be of any type including opinion type, factoid type and others. Current QA research can only handle particular kinds of questions efficiently - such as factoid and definition types of questions. This research on the content method reinforces some issues raised by previous research such as the identification of the correct meaning of the pronouns in the questions, and the answering of

© 2009 Lin Chen Page 119 Chapter 5: Experimentation & Result Page 120 questions which requires human logic. In general, QA systems based on a content method find it more difficult to achieve decent performance and more research still needs to be done.

Three different types of evaluation criteria were used. They were: trend comparisons; correlation scores (comparing the performance between the proposed method and the human judgement); and the Top-n match rate. Feedback returned from Yahoo! Answers users is considered as the ideal outcome. However, feedback is sometimes biased due to various reasons such as the question author’s opinion to answers and their limited knowledge. It would be best if the best answers returned by the proposed method are compared against human judgement from several neutral voters. Different neutral voters may have disagreements on the best answer for a question and it is believed that the best answer is normally the one which the majority of neutral voters choose as the best answer. However, human judgement is costly and time-consuming. Therefore, this thesis adapted a solution that is feasible in order to reflect the performance of the proposed method.

The hypothesis was that the combination of a reputation and content method may help to improve the performance of a reputation only method. However, combining the two methods sacrifices the performance of the reputation method. The reason may be related to the problem of the low accuracy of QA systems.

In summary, the reputation only method is a better performing method than a content only method, a combined reputation and content method and HITS. The HITS algorithm was a poorly performing approach. The reputation only method achieves the best results when the weights for the participation function and best answer function are set at 0.0 and 1.0 respectively in the “Arts & Humanity” and “Sports” categories. It is other way around for the “Science & Math” category. For the combined method, the performance attains better results when the weighting of the reputation component is higher than the weighting of the content component. Applying WordNet to the content method does not make a significant difference to the performance of the content method.

5.6 CONCLUSION This chapter discusses the design and setup of the experiments along with the results obtained. The experiments test the performance of the reputation method using two criteria, which are the trend comparison and correlation score. From the results of the trend comparison and correlation score, the reputation method performs well and marginally outperforms HITS. Also, an

© 2009 Lin Chen Page 120 Chapter 5: Experimentation & Result Page 121 experiment was carried out to evaluate the influence on the performance of the reputation method when different weights are applied to the participation score and best answer score. Findings from the experiments prove that the weights applied to the participation score and best answer score do not have much effect on the performance of the reputation method.

The experiments to test the performance of the content methods include obtaining the Top-n match score by comparing: (1) an answer in Yahoo! Answers with the answers provided from the AnswerBus QA portal; (2) question answer match; (3) the adapted first method with using WordNet along with AnswerBus; (4) combined method of 1 and 2; and (5) combined method of 2 and 3. The performances of these various combinations of the content-based method are also presented in this chapter. Amongst the 5 methods, the fourth method (a combination of comparing the answer supplied in Yahoo! Answers with the answers from AnswerBus and a comparison of the answer from Yahoo! Answers against the question asked) is the best performing one. However, the overall Top -1 match rate is not high. The reasons for low match rate are discussed and include issues such as the complexity of the questions and lack of available information in the Web to answer the question.

Finally, experiments were conducted to obtain the performance of the proposed system, combining both the reputation method with various content methods. The results of the experiments show that the use of just the reputation method alone outperforms all of the combinations of reputation and content methods. Moreover, different combinations of weighting factors for the combining of the reputation and content methods were tested. It was found that the weights do not have much impact on the performance. In summary, experiments with Yahoo! Answers show that an expert with a good reputation can be considered a good source of knowledge to meet an information seeker’s needs.

© 2009 Lin Chen Page 121 Chapter 5: Experimentation & Result Page 122

© 2009 Lin Chen Page 122 Chapter 6: Conclusions Page 123

6Chapter 6: Conclusions

As a collaborative social network website, Yahoo! Answers has obtained great popularity in recent years. Communications in Yahoo! Answers are based around asking and answering questions, where answers can be provided by multiple different users. The quality of an answer and a question can vary greatly. Currently, the best answer to a question is chosen through a manual process which requires time and effort on the users’ behalf. Moreover, the best answer is decided by the question author and by other users who did not answer the question after reading all the answers submitted for the question. An automatic process of choosing the best answer to a question is desired as this will allow the question author or other users to read just the best answer instead of going through and having to read all the answers provided.

6.1 MAIN FINDINGS This research presents an innovative approach to automatically select the best answer from many different answers for a given question in a cQA network. The research determines the quality of users’ answers by utilizing the reputation of users and the content of their answers. The reputation method corresponds to features of the power law so that the majority of users will receive a low reputation score due to their limited contribution and only those few who provide many quality answers will receive a high reputation score. Alongside this, a content method is proposed to filter out irrelevant and low quality answers based on the textual content of the answers. The proposed content method makes use of AnswerBus, an online QA system as a possible expert source. Questions from Yahoo! Answers are passed on to the online QA system so that possible answers to the question can be discovered. These two features are combined using a score fusion, allowing either feature or both to be used as needed. Experiments were conducted to compare the performance of the reputation method, the content method and the combined methods. Empirical analysis shows that the proposed reputation method produced the best performance amongst all. The content-based methods perform poorly due to several reasons, such as the complexity of the questions in Yahoo! Answers, the type of questions, ambiguous meaning of terms that exist in questions and unavailability of (ideal) answer information on the Web. The empirical analysis ascertains that the reputation method, based on user reputation and expertise should be used for recommending the best answer, at the current stage of this research.

© 2009 Lin Chen Page 123 Chapter 6: Conclusions Page 124

There are many types of social networks and each one has a different structure and features. Analysing the structure and features of Yahoo! Answers helps to understand the collaborative question answer type social network. A comprehensive study on Bow Tie structure, degree centrality and the process of question answering in Yahoo! Answers has been carried out. It was discovered that only a few users exclusively ask questions and users who only answer questions, surprisingly, constitute a third of the population in the sample dataset. The analysis of indegree and outdegree proves the existence of the power law in Yahoo! Answers, that is, the majority of users only answer a few questions and only a few users provide a massive number of answers.

A series of experiments were conducted to test the effectiveness of the proposed reputation and content methods. The first experiment tested the influence of different weights on the reputation score. Results show that the weights have small influence on the final reputation score. Based on this, the way the reputation scores are measured makes the reputation-based method robust and insensitive to weights and parameters. The reputation method has also been compared with the HITS algorithm and the outcome from the experiments indicates that the results from the proposed reputation method follow the power law. However, the results from the HITS algorithm do not. Both methods were also compared against manual judgement of the best answer with the measure of this experiment being the correlation score. It was found that the use of the proposed reputation method was more highly correlated with manual judgement than HITS. HITS does not consider the best answer score as much and that is why it performs poorly. However, it would still work, even if manual ranking is not done.

From the experiments on the proposed content-based methods, it was found that the use of AnswerBus for evaluating the quality of an answer is the best performing method among all the question answer matching methods, including the use of WordNet with AnswerBus. But comparing to the experimental results, reputation based method is better than content based methods which use NLP and IR techniques. Current NLP & IR techniques can not solve the problem of identifying pronouns, ambiguous terms and many others. Questions in Yahoo! Answers are more complicated than the questions tested on current NLP &IR based QA systems. The current QA systems are not able to handle these questions. The reputation method is the better indicator and more suitable to use in a cQA network.

Trend comparison, correlation score and top-n score are used to evaluate the proposed approach. Trend comparison observes the trend of the scores generated from proposed methods to ensure the proposed methods follows the power law. Correlation score and top-n score are useful in

© 2009 Lin Chen Page 124 Chapter 6: Conclusions Page 125 evaluating how closely related the best answer chosen by proposed method and by the human expert(s) are.

6.2 CONTRIBUTIONS The contributions that this research made include the following points: • A comprehensive analysis of Yahoo! Answers has been conducted. As a consequence, the properties and features of Yahoo! Answers are discovered and applied to the proposed methods. • A non-content method based on users’ reputation is proposed. The reputation score for each answer author is calculated. There are two components essential to the reputation method and they are the participation score and best answer score. The participation score is based on how active a user is in the network, while the best answer score is based on how good the user is at providing the quality (best) answers. • A question type classification experiment is carried out. The performance of the training dataset is better than previous research conducted by another research group. A Name Entity Recognition experiment is also conducted. The results are reviewed and a content method is proposed. The content score for each answer is calculated based on the cosine similarity of the answer in Yahoo! Answers and the answer returned from another online QA system. • A series of experiments are conducted to evaluate the performance of the proposed methods. The experiments compare the distribution of reputation score, HITS and the baseline score. The correlation score of the Top-n users’ reputation score and manual evaluation of answer results is calculated. The correlation score between HITS and manual evaluation of answer results is calculated as well. As the results show, the correlation score of the Top-n users’ reputation score and manual evaluation of answer is high (being near to 1). HITS, on the other hand, suffers from a low correlation score. Furthermore, the matching score is given based on the degree of matching between the best answer according to the reputation score and best answer from Yahoo! Answers users. Similarly, the matching score is done for evaluation of the content score and the combined reputation and content score. Amongst all the proposed methods, the Top-n match score of the reputation method is always the highest in all 3 categories and overall across the 3 categories. • Weights are applied to the reputation score calculation. Weighting is also applied to the combined reputation and content score calculation. Experiments are conducted to check the influence of the weights on the performance of the reputation method and the combined reputation and content method in predicting the Top-n matches. Results prove

© 2009 Lin Chen Page 125 Chapter 6: Conclusions Page 126

that the variation of the Top-n match rate is small when various weights are applied for reputation method and the combined reputation and content method. • This research can be directly applied to Yahoo! Answers in order to improve the efficiency of the process of selecting the best answer to a question. A feature of this research is that it can also list the answers and their authors in descending order according to their scores using the proposed methods. The proposed methods can not only be applied to Yahoo! Answers, but also can be applied to other cQA systems. Reputation is decided by a user’s participation level and expertise level. In most of the current cQA systems, the participation level is related to the number of answers that a user has provided, while the expertise level is related to the number of best answers that a user has provided. Both the number of answers and the number of best answers records for a user are easily obtained from current cQA systems. Therefore, it is possible for the proposed methods to have wide application to cQA systems. On the other hand, the quality of answers is determined by the content quality of an answer such as relevance and informativeness. A content score is proposed based on the content quality of the answer. The process for calculating the content score is detailed. When applying the content method to other cQA systems, the content score can be obtained by following the same steps as the content method outlined in this thesis.

6.3 FUTURE WORK There are a number of limitations in this research. One limitation is that the data collection represents a small sample of the whole amount of data available in Yahoo! Answers. Data has not been collected from all the categories available. Another limitation is that no method has been proposed to evaluate the quality of the questions asked. In this research, all the best answers for different questions are treated the same. It makes sense that the best answer for a quality question is more valuable than the best answer for a low quality question. In addition, this thesis does not completely solve the cold start problem, that is, treating new users in the same fashion as older existing users. The proposed reputation method favours the older active users more than the new users. Those new users who are good at providing quality answers but have a low participation level and a low best answer providing level are not deemed to be expert users with the current proposed method due to their short time as users in Yahoo! Answers. The content method was included to consider the quality of answers along with the reputation of the users. Unfortunately, due to some complexities and limitations inherent in the content-based methods, the reputation method is considered to be the best method.

© 2009 Lin Chen Page 126 Chapter 6: Conclusions Page 127

As for future work, more data will be collected from all categories. Questions need to be analysed to decide on the nature of the question (for example, whether the question is an opinion question, a discussion question or a factoid question). In addition, the quality of a question should be considered in future research, so that any question which is either an opinion, discussion or otherwise of poor quality will not be included in the research. There is little value recommending a best answer to these types of questions. An improved reputation method should be proposed to solve the cold start problem so that new users can receive a fair reputation score. A classification study may also be carried out to decide on the type of the question. Further improvements can be made for the QA system. Training data can be used to help set up question analysis as well as manually labelling the questions in Yahoo! Answers and assigning the type to the question based on the labels shown in Table 4.1. Setting up a knowledge database may work as a replacement for NER, which is only useful for a few types of questions. In this knowledge database, a record will exist regarding the relations of certain types of questions to certain answer terms. It would also be desirable to test and compare the performance of the current Yahoo! Answers system against the proposed answer quality evaluation system through real world use.

© 2009 Lin Chen Page 127 Chapter 6: Conclusions Page 128

© 2009 Lin Chen Page 128 Appendix A Page 129

Bibliography

NLP group. (2006). What is NLP. Retrieved December 06, 2008, from http://nlp.shef.ac.uk/ PC World celebrates Yahoo! Answers. (2008). Retrieved January 3rd, 2009, from http://yanswersblog.com/index.php/archives/2008/10/09/pc-world- celebrates-yahoo-answers/ Question Answering. (2008). Retrieved February 18th, 2009, from http://wikipedia.com TechGazing: Why do you answer questions on Yahoo Answers and LinkedIn Answers? (2008). from http://www.techgazing.com/2008/01/22/why- do-you-answer-questions-on-yahoo-answers-and--answers Points and Levels. (2009). Retrieved January 3rd, 2009, from http://answers.yahoo.com/info/scoring_system Scale-free network. (2009). Retrieved January 4, 2009, from http://en.wikipedia.org/wiki/Scale-free_network what is the theme to third Harry Potter book. (2009). Retrieved January 12, 2009, from http://answers.yahoo.com/question/index;_ylt=AvXHuGmPgsEq4WhX dwMefFEjzKIX;_ylv=3?qid=20061105112546AA0L5lM WordNet. (2009). Retrieved Febuary 1st, 2009, from http://en.wikipedia.org/wiki/WordNet Yahoo! Answers. (2009). Retrieved January 3rd, 2009, from http://en.wikipedia.org/wiki/Yahoo!_Answers YouTube. (2009). Retrieved August 4 th from http://en.wikipedia.org/wiki/youtube Adamic, L. A., Buyukkokten, O., & Adar, E. (2003). A social network caught in the Web. First Monday, 8 (6). Adler, B. T., & Alfaro, L. D. (2007). A Content - Driven Reputation System for the Wikipedia . Paper presented at the WWW'07. Agrawal, A. J. (2008). Domain specific question answering technique for accessing information on mobile phone. Paper presented at the Proceedings of IET 2008. Arrington, M. (2006). YouTube's Magic Number -1.5 Billion. Retrieved January 12th, 2009, from http://www.techcrunch.com/2006/09/21/youtubes-magic-number-15- billion/ Baumgartner, H., & Pieters, R. (2000). The Influence of Marketing Journals: a Citation Analysis of the Discipline and its Sub-Areas. Economic Research, 2000 , 123. Bavelas, A. (1950). Communication patterns in task oriented groups. Journal of the Acoustical Society of America, 22 , 271-282. Bharat, K., & Henziner, M. (1998). Improved Algorithm for topic Distillation in Hyperlinked Environments. Paper presented at the Proceeding of 21 st SIGIR.

© 2009 Lin Chen Page 129 Appendix A Page 130

Bontcheva, K., Tablan, V., Maynard, D., & Cunningham, H. (2004). Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering, 1 (1). Borodin, A., & al., e. (2003). Graph structure in the web. Retrieved 18th June, 2008, from http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/broder.pd f Borthwick, A. (1999). A Maximum Entropy approach to Named Entity Recognition. Unpublished PhD dissertation, New York University, New York. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Paper presented at the seventh international conference on World Wide Web. Burt, R. S. (1979). Disaggregating the effect on profits in manufacturing industries of having imperfectly competitive consumers and suppliers. Social Science Research, 8 (2), 120-143. Cai, D., & et, a. (2005). Mining Hidden Community in Heterogeneous Social Networks . Paper presented at the LinkKDD'05. Cao, Y., Das, G. C., Chan, C., & Tan, K. (2008). Optimizing complex queries with multiple relation instances. Paper presented at the Proceedings of SIGMOD 2008. Chakrabarti, S., et al. (1998). Automatic Resource compilation by Analyzing Hyperlink Structure and Associated Text. Computer Networks, 30(1- 7), 65-74. Chi, E. H. (2008). Yahoo! Answers vs. Google+Wikipedia va. Powerset. Retrieved January 3rd, 2009, from http://asc- parc.blogspot.com/2008/05/yahoo-answer-vs-googlewikipedia-vs.html Clarke, C., & Terra, E. (2003). Passage retrieval vs document retrieval for factoid question answering. Paper presented at the Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval., Clarke, L. A., Cormack, G. V., & Lynam, T. R. (2001). Exploiting Redundancy in Question Answering . Paper presented at the SIGIR2001. Cohen, W. W. (2004). Minorthird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data. Retrieved January 23rd, 2009, from http://minorthird.sourceforge.net Costa, M. G., & Gong, Z. G. (2005). Web Structure Mining: An Introduction. Paper published at Information Acquistion 2005. Cui, H., Kan, M., Chua, T. (2007). Soft Pattern matching models for definitional question answering. TOIS(25),2. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). A framework and graphicical development environment for robust NLP tools and applications . Paper presented at the ACL2002. Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of TREC 2006 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec15/papers/QA06.OVERVIEW.pdf Day, M., Ong, C. S., Hsu, W. L. (2007). Question Classification in English- Chinese Cross-Language Question Answering: An Integrated Genetic

© 2009 Lin Chen Page 130 Appendix A Page 131

Algorithm & Machine Learning Approach. Paper presented at the Proceedings of Information Reuse and Integration 2007. Ding, C. (2002). PageRank, HITS and a Unified Framework for Link Analysis. Retrieved June 23 rd , 2008, from http://www.siam.org/proceedings/datamining/2003/dm03_24DingC.pdf Domingos, P., & Richardson, M. (2001). Mining the Network Value of Customers . Paper presented at the KDD 2001. Donato, D., & Leonardi, L. (2005). Mining the inner structure of the Web graph. Retrieved 28th December, 2008, from http://www.cs.helsinki.fi/u/tsaparas/publications/WebDB.pdf Drucker, H., Wu, D. H., & Vapnik, V. N. (1999). Support Vector Machine for Spam Categorization. Retrieved June 20th, 2007, from http://ieeexplore.ieee.org/iel5/72/17091/00788645.pdf?tp=&isnumber= &arnumber=788645 Dumais, S. T. (1998). Using SVMs for Text Categorization. Retrieved June 20th, 2007, from http://research.microsoft.com/~sdumais/ieee- tc98.doc Eirinaki, M., & Vazirgiannis, M. (2007). Web site personalization based on link analysis and navigational pattern. ACM Trans. Internet Techn., 7(4). Erickson, B. (1988). The relational basis of attitudes. In Social Structures: A Newwork Approach (pp. 99-121). Cambridge: Cambridge University Press. Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. Paper presented at ACM-SIAM 2003. Feng, D. H., & et.al. (2006). An Intelligent Discussion - Bot for Answering Student Queries in Threaded Discussions. Paper presented at the Proceedings of the International Conference on Intelligent User Interfaces IUI -2006, Sydney, Australian. Florian, R., Ittycheriah, A., Jing, H. Y., & Zhang, T. (2003). Name Entity Recognition through Classifier Combination. Paper presented at the CoNLL 2003. Freeman, L. (1977). A set of measures of centrality based upon betweenness. Socimetry (40), 35-41. Friedkin, N. E. (1986). A formal theeory of social power. Mathematical Sociology, 12 , 103-126. Fu, J., Jia, K. L., Xu, J. Z. (2008). Domain ontology learning for question answering system in network education. Paper presented at ICYCS 2008. Gabson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web Communities from Link Topology. Paper presented at Proceedings of HyperText 1998. Garton, L. (1997). Studying Online Social Networks. Retrieved September 14th, 2007, from http://jcmc.indiana.edu/vol3/issue1/garton.html Gotta, M. (2008). Social Networks: Making Sense Out of Terminology. Retrieved January 30th, 2009, from http://ccsblog.burtongroup.com/collaboration_and_content/2008/04/so cial-networks.html

© 2009 Lin Chen Page 131 Appendix A Page 132

Greenwood, M. A., Stevenson, M., & Gaizauskas, R. (2006). The University of Sheffield's TREC 2006 Q&A Experiments . Paper presented at the TREC 2006. Grossman, D. A., & Frieder, O. (2004). Information Retrieval Algorithms and Heuristics (2 ed.). Norwell: Springer. Gunter, N., & Xu, F. (2003) Mining Answers in German Web Pages. Paper presented at the Web Intelligence 2003. Guo, J. W., Xu, S. L., & Bao, S. H. (2008). Tapping on the Potential of Q&A Community by Recommending Answer Provider. Paper presented at the Proceeding CIKM 2008. Gyongyi, Z., & Koutrika, G. (2008). Questioning Yahoo! Answers . Paper presented at the WWW 2008. Han, K., Song, Y., Rim, H. (2006). Probabilistic model for definitional question answering. Paper presented at the Proceeding SIGIR 2006. Harabagiu, S., & et al. (2001). The role of Lexico-Sematic Feedback in Open-Domain Textual Question Answering. Paper presented at the Proceedings of 39 th Annual Meeting of the Association for Computational Linguistics (ACL-2001). Harabaiu, S., & Lacatusu, V. F., Hickl, A. (2006). Answering complex question with random walk models. Paper presented at the Proceeding SIGIR 2006. Haveliwala, T. H. (2002). Topic sensitive PageRank. Paper presented at the Proceedings of WWW 2002. Herlocker, J. L., Konstan, J.A., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender system. ACM Traction on Information Systems, 22(1), 5-53. Hovy, E. H., Hermjakob, U., Lin, C. (2001) The Use of External Knowledge of Factoid QA. Paper presented at TREC 2001. Houser, D. E., & Wooders, J. (2001). Reputation in Internet Auctions: Theory and Evidence from eBay. Retrieved January 23rd, 2009, from http://w3.arizona.edu/~econ/working_papers/Internet_Auctions.pdf . Hung, J. C., et al. (2005). Applying Word Sense Disambiguation to Question Answering System for e-Learning. Paper presented at the Proceeding AINA 2005. Ittycheriah, A., Franz, M., Roukos, S. (2001). IBM’s Statistical Question Answering System. Paper presented at TREC 2001. Jen, G., & Widom, J. (2003). Scaling personalized web search. Paper presented at the Proceedings WWW 2003. Jeon, J. Croft, W. B., & Lee, J. H. (2006). A Framework to Predict the Queality of Answers with Non-Textual Features. Paper presented at the Proceeding SIGIR 2006. Jeyes, D. (2009). Where Facebook’s 250 million users aren’t. Retrieved August 4 th 2009, from http://tech.blorge.com/structure:%20/2009/07/16/where-facebooks- 250-million-users-arent/ Jurczyk, P., & Agichtein, E. (2007). HITS on Question Answer Portals: Exploration of Link Analysis for Author Ranking. Paper presented at the SIGIR2007.

© 2009 Lin Chen Page 132 Appendix A Page 133

Kaisser, M. (2004). The QuALiM Question Answering system. Retrieved January 29th, 2009, from http://37.stuts.de/downloads/ags/QuALiM_forDummies.pps#1 Kangavari, M. R., Ghandichi, S., & Golpour, M. (2008). Information Retrieval: Improving Question Answering Systems by Query Reformulation and Answer Validation . Paper presented at the Proceedings of world academy of science, engineering and technology. Katz, B., Lin, J., Loreto, D., Hildebrandt, W., Bilotti, M., Felshin, S., et al. (2003). Integrating Web-based and Corpus-based Techniques for Question Answering . Paper presented at the TREC2003. Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18 , 63-80. Kautz, H., & et, a. (1997). Combining Social Networks and Collaborative Filtering Communications of the ACM, 40 (3), 63-65. Kim, H., Kim, K., Lee, G. G., & Seo, J. (2001). MAYA: A Fast Question Answering System based on a Predictive Answer Indexer. Paper presented at the ARABIC language processing: status and prospects. Kim, M. K., & Kim, H. J. (2008). Design of Question Answering System with Automated Question Generation. Paper presented at the Conference of Network Computing and Advanced Information Management. Kim, Y. A., & Srivastava, J. (2007). Impact of Social Influence in E- Commerce Decision Making. Paper presented at the ICEC 2007. Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. ACM 46 (5), 604-632. Knoke, D., & Yang, S. (2008). Social Network Analysis (2 ed.): SAGE Publications. Ko, J., Si, L., & Nyberg, E. (2007). A Probabilistic Framework for Answer Selection in Question Answering. Paper presented at the Proceedings of NAACL-HLT 2007. Kor, K., & Chua, T. (2007). Interesting nuggets and their impact on definitional question answering. Paper presented at the Proceedings of SIGIR 2007. Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the Web for Emerging Cyber-Communities. Computer Networks, 31 , 1481-1493. Lauman, E. O., Marsden, P. V., & Galaskiewicz, J. (1977). Community elite influence structures: Extension of a network approach. American Journal of Sociology, 83 , 594-631. Leibenluft, J. (2007). A Librarian's Worst Nightmare. Retrieved January 3rd, 2009, from http://www.slate.com/id/2179393/pagenum/2 Lempel, R., & Moran, S. (2000). The stochastic Approach for Link Structure Analysis (SALSA) and TKC Effect. Paper presented at the Proceeding WWW 2000. Li, B. L., Liu, Y. D., Ram, A., Garcia, E. V., Agichtein, E. (2008). Exploring question subjectivity prediction in community QA. Paper presented at the Proceedings of SIGIR 2008. Li, X., & Roth, D. (2002). Learning Question Classifiers. Paper presented at the 19th International Conference on Computational Linguistics. Ligozat, G. (2007). Eligible and Frozen Constraints for Solving Temporal Qualitative Constraint Networks. CP2007, 806-814.

© 2009 Lin Chen Page 133 Appendix A Page 134

Lin, J. (2007). An exploration of the principles underlying redundancy-based factoid question answering. TOIS(25), 2. Lin, J., Fernandes, A., Katz, B., Marton, G., & Tellex, S. (2002). Extracting Answers from the Web Using Knowledge Annotation and Knowledge Mining Techniques . Paper presented at the TREC 2002. Lin, J., & Katz, B. (2003). Question Answering from the World Wide Web using Knowledge Annotation and Knowledge Mining Techniques . Paper presented at the CIKM 2003. Linden, G. (2006). Stats on Yahoo! Answers. Geeking with Greg: Exploring the future of personalized information , from http://glinden.blogspot.com/2006/05/stats-on-yahoo-answers.html Liu, B. (2007). Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. Heidelberg: Springer. Liu, X. L., et al. (2007). Software Architecture for a Pattern based Question Answering System. Paper presented at the Proceedings of Software Engineering Research, Management & Application 2007. Marcos, J. A., Martinez, A., & Dimitriadis, Y. (2006). Adapting Interaction Analysis to Support Evaluation and Regulation: A Case Study. Paper presented at the Proceedings of the Sixth International Conference on Advanced Learning Technologies. Matsuo, Y., Mori, J., & Hamasaki, M. (2006). POLYPHONET: An Advanced Social Network Extraction System . Paper presented at the WWW'06. Michelakis, E., Androutspoulos, I., Paliouras, G., Sakkis, G., & Stamatopoulous, P. (2004). A learning-based anti-spam filter . Paper presented at the First Conference on Email and Anti-Spam(CEAS). Mihalcea, R., Tarau, P., Figa, E. (2004). PageRank on Semantic Networks with Application to WordSense Disambiguation. Retrieved October 2nd , 2007 from http://www.cse.unt.edu/~rada/papers/mihalcea.coling04.pdf Mislove, A., Marcon, M., & Gaummadi, K. (2007). Measurement and Analysis of Online Social Networks . Paper presented at the IMC'07. Moldovan, D., Pasa, M., Harabagiu, S., & Surdeanu, M. (2002). Performance issues and error analysis in an open-domain question answering system . Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Moldovan, D., Pasa, M., Harabagiu, S., & Surdeanu, M. (2003). performance issues and error analysis in an open-domain question answering system. ACM Transactions on Information Systems (TOIS), 21 (2), 133-154. Moschitti, A. (2003). Answer Filtering via Text Categorization in Question Answering Systems. ICTAI 2003. Mui, L., Halberstadt, A., & Mohtashemi, M. (2002). Notions of Reputation in Multi-Agents System: A Review . Paper presented at the AAMAS'02. Nadeau, D. (2005). Balie-Baseline Information Extraction Multilingual Information Extraction from Text with Machine Learning and Natural Language Techniques. Retrieved January 14th, 2009, from http://balie.sourceforge.net/dnadeau05balie.pdf Neumann, G., Xu, F., & Sacaleanu, B. (2003). Strategies for Web Based Cross-Language Question Answering . Paper presented at the CES 2003.

© 2009 Lin Chen Page 134 Appendix A Page 135

Ng, A., Zheng, A. X., & Jordan, M. (2001). Stable Algorithms for Link Analysis. Paper presented at the Proceeding SIGIR 2001. Nie, L., Davison, B. D., & Qi, X. G. (2006). Topical link analysis for web search . Paper presented at the SIGIR 2006. Noguchi, Y. (2006, August 16, 2006). Web searches go low-tech: You ask, a person answers . Washington Post, p. A01, Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank citiation ranking: bringing order to the web. Tech rep. Stanford Digital Library Technologies Project. Pasa, M. (2005). New Direction in Question Answering. Retrieved December 19th, 2008, from http://delivery.acm.org.ezp02.library.qut.edu.au/10.1145/1110000/110 9001/s7.pdf?key1=1109001&key2=6167902321&coll=ACM&dl=ACM &CFID=18483522&CFTOKEN=64798226 Pasa, M., & Harabagiu, S. (2001). High Performance Question/Answering Paper presented at the SIGIR2001. Pathak, N., Mane, S., & Srivastava, J. (2007). Analysis of Cognitive Social and Knowledge Networks from Electronic Communication. International Journal of Semantic Computing, 1 (1), 87-118. Peng, F. C., et al. (2005). Combining deep liniguistics analysis and surface pattern learning: a hybrid approach to chinese definitional question answering. Paper presented at the Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Proceeding (HLT/EMNLP). Prager, J. (2006). Open-Domain Question-Answering (Vol. 1). Hanover: now Publishers Inc. Prager, J., Carroll, J. C., & Czuba, K. (2002). Statistical Answer-Type Identification in Open-Domain Question Answering . Paper presented at the Proceedings of the second international conference on Human Language Technology Research. Pujol, J. M., Sanguesa, R., & Delgado, J. (2002). Extracting Reputation in Mulit-Agent Systems by Means of Social Network Topology. Paper presented at the Proceedings of AAMAS 2002. Ratnaparkhi, A. (1996). A maximum Entropy Model for Part-Of-Speech Tagging. from http://citeseer.ist.psu.edu/article/ratnaparkhi96maximum.html Richardson, M. E., & Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. Paper presented at NIPS (Vol. 14). Richardson, M., & Prakash, A. (2006). Beyond PageRank: Machine Learning for Static Ranking. Paper presented at the WWW 2006. Roberts, I. (2002). Information retrieval for question answering. University of Sheffield. Schlaefer, N., Gieselmann, P., & Sautter, G. (2006). The Ephyra QA system at TREC 2006 . Paper presented at the TREC 2006. Schlaefer, N., Gieselmann, P., Schaaf, T., & Waibel, A. (2006). A Pattern Learning Approach to Question Answering within the Ephyra Framework . Paper presented at the Nine International Conference on TEXT, SPEECH and DIALOGUE (TSD).

© 2009 Lin Chen Page 135 Appendix A Page 136

Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. Paper presented at the New Methods in Language Processing, Manchester, UK. Shimbo, M., Ito, T., & Matsumoto, Y. (2007). Evaluation of kernel-based link analysis measures on research paper recommendation. Paper presented at the JCDL 2007. Song, X. D., & et, a. (2005). Modeling and Predicting Personal Information Dissemination Behaviour. Paper presented at the KDD'05. Srihari, R. K., & Li, W. (2000). A Question Answering System Supported by Information Extraction. Paper Presented at ANLP 2000. Srivastava, J., & et, a. (2006). Data Mining for Social Network Analysis. Paper presented at the ICDM' 06. Sullivan, D. (2006). Look Out Wikipedia, Here Comes Yahoo Answers! Retrieved January 3rd, 2009, from http://searchenginewatch.com/3612046 Tarjan, R. (1972). Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1 (2), 146-160. Tomlin, J. A. (2003) A new paradigm for ranking pages on the world wide web. Paper presented at the Proceedings WWW 2003. Vapnik, V. (1997). Estimation of Dependencies Based on Empirical data . New York: Springer. Voorhees, E. M. (2001). Overview of the TREC 2001 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec10/papers/qa10.pdf Voorhees, E. M. (2002). Overview of TREC 2002 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec11/papers/QA11.pdf Voorhees, E. M. (2003). Overview of the TREC 2003 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec12/papers/QA.OVERVIEW.pdf Voorhees, E. M. (2004). Overview of the TREC2004 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIEW.pdf Voorhees, E. M. (2005). Overview of TREC 2005 Question Answering Track. Retrieved January 10th, 2009, from http://trec.nist.gov/pubs/trec14/papers/QA.OVERVIEW.pdf Wang, B. & Li, Y. Q. (2008). Research on the Design of the Ontology-Based Automatic Question Answering System. Paper presented at the Proceedings of Computer Science and Software Engineering 2008. Wang, J. D., Chen, Z., Tao, L., Ma, W. Y., & Liu, W. Y. (2002). Ranking user's relevance to a topic through link analysis on web logs. Paper presented at the Fourth International workshop on Web information and data management., Virginia, USA. Wang, Y. L., & Gao, G. L. (2008). Research and Implementation of Intelligent Question Answering System in a Restricted Domain. Paper presented at the Proceedings of Pattern Recognition 2008. Wasserman, S., & Faust, K. (1994). Social Network Analysis: Methods and Applications (Vol. 8): Press Syndicate of the University of Cambridge.

© 2009 Lin Chen Page 136 Appendix A Page 137

Whittaker, E., Furui, S., & Klakow, D. (2005). A statistical Classification Approach to Question Answering using Web Data . Paper presented at the Proceedings of Cyberworlds. Wu, W., & Yang, J. (2008) Semi-supervised learning of object categories from paired local features. Paper presented at CIVR 2008.Xu, J. Z., Jia, K., Fu, J. (2008). Research of automatic Question Answering system in Network Teaching. Paper presented at the Proceedings of ICYCS 2008. Yang, H., Chua, T., Wang, S., Koh, C. (2003) Structured use of external knowledge for event-based open domain question answering. Paper presented at the Proceeding SIGIR 2003. Zhang, D., & Lee, W. (2003). Question classification using support vector machines . Paper presented at the Proceedings of the 26th annual international ACM SIGIR conference. Zhang, J., et al. (2007). Expertise Network in Online Communities: Structure and Algorithms. Paper presented at the Proceedings WWW 2007. Zheleva, E., Koloz, A., & Getoor, L. (2008). Trusting spam reporters: A reporter-based reputation system for email filtering. ACM Transactions on Information Systems (TOIS), 27 (1), 3-29. Zheng, Z. P. (2002). AnswerBus Question Answering System. Paper presented at the HLT 2002. Zhu, J. H., Hong, J., & Hughes, J. G. (2001). PageRate: Counting Web Users’ Votes. Paper presented at the HT 2001, 131-132.

© 2009 Lin Chen Page 137 Appendix A Page 138

7Appendix A

© 2009 Lin Chen Page 138 Appendix A Page 139 QA-NC-0.1AS-0.9BAS QA-NC-0.2AS-0.8BAS QA-NC-0.3AS-0.7BAS QA-NC-0.4AS-0.6BAS QA-NC-0.5AS-0.5BAS QA-NC-0.6AS-0.4BAS QA-NC-0.7AS-0.3BAS QA-NC-0.8AS-0.2BAS QA-NC-0.9AS-0.1BAS QA-NC-0.25AS-0.75BAS QA-NC-0.75AS-0.25BAS Overall Weighting- Overall 0.5C & of 0.5NC Figure A.1. Overall reputation score weighting. Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 139 Appendix A Page 140

QA-NC-0.1AS-0.9BAS QA-NC-0.2AS-0.8BAS QA-NC-0.3AS-0.7BAS QA-NC-0.4AS-0.6BAS QA-NC-0.5AS-0.5BAS QA-NC-0.6AS-0.4BAS QA-NC-0.7AS-0.3BAS QA-NC-0.8AS-0.2BAS QA-NC-0.9AS-0.1BAS QA-NC-0.25AS-0.75BAS QA-NC-0.75AS-0.25BAS ing. Arts & Humanity - Weighting- & 0.5C & Humanity 0.5NC of Arts Figure A.2. Arts & Humanity reputation score weight Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 140 Appendix A Page 141

QA-NC-0.1AS-0.9BAS QA-NC-0.2AS-0.8BAS QA-NC-0.3AS-0.7BAS QA-NC-0.4AS-0.6BAS QA-NC-0.5AS-0.5BAS QA-NC-0.6AS-0.4BAS QA-NC-0.7AS-0.3BAS QA-NC-0.8AS-0.2BAS QA-NC-0.9AS-0.1BAS QA-NC-0.25AS-0.75BAS QA-NC-0.75AS-0.25BAS ng. Science & & Maths Weighting- Science & 0.5C of 0.5NC Figure A.3. Science & Math reputation score weighti Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 141 Appendix A Page 142

QA-NC-0.1AS-0.9BAS QA-NC-0.2AS-0.8BAS QA-NC-0.3AS-0.7BAS QA-NC-0.4AS-0.6BAS QA-NC-0.5AS-0.5BAS QA-NC-0.6AS-0.4BAS QA-NC-0.7AS-0.3BAS QA-NC-0.8AS-0.2BAS QA-NC-0.9AS-0.1BAS QA-NC-0.25AS-0.75BAS QA-NC-0.75AS-0.25BAS Sports - Sports Weighting- & 0.5C 0.5NC of Figure A.4. Sports reputation score weighting. Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 142 Appendix A Page 143

The following are based on a dataset size of 63,990 QA pages with Arts & Humanity containing 10,939 QA pages, Science & Maths containing 13,052 QA pages and Sports containing 39,999 QA pages.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35848 0.5602 49092 0.7671 55399 0.8657 58572 0.9153 60440 0.9445 Arts & Humanity 7367 0.6734 9532 0.8713 10375 0.9484 10676 0.9759 10809 0.9881 Science & Maths 8063 0.6177 10582 0.8107 11716 0.8976 12195 0.9343 12513 0.9587 Sports 20418 0.5104 28978 0.7244 33308 0.8327 35701 0.8925 37118 0.9279 Table A.1. 0.1 participation score weighting with 0.9 answer score weighting (QA-NC- 0.1AS-0.9BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35616 0.5565 48870 0.7637 55243 0.8633 58419 0.9129 60333 0.9428 Arts & Humanity 7290 0.6664 9491 0.8676 10357 0.9467 10656 0.9741 10795 0.9868 Science & Maths 8139 0.6235 10663 0.8169 11753 0.9004 12237 0.9375 12541 0.9608 Sports 20187 0.5046 28716 0.7179 33133 0.8283 35526 0.8881 36997 0.9249 Table A.2. 0.2 participation score weighting with 0.8 answer score weighting (QA-NC- 0.2AS-0.8BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35443 0.5538 48700 0.761 55092 0.8609 58309 0.9112 60239 0.9413 Arts & Humanity 7241 0.6619 9450 0.8638 10332 0.9445 10640 0.9726 10784 0.9858 Science & Maths 8184 0.627 10710 0.8205 11780 0.9025 12259 0.9392 12549 0.9614 Sports 20018 0.5004 28540 0.7135 32980 0.8245 35410 0.8852 36906 0.9226 Table A.3. 0.3 participation score weighting with 0.7 answer score weighting (QA-NC- 0.3AS-0.7BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35214 0.5503 48490 0.7577 54916 0.8581 58108 0.9096 60140 0.9398 Arts & Humanity 7183 0.6566 9395 0.8588 10304 0.9419 10627 0.9714 10777 0.9851 Science & Maths 8226 0.6302 10731 0.8221 11806 0.9045 12285 0.9412 12567 0.9628 Sports 19805 0.4951 28364 0.7091 32806 0.8201 35296 0.8824 36796 0.9199 Table A.4. 0.4 participation score weighting with 0.6 answer score weighting (QA-NC- 0.4AS-0.6BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34975 0.5465 48321 0.7551 54764 0.8558 58085 0.9077 60085 0.9389 Arts & Humanity 7132 0.6519 9354 0.8551 10276 0.9393 10610 0.9699 10769 0.9844 Science & Maths 8259 0.6327 10778 0.8257 11831 0.9064 12306 0.9428 12586 0.9642 Sports 19584 0.4896 28189 0.7047 32657 0.8164 35169 0.8792 36730 0.9182 Table A.5. 0.5 participation score weighting with 0.5 answer score weighting (QA-NC- 0.5AS-0.5BAS).

© 2009 Lin Chen Page 143 Appendix A Page 144

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34790 0.5436 48179 0.7529 54626 0.8536 58012 0.9065 59981 0.9373 Arts & Humanity 7083 0.6474 9315 0.8515 10246 0.9366 10606 0.9695 10751 0.9828 Science & Maths 8322 0.6376 10838 0.8303 11869 0.9093 12333 0.9449 12601 0.9654 Sports 19385 0.4846 28026 0.7006 32511 0.8127 35073 0.8768 36629 0.9157 Table A.6. 0.6 participation score weighting with 0.4 answer score weighting (QA-NC- 0.6AS-0.4BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34661 0.5416 48014 0.7503 54519 0.8519 57945 0.9055 59959 0.937 Arts & Humanity 7047 0.6442 9282 0.8485 10215 0.9338 10587 0.9678 10745 0.9822 Science & Maths 8387 0.6425 10903 0.8353 11911 0.9125 12373 0.9479 12640 0.9684 Sports 19227 0.4806 27829 0.6957 32393 0.8098 34985 0.8746 36574 0.9143 Table A.7. 0.7 participation score weighting with 0.3 answer score weighting (QA-NC- 0.7AS-0.3BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34479 0.5388 47831 0.7474 54370 0.8496 57848 0.904 59894 0.9359 Arts & Humanity 6975 0.6376 9237 0.8444 10187 0.9312 10585 0.9676 10742 0.9819 Science & Maths 8467 0.6487 10958 0.8395 11947 0.9153 12411 0.9508 12663 0.9701 Sports 19037 0.4759 27636 0.6909 32236 0.8059 34852 0.8713 36489 0.9122 Table A.8. 0.8 participation score weighting with 0.2 answer score weighting (QA-NC- 0.8AS-0.2BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34255 0.5353 47616 0.7441 54180 0.8466 57703 0.9017 59790 0.9343 Arts & Humanity 6908 0.6315 9201 0.8411 10161 0.9288 10572 0.9664 10735 0.9813 Science & Maths 8538 0.6541 11004 0.843 11967 0.9168 12424 0.9518 12673 0.9709 Sports 18809 0.4702 27411 0.6852 32052 0.8013 24707 0.8676 36382 0.9095 Table A.9. 0.9 participation score weighting with 0.1 answer score weighting (QA-NC- 0.9AS-0.1BAS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35515 0.555 48801 0.7626 55171 0.8621 58366 0.9121 60283 0.942 Arts & Humanity 7267 0.6643 9470 0.8657 10348 0.9459 10648 0.9733 10790 0.9863 Science & Maths 8170 0.6259 10697 0.8195 11776 0.9022 12254 0.9388 12546 0.9612 Sports 20078 0.5019 28634 0.7158 33047 0.8261 35464 0.8866 36947 0.9236 Table A.10. 0.25 participation score weighting with 0.75 answer score weighting (QA-NC- 0.25AS-0.75BAS).

© 2009 Lin Chen Page 144 Appendix A Page 145

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 34573 0.5402 47912 0.7487 54455 0.8509 57883 0.9045 59924 0.9364 Arts & Humanity 7003 0.6401 9252 0.8457 10197 0.9321 10593 0.9683 10743 0.982 Science & Maths 8434 0.6461 10930 0.8374 11934 0.9143 12390 0.9492 12650 0.9692 Sports 19136 0.4784 27730 0.6932 32324 0.8081 34900 0.8725 36531 0.9132 Table A.11. 0.75 participation score weighting with 0.25 answer score weighting (QA-NC- 0.75AS-0.25BAS).

© 2009 Lin Chen Page 145 Appendix B Page 146

8Appendix B

© 2009 Lin Chen Page 146 Appendix B Page 147 QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS on and content scoring (1). ll dataset using different combinations of reputati Overall Weighting- Overall 0.5C & of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.1. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 147 Appendix B Page 148 QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) on and content scoring (2). ll dataset using different combinations of reputati Overall - Weighting- Overall 0.75C & 0.25NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.2. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 148 Appendix B Page 149

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) on and content scoring (3). ll dataset using different combinations of reputati Overall - Weighting- Overall 0.25C & 0.75NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.3. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 149 Appendix B Page 150

QA-NC QA-C(1) QA-C(1)-WN QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(1)-WN QA-0.5NC-0.5(C(1)-C(2)) QA-0.5NC-0.5(C(1)-C(2))-WN QA-C(1)-C(2) QA-C(1)-C(2)-WN on and content scoring (4). mparison ll dataset using different combinations of reputati Overall Weighting-with Overall 0.5C & of 0.5NC co WordNet Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.4. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 150 Appendix B Page 151 QA-NC QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(1)-WN QA-0.25NC-0.75(C(1)-C(2)) QA-0.25NC-0.75(C(1)-C(2))-WN on and content scoring (5). comparison ll dataset using different combinations of reputati Overall - Weighting of 0.25NC & 0.75C with Weighting- Overall 0.75C & 0.25NC of WordNet Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.5. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 151 Appendix B Page 152

QA-NC QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(1)-WN QA-0.75NC-0.25(C(1)-C(2)) QA-0.75NC-0.25(C(1)-C(2))-WN on and content scoring (6). comparison ll dataset using different combinations of reputati Overall - Weighting of 0.75NC & 0.25C with Weighting- Overall 0.25C & 0.75NC of WordNet Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.6. Overall top n match score for first sma 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 152 Appendix B Page 153

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS reputation and content scoring (1). irst small dataset using different combinations of Arts & Humanity - Weighting- & 0.5C & Humanity 0.5NC of Arts Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.7. Arts & Humanity top n match score for f 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 153 Appendix B Page 154

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) reputation and content scoring (2). irst small dataset using different combinations of Arts & Humanity - Weighting& 0.75C Humanity of & 0.25NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.8. Arts & Humanity top n match score for f 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 154 Appendix B Page 155

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) reputation and content scoring (3). irst small dataset using different combinations of Arts & Humanity - Weighting& 0.25C Humanity of & 0.75NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.9. Arts & Humanity top n match score for f 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 155 Appendix B Page 156

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS reputation and content scoring (1). irst small dataset using different combinations of Science & & Maths Weighting- Science & 0.5C of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.10. Science & Math top n match score for f Figure B.10. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 156 Appendix B Page 157

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) reputation and content scoring (2). irst small dataset using different combinations of Science & Maths Weighting- Science & 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.11. Science & Math top n match score for f Figure B.11. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 157 Appendix B Page 158

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) reputation and content scoring (3). irst small dataset using different combinations of Science & Maths Weighting- Science & 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.12. Science & Math top n match score for f Figure B.12. 1 0 1.2 0.8 0.6 0.4 0.2

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 158 Appendix B Page 159

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS on and content scoring (1). ll dataset using different combinations of reputati Sports - Sports Weighting- & 0.5C 0.5NC of Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.13. Sports top n match score for first sma Figure B.13. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 159 Appendix B Page 160

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) on and content scoring (2). ll dataset using different combinations of reputati Sports - Sports Weighting& 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.14. Sports top n match score for first sma Figure B.14. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 160 Appendix B Page 161

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) on and content scoring (3). ll dataset using different combinations of reputati Sports - Sports Weighting& 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.15. Sports top n match score for first sma Figure B.15. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 161 Appendix B Page 162

The following are based on a dataset size of 15,991 QA pages with Arts & Humanity containing 2,735 QA pages, Science & Maths containing 3,268 QA pages and Sports containing 9,988 QA pages.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 8449 0.5283 11831 0.7398 13470 0.8423 14415 0.9014 14953 0.9350 Arts & Humanity 1668 0.6098 2274 0.8314 2529 0.9246 2646 0.9674 2688 0.9828 Science & Maths 1939 0.5933 2583 0.7903 2879 0.8809 3033 0.9280 3117 0.9537 Sports 4842 0.4847 6974 0.6982 8062 0.8071 8736 0.8746 9148 0.9158 Table B.1. Top n match score for first small dataset using reputation scoring (QA-NC).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6017 0.3762 9242 0.5779 11213 0.7012 12494 0.7813 13398 0.8378 Arts & Humanity 1238 0.4526 1834 0.6705 2158 0.7890 2336 0.8541 2467 0.9020 Science & Maths 1425 0.4360 2127 0.6508 2479 0.7585 2711 0.8295 2875 0.8797 Sports 3354 0.3358 5281 0.5287 6576 0.6583 7447 0.7455 8056 0.8065 Table B.2. Top n match score for first small dataset using AnswerBus (QA-C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 5736 0.3587 9037 0.5651 11070 0.6922 12426 0.7770 13379 0.8366 Arts & Humanity 1203 0.4398 1800 0.6581 2148 0.7853 2344 0.8570 2466 0.9016 Science & Maths 1351 0.4134 2077 0.6355 2459 0.7524 2705 0.8277 2867 0.8772 Sports 3182 0.3185 5160 0.5166 6463 0.6470 7377 0.7385 8046 0.8055 Table B.3. Top n match score for first small dataset using question answer match (QA- C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7807 0.4882 11001 0.6879 12678 0.7928 13690 0.8561 14331 0.8961 Arts & Humanity 1447 0.5290 2065 0.7550 2346 0.8577 2488 0.9096 2561 0.9363 Science & Maths 1864 0.5703 2470 0.7558 2770 0.8476 2934 0.8977 3032 0.9277 Sports 4496 0.4501 6466 0.6473 7562 0.7571 8268 0.8277 8738 0.8748 Table B.4. Top n match score for first small of dataset using combination of reputation and AnswerBus scoring (QA-0.5NC-0.5C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7387 0.4619 10557 0.6601 12345 0.7719 13406 0.8383 14132 0.8837 Arts & Humanity 1369 0.5005 1954 0.7144 2282 0.8343 2446 0.8943 2546 0.9308 Science & Maths 1757 0.5376 2415 0.7389 2723 0.8332 2883 0.8821 2992 0.9155 Sports 4261 0.4266 6188 0.6195 7340 0.7348 8077 0.8086 8594 0.8604 Table B.5. Top n match score for first small of dataset using combination of reputation and question answer match scoring (QA-0.5NC-0.5C(2)).

© 2009 Lin Chen Page 162 Appendix B Page 163

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7510 0.4696 10620 0.6641 12383 0.7743 13422 0.8393 14132 0.8837 Arts & Humanity 1396 0.5104 1978 0.7232 2279 0.8332 2427 0.8873 2531 0.9254 Science & Maths 1811 0.5541 2422 0.7411 2722 0.8329 2892 0.8849 2998 0.9173 Sports 4303 0.4308 6220 0.6227 7382 0.7390 8103 0.8112 8603 0.8613 Table B.6. Top n match score for first small of dataset using combination of reputation and content scoring (QA-0.5NC-0.5(C(1)-C(2))).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6087 0.3806 9367 0.5857 11377 0.7114 12640 0.7904 13531 0.8461 Arts & Humanity 1264 0.4621 1863 0.6811 2203 0.8054 2359 0.8625 2479 0.9063 Science & Maths 1427 0.4366 2128 0.6511 2506 0.7668 2732 0.8359 2888 0.8837 Sports 3396 0.3400 5376 0.5382 6668 0.6676 7549 0.7558 8164 0.8173 Table B.7. Top n match score for first small of dataset using different content scoring (QA-C(1)-C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 5873 0.3672 9080 0.5678 11040 0.6903 12426 0.777 13383 0.8369 Arts & Humanity 1159 0.4237 1770 0.6471 2077 0.7594 2296 0.8394 2435 0.8903 Science & Maths 1397 0.4274 2101 0.6429 2424 0.7417 2672 0.8176 2850 0.872 Sports 3317 0.332 5209 0.5215 6539 0.6546 7458 0.7466 8098 0.8107 Table B.8. Top n match score for first small of dataset using HITS (QA-HITS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6060 0.3789 9275 0.58 11253 0.7037 12544 0.7844 13447 0.8409 Arts & Humanity 1249 0.4566 1841 0.6731 2161 0.7901 2343 0.8566 2481 0.9071 Science & Maths 1433 0.4384 2137 0.6539 2497 0.764 2724 0.8335 2872 0.8788 Sports 3378 0.3382 5297 0.5303 6595 0.6602 7477 0.7485 8094 0.8103 Table B.9. Top n match score for first small of dataset using WordNet applying to AnswerBus (QA-C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6555 0.4099 9704 0.6068 11604 0.7256 12826 0.802 13636 0.8527 Arts & Humanity 1467 0.5363 2064 0.7546 2349 0.8588 2490 0.9104 2561 0.9363 Science & Maths 1712 0.5238 2347 0.7181 2662 0.8145 2861 0.8754 2982 0.9124 Sports 3376 0.338 5293 0.5299 6593 0.66 7475 0.7483 8093 0.8102 Table B.10. Top n match score for first small of dataset using combination of reputation and WordNet scoring (QA-0.5NC-0.5C(1)-WN).

© 2009 Lin Chen Page 163 Appendix B Page 164

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6397 0.4 9615 0.6012 11594 0.725 12797 0.8002 13670 0.8548 Arts & Humanity 1402 0.5126 1974 0.7217 2287 0.8361 2424 0.8862 2534 0.9265 Science & Maths 1573 0.4813 2263 0.6924 2638 0.8072 2824 0.8641 2963 0.9066 Sports 3422 0.3426 5378 0.5384 6669 0.6677 7549 0.7558 8173 0.8182 Table B.11. Top n match score for first small of dataset using combination of reputation and content scoring (QA-0.5NC-0.5(C(1)-C(2))-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6115 0.3824 9379 0.5865 11411 0.7135 12656 0.7914 13555 0.8476 Arts & Humanity 1272 0.465 1859 0.6797 2209 0.8076 2366 0.865 2486 0.9089 Science & Maths 1421 0.4348 2141 0.6551 2532 0.7747 2740 0.8384 2896 0.8861 Sports 3422 0.3426 5379 0.5385 6670 0.6678 7550 0.7559 8173 0.8182 Table B.12. Top n match score for first small of dataset using combination of different content scoring (QA-C(1)-C(2)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7411 0.4634 10626 0.6644 12404 0.7756 13495 0.8439 14204 0.8882 Arts & Humanity 1449 0.5297 2016 0.7371 2306 0.8431 2461 0.8998 2547 0.9312 Science & Maths 1754 0.5367 2400 0.7343 2704 0.8274 2899 0.887 3018 0.9235 Sports 4208 0.4213 6210 0.6217 7394 0.7402 8135 0.8144 8639 0.8649 Table B.13. Top n match score for first small of dataset using different weights on reputation and content scoring (1) (QA-0.25NC-0.75C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6776 0.4237 10097 0.6314 12017 0.7514 13186 0.8245 14009 0.876 Arts & Humanity 1324 0.484 1922 0.7027 2257 0.8252 2429 0.8881 2532 0.9257 Science & Maths 1634 0.5 2308 0.7062 2643 0.8087 2828 0.8653 2964 0.9069 Sports 3818 0.3822 5867 0.5874 7117 0.7125 7929 0.7938 8513 0.8523 Table B.14. Top n match score for first small of dataset using different weights on reputation and content scoring (2) (QA-0.25NC-0.75C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7029 0.4395 10229 0.6396 12075 0.7551 13189 0.8247 13964 0.8732 Arts & Humanity 1362 0.4979 1950 0.7129 2253 0.8237 2403 0.8786 2498 0.9133 Science & Maths 1684 0.5152 2320 0.7099 2652 0.8115 2839 0.8687 2966 0.9075 Sports 3983 0.3987 5959 0.5966 7170 0.7178 7947 0.7956 8500 0.851 Table B.15. Top n match score for first small of dataset using different weights on reputation and content scoring (3) (QA-0.25NC-0.75(C(1)-C(2))).

© 2009 Lin Chen Page 164 Appendix B Page 165

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6372 0.3984 9533 0.5961 11488 0.7184 12751 0.7973 13598 0.8503 Arts & Humanity 1458 0.533 2014 0.7363 2322 0.8489 2468 0.9023 2553 0.9334 Science & Maths 1539 0.4709 2225 0.6808 2573 0.7873 2807 0.8589 2952 0.9033 Sports 3375 0.3379 5294 0.53 6593 0.66 7476 0.7484 8093 0.8102 Table B.16. Top n match score for first small of dataset using different weights on reputation and content scoring (4) (QA-0.25NC-0.75C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6298 0.3938 9538 0.5964 11525 0.7207 12753 0.7975 13621 0.8517 Arts & Humanity 1363 0.4983 1946 0.711 2259 0.8259 2413 0.8822 2506 0.9162 Science & Maths 1512 0.4626 2214 0.6774 2598 0.7949 2792 0.8543 2942 0.9002 Sports 3423 0.3427 5378 0.5384 6668 0.6676 7548 0.7557 8173 0.8182 Table B.17. Top n match score for first small of dataset using different weights on reputation and content scoring (5) (QA-0.25NC-0.75(C(1)-C(2))-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7982 0.4991 11207 0.7008 12833 0.8025 13812 0.8637 14440 0.903 Arts & Humanity 1516 0.5542 2117 0.774 2379 0.8698 2511 0.918 2590 0.9469 Science & Maths 1886 0.5771 2521 0.7714 2800 0.8567 2968 0.9082 3059 0.936 Sports 4580 0.4585 6569 0.6576 7654 0.7663 8333 0.8343 8791 0.8801 Table B.18. Top n match score for first small of dataset using different weights on reputation and content scoring (6) (QA-0.75NC-0.25C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7679 0.4802 10848 0.6783 12602 0.788 13631 0.8524 14317 0.8953 Arts & Humanity 1434 0.5243 2037 0.7447 2334 0.8533 2485 0.9085 2576 0.9418 Science & Maths 1848 0.5654 2486 0.7607 2786 0.8525 2938 0.899 3046 0.932 Sports 4397 0.4402 6325 0.6332 7482 0.749 8208 0.8217 8695 0.8705 Table B.19. Top n match score for first small of dataset using different weights on reputation and content scoring (7) (QA-0.75NC-0.25C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7793 0.4873 10947 0.6845 12667 0.7921 13667 0.8546 14358 0.8978 Arts & Humanity 1463 0.5349 2054 0.751 2346 0.8577 2491 0.9107 2582 0.944 Science & Maths 1864 0.5703 2493 0.7628 2788 0.8531 2957 0.9048 3060 0.9363 Sports 4466 0.4471 6400 0.6407 7533 0.7542 8219 0.8228 8716 0.8726 Table B.20. Top n match score for first small of dataset using different weights on reputation and content scoring (8) (QA-0.75NC-0.25(C(1)-C(2))).

© 2009 Lin Chen Page 165 Appendix B Page 166

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6467 0.4044 9669 0.6046 11599 0.7253 12823 0.8018 13650 0.8536 Arts & Humanity 1517 0.5546 2114 0.7729 2393 0.8749 2519 0.921 2590 0.9469 Science & Maths 1574 0.4816 2260 0.6915 2610 0.7986 2825 0.8644 2966 0.9075 Sports 3376 0.338 5295 0.5301 6596 0.6603 7479 0.7487 8094 0.8103 Table B.21. Top n match score for first small of dataset using different weights on reputation and content scoring (9) (QA-0.75NC-0.25C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6452 0.4034 9689 0.6059 11668 0.7296 12879 0.8053 13737 0.859 Arts & Humanity 1465 0.5356 2053 0.7506 2354 0.8606 2494 0.9118 2583 0.9444 Science & Maths 1565 0.4788 2258 0.6909 2645 0.8093 2836 0.8678 2981 0.9121 Sports 3422 0.3426 5378 0.5384 6669 0.6677 7549 0.7558 8173 0.8182 Table B.22. Top n match score for first small of dataset using different weights on reputation and content scoring (10) (QA-0.75NC-0.25(C(1)-C(2))-WN).

© 2009 Lin Chen Page 166 Appendix B Page 167 QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS tion and content scoring (1). mall dataset using different combinations of reputa Overall Weighting- Overall 0.5C & of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.16. Overall top n match score for second s Figure B.16. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 167 Appendix B Page 168

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) tion and content scoring (2). mall dataset using different combinations of reputa Overall - Weighting- Overall 0.75C & 0.25NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.17. Overall top n match score for second s Figure B.17. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 168 Appendix B Page 169

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) tion and content scoring (3). mall dataset using different combinations of reputa Overall - Weighting- Overall 0.25C & 0.75NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.18. Overall top n match score for second s Figure B.18. 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 169 Appendix B Page 170

QA-NC QA-C(1) QA-C(1)-WN QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(1)-WN QA-0.5NC-0.5(C(1)-C(2)) QA-0.5NC-0.5(C(1)-C(2))-WN QA-C(1)-C(2) QA-C(1)-C(2)-WN tion and content scoring (3). mparison mall dataset using different combinations of reputa Overall Weighting-with Overall 0.5C & of 0.5NC co WordNet Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.19. Overall top n match score for second s Figure B.19. 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 170 Appendix B Page 171 QA-NC QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(1)-WN QA-0.25NC-0.75(C(1)-C(2)) QA-0.25NC-0.75(C(1)-C(2))-WN tion and content scoring (4). comparison mall dataset using different combinations of reputa Overall - Weighting of 0.25NC & 0.75C with Weighting- Overall 0.75C & 0.25NC of WordNet Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.20. Overall top n match score for second s Figure B.20.

1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 171 Appendix B Page 172

QA-NC QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(1)-WN QA-0.75NC-0.25(C(1)-C(2)) QA-0.75NC-0.25(C(1)-C(2))-WN tion and content scoring (5). comparison mall dataset using different combinations of reputa Overall - Weighting of 0.75NC & 0.25C with Weighting- Overall 0.25C & 0.75NC of WordNet Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.21. Overall top n match score for second s Figure B.21. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 172 Appendix B Page 173

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS f reputation and content scoring (1). second small dataset using different combinations o Arts & Humanity - Weighting- & 0.5C & Humanity 0.5NC of Arts Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.22. Arts & Humanity top n match score for Figure B.22. 1 0 1.2 0.8 0.6 0.4 0.2

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 173 Appendix B Page 174

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) f reputation and content scoring (2). second small dataset using different combinations o Arts & Humanity - Weighting& 0.75C Humanity of & 0.25NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0 Figure B.23. Arts & Humanity top n match score for Figure B.23.

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 174 Appendix B Page 175

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) f reputation and content scoring (3). second small dataset using different combinations o Arts & Humanity - Weighting& 0.25C Humanity of & 0.75NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0 Figure B.24. Arts & Humanity top n match score for Figure B.24. 1.2 0.8 0.6 0.4 0.2

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 175 Appendix B Page 176

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS f reputation and content scoring (1). second small dataset using different combinations o Science & & Maths Weighting- Science & 0.5C of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.25. Science & Maths top n match score for Figure B.25. 1 0 1.2 0.8 0.6 0.4 0.2

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 176 Appendix B Page 177

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) f reputation and content scoring (2). second small dataset using different combinations o Science & Maths Weighting- Science & 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.26. Science & Maths top n match score for Figure B.26. 1 0 1.2 0.8 0.6 0.4 0.2

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 177 Appendix B Page 178

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) f reputation and content scoring (3). second small dataset using different combinations o Science & Maths Weighting- Science & 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top 1 0 Figure B.27. Science & Maths top n match score for Figure B.27.

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 178 Appendix B Page 179

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS ion and content scoring (1). all dataset using different combinations of reputat Sports - Sports Weighting- & 0.5C 0.5NC of Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.28. Sports top n match score for second sm Figure B.28. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 179 Appendix B Page 180

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) ion and content scoring (2). all dataset using different combinations of reputat Sports - Sports Weighting& 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.29. Sports top n match score for second sm Figure B.29. 1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 180 Appendix B Page 181

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) ion and content scoring (3). all dataset using different combinations of reputat Sports - Sports Weighting& 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure Sports B.30. top n match score for second sm 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 181 Appendix B Page 182

The following are based on a dataset size of 16,024 QA pages with Arts & Humanity containing 2,740 QA pages, Science & Maths containing 3,267 QA pages and Sports containing 10,017 QA pages.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 9018 0.5627 12369 0.7719 13931 0.8693 14701 0.9174 15120 0.9435 Arts & Humanity 1828 0.6671 2404 0.8773 2609 0.9521 2666 0.9729 2702 0.9861 Science & Maths 2138 0.6544 2740 0.8386 2988 0.9146 3104 0.9501 3153 0.9651 Sports 5052 0.5043 7225 0.7212 8334 0.8319 8931 0.8915 9265 0.9249 Table B.23. Top n match score for second small of dataset using reputation scoring (QA- NC).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6286 0.3922 9504 0.5931 11359 0.7088 12559 0.7837 13384 0.8352 Arts & Humanity 1453 0.5302 1979 0.7222 2263 0.8259 2419 0.8828 2513 0.9171 Science & Maths 1570 0.4805 2268 0.6942 2600 0.7958 2777 0.85 2873 0.8794 Sports 3263 0.3257 5257 0.5248 6496 0.6484 7363 0.735 7998 0.7984 Table B.24. Top n match score for second small of dataset using AnswerBus (QA-C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6082 0.3795 9324 0.5818 11254 0.7023 12522 0.7814 13373 0.8345 Arts & Humanity 1404 0.5124 1973 0.72 2271 0.8288 2428 0.8861 2518 0.9189 Science & Maths 1569 0.4802 2218 0.6789 2566 0.7854 2741 0.8389 2856 0.8741 Sports 3109 0.3103 5133 0.5124 6417 0.6406 7353 0.734 7999 0.7985 Table B.25. Top n match score for second small of dataset using question answer match scoring (QA-C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6494 0.4052 9699 0.6052 11505 0.7179 12668 0.7905 13479 0.8411 Arts & Humanity 1637 0.5974 2151 0.785 2396 0.8744 2522 0.9204 2604 0.9503 Science & Maths 1583 0.4845 2280 0.6978 2613 0.7998 2788 0.8533 2878 0.8809 Sports 3274 0.3268 5268 0.5259 6496 0.6484 7358 0.7345 7997 0.7983 Table B.26. Top n match score for second small of dataset using different combination of reputation and content scoring (1) (QA-0.5NC-0.5C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7624 0.4557 10671 0.6659 12355 0.771 13466 0.8403 14144 0.8826 Arts & Humanity 1588 0.5795 2129 0.777 2377 0.8675 2501 0.9127 2577 0.9405 Science & Maths 1928 0.5901 2495 0.7636 2757 0.8438 2933 0.8977 3003 0.9191 Sports 4108 0.4101 6047 0.6036 7221 0.7208 8032 0.8018 8564 0.8549 Table B.27. Top n match score for second small of dataset using different combination of reputation and content scoring (2) (QA-0.5NC-0.5C(2)).

© 2009 Lin Chen Page 182 Appendix B Page 183

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6484 0.4046 9667 0.6032 11550 0.7207 12746 0.7954 13527 0.8441 Arts & Humanity 1572 0.5737 2090 0.7627 2368 0.8642 2486 0.9072 2561 0.9346 Science & Maths 1620 0.4958 2300 0.704 2621 0.8022 2798 0.8564 2892 0.8852 Sports 3292 0.3286 5277 0.5268 6561 0.6549 7462 0.7449 8074 0.806 Table B.28. Top n match score for second small of dataset using different combination of reputation and content scoring (3) (QA-0.5NC-0.5(C(1)-C(2))).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6323 0.3945 9551 0.596 11452 0.7146 12690 0.7919 13501 0.8425 Arts & Humanity 1414 0.516 1978 0.7218 2285 0.8339 2434 0.8883 2534 0.9248 Science & Maths 1614 0.494 2292 0.7015 2615 0.8004 2794 0.8552 2889 0.8842 Sports 3295 0.3289 5281 0.5272 6552 0.654 7462 0.7449 8078 0.8064 Table B.29. Top n match score for second small of dataset using different combination of content scoring (1) (QA-C(1)-C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6209 0.3874 9432 0.5886 11350 0.7083 12582 0.7851 13427 0.8379 Arts & Humanity 1353 0.4937 1913 0.6981 2218 0.8094 2396 0.8744 2516 0.9182 Science & Maths 1580 0.4836 2229 0.6822 2543 0.7783 2739 0.8383 2872 0.879 Sports 3276 0.327 5290 0.5281 6589 0.6577 7447 0.7434 8039 0.8025 Table B.30. Top n match score for second small of dataset using HITS (QA-HITS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6347 0.396 9540 0.5953 11423 0.7128 12582 0.7851 13400 0.8362 Arts & Humanity 1455 0.531 1983 0.7237 2269 0.8281 2412 0.8802 2514 0.9175 Science & Maths 1593 0.4876 2276 0.6966 2607 0.7979 2783 0.8518 2892 0.8852 Sports 3299 0.3293 5281 0.5272 6547 0.6535 7387 0.7374 7994 0.798 Table B.31. Top n match score for second small of dataset WordNet (QA-C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7976 0.4977 11112 0.6934 12686 0.7916 13648 0.8517 14287 0.8916 Arts & Humanity 1623 0.5923 2147 0.7835 2395 0.874 2512 0.9167 2599 0.9485 Science & Maths 1984 0.6072 2582 0.7903 2827 0.8653 2964 0.9072 3039 0.9302 Sports 4369 0.4361 6383 0.6372 7464 0.7451 8172 0.8158 8649 0.8634 Table B.32. Top n match score for second small of dataset using different combination of reputation and content scoring (4) (QA-0.5NC-0.5C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7646 0.4771 10704 0.6679 12349 0.7706 13458 0.8398 14104 0.8801 Arts & Humanity 1566 0.5715 2084 0.7605 2358 0.8605 2487 0.9076 2562 0.935 Science & Maths 1929 0.5904 2517 0.7704 2772 0.8484 2938 0.8992 3014 0.9225 Sports 4151 0.4143 6103 0.6092 7219 0.7206 8033 0.8019 8528 0.8513 Table B.33. Top n match score for second small of dataset using different combination of reputation and content scoring (5) (QA-0.5NC-0.5(C(1)-C(2))-WN).

© 2009 Lin Chen Page 183 Appendix B Page 184

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6355 0.3965 9568 0.5971 11469 0.7157 12683 0.7915 13528 0.8442 Arts & Humanity 1421 0.5186 1979 0.7222 2287 0.8346 2432 0.8875 2528 0.9226 Science & Maths 1621 0.4961 2287 0.7 2615 0.8004 2800 0.857 2904 0.8888 Sports 3313 0.3307 5302 0.5293 6567 0.6555 7451 0.7438 8096 0.8082 Table B.34. Top n match score for second small of dataset using different combination content scoring (2) (QA-C(1)-C(2)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6448 0.4023 9667 0.6032 11484 0.7166 12651 0.7895 13470 0.8406 Arts & Humanity 1592 0.581 2131 0.7777 2385 0.8704 2502 0.9131 2592 0.9459 Science & Maths 1585 0.4851 2279 0.6975 2614 0.8001 2787 0.853 2880 0.8815 Sports 3271 0.3265 5257 0.5248 6485 0.6473 7362 0.7349 7998 0.7984 Table B.35. Top n match score for second small of dataset using different combination of reputation and content scoring (6) (QA-0.25NC-0.75C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7159 0.4467 10248 0.6395 12073 0.7534 13196 0.8235 13970 0.8718 Arts & Humanity 1552 0.5664 2090 0.7627 2349 0.8572 2483 0.9062 2560 0.9343 Science & Maths 1818 0.5564 2419 0.7404 2715 0.831 2883 0.8824 2970 0.909 Sports 3789 0.3782 5739 0.5729 7009 0.6997 7830 0.7816 8440 0.8425 Table B.36. Top n match score for second small of dataset using different combination of reputation and content scoring (7) (QA-0.25NC-0.75C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6447 0.4023 9619 0.6002 11502 0.7177 12717 0.7936 13511 0.8431 Arts & Humanity 1530 0.5583 2054 0.7496 2321 0.847 2459 0.8974 2544 0.9284 Science & Maths 1620 0.4958 2294 0.7021 2620 0.8019 2796 0.8558 2892 0.8852 Sports 3297 0.3291 5271 0.5262 6561 0.6549 7462 0.7449 8075 0.8061 Table B.37. Top n match score for second small of dataset using different combination of reputation and content scoring (8) (QA-0.25NC-0.75(C(1)-C(2))).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7676 0.479 10810 0.6746 12470 0.7782 13454 0.8396 14130 0.8818 Arts & Humanity 1588 0.5795 2119 0.7733 2373 0.866 2492 0.9094 2583 0.9427 Science & Maths 1923 0.5886 2526 0.7731 2791 0.8543 2938 0.8992 3013 0.9222 Sports 4165 0.4157 6165 0.6154 7306 0.7293 8024 0.801 8534 0.8519 Table B.38. Top n match score for second small of dataset using different combination of reputation and content scoring (9) (QA-0.25NC-0.75C(1)-WN).

© 2009 Lin Chen Page 184 Appendix B Page 185

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7221 0.4506 10305 0.643 12049 0.7519 13178 0.8223 13885 0.8665 Arts & Humanity 1528 0.5576 2055 0.75 2331 0.8507 2460 0.8978 2541 0.9273 Science & Maths 1831 0.5604 2428 0.7431 2720 0.8325 2901 0.8879 2974 0.9103 Sports 3862 0.3855 5822 0.5812 6998 0.6986 7817 0.7803 8370 0.8355 Table B.39. Top n match score for second small of dataset using different combination of reputation and content scoring (10) (QA-0.25NC-0.75(C(1)-C(2))-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6699 0.418 9919 0.619 11723 0.7315 12859 0.8024 13648 0.8517 Arts & Humanity 1659 0.6054 2202 0.8036 2452 0.8948 2564 0.9357 2628 0.9591 Science & Maths 1777 0.5439 2459 0.7526 2784 0.8521 2928 0.8962 3018 0.9237 Sports 3263 0.3257 5258 0.5249 6487 0.6475 7367 0.7354 8002 0.7988 Table B.40. Top n match score for second small of dataset using different combination of reputation and content scoring (11) (QA-0.75NC-0.25C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7956 0.4965 11021 0.6877 12710 0.7931 13749 0.858 14362 0.8962 Arts & Humanity 1629 0.5945 2181 0.7959 2438 0.8897 2555 0.9324 2621 0.9565 Science & Maths 2013 0.6161 2575 0.7881 2839 0.8689 2995 0.9167 3062 0.9372 Sports 4314 0.4306 6265 0.6254 7433 0.742 8199 0.8185 8679 0.8664 Table B.41. Top n match score for second small of dataset using different combination of reputation and content scoring (12) (QA-0.75NC-0.25C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 6710 0.4187 9895 0.6175 11752 0.7333 12950 0.8081 13711 0.8556 Arts & Humanity 1629 0.5945 2178 0.7948 2433 0.8879 2563 0.9354 2621 0.9565 Science & Maths 1788 0.5472 2435 0.7453 2763 0.8457 2923 0.8947 3004 0.9194 Sports 3293 0.3287 5282 0.5273 6556 0.6544 7464 0.7451 8086 0.8072 Table B.42. Top n match score for second small of dataset using different combination of reputation and content scoring (13) (QA-0.75NC-0.25(C(1)-C(2))).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 8242 0.5143 11371 0.7096 12959 0.8087 13894 0.867 14467 0.9028 Arts & Humanity 1657 0.6047 2193 0.8003 2443 0.8916 2555 0.9324 2617 0.9551 Science & Maths 2043 0.6253 2623 0.8028 2885 0.883 3011 0.9216 3089 0.9455 Sports 4542 0.4534 6555 0.6543 7631 0.7618 8328 0.8313 8761 0.8746 Table B.43. Top n match score for second small of dataset using different combination of reputation and content scoring (14) (QA-0.75NC-0.25C(1)-WN).

© 2009 Lin Chen Page 185 Appendix B Page 186

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 7994 0.4988 11077 0.6912 12762 0.7964 13778 0.8598 14357 0.8959 Arts & Humanity 1618 0.5905 2168 0.7912 2435 0.8886 2561 0.9346 2616 0.9547 Science & Maths 2029 0.621 2599 0.7955 2864 0.8766 3006 0.9201 3067 0.9387 Sports 4347 0.4339 6310 0.6299 7463 0.745 8211 0.8197 8674 0.8659 Table B.44. Top n match score for second small of dataset using different combination of reputation and content scoring (15) (QA-0.75NC-0.25(C(1)-C(2))-WN).

© 2009 Lin Chen Page 186 Appendix B Page 187

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS content scoring (1). set using different combinations of reputation and Overall Weighting- Overall 0.5C & of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.31. Overall top n match score for big data Figure B.31. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 187 Appendix B Page 188

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) content scoring (2). set using different combinations of reputation and Overall - Weighting- Overall 0.75C & 0.25NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.32. Overall top n match score for big data Figure B.32. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 188 Appendix B Page 189

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS ion and content scoring big dataset using different combinations of reputat Arts & Humanity - Weighting- & 0.5C & Humanity 0.5NC of Arts Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.33. Arts & Humanity top n match score for Figure B.33. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 189 Appendix B Page 190 QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS ion and content scoring big dataset using different combinations of reputat Science & & Maths Weighting- Science & 0.5C of 0.5NC Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.34. Science & Maths top n match score for Figure B.34. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 190 Appendix B Page 191

QA-NC QA-C(1) QA-C(2) QA-0.5NC-0.5C(1) QA-0.5NC-0.5C(2) QA-0.5NC-0.5(C(1)-C(2)) QA-C(1)-C(2) QA-HITS ontent scoring et using different combinations of reputation and c Sports - Sports Weighting- & 0.5C 0.5NC of Top 1 Top 2 Top 3 Top 4 Top 5 Top Figure B.35. Sports top n match score for big datas Figure B.35. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 191 Appendix B Page 192

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) weighting. et using 0.75 reputation weighting and 0.25 content Overall - Weighting- Overall 0.25C & 0.75NC of Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.36. Overall top n match rate for big datas Figure B.36. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 192 Appendix B Page 193

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) weighting. et using 0.25 reputation weighting and 0.75 content Arts & Humanity - Weighting& 0.75C Humanity of & 0.25NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.37. Overall top n match rate for big datas Figure B.37. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 193 Appendix B Page 194

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) content weighting. ig dataset using 0.75 reputation weighting and 0.25 Arts & Humanity - Weighting& 0.25C Humanity of & 0.75NC Arts Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.38. Arts & Humanity top n match rate for b Figure B.38. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 194 Appendix B Page 195

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) content weighting. g dataset using 0.25 reputation weighting and 0.75 Science & Maths Weighting- Science & 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.39. Science & Math top n match rate for bi Figure B.39. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 195 Appendix B Page 196

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) content weighting. g dataset using 0.75 reputation weighting and 0.25 Science & Maths Weighting- Science & 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.40. Science & Math top n match rate for bi Figure B.40. 1 0

1.2 0.8 0.6 0.4 0.2 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 196 Appendix B Page 197

QA-NC QA-C(1) QA-C(2) QA-0.25NC-0.75C(1) QA-0.25NC-0.75C(2) QA-0.25NC-0.75(C(1)-C(2)) QA-C(1)-C(2) weighting. t using 0.25 reputation weighting and 0.75 content Sports - Sports Weighting& 0.75C of 0.25NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.41. Sports top n match rate for big datase Figure B.41. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 197 Appendix B Page 198

QA-NC QA-C(1) QA-C(2) QA-0.75NC-0.25C(1) QA-0.75NC-0.25C(2) QA-0.75NC-0.25(C(1)-C(2)) QA-C(1)-C(2) weighting. t using 0.75 reputation weighting and 0.25 content Sports - Sports Weighting& 0.25C of 0.75NC Top 1Top 2 Top 3 Top 4 Top 5 Top Figure B.42. Sports top n match rate for big datase Figure B.42. 1 0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 s r e w s n a n e s o h c r e s u h t i w e t a r h c t a M

© 2009 Lin Chen Page 198 Appendix B Page 199

The following are based on a dataset size of 63,990 QA pages with Arts & Humanity containing 10,939 QA pages, Science & Maths containing 13,052 QA pages and Sports containing 39,999 QA pages.

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 35443 0.5538 48700 0.761 55092 0.8609 58309 0.9112 60239 0.9413 Arts & Humanity 7241 0.6619 9450 0.8638 10332 0.9445 10640 0.9726 10784 0.9858 Science & Maths 8184 0.627 10710 0.8205 11780 0.9025 12259 0.9392 12549 0.9614 Sports 20018 0.5004 28540 0.7135 32980 0.8245 35410 0.8852 36906 0.9226 Table B.45. Top n match score for big of dataset using reputation scoring (QA-NC).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25282 0.395 37970 0.5933 45603 0.7126 50535 0.7897 53795 0.8406 Arts & Humanity 5627 0.5143 7782 0.7113 8918 0.8152 9623 0.8796 10006 0.9147 Science & Maths 6233 0.4775 8936 0.6846 10342 0.7923 11119 0.8519 11598 0.8885 Sports 13422 0.3355 21252 0.5313 26343 0.6585 29793 0.7448 32191 0.8047 Table B.46. Top n match score for big of dataset using content scoring (1) (QA-C(1) ).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 24214 0.3784 37170 0.5808 45197 0.7063 50303 0.7861 53746 0.8399 Arts & Humanity 5457 0.4988 7764 0.7097 8952 0.8183 9620 0.8794 10020 0.9159 Science & Maths 6050 0.4635 8749 0.6703 10248 0.7851 11062 0.8475 11576 0.8869 Sports 12707 0.3176 20657 0.5164 25997 0.6499 29621 0.7405 32150 0.8037 Table B.47. Top n match score for big of dataset using content scoring (2) (QA-C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25993 0.4062 38662 0.6041 46154 0.7212 50918 0.7957 54105 0.8455 Arts & Humanity 6355 0.5809 8459 0.7732 9451 0.8639 9985 0.9127 10309 0.9424 Science & Maths 6221 0.4766 8930 0.6841 10338 0.792 11120 0.8519 11594 0.8882 Sports 13417 0.3354 21273 0.5318 26365 0.6591 29813 0.7453 32202 0.805 Table B.48. Top n match score for big of dataset using different combination of reputation and content scoring (1) (QA-0.5NC-0.5C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 30683 0.4794 42934 0.6709 49845 0.7789 54100 0.8454 56867 0.8886 Arts & Humanity 6284 0.5744 8403 0.7681 9426 0.8616 9973 0.9116 10296 0.9412 Science & Maths 7479 0.573 9871 0.7562 11101 0.8505 11736 0.8991 12098 0.9269 Sports 16920 0.423 24660 0.6165 29318 0.7329 32391 0.8097 34473 0.8618 Table B.49. Top n match score for big of dataset using different combination of reputation and content scoring (2) (QA-0.5NC-0.5C(2)).

© 2009 Lin Chen Page 199 Appendix B Page 200

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 30037 0.4694 42519 0.6644 49597 0.775 53779 0.8404 56472 0.8825 Arts & Humanity 6238 0.5702 8317 0.7603 9373 0.8568 9902 0.9052 10231 0.9352 Science & Maths 7573 0.5802 9996 0.7658 11189 0.8572 11776 0.9022 12113 0.928 Sports 16226 0.4056 24206 0.6051 29035 0.7258 32101 0.8025 34128 0.8532 Table B.50. Top n match score for big of dataset using different combination of reputation and content scoring (3) (QA-0.5NC-0.5(C(1)-C(2))).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25101 0.3922 38074 0.5949 45937 0.7178 50828 0.7943 54140 0.846 Arts & Humanity 5547 0.507 7860 0.7185 9029 0.8253 9690 0.8858 10056 0.9192 Science & Maths 6295 0.4823 8965 0.6868 10406 0.7972 11199 0.858 11657 0.8931 Sports 13259 0.3314 21249 0.5312 26502 0.6625 29939 0.7484 32427 0.8106 Table B.51. Top n match score for big of dataset using different combination of content scoring (1) (QA-C(1)-C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25073 0.3918 37972 0.5934 45755 0.715 50764 0.7933 54146 0.8461 Arts & Humanity 5283 0.4829 7549 0.69 8777 0.8023 9545 0.8725 10023 0.9162 Science & Maths 6313 0.4836 8885 0.6807 10266 0.7865 11095 0.85 11605 0.8891 Sports 13477 0.3369 21538 0.5384 26712 0.6678 30124 0.7531 32518 0.8129 Table B.52. Top n match score for big of dataset using HITS (QA-HITS).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25492 0.3983 38311 0.5987 45901 0.7173 50718 0.7925 53921 0.8426 Arts & Humanity 5670 0.5183 7814 0.7143 8949 0.818 9619 0.8793 10005 0.9146 Science & Maths 6270 0.4803 8987 0.6885 10358 0.7935 11127 0.8525 11607 0.8892 Sports 13552 0.3388 21510 0.5377 26594 0.6648 29972 0.7493 32309 0.8077 Table B.53. Top n match score for big of dataset using different combination of content scoring (2) (QA-C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 26154 0.4087 38849 0.6071 46281 0.7232 50996 0.7969 54141 0.846 Arts & Humanity 6386 0.5837 8477 0.7749 9468 0.8655 9991 0.9133 10308 0.9423 Science & Maths 6248 0.4787 8955 0.6861 10334 0.7917 11126 0.8524 11605 0.8891 Sports 13520 0.338 21417 0.5354 26479 0.6619 29879 0.7469 32228 0.8057 Table B.54. Top n match score for big of dataset using different combination of reputation and content scoring (4) (QA-0.5NC-0.5C(1)-WN).

© 2009 Lin Chen Page 200 Appendix B Page 201

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25782 0.4029 38602 0.6032 46310 0.7237 51092 0.7984 54349 0.8493 Arts & Humanity 6131 0.5604 8269 0.7559 9347 0.8544 9898 0.9048 10217 0.9339 Science & Maths 6313 0.4836 8983 0.6882 10410 0.7975 11202 0.8582 11662 0.8935 Sports 13338 0.3334 21350 0.5337 26553 0.6638 29992 0.7498 32470 0.8117 Table B.55. Top n match score for big of dataset using different combination of reputation and content scoring (5) (QA-0.5NC-0.5(C(1)-C(2))-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25212 0.3939 38215 0.5972 46007 0.7189 50884 0.7951 54187 0.8468 Arts & Humanity 5554 0.5077 7875 0.7199 9046 0.8269 9687 0.8855 10055 0.9191 Science & Maths 6310 0.4834 8988 0.6886 10410 0.7975 11200 0.8581 11664 0.8936 Sports 13348 0.3337 21352 0.5338 26551 0.6637 29997 0.7499 32468 0.8117 Table B.56. Top n match score for big of dataset using different combination of reputation and content scoring (6) (QA-C(1)-C(2)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 26423 0.4129 39119 0.6113 46614 0.7284 51309 0.8018 54446 0.8508 Arts & Humanity 6276 0.5737 8393 0.7672 9425 0.8615 9960 0.9105 10274 0.9392 Science & Maths 6694 0.5128 9403 0.7204 10769 0.825 11461 0.8781 11912 0.9126 Sports 13453 0.3363 21323 0.533 26420 0.6605 29888 0.7472 32260 0.8065 Table B.57. Top n match rate for big of dataset using 0.25 for reputation weighting and 0.75 for AnswerBus weighting (QA-0.25NC-0.75C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 28425 0.4442 41154 0.6431 48534 0.7584 53067 0.8293 56083 0.8764 Arts & Humanity 6056 0.5536 8271 0.7561 9324 0.8523 9912 0.9061 10238 0.9359 Science & Maths 7012 0.5372 9528 0.73 10825 0.8293 11528 0.8832 11938 0.9146 Sports 15357 0.3839 23355 0.5838 28385 0.7096 31627 0.7906 33907 0.8476 Table B.58. Top n match rate for big dataset using 0.25 for reputation weighting and 0.75 for question answer weighting (QA-0.25NC-0.75C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25989 0.4061 38765 0.6057 46449 0.7258 51197 0.8 54445 0.8508 Arts & Humanity 6029 0.5511 8155 0.7454 9234 0.8441 9825 0.8981 10165 0.9292 Science & Maths 6680 0.5117 9316 0.7137 10684 0.8185 11407 0.8739 11843 0.9073 Sports 13280 0.332 21294 0.5323 26531 0.6632 29965 0.7491 32437 0.8109 Table B.59. Top n match rate for big dataset using 0.25 for reputation weighting and 0.75 for content weighting (1) (QA-0.25NC-0.75(C(1)-C(2))).

© 2009 Lin Chen Page 201 Appendix B Page 202

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 26542 0.4147 39260 0.6135 46729 0.7302 51364 0.8026 54469 0.8512 Arts & Humanity 6290 0.575 8394 0.7673 9418 0.8609 9959 0.9104 10268 0.9386 Science & Maths 6723 0.515 9413 0.7211 10774 0.8254 11478 0.8794 11921 0.9133 Sports 13529 0.3382 21453 0.5363 26537 0.6634 29927 0.7481 32280 0.807 Table B.60. Top n match rate for big of dataset using 0.25 reputation weighting and 0.75 for AnswerBus with WordNet (QA-0.25NC-0.75C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 25995 0.4062 38856 0.6072 46505 0.7267 51265 0.8011 54477 0.8513 Arts & Humanity 6023 0.5505 8159 0.7458 9245 0.8451 9836 0.8991 10164 0.9291 Science & Maths 6615 0.5068 9297 0.7123 10663 0.8169 11418 0.8748 11844 0.9074 Sports 13357 0.3339 21400 0.535 26597 0.6649 30011 0.7502 32469 0.8117 Table B.61. Top n match rate for big of dataset using 0.25 reputation weighting and 0.75 for content weighting (2) (QA-0.25NC-0.75(C(1)-C(2))-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 27046 0.4226 39720 0.6207 47119 0.7363 51727 0.8083 54775 0.8559 Arts & Humanity 6625 0.6056 8765 0.8012 9712 0.8878 10196 0.932 10437 0.9541 Science & Maths 6981 0.5348 9635 0.7382 10996 0.8424 11643 0.892 12074 0.925 Sports 13440 0.336 21320 0.533 26411 0.6602 29888 0.7472 32264 0.8066 Table B.62. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (1) (QA-0.75NC-0.25C(1)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 32029 0.5005 44449 0.6946 51213 0.8003 55098 0.861 57527 0.8989 Arts & Humanity 6483 0.5926 8631 0.789 9658 0.8828 10137 0.9266 10416 0.9521 Science & Maths 7802 0.5977 10224 0.7833 11401 0.8735 11956 0.916 12278 0.9406 Sports 17744 0.4436 25594 0.6398 30154 0.7538 33005 0.8251 34833 0.8708 Table B.63. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (2) (QA-0.75NC-0.25C(2)).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 26812 0.419 39640 0.6194 47237 0.7381 51819 0.8097 54959 0.8588 Arts & Humanity 6513 0.5953 8629 0.7888 9658 0.8828 10138 0.9267 10415 0.952 Science & Maths 6941 0.5317 9612 0.7364 10984 0.8415 11671 0.8941 12077 0.9252 Sports 13358 0.3339 21399 0.5349 26595 0.6648 30010 0.7502 32467 0.8116 Table B.64. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (3) (QA-0.75NC-0.25(C(1)-C(2))).

© 2009 Lin Chen Page 202 Appendix B Page 203

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 27034 0.4224 39796 0.6219 47191 0.7374 51744 0.8086 54770 0.8559 Arts & Humanity 6638 0.6068 8767 0.8014 9722 0.8887 10194 0.9318 10428 0.9532 Science & Maths 6869 0.5262 9570 0.7332 10928 0.8372 11621 0.8903 12063 0.9242 Sports 13527 0.3381 21459 0.5364 26541 0.6635 29929 0.7482 32279 0.8069 Table B.65. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (4) (QA-0.75NC-0.25C(1)-WN).

Cat Top 1 Top 2 Top 3 Top 4 Top 5 Overall 26812 0.419 39640 0.6194 47237 0.7381 51819 0.8097 54959 0.8588 Arts & Humanity 6513 0.5953 8629 0.7888 9658 0.8828 10138 0.9267 10415 0.952 Science & Maths 6941 0.5317 9612 0.7364 10984 0.8415 11671 0.8941 12077 0.9252 Sports 13358 0.3339 21399 0.5349 26595 0.6648 30010 0.7502 32467 0.8116 Table B.66. Top n match rate for big of dataset using 0.75 for reputation weighting and 0.25 for content weighting (5) (QA-0.75NC-0.25(C(1)-C(2))-WN).

© 2009 Lin Chen Page 203