Measuring Author Influence on Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Social Impact Retrieval: Measuring Author Influence on Information Retrieval James Lanagan A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Alan F. Smeaton June 2009 Declaration I hereby certify that this material, which I now submit for assessment on the pro- gramme of study leading to the award of Doctor of Philosophy is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: ID No: Date: ACKNOWLEDGEMENTS Firstly I would like to thank my supervisor Alan Smeaton for the help he has given me in finding a direction for my research. I must thank him also for the effort he has put in to the balancing act of knowing when to let me follow my own ideas, whilst still providing that direction. A collection of people helped me through this thesis, all of whom provided me with intelligent insight and the odd push to keep me focussed and confident that the research could be done: During the creation of both my initial systems, Hyowon Lee was of immeasurable help in advising and tweaking the ideas that I had for how the systems should look. Throughout this thesis Paul Ferguson and Pete Wilkins have both provided valu- able knowledge and help in getting through my experiments, providing me with expert judgements, and lastly with perhaps some questionable sporting wisdom. Colum Foley, Adam Bermingham, Cathal Gurrin, and Gareth Jones all listened to my thoughts towards the later months of my thesis and helped to push me over the finish line. A word of thanks to all those both past and present within the CDVP who have helped me enjoy my PhD experience and in particular: Georgina, Sin´ead,Kieran, Mike, Kevin, Phil, Fabrice, Sandra, Neil, Edel, Daragh and Niamh. Mike Christel first got me into the information retrieval game, and put me in contact with Alan and the CDVP. For this I am always grateful. Mick Cooney for all the advice and direction he has provided thanks to his incred- ible knowledge of all things statistical or otherwise. I must also thank most sincerely my parents Geoff and Jane for all they have done for me through my many adventures. Your unwavering support, along with that of my sister Sarah has meant so much. I am truly blessed. Lastly, I thank my girlfriend Linda for all that she has put up with through these last few years. It has been made easier each day by your constant patience and encouragement and I could not have done this without you. Thank you from the bottom of my heart. i ABSTRACT The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly. In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of authority and centrality to the user. This measure is then used to attribute an query- independent interest score to each of the contributions the author makes, enabling us to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two algorithms; AuthorRank and MessageRank. We present two real-world user experiments which focussed around multimedia an- notation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the effectiveness of incorporating user generated content, or annotations, to improve information retrieval. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ i ABSTRACT ..................................... ii LIST OF TABLES ................................. ix LIST OF FIGURES ................................ xi I INTRODUCTION ............................... 1 1.1 Social Information Retrieval . .1 1.1.1 User Generated Content and the Social Web . .3 1.2 Thesis Hypothesis . .4 1.3 Research Objectives . .5 1.4 Thesis Organisation . .6 II BACKGROUND & RELATED WORK ................. 8 2.1 Information Retrieval . .9 2.1.1 An Information System . 10 2.1.2 Retrieval Strategies . 13 2.1.3 Linkage Analysis . 20 2.1.4 Evaluating Retrieval Performance . 24 2.1.5 Combining Sources of Evidence . 26 2.2 Social Network Analysis . 29 2.2.1 The Small World . 30 iii 2.2.2 Social Network Models . 31 2.3 Trust and Authority . 33 2.3.1 A Trust Framework . 34 2.3.2 Reputation . 36 2.3.3 Propagation of Trust . 37 2.4 Data Quality . 40 2.4.1 Defining Data . 41 2.4.2 Data Systems . 42 2.4.3 Classifications of Data Quality . 44 2.5 Annotation . 48 2.5.1 Physical Vs. Digital . 49 2.5.2 Annotations as Queries . 51 2.5.3 Annotations as Hyper-links . 51 2.5.4 Grouping Annotations . 52 III WEB 2.0 ..................................... 54 3.1 Web 2.0: People Talking to People . 54 3.1.1 Vote for me . 55 3.1.2 Tag, you're it. 57 3.1.3 Social Commentary . 59 3.2 Two novel Web 2.0 systems . 60 3.2.1 SportsAnno . 61 iv 3.2.2 Summarising Sporting Events . 65 3.2.3 Annoby . 73 3.2.4 Comparison of SportsAnno and Annoby Usage . 82 3.3 Conclusions . 89 IV EXPERIMENTAL SETUP ......................... 90 4.1 Citation . 91 4.1.1 Citation Indexing . 92 4.1.2 Citation Analysis . 94 4.2 An Annotated Corpus . 98 4.2.1 The SIGIR Corpus . 99 4.3 Deriving Annotations from Scientific Citations . 103 4.3.1 System Description . 104 4.3.2 An XML Collection . 108 4.4 Graph Theory . 110 4.5 Corpus Analysis . 112 4.5.1 Citation Network of SIGIR Publications . 114 4.5.2 Citation Network of SIGIR Authors . 115 4.6 Network measures . 119 4.6.1 The Value of Authorship . 122 4.7 An Hypothesis Re-visited . 126 V EXPERIMENTAL SETUP II ........................ 128 v 5.1 Creation of a Ground Truth . 128 5.1.1 Document Selection . 129 5.1.2 Expert Rankings . 133 5.1.3 A Combined Expert Ground-Truth Ranking . 136 5.2 Additional Ranking Sources . 137 5.3 Comparisons of Rankings . 139 5.3.1 Comparing Google Scholar to Our Experts . 141 5.3.2 Harnessing Community Expertise . 144 VI EXPERIMENTS ................................ 146 6.1 A Search System . 146 6.1.1 Document Relevance . 148 6.1.2 A TF-IDF Baseline . 148 6.2 Author Value Revisited . 150 6.3 Calculation of Author Feature Contributions . 156 6.3.1 Combination of Author Features . 167 6.4 Calculation of AuthorRank Weights . 170 6.5 The Contribution of Single Messages . 173 6.6 Calculation of Message Feature Contributions . 177 6.6.1 Combination of Message Features . 181 6.7 Calculation of MessageRank Weights . 186 6.7.1 The Performance of MessageRank . 189 vi 6.8 Comparisons with the SportsAnno Corpus . 191 6.8.1 Collection of a Ground-Truth . 192 6.8.2 Searching Against the SportsAnno Corpus . 195 6.8.3 Using Ar and Mr to Re-Rank . 197 VII CONCLUSIONS AND SUMMARY .................... 200 7.1 Hypothesis Re-visited . 200 7.2 Research Objectives Re-visited . 201 7.2.1 Annotation . 201 7.2.2 Utility . 202 7.3 Conclusions . 203 7.3.1 Annotation Creation . 203 7.3.2 Expert Opinion . 204 7.3.3 Author Network Features . 205 7.3.4 Citation/Annotation Network . 205 7.4 Considerations . 206 7.5 Directions for Future Work . 208 7.6 Summary . 210 APPENDIX A | ANNOBY USER QUESTIONNAIRE ....... 212 APPENDIX B | KENDALL'S CO-EFFICIENT OF CONCORDANCE 216 APPENDIX C | XML AND MPEG-7 .................. 219 vii REFERENCES ................................... 221 viii LIST OF TABLES 1 Dimensions of Data Quality according to the Intuitive Approach . 45 2 Dimensions of Data Quality according to the Empirical Approach . 47 5 PDF file error Statistics for the extended SIGIR corpus . 101 6 PDF file error Statistics for the extended SIGIR corpus . 105 7 Statistics for the different graphs of SIGIR and extended SIGIR corpora114 8 PageRank for citation of SIGIR papers within the extended SIGIR corpus116 9 Statistics For The Top-10 Authors By Co-Authorship . 118 10 Top 5 Authors by Co-Author and Publication Counts . 119 11 Statistics for the top-10 authors by citation within our extended SIGIR corpus . 120 12 Number of papers per year for each of the topics chosen. 130 13 Number of documents returned by Google Scholar queries restricted to the years 1997-2007, and only the ACM SIGIR publication series. 132 14 Rankings assigned for collaborative feedback documents by the experts.