On Search Engine Evaluation Metrics
Total Page:16
File Type:pdf, Size:1020Kb
On Search Engine Evaluation Metrics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie (Dr. Phil.) durch die Philosophische Fakultät der Heinrich-Heine-Universität Düsseldorf Vorgelegt von Pavel Sirotkin aus Düsseldorf Betreuer: Prof. Wolfgang G. Stock Düsseldorf, April 2012 - 2 - - Oh my God, a mistake! - It’s not our mistake! - Isn’t it? Whose is it? - Information Retrieval. “BRAZIL” - 3 - Acknowledgements One man deserves the credit, one man deserves the blame… TOM LEHRER, “LOBACHEVSKY” I would like to thank my supervisor, Wolfgang Stock, who provided me with patience, support and the occasional much-needed prod to my derrière. He gave me the possibility to write a part of this thesis as part of my research at the Department of Information Science at Düsseldorf University; and it was also him who arranged for undergraduate students to act as raters for the study described in this thesis. I would like to thank my co-supervisor, Wiebke Petersen, who bravely delved into a topic not directly connected to her research, and took the thesis on sightseeing tours in India and to winter beaches in Spain. Wiebke did not spare me any mathematical rod, and many a faulty formula has been spotted thanks to her. I would like to thank Dirk Lewandowski, in whose undergraduate seminar I first encountered the topic of web search evaluation, and who provided me with encouragement and education on the topic. I am also indebted to him for valuable comments on a draft of this thesis. I would like to thank the aforementioned undergraduates from the Department of Information Science for their time and effort on providing the data on which this thesis’ practical part stands. Last, but definitely not least, I thank my wife Alexandra, to whom I am indebted for far more than I can express. She even tried to read my thesis, which just serves to show. As is the custom, I happily refer to the acknowledged all good that I have derived from their help, while offering to blame myself for any errors they might have induced. - 4 - Contents 1 Introduction ......................................................................................................................... 7 1.1 What It Is All About ................................................................................................... 7 1.2 Web Search and Search Engines ................................................................................ 8 1.3 Web Search Evaluation ............................................................................................ 11 Part I: Search Engine Evaluation Measures ............................................................................. 13 2 Search Engines and Their Users ....................................................................................... 14 2.1 Search Engines in a Nutshell .................................................................................... 14 2.2 Search Engine Usage ................................................................................................ 17 3 Evaluation and What It Is About ...................................................................................... 21 4 Explicit Metrics ................................................................................................................. 23 4.1 Recall, Precision and Their Direct Descendants ...................................................... 24 4.2 Other System-based Metrics .................................................................................... 27 4.3 User-based Metrics ................................................................................................... 32 4.4 General Problems of Explicit Metrics ...................................................................... 34 5 Implicit Metrics ................................................................................................................. 40 6 Implicit and Explicit Metrics ............................................................................................ 43 Part II: Meta-Evaluation ........................................................................................................... 51 7 The Issue of Relevance ..................................................................................................... 52 8 A Framework for Web Search Meta-Evaluation .............................................................. 57 8.1 Evaluation Criteria ................................................................................................... 57 8.2 Evaluation Methods .................................................................................................. 58 8.2.1 The Preference Identification Ratio ................................................................... 62 8.2.2 PIR Graphs ......................................................................................................... 66 9 Proof of Concept: A Study ................................................................................................ 69 9.1 Gathering the Data ................................................................................................... 69 9.2 The Queries .............................................................................................................. 74 9.3 User Behavior ........................................................................................................... 77 9.4 Ranking Algorithm Comparison .............................................................................. 81 10 Explicit Metrics ............................................................................................................. 87 10.1 (N)DCG .................................................................................................................... 87 - 5 - 10.2 Precision ................................................................................................................... 96 10.3 (Mean) Average Precision ...................................................................................... 100 10.4 Other Metrics .......................................................................................................... 104 10.5 Inter-metric Comparison ........................................................................................ 115 10.6 Preference Judgments and Extrinsic Single-result Ratings .................................... 120 10.7 PIR and Relevance Scales ...................................................................................... 129 10.7.1 Binary Relevance ............................................................................................. 130 10.7.2 Three-point Relevance ..................................................................................... 146 11 Implicit Metrics ........................................................................................................... 153 11.1 Session Duration Evaluation .................................................................................. 153 11.2 Click-based Evaluations ......................................................................................... 157 11.2.1 Click Count ...................................................................................................... 158 11.2.2 Click Rank ........................................................................................................ 161 12 Results: A Discussion .................................................................................................. 164 12.1 Search Engines and Users ...................................................................................... 164 12.2 Parameters and Metrics .......................................................................................... 165 12.2.1 Discount Functions ........................................................................................... 165 12.2.2 Thresholds ........................................................................................................ 166 12.2.2.1 Detailed Preference Identification ............................................................ 167 12.2.3 Rating Sources .................................................................................................. 171 12.2.4 Relevance Scales .............................................................................................. 171 12.2.5 Cut-off Ranks ................................................................................................... 172 12.2.6 Metric Performance .......................................................................................... 174 12.3 The Methodology and Its Potential ........................................................................ 176 12.4 Further Research Possibilities ................................................................................ 177 Executive Summary ............................................................................................................... 180 Bibliography ........................................................................................................................... 181 Appendix: Metrics Evaluated in Part II .................................................................................. 190 - 6 - 1 Introduction You shall seek all day ere you find them, and when you have them, they are not worth the search. WILLIAM SHAKESPEARE, “THE MERCHANT OF VENICE” 1.1 What It Is All About The present work deals with certain aspects of the evaluation of web search engines. This does not sound too exciting; but for some people,