Modelling Search and Session Effectiveness
Total Page:16
File Type:pdf, Size:1020Kb
Modelling Search and Session Effectiveness Alfan Farizki Wicaksono Supervisors: Professor Alistair Moffat Professor Justin Zobel Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy School of Computing and Information Systems The University of Melbourne October 2020 Copyright © 2020 Alfan Farizki Wicaksono All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author. Abstract Search effectiveness metrics are used to quantify the quality of a ranked list of search results relative to a query. One line of argument suggests that incorporating user behaviour into the measurement of search effectiveness via a user model is useful, so that the metric scores reflect what the user has experienced during the search process. A wide range of metrics has been proposed, and many of these metrics correspond to user models. In reality users often reformulate their queries during the course of the session. Hence, it is desirable to involve both query- and session-level behaviours in the development of model-based metrics. In this thesis, we use interaction data from commercial search engines and laboratory-based user studies to model query- and session-level search behaviours, and user satisfaction; to inform the method for evaluation of search sessions; and to explore the interaction between user models, metric scores, and satisfaction. We consider two goals in session evaluation. The first goal is to develop an effectiveness model for session evaluation; and the second goal is to establish a fitted relationship between individual query scores and session-level satisfaction ratings. To achieve the first goal, we investigate factors that affect query- and session-level behaviours, and develop a new session-based user model that provides a closer fit to the observed behaviour than do previous models. This model is then used to devise a new session-based metric, sINST. In regard to the second goal, we explore variables influencing session-level satisfaction, and suggest that the combination of both query positional and quality factors provides a better correlation with session satisfaction than those based on query position alone. Based on this observation, we propose a novel query-to-session aggregation function, that is useful for scoring sessions when sequences of query reformulations are observed. We also propose a meta-evaluation framework that allows metric comparisons based on empirical evidence derived from search interaction logs, and investigate the connection between predicted behaviour and observed behaviour, and between metric scores and user satisfaction at both query and session-levels. iii iv Declaration This is to certify that 1. the thesis comprises only my original work towards the PhD, 2. due acknowledgement has been made in the text to all other material used, 3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliogra- phies and appendices. Alfan Farizki Wicaksono, October 2020 v Credits The material in Chapter 3 is based on the following published papers: • Alfan F. Wicaksono and Alistair Moffat. Empirical Evidence for Search Effectiveness Models. In Proc. CIKM, pages 1571–1574, 2018. • Alfan F. Wicaksono and Alistair Moffat. Exploring Interaction Patterns in Job Search. In Proc. Aust. Doc. Comp. Symp., pages 1–8, 2018. • Alfan F. Wicaksono. Measuring Job Search Effectiveness. In Proc. SIGIR, page 1453, 2019. • Alfan F. Wicaksono, Alistair Moffat, and Justin Zobel. Modeling User Actions in Job Search. In Proc. ECIR, pages 652–664, 2019. The material in Chapter 4 (Sections 4.4, 4.5, and 4.6) is currently under review. The material in Chapter 5 (except Sections 5.5.4 and 5.6.2) is based on the following published paper: • Alfan F. Wicaksono and Alistair Moffat. Metrics, User Models, and Satisfaction. In Proc. WSDM, pages 654–662, 2020. vii Acknowledgements All praise due to Allah (the most glorified, the most high) who blessed me with the ability to complete this thesis. I am grateful to my mother, my father, and my sisters, who always pray for me and provide support throughout my life; to my wife Nisa who always loves me; and to my son Budi who brings me joy. I would like to thank my supervisors, Professor Alistair Moffat and Professor Justin Zobel, for their invaluable support and guidance throughout my doctoral study. I wish to be able to apply what I learn from Alistair and Justin; and to become a good supervisor for my future students. I would also like to thank Professor Trevor Cohn, who served as my committee chair. I gratefully acknowledge the generosity of the creators of the datasets used in this thesis; and Dr. Paul Thomas (Microsoft) in particular for his assistance. This work was supported by the University of Melbourne, by the Australian Research Council, and by Seek.com.I attended ADCS 2018 in Dunedin using travel support from ADCS. I also attended SIGIR 2019 in Paris with support from SIGIR and from University of Melbourne Google-funded travel grants. Thanks are also due to Dr. Damiano Spina (RMIT University) and Dr. Bahar Salehi (The University of Melbourne) who supported my research; to IR researchers at SEEK, Dr. Sargol Sadeghi and Dr. Vincent Li, who provided access to the Seek.com datasets; and to my colleagues at the University of Melbourne, Alicia and Unni, who helped me in my study. Finally, I would also like to thank John Papandriopoulos who provided a template for this thesis. ix To those who sincerely search for the eternal truth throughout their lives. xi Contents 1 Introduction 1 1.1 Research Questions . 4 1.2 Contributions . 5 1.3 Thesis Structure . 6 2 Background 9 2.1 Information Retrieval Evaluation . 10 2.1.1 The Use of Ranking, Search Success, and Evaluation . 10 2.1.2 User-Based and Test Collection-Based Evaluation . 16 2.1.3 Search Task Classification . 20 2.1.4 Fundamental Effectiveness Metrics . 21 2.1.5 Relaxations of the Assumptions . 26 2.1.6 The Problem of Recall and The Virtue of Precision . 35 2.2 User Search Behaviour . 37 2.2.1 Interaction Log Study . 38 2.2.2 User Browsing Behaviour . 40 2.2.3 User Stopping Behaviour . 43 2.3 Metrics and User Models . 44 2.3.1 User Model . 45 2.3.2 C/W/L Framework . 46 2.4 Classification of User Models . 51 2.4.1 Static User Models . 51 2.4.2 Adaptive User Models . 53 2.4.3 Incorporating Costs into Metrics . 56 2.5 User Satisfaction . 60 2.5.1 The Concept of User Satisfaction for IR Evaluation . 62 2.5.2 User Feedback for Predicting Satisfaction . 63 2.6 Meta-Evaluation . 64 2.6.1 Meta-Evaluation Based on User Satisfaction . 65 2.6.2 Meta-Evaluation Based on User Performance . 66 2.6.3 Meta-Evaluation Based on User Preference . 68 2.6.4 Meta-Evaluation Based on User Model Accuracy . 69 2.6.5 Comparison-Based Meta-Evaluation . 70 2.6.6 Axiomatic-Based Meta-Evaluation . 70 2.7 Summary . 71 xiii 3 Modelling User Actions 73 3.1 Motivation and Research Question . 74 3.2 Action Sequences and Interaction Logs . 76 3.2.1 Action Sequences . 76 3.2.2 Interaction Logs . 79 3.3 Inferring Continuation Probability . 80 3.3.1 Computing Empirical C(i) ....................... 80 3.3.2 Predicted C(i) Versus Empirical Cˆ (i) . 87 3.4 Exploring Interaction Patterns . 88 3.4.1 Impression and Clickthrough Orderings . 88 3.4.2 A Prelude to Clickthroughs . 96 3.4.3 Last and Deepest Clickthroughs . 97 3.5 Predicting Impression Distributions . 102 3.5.1 Can Clickthroughs Directly Substitute for Impressions? . 102 3.5.2 Impression Model . 105 3.6 Impression Model Evaluation . 110 3.6.1 Inferring C(i) from Impression Models . 110 3.6.2 Model Validation . 111 3.7 Summary . 117 4 Modelling Search Sessions 121 4.1 Motivation and Research Question . 122 4.1.1 Motivation . 122 4.1.2 Session Effectiveness Model . 123 4.1.3 Observational Goal . 126 4.2 Previous Work . 128 4.2.1 Session-Based Effectiveness Metrics . 128 4.2.2 Query-to-Session Aggregation Functions . 132 4.3 Interaction Logs . 134 4.3.1 Industrial-Based Datasets . 134 4.3.2 Laboratory-Based Datasets . 136 4.3.3 Organic SERPS . 137 4.4 A Session-Based C/W/L Framework . 137 4.5 Search Behaviours . 142 4.5.1 Query-Level Behaviours . 142 4.5.2 Session-Level Behaviours . 150 4.6 A Model-Based Session Metric . 154 4.7 Factors Affecting Session Satisfaction . 158 4.8 Modelling Session Satisfaction . 163 4.8.1 Query Aggregation Using Weighted Mean Method . 164 4.8.2 Memory-Based Query Aggregation . 168 4.9 Summary . 172 xiv 5 Metrics, User Models, and Satisfaction 175 5.1 Motivation and Research Question . 176 5.2 Previous Work . 179 5.3 Datasets . 181 5.4 Metric Scores and Satisfaction . 183 5.4.1 Query-Level Satisfaction . 184 5.4.2 Session-Level Satisfaction . 194 5.5 User Models and User Behaviour . 202 5.5.1 Measuring User Model Accuracy . 203 5.5.2 Measuring Accuracy Using View Distributions . 205 5.5.3 User Model Evaluation . 206 5.5.4 Empirical Evidence for Adaptive Models . 209 5.6 Model Accuracy and Satisfaction . 217 5.6.1 Tuning Parameters via Model Accuracy and Satisfaction . 218 5.6.2 Metrics Based on What Users Have Seen . 218 5.7 Summary . 220 6 Conclusion and Future Work 223 6.1 Conclusion . 223 6.2 Future Work . 226 xv List of Figures 1.1 Comparison of two ranked lists of results generated from two different sys- tems, Bing.com and Google.com, for the query “parenthood and phd” (searched on 2020-10-20).