Federated Search for Heterogeneous Environments

Federated Search for Heterogeneous Environments Jaime Arguello CMU-LTI-11-008 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Jamie Callan (chair) Jaime Carbonell Yiming Yang Fernando Diaz (Yahoo! Research) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies © 2011, Jaime Arguello Para Sophia, con todo mi amor. Acknowledgments This dissertation would have not been possible without the persistent guidance and encouragement of my mentors. I owe a big ‘thank you’ to my advisor, Jamie Callan. It was during the Fall of 2004, when I took two half-semester courses taught by Jamie (Digi- tal Libraries and Text Data Mining), that I discovered the field of Information Retrieval. Seven years later, I am writing my dissertation acknowledgments. Jamie’s advice on both research and life in general has been a great source of insight and support. His feedback on research papers, presentations, lecture slides, proposals, and this dissertation has taught me a great deal about communicating effectively. From Jamie’s example, I have learned valuable lessons on research, teaching, and mentoring. I would like to thank Jaime Carbonell, Yiming Yang, and Fernando Diaz for agreeing to be in my thesis committee. Their feedback was critical in making this dissertation stronger. Fernando Diaz helped shape many of the ideas presented in this dissertation. It was during an internship with Fernando at Yahoo! where I began working on vertical selection. I enjoyed it so much I returned for a second internship a year later. I have been fortunate to have had Fernando as a mentor and collaborator ever since. I wish to thank the rest of the group at Yahoo! Labs Montreal for making my two in- ternships so rewarding. I was lucky to have had Jean-François Crespo as a manager, who provided everything necessary to run experiments quickly. Jean-François Paiement was a helpful collaborator on the work on domain adaptation for vertical selection. Hughes Buchard, Alex Rochette, Remi Kwan, and Daniel Boies provided much needed feedback, help with tools, and a model work environment. Merci pour tout! I would like to thank Carolyn Rosé for taking me in as a masters student and for guiding me through my first two years doing research. More generally, I would like to thank all the LTI faculty for putting together such an engaging curriculum and the LTI staff for keeping the place running and always being patient with my questions. Grad school would have not been the same without my friendship with Jon Elsas. It is quite possible I ran every research idea through Jon. Those hundreds of lunch and coffee-break discussions are much appreciated. Thank you to all my other friends at CMU, who provided such valuable feedback on every practice talk I subjected them to. And, thank you to my officemates, Grace Hui-Yang and Anagha Kulkarni, for providing such a nice environment to work in. I would also like to thank all my friends outside of CMU for taking me away from work (admittedly, without much convincing) and for making those times so worthwhile. I am grateful to Tori Ames for her friendship and support during my last year as a PhD student. 1 2 Finally and most importantly, I would like to thank my family for their love and encouragement. Leaping into the unknown is easier when you have someone there to pick you up and cheer you on. None of this would have been possible without my mother, who taught me the value of loving what you do. Abstract In information retrieval, federated search is the problem of automatically searching across multiple distributed collections or resources. It is typically decomposed into two subsequent steps: deciding which resources to search (resource selection) and deciding how to combine results from multiple resources into a single presentation (results merging). Federated search occurs in different environments. This dissertation focuses on an environment that has not been deeply investigated in prior work. The growing heterogeneity of digital media and the broad range of user information needs that occur in today’s world have given rise to a multitude of systems that specialize on a specific type of search task. Examples include search for news, images, video, local businesses, items for sale, and even social-media interactions. In the Web search domain, these specialized systems are called verticals and one important task for the Web search engine is the prediction and integration of relevant vertical content into the Web search results. This is known as aggregated web search and is the main focus on this dissertation. Providing a single-point of access to all these diverse systems requires federated search solutions that can support result-type and retrieval-algorithm heterogeneity. This type of heterogeneity violates major assumptions made by state-of-the-art resource selection and results merging methods. While existing resource selection methods derive predictive evidence exclusively from sampled resource content, the approaches proposed in this dissertation draw on machine learning as a means to easily integrate various different types of evidence. These include, for example, evidence derived from (sampled) vertical content, vertical query- traffic, click-through information, and properties of the query string. In order to operate in a heterogeneous environment, we focus on methods that can learn a vertical-specific relationship between features and relevance. We also present methods that reduce the need for human-produced training data. Existing results merging methods formulate the task as score normalization. In a more heterogeneous environment, however, combining results into a single presentation requires satisfying a number of layout constraints. The dissertation proposes a novel formulation of the task: block ranking. During block-ranking, the objective is to rank sequences of results that must appear grouped together (vertically or horizontally) in the final presentation. Based on this formulation, the dissertation proposes and empiri- cally validates a cost-effective methodology for evaluating aggregated web search results. Finally, it proposes the use of machine learning methods for the task of block-ranking. 3 Contents 1 Introduction 1 1.1 Unique Properties of Aggregated Web Search . 3 1.2 The Goal and Contribution of this Dissertation . 6 1.2.1 Summary of Contributions . 10 1.3 Impact on Other Applications . 11 1.4 Dissertation Outline . 12 2 Prior Research in Federated Search 13 2.1 Cooperative vs. Uncooperative Federated Search . 13 2.2 Resource Representation . 14 2.2.1 How much to sample . 14 2.2.2 What to sample . 15 2.2.3 When to re-sample . 16 2.2.4 Combining collection representations. 16 2.3 Resource Selection . 16 2.3.1 Problem Formulation and Evaluation . 16 2.3.2 Modeling Query-Collection Similarity . 17 2.3.3 Modeling the Relevant Document Distribution . 19 2.3.4 Modeling the Document Score Distribution . 20 2.3.5 Combining Collection Representations . 22 2.3.6 Modeling Search Engine Effectiveness . 23 2.3.7 Query Classification . 25 2.3.8 Machine Learned Resource Selection . 26 2.4 Results Merging . 27 2.5 Summary . 28 3 Vertical Selection 30 3.1 Formal Task Definition . 32 3.2 Classification Framework . 32 3.3 Features . 32 3.3.1 Corpus Features . 33 3.3.2 Query-Log Features . 36 3.3.3 Query Features . 38 3.3.4 Summary of Features . 39 3.4 Methods and Materials . 39 4 CONTENTS 5 3.4.1 Verticals . 40 3.4.2 Queries . 40 3.4.3 Evaluation Metric . 41 3.4.4 Implementation Details . 42 3.4.5 Single-evidence Baselines . 42 3.5 Experimental Results . 42 3.6 Discussion . 44 3.6.1 Feature Ablation Study . 44 3.6.2 Per Vertical Performance . 45 3.7 Related Work in Vertical Selection . 46 3.8 Summary . 48 4 Domain Adaptation for Vertical Selection 50 4.1 Formal Task Definition . 51 4.2 Related Work . 51 4.3 Vertical Adaptation Approaches . 54 4.3.1 Gradient Boosted Decision Trees . 54 4.3.2 Learning a Portable Model . 55 4.3.3 Adapting a Model to the Target Vertical . 58 4.4 Features . 59 4.4.1 Query Features . 59 4.4.2 Query-Vertical Features . 59 4.5 Methods and Materials . 60 4.5.1 Verticals . 60 4.5.2 Queries . 61 4.5.3 Evaluation Metrics . 61 4.5.4 Unsupervised Baselines . 62 4.6 Experimental Results . 63 4.7 Discussion . 63 4.8 Summary . 68 5 Vertical Results Presentation: Evaluation Methodology 69 5.1 Related Work in Aggregated Web Search Evaluation . 71 5.1.1 Understanding User Behavior . 71 5.1.2 Aggregated Web Search Evaluation . 72 5.2 Layout Assumptions . 74 5.3 Task Formulation: Block Ranking . 75 5.4 Metric-Based Evaluation Methodology . 75 5.4.1 Collecting Block-Pair Preference Judgements . 76 5.4.2 Deriving the Reference Presentation . 77 5.4.3 Measuring Distance from the Reference . 79 5.5 Methods and Materials . 80 5.5.1 Block-Pair Preference Assessment . 81 6 CONTENTS 5.5.2 Verticals . 81 5.5.3 Queries . 82 5.5.4 Empirical Metric Validation . 83 5.6 Experimental Results . 83 5.6.1 Assessor Agreement on Block-Pair Judgements . 84 5.6.2 Assessor Agreement on Presentation-Pair Judgements . 85 5.6.3 Metric Validation Results . 86 5.7 Discussion . 88 5.8 Summary . 89 6 Vertical Results Presentation: Approaches 92 6.1 Formal Task Definition . 93 6.2 Related Work . 94 6.3 Features . ..

Federated Search for Heterogeneous Environments

Key Issue Cervone 1L

Federated Search: New Option for Libraries in the Digital Era Shailendra Kumar Gareema Sanaman Namrata Rai

Federal Depository Library Handbook

From Federated to Aggregated Search

Federated Search of Text Search Engines in Uncooperative Environments

Numbers Are Footnotes for Positions Referenced Below

By Michael L. Black Dissertation

3/1996 Info1mat1on• Technology

Federated Search As a Transformational Technology Enabling Knowledge Discovery: the Role of Worldwidescience.Org

School Board of National School Boards Association

Aquene Freechild Submission

Clustering and Ranking