Looking for a Haystack

LOOKING FOR A HAYSTACK SELECTING DATA SOURCES IN A DISTRIBUTED RETRIEVAL SYSTEM Ryan Scherle Submitted to the faculty of the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in Computer Science and Cognitive Science Indiana University November 2006 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements of the degree of Doctor of Philosophy. Doctoral David B. Leake, Ph.D. Committee (Principal Advisor, Computer Science) Michael Gasser, Ph.D. (Principal Advisor, Cognitive Science) Gregory J. E. Rawlins, Ph.D. October 5, 2006 Javed Mostafa, Ph.D. ii Copyright c 2006 Ryan Scherle ALL RIGHTS RESERVED iii Acknowledgements When I was growing up, many of my friends owned the Atari 2600, an early home video game system. I wanted one, too. But my parents thought I should “learn something” instead of just play- ing games, so they bought me a TRS-80 Color Computer instead. That single purchase changed the course of my life, and I have never been far from a computer since. My parents have always provided this sort of gentle guidance, pointing me in useful directions while supporting the decisions I made, even when the decisions took me in a different direction than they had originally intended. For this, I will always be grateful. While attending high school, I came across James Gleick's book Chaos: Making a New Science. This was my first exposure to the world of scientific research. Del Steinhart and Richard Abell, two of my science teachers, indulged me as I spent an inordinate amount of time reconstructing the experiments described in this book. In college, my interest in computers remained, but I also had a keen interest in electrical engi- neering. Fortunately, Frank Young convinced me to major in computer science, which allowed me to work on much more interesting problems. Cary Laxer and Claude Anderson provided a wealth of knowledge, advice, and friendship. David Mutchler was instrumental in my decision to pursue research, and provided critical advice along the way. iv My primary advisor, David B. Leake, is amazing. He has an uncanny ability to provide a critical idea or reference just when it is needed. His wordsmithing abilities are incredible. All of my writ- ings, including this dissertation, are greatly improved due to his involvement. He also displayed an immense amount of patience while the dissertation process took much longer than expected. The rest of my dissertation committee, Michael Gasser, Gregory J. E. Rawlins, and Javed Mostafa, also provided critical insights along the way. Members of the Indiana University Cranium lab provided camaraderie and stimulating conver- sation. These include: Douglas Eck, Andrew Kinley, Adam Leary, Kyle Wagner, David C. Wilson, Ana Maguitman, Thomas Reichherzer, and Jocelyn Bauer. Travis Bauer deserves to be singled out. In addition to being a source of stimulating ideas, Travis was an outstanding model of time man- agement, proving that it is possible to balance a growing family and a serious research career. Josh Goldberg was instrumental in determining the best way to vary databases with respect to a particular topic, which opened the door to the bulk of the experiments in this dissertation. I would like to thank Northwestern University and the Intelligent Information Laboratory for providing a place to work and intellectual support during my brief sojourn in Chicago. Many thanks go to Kris Hammond, Jay Budzik, Shannon Bradshaw, Dave Franklin, and Josh Flachsbart. The support staff of the Indiana University Computer Science department is wonderful. I'm certain that each of them helped me at some time during the process, but Rob Henderson, Laura Reed, and Sherry Kay deserve special thanks. I worked in the Indiana University Digital Library Program during much of the dissertation process. Switching between my dissertation research and my work in this department each day was like stepping into a parallel universe. Libraries contain different types of data than Web pages, and this data often adds a new level of complexity to the search process. Working in this environment v gave me a wonderful new perspective on the process of search. I am deeply indebted to all of the people there, including: Jon W. Dunn, Mark Notess, Jim Halliday, Donald Byrd, Jenn Riley, John Walsh, Tamara Lopez, David Jiao, Randall Floyd, and Michelle Dalmau. Finally, eternal gratefulness goes to my wife, Amanda, and my children, Kedric, Graham, and Adrian. They waited patiently through hundreds of hours of writing, always greeting me with happy smiles when I returned to the real world. They have been the driving force that kept me going through the darkest times, and this dissertation would not exist without their support. vi Abstract The Internet contains billions of documents and thousands of systems for searching over these documents. Searching for a useful document can be as difficult as the proverbial search for a needle in a haystack. Each search engine provides access to a different collection of documents. Collections may be large or small, focused or comprehensive. Focused collections may be centered on any possible topic, and comprehensive collections typically have particular topical areas with higher concentrations of documents. Some of these collections overlap, but many documents are available from only a single collection. To find the most needles, one must first select the best haystacks. This dissertation develops a framework for automatic selection of search engines. In this framework, the collection underlying each search engine is examined to determine how properties such as central topic, size, and degree of focus affect retrieval performance. When measured with ap- propriate techniques, these properties may be used to predict performance. A new distributed retrieval algorithm that takes advantage of this knowledge is presented and compared to existing retrieval algorithms. vii Contents Acknowledgements iv Abstract vii 1 Introduction 1 1.1 The Need for Resource Selection . 2 1.2 Project Outline . 7 1.3 Contributions of This Work . 8 1.4 Structure of the Dissertation . 9 2 Information Retrieval 10 2.1 Centralized Information Retrieval . 11 Evaluating Information Retrieval Systems . 13 2.2 Distributed Information Retrieval . 14 2.3 Distributed Search on the Web . 16 viii Meta Search . 17 2.4 Intelligent Information Systems . 19 2.5 Cognitive Perspectives on Information Retrieval . 20 2.6 Information Retrieval in This Dissertation . 22 3 Choosing Collections Using Topic Similarity 25 3.1 Previous Work on Topic Similarity . 27 3.2 Creating a Collection . 28 The WT2g Collection . 28 The Search Engine . 30 Creating a Collection From WT2g . 33 3.3 Evaluating Collection Performance . 34 Lucene's Overall Performance . 35 Comparing the Topical Collection to the Full Collection . 37 3.4 Varying the Topical Focus of a Collection . 40 Creating Collections With Varied Focus . 41 Initial Results . 41 3.5 Calculating Similarity Between Queries and Collections . 42 The Nature of IDF . 45 Similarity Estimation . 46 ix 3.6 The Robustness of Similarity Scores . 49 3.7 Varying Topic Difficulty . 51 Selecting Target Topics . 52 Results of Changing Topic Difficulty . 54 3.8 Conclusions . 57 4 The Effects of Collection Size 60 4.1 Previous Work on Collection Size . 61 4.2 The Effect of Size on Essential Properties . 61 4.3 Performance of Smaller Collections . 63 4.4 Performance of Larger Collections . 67 4.5 Trends as a Factor of Size . 69 5 The Effects of Collection Focus 75 5.1 Previous Work on Collection Focus . 76 5.2 Measuring Collection Focus . 78 Entropy Measures . 78 Similarity-Based Measures . 82 5.3 Collection Focus as a Predictor of Performance . 84 5.4 Combining the Topic, Size, and Focus Measures . 85 x 6 Improving Topic Match With Context 88 6.1 Background . 90 6.2 Varying the Amount of Information Available in a Query . 93 6.3 The Effect of Context on Focused Collections . 96 Using Context to Select Collections . 96 Using Context to Query Collections . 99 6.4 Discussion . 102 7 Putting It All Together 104 7.1 Communicating With a Search Engine . 106 7.2 Building a Collection Representation . 108 Estimating Topic . 110 Estimating Size . 112 Estimating Focus . 112 7.3 Result Merging . 113 7.4 Testing the Combined System . 115 The DIR Algorithm . 116 Results . 117 8 Conclusions and Future Work 119 8.1 Summary of Results . 120 8.2 Future Directions . 121 xi A Appendix: Topics and Queries 125 B Appendix: Results of Topic and Size Testing 129 C Appendix: Results of Focus Testing 139 D Appendix: Results of Context Testing 144 xii List of Tables 3.1 Comparison of one-dimensional precision measures. 40 3.2 Summary of the three target topics. 54 3.3 Summary of P@20 values for the topical and centralized collections. 58 4.1 Similarity and performance scores for the large t437 collections. 69 5.1 Entropy-based scores. 81 5.2 Similarity-based scores. 83 5.3 Results of multiple linear regression on candidate features. 86 xiii List of Figures 3.1 Topic 401 . 30 3.2 Documents relevant to topic 401 . 30 3.3 Lucene's average performance on the TREC-8 Web topics. 36 3.4 A comparison of precision-recall curves using a search engine based on topic 401 and a search engine based on.

Load more