Neural Methods for Effective, Efficient, and Exposure-Aware Information

Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval Bhaskar Mitra A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London March 19, 2021 2 3 I, Bhaskar Mitra, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Dedicated to Ma and Bapu Abstract Neural networks with deep architectures have demonstrated significant perfor- mance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents—or short passages—in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by mod- eling relationships between different query and document terms and how they indi- cate relevance. Models should also consider lexical matches when the query con- tains rare terms—such as a person’s name or a product model number—not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections—such as the document index of a commercial Web search engine—containing billions of documents. Efficient IR methods should take advantage of specialized IR data struc- tures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for ad- ditional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature. Our key contribu- tion towards improving the effectiveness of deep ranking models is developing the 8 Abstract Duet principle which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent rep- resentations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets and explores the application of neural methods to other IR tasks, such as query auto-completion. Impact Statement The research presented in this thesis was conducted while the author was employed at Bing, a commercial web search engine. Many of the research questions inves- tigated here have consequently been motivated by real world challenges in building large scale retrieval systems. The insights gained from these studies have in- formed, and continues to influence, the design of retrieval models in both industry and academia. Duet was the first to demonstrate the usefulness of deep representation learning models for document ranking. Since then, the academic community has continued to build on those early results, culminating in large improvements in retrieval qual- ity from neural methods over traditional IR approaches at the TREC 2019 Deep Learning track. Many of these deep models have subsequently been deployed at the likes of Bing and Google. Similarly, our foray into efficient neural architectures is both of academic interest and critical to increasing the scope of impact of these computation-heavy models. Search systems do not just exist in laboratory environments but are inher- ently sociotechnical instruments that mediate what information is accessible and consumed. This presents enormous responsibility on these systems to ensure that retrieved results are representative of the collections being searched, and that the retrieval is performed in a manner fair to both content producers and consumers. Lack of attention to these facets may lead to serious negative consequences—e.g., the formation of information filter bubbles or visible demographic bias in search results. The capability to directly optimize for expected exposure may be key to addressing some of these concerns. 10 Impact Statement Finally, to accurately qualify the impact of our research, we must look beyond published empirical and theoretical results for contributions. Science is not an in- dividual endeavour, and therefore any evaluation of research must also encompass the impact of the artifacts produced by said research on its immediate field of study and the academic community around it. A more meaningful evaluation of our contributions, therefore, requires juxtaposition of how the field of neural information retrieval has evolved during the course of this research and have been supported directly by our work. In 2016, when neural IR was still an emerging area, we organized the first workshop focused on this topic at the ACM SIGIR conference. That year, approximately 8% of the published papers at SIGIR were related to this topic. In contrast, this year at the same conference about 79% of the publications employed neural methods. Our research has directly contributed to this momen- tum in several ways—including, building standard task definition and benchmarks in the form of MS MARCO and the TREC Deep Learning track, which has been widely adopted as the primary benchmark by the community working on deep learning for search. More recently we have also released a large user behavior dataset, called ORCAS, that may enable training even larger and more sophisticated neural retrieval models. We have organized several workshops and tutorials to bring together researchers whose work cuts across information retrieval, natural language processing, and machine learning domains to build a community around this topic. Our early survey of neural IR methods also resulted in an instructive manuscript that both summarized important progress and charted out key directions for the field. Acknowledgements It takes a proverbial village to teach the skills necessary to do good science and become an independent researcher. So, when I pause and look back at my academic journey thus far, I am filled with gratitude towards every person who professionally and personally supported, mentored, and inspired me on this journey. This thesis is as much a product of the last four years of my own research as it is the fruit of the time and labor that others invested in me on the way. I am grateful to my supervisor Emine Yilmaz not just for her mentorship and guidance during my PhD but also for her initial encouragement to pursue a doc- torate. I remember my apprehension about pursuing a part-time degree while employed full-time. Emine did not just believe that I could succeed in this endeavour, but also helped me find the confidence to pursue it. My PhD journey would not even have started without her support and guidance. I am grateful to her for plac- ing that enormous trust in me and then guiding me with patience and kindness. I thank her for the countless technical discussions and insights, and for all the close collaboration including organizing the TREC Deep Learning track. A few years before the start of my PhD journey, while I was still a ranking engineer at Bing, I walked into Nick Craswell’s office one day and expressed my interest to join his applied science team in Cambridge (UK) to “learn the ropes” for how to do research. The trust Nick placed in me that day when he encouraged me to pursue that dream, altered the course of my professional career. Seven years later, I am still proud to be part of Nick’s team and grateful for having grown under his mentorship. Nick has constantly encouraged and supported, and often gone out of his way to do so, my dreams of pursuing a career in research. I co-authored my 12 Acknowledgements first book with Nick. We co-organized the first deep learning workshop at SIGIR together. We have collaborated on numerous projects and papers over the years. His insightful comments and thoughtful feedback have always served as important lessons for me on how to do good science. I moved to Cambridge in the summer of 2013 with limited understanding of what it takes to be a researcher or to publish. I am grateful that in those early formative years of my research career, Milad Shokouhi, Filip Radlinkski, and Katja Hofmann took me under their wings. I co-authored my first two peer reviewed publications with the three of them. All the advice and lessons I received during that time stuck with me and continues to influence and guide my research to this day. I am grateful to all three of them for their patient and thoughtful mentorship. I am indebted to Fernando Diaz for his invaluable mentorship over the years and for the many critical lessons on doing good science and good scholarship. All the projects or publications we have collaborated on have been incredibly instructive. Fernando continues to shape my research agenda and my personal vision of what kind of a researcher I want to be. I want to thank Fernando for the enormous trust that he placed in me by taking me under his wings. I have cherished him both as a mentor and a collaborator, and hope for continued collaborations in the future. David Hawking deserves a special mention in this thesis.

Load more