Pretrained Transformers for Text Ranking: BERT and Beyond

Pretrained Transformers for Text Ranking: BERT and Beyond Jimmy Lin,1 Rodrigo Nogueira,1 and Andrew Yates2;3 1 David R. Cheriton School of Computer Science, University of Waterloo 2 University of Amsterdam 3 Max Planck Institute for Informatics Version 0.99 — August 20, 2021 Abstract The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task. Although the most common formulation of text ranking is search, instances of the task can also be found in many text processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. For text ranking, transformer-based models produce high quality results across many domains, tasks, and settings. This survey provides a synthesis of existing work as a single point of entry for practitioners who wish to deploy transformers for text ranking and researchers who wish to pursue work in this area. We cover a wide range of techniques, grouped into two categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. Examples in the first category include approaches based on relevance classification, evidence aggregation from multiple segments of text, and document and query expansion. The second category involves using transformers to learn dense representations of texts, where ranking is formulated as comparisons between query and document representations that take advantage of nearest neighbor search. At a high level, there are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) arXiv:2010.06467v3 [cs.IR] 19 Aug 2021 and efficiency (e.g., query latency, model and index size). Much effort has been devoted to developing ranking models that address the mismatch between document lengths and the length limitations of existing transformers. The computational costs of inference with transformers has led to alternatives and variants that aim for different tradeoffs, both within multi-stage architectures as well as with dense learned representations. Although transformer architectures and pretraining techniques are recent innova- tions, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading. Contents 1 Introduction 4 1.1 Text Ranking Problems . .6 1.2 A Brief History . 10 1.2.1 The Beginnings of Text Ranking . 10 1.2.2 The Challenges of Exact Match . 12 1.2.3 The Rise of Learning to Rank . 14 1.2.4 The Advent of Deep Learning . 15 1.2.5 The Arrival of BERT . 18 1.3 Roadmap, Assumptions, and Omissions . 19 2 Setting the Stage 20 2.1 Texts . 20 2.2 Information Needs . 21 2.3 Relevance . 23 2.4 Relevance Judgments . 25 2.5 Ranking Metrics . 26 2.6 Community Evaluations and Reusable Test Collections . 30 2.7 Descriptions of Common Test Collections . 34 2.8 Keyword Search . 41 2.9 Notes on Parlance . 43 3 Multi-Stage Architectures for Reranking 46 3.1 A High-Level Overview of BERT . 47 3.2 Simple Relevance Classification: monoBERT . 51 3.2.1 Basic Design of monoBERT . 52 3.2.2 Exploring monoBERT . 56 3.2.3 Investigating How BERT Works . 61 3.2.4 Nuances of Training BERT . 63 3.3 From Passage to Document Ranking . 67 3.3.1 Document Ranking with Sentences: Birch . 68 3.3.2 Passage Score Aggregation: BERT–MaxP and Variants . 72 3.3.3 Leveraging Contextual Embeddings: CEDR . 77 3.3.4 Passage Representation Aggregation: PARADE . 82 3.3.5 Alternatives for Tackling Long Texts . 86 3.4 From Single-Stage to Multi-Stage Rerankers . 87 3.4.1 Reranking Pairs of Texts . 90 3.4.2 Reranking Lists of Texts . 93 3.4.3 Efficient Multi-Stage Rerankers: Cascade Transformers . 94 3.5 Beyond BERT . 97 2 3.5.1 Knowledge Distillation . 98 3.5.2 Ranking with Transformers: TK, TKL, CK . 101 3.5.3 Ranking with Sequence-to-Sequence Models: monoT5 . 104 3.5.4 Ranking with Sequence-to-Sequence Models: Query Likelihood . 109 3.6 Concluding Thoughts . 110 4 Refining Query and Document Representations 112 4.1 Query and Document Expansion: General Remarks . 113 4.2 Pseudo-Relevance Feedback with Contextualized Embeddings: CEQE . 115 4.3 Document Expansion via Query Prediction: doc2query . 118 4.4 Term Reweighting as Regression: DeepCT . 122 4.5 Term Reweighting with Weak Supervison: HDCT . 125 4.6 Combining Term Expansion with Term Weighting: DeepImpact . 127 4.7 Expansion of Query and Document Representations . 128 4.8 Concluding Thoughts . 131 5 Learned Dense Representations for Ranking 132 5.1 Task Formulation . 132 5.2 Nearest Neighbor Search . 137 5.3 Pre-BERT Text Representations for Ranking . 138 5.4 Simple Transformer Bi-encoders for Ranking . 139 5.4.1 Basic Bi-encoder Design: Sentence-BERT . 141 5.4.2 Bi-encoders for Dense Retrieval: DPR and ANCE . 143 5.4.3 Bi-encoders for Dense Retrieval: Additional Variations . 148 5.5 Enhanced Transformer Bi-encoders for Ranking . 150 5.5.1 Multiple Text Representations: Poly-encoders and ME-BERT . 151 5.5.2 Per-Token Representations and Late Interactions: ColBERT . 153 5.6 Knowledge Distillation for Transformer Bi-encoders . 155 5.7 Concluding Thoughts . 158 6 Future Directions and Conclusions 161 6.1 Notable Content Omissions . 161 6.2 Open Research Questions . 162 6.3 Final Thoughts . 170 Acknowledgements 171 Version History 172 References 173 3 1 Introduction The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task. The most common formulation of text ranking is search, where the search engine (also called the retrieval system) produces a ranked list of texts (web pages, scientific papers, news articles, tweets, etc.) ordered by estimated relevance with respect to the user’s query. In this context, relevant texts are those that are “about” the topic of the user’s request and address the user’s information need. Information retrieval (IR) researchers call this the ad hoc retrieval problem.1 With keyword search, also called keyword querying (for example, on the web), the user typically types a few query terms into a search box (for example, in a browser) and gets back results containing representations of the ranked texts. These results are called ranked lists, hit lists, hits, “ten blue links”,2 or search engine results pages (SERPs). The representations of the ranked texts typically comprise the title, associated metadata, “snippets” extracted from the texts themselves (for example, an extractive keyword-in-context summary where the user’s query terms are highlighted), as well as links to the original sources. While there are plenty of examples of text ranking problems (see Section 1.1), this particular scenario is ubiquitous and undoubtedly familiar to all readers. This survey provides an overview of text ranking with a family of neural network models known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) [Devlin et al., 2019], an invention of Google, is the best-known example. These models have been responsible for a paradigm shift in the fields of natural language processing (NLP) and information retrieval (IR), and more broadly, human language technologies (HLT), a catch-all term that includes technologies to process, analyze, and otherwise manipulate (human) language data. There are few endeavors involving the automatic processing of natural language that remain untouched by BERT.3 In the context of text ranking, BERT provides results that are undoubtedly superior in quality than what came before. This is a robust and widely replicated empirical result, across many text ranking tasks, domains, and problem formulations. A casual skim through paper titles in recent proceedings from NLP and IR conferences will leave the reader without a doubt as to the extent of the “BERT craze” and how much it has come to dominate the current research landscape. However, the impact of BERT, and more generally, transformers, has not been limited to academic research. In October 2019, a Google blog post4 confirmed that the company had improved search “by applying BERT models to both ranking and featured snippets”. Ranking refers to “ten blue links” and corresponds to most users’ understanding of web search; “feature snippets” represent examples of question answering5 (see additional discussion in Section 1.1). Not to be outdone, in November 2019, a Microsoft blog post6 reported that “starting from April of this year, we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year”. As a specific instance of transformer architectures, BERT has no doubt improved how users find relevant information. Beyond search, other instances of the model have left their marks as well. For example, transformers dominate approaches to machine translation, which is the automatic translation of natural language text7 from one human language to another, for example, from English to French. 1There are many footnotes in this survey. Since nobody reads footnotes, we wanted to take one opportunity to inform the reader here that we’ve hidden lots of interesting details in the footnotes. But this message is likely to be ignored anyway. 2Here’s the first interesting tidbit: The phrase “ten blue links” is sometimes used to refer to web search and has a fascinating history.

Load more