Ad-Hoc Document Retrieval Using Weak-Supervision with BERT and GPT2

Ad-hoc Document Retrieval using Weak-Supervision with BERT and GPT2 Yosi Mass Haggai Roitman∗ IBM Research eBay Research Haifa University, Mount Carmel, Haifa, Netanya, Israel HA 31905, Israel [email protected] [email protected] Abstract stracts can take the role of questions and answers of FAQs, respectively. We describe a weakly-supervised method for Whenever a document is missing a title, we con- training deep learning models for the task of sider its first sentence as its augmented title. In ad-hoc document retrieval. Our method is based on generative and discriminative mod- a similar way, whenever a document is missing els that are trained using weak-supervision an abstract, we consider the first 512 words of its based solely on the documents in the cor- content as the abstract. pus. We present an end-to-end retrieval sys- The three fields are used for retrieving candidate tem that starts with traditional information re- documents. Inspired by (Mass et al., 2020), the trieval methods, followed by two deep learning title and abstract fields are further used as a weak- re-rankers. We evaluate our method on three supervision data source for training two indepen- different datasets: a COVID-19 related scien- dent BERT (Devlin et al., 2019) models, that are tific literature dataset and two news datasets. We show that our method outperforms state-of- then used to re-rank those candidates documents. the-art methods; this without the need for the The first model matches user queries to docu- expensive process of manually labeling data. ments’ abstracts. Here we use the title-to-abstract associations to fine-tune a BERT model to semanti- 1 Introduction cally match queries to abstracts. The second model matches user queries to titles. Here our assumption The ad-hoc retrieval task has been extensively is that by generating title paraphrases, we can train studied by the Information Retrieval (IR) com- a model to match user queries to titles. To this end, munity. Traditional IR models evaluate ad-hoc we use GPT2 (Radford et al., 2018) to generate title queries against documents mainly on a syntac- paraphrases, which are then utilized for fine-tuning tic (exact) word-matching basis (Manning et al., the second BERT model. 2008). Recent years advances in Deep Learning While our work is closely related to (Mass et al., (DL) methods have lead to further improvement in 2020), with the lack of human-curated questions IR tasks, and among others, in ad-hoc document (such as in FAQs), we still need to resort to title retrieval (Guo et al., 2019). DL methods add a se- paraphrases as (noisy) pseudo-questions and trans- mantic dimension to IR methods. However, such fer (Mass et al., 2020)’s method to the more general methods usually require large amounts of labeled task of ad-hoc document retrieval. Moreover, com- data for model training. pared to FAQs that are relatively short, the current In this work, we describe a novel weakly- task deals with documents that can be quite long. supervised method for training DL methods for Thus, in current paper we use three fields (title, ab- ad-hoc document retrieval. Motivated by the re- stract, content) and present a strong IR base line cent work of (Mass et al., 2020) on Frequently instead of only two fields and a simple IR baseline Asked Questions (FAQ) retrieval, we assume that used in (Mass et al., 2020) documents have at least three fields, namely title, As a proof of concept, we evaluate our method abstract and content. Such documents are actually on three benchmarks: TREC-COVID - a scien- quite common nowadays in the scientific and news tific literature dataset on COVID-19 topics; and domains. Our main hypothesis is that: titles and ab- TREC’s newswire corpora: Associated Press (AP) *∗ Work done while affiliated with IBM. and Wall Street Journal (WSJ). By combining the 4191 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4191–4197, November 16–20, 2020. c 2020 Association for Computational Linguistics two weakly-supervised BERT models with an ex- a joint attention-based representation for pairs of isting strong IR baseline, we demonstrate that the (query, abstarct) and (query, title) while in (Chang former can help to elevate the performance of the et al., 2020) they learn a separate representation for latter. Our approach further outperforms state-of- queries and passages. the-art methods on these benchmarks. 3 Method 2 Related Work Inspired by (Mass et al., 2020), we consider the ad- With the lack of training data, several weakly- hoc document retrieval problem as an instance of supervised alternatives have been explored so far FAQ retrieval, where a document’s title represents for the task at hand. (Dehghani et al., 2017b,a) and the question and its abstract the answer. (Nie et al., 2018) have utilized rankings produced Our proposed retrieval approach allows to en- by BM25 model as training samples. (MacAvaney hance existing state-of-the-art ad-hoc retrieval et al., 2019) have used pseudo query-document methods with weakly-supervised neural models pairs that already exhibit relevance (e.g., newswire that are completely trained from the documents headline-content pairs). (Frej et al., 2019) have collection itself without the need to supply man- utilized Wikipedia’s internal linkage to define au- ual relevance labels. Following the common ap- tomated queried topics. (Zhang et al., 2020) have proach (Guo et al., 2019), these neural-models used anchor texts and their linked web pages as are utilized for re-ranking candidate documents query-document pairs. retrieved by a given IR baseline. Our work is different from all those works as we In what follows, the initial candidate documents train a model to generate title paraphrases that are retrieval uses pure IR similarities and relevance used to enable query-to-title (question) matching models (Section 3.1). The re-ranking step exploits and not only query-to-abstract (answer) matching. two independent weakly-supervised BERT models, (Ma et al., 2020) have proposed a zero-shot re- namely: BERT-Q-a (Section 3.2) for matching trieval approach using synthetic query generation queries to abstracts and BERT-Q-t (Section 3.2) by training a generative model on a different Com- for matching queries to titles. munity QA data. Our work differs from (Ma et al., The final re-ranking is obtained by combining 2020) in three main aspects. First, (Ma et al., 2020) the outcome of the baseline IR method and the two focuses on QA, where answers are very short, while BERT-based re-rankers using an unsupervised late- we generate title paraphrases from full abstracts. fusion step (Section 3.4). The components of our Second, we train a model to generate title para- approach are described in the rest of this section. phrases which are used to enable not only query-to- abstract (answer) matching, but also query-to-title 3.1 Initial retrieval (question) matching. Third, (Ma et al., 2020) fil- ters the input QA pairs that are used to train the We first obtain for each query a reasonable pool generative model by taking only pairs that were of candidate documents to be re-ranked using our voted by at least one-user on those Community QA weakly-supervised models. To this end we retrieve (CQA) sites. We do not have such voting so we several ranked lists from an Apache Lucene1 index use a smart filtering on the output data (namely using various state-of-the-art IR similarities. that on the generated title-paraphrases) as described in are available in Lucene. The various retrieved lists Section 3.3. are then combined to generate a single pool of The work in (Chang et al., 2020) suggests an top-k candidates for re-ranking by employing the efficient neural method for initial retrieval of candi- PoolRank (Roitman, 2018) fusion method. We dates. Their method uses a two-tower architecture refer to this IR pipeline as IR-Base. which learns a different representation for passages The IR similarities and the PoolRank method and for queries. While their method can be used have few free-parameters that are tuned so to opti- as an initial retrieval (instead of our IR method), mize Mean Average Precision (MAP@1000). De- the authors of (Chang et al., 2020) still require tails are given in the experimental setup (Sec- an additional re-ranking step. Thus it does not re- tion 4.2) below. place our two weakly-supervised BERT re-ranking models. Moreover, our two BERT models learn 1https://lucene.apache.org/ 4192 3.2 BERT-Q-a At run time, given a user query Q, BERT-Q-t re- k We use pairs of title-abstract (t,a) of documents in ranks the top- candidate documents by matching Q t the collection as a weak-supervision data source to titles ( ) only. for fine-tuning a pre-trained BERT model which is 3.4 Enhanced ad-hoc retrieval using Fusion then used to match user queries to abstracts. To enhance ad-hoc retrieval quality, we now pro- Similar to (Mass et al., 2020), we fine-tune the pose to combine the two weakly-supervised fine- BERT model (denoted BERT-Q-a) using a triplet tuned BERT models with the baseline IR method network (Hoffer and Ailon, 2015). This network is (IR-Base, see again Section 3.1). To this end, fol- adopted for BERT fine-tuning (Mass et al., 2019) lowing (Roitman, 2018), we utilize the Two-Step using triplets (t; a; a0), where (t; a) constitutes a PoolRank (denoted TSPR) unsupervised fusion document title and its abstract.

Load more