Bridging the Gap Between Relevance Matching and Semantic Matching for Short Text Similarity Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Bridging the Gap Between Relevance Matching and Semantic Matching for Short Text Similarity Modeling Jinfeng Rao,1 Linqing Liu,2 Yi Tay,3 Wei Yang,2 Peng Shi,2 and Jimmy Lin2 1 Facebook Assistant 2 David R. Cheriton School of Computer Science, University of Waterloo 3 Nanyang Technological University [email protected] Abstract et al., 2016) and attention (Seo et al., 2017; Tay et al., 2019b), have been proposed to model se- A core problem of information retrieval (IR) mantic similarity using diverse techniques. is relevance matching, which is to rank doc- A core problem of information retrieval (IR) uments by relevance to a user’s query. On is relevance matching (RM), where the goal is to the other hand, many NLP problems, such as question answering and paraphrase iden- rank documents by relevance to a user’s query. tification, can be considered variants of se- Though at a high level semantic and relevance mantic matching, which is to measure the se- matching both require modeling similarities in mantic distance between two pieces of short pairs of texts, there are fundamental differences. texts. While at a high level both relevance Semantic matching emphasizes “meaning” corre- and semantic matching require modeling tex- spondences by exploiting lexical information (e.g., tual similarity, many existing techniques for words, phrases, entities) and compositional struc- one cannot be easily adapted to the other. To bridge this gap, we propose a novel model, tures (e.g., dependency trees), while relevance HCAN (Hybrid Co-Attention Network), that matching focuses on keyword matching. It has comprises (1) a hybrid encoder module that been observed that existing approaches for tex- includes ConvNet-based and LSTM-based en- tual similarity modeling in NLP can produce poor coders, (2) a relevance matching module that results for IR tasks (Guo et al., 2016), and vice measures soft term matches with importance versa (Htut et al., 2018). weighting at multiple granularities, and (3) a Specifically, Guo et al.(2016) point out three semantic matching module with co-attention distinguishing characteristics of relevance match- mechanisms that capture context-aware se- mantic relatedness. Evaluations on multiple ing: exact match signals, query term importance, IR and NLP benchmarks demonstrate state-of- and diverse matching requirements. In particular, the-art effectiveness compared to approaches exact match signals play a critical role in relevance that do not exploit pretraining on external data. matching, more so than the role of term match- Extensive ablation studies suggest that rele- ing in, for example, paraphrase detection. Further- vance and semantic matching signals are com- more, in document ranking there is an asymmetry plementary across many problem settings, re- between queries and documents in terms of length gardless of the choice of underlying encoders. and the richness of signals that can be extracted; 1 Introduction thus, symmetric models such as Siamese architec- tures may not be entirely appropriate. Neural networks have achieved great success in To better demonstrate these differences, we many NLP tasks, such as question answering (Rao present examples from relevance and semantic et al., 2016; Chen et al., 2017a), paraphrase de- matching tasks in Table1. Column ‘Label’ de- tection (Wang et al., 2017), and textual semantic notes whether sentence A and B are relevant or du- similarity modeling (He and Lin, 2016). Many of plicate. The first example from tweet search shares these tasks can be treated as variants of a seman- many common keywords and is identified as rele- tic matching (SM) problem, where two pieces of vant, while the second pair from Quora shares all texts are jointly modeled through distributed rep- words except for the subject and is not considered resentations for similarity learning. Various neural a duplicate pair. An approach based on keyword network architectures, e.g., Siamese networks (He matching alone is unlikely to be able to distinguish 5370 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5370–5381, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics Task Label Sentence A Sentence B 1 2022 FIFA soccer 2022 world cup FIFA could be Tweet Search held at the end of year in Qatar 0 Does RBI send its employees Does EY send its employees Duplicate Detection for higher education, like MBA? for higher education, like MBA? 1 What was the monetary value of Each Nobel prize Question Answering the Nobel peace prize in 1989 ? is worth $469,000 . Table 1: Sample sentence pairs from TREC Microblog 2013, Quora, and TrecQA. between these cases. In contrast, the third example 2. A relevance matching module that measures is judged as a relevant QA pair because different soft term matches with term weightings be- terms convey similar semantics. tween pairs of texts, starting from word-level These divergences motivate different architec- to phrase-level, and finally to sentence-level. tural choices. Since relevance matching is fun- damentally a matching task, most recent neu- 3. A semantic matching module with co-attention ral architectures, such as DRMM (Guo et al., mechanisms applied at each encoder layer to 2016) and Co-PACRR (Hui et al., 2018), adopt enable context-aware representation learning at an interaction-based design. They operate di- multiple semantic levels. rectly on the similarity matrix obtained from prod- Finally, all relevance and semantic matching sig- ucts of query and document embeddings and build nals are integrated using a fully-connected layer sophisticated modules on top to capture addi- to yield the final classification score. tional n-gram matching and term importance sig- nals. On the other hand, many NLP problems, Contributions. We see our work as making the such as question answering and textual similar- following contributions: ity measurement, require more semantic under- • We highlight and systematically explore im- standing and contextual reasoning rather than spe- portant differences between relevance matching cific term matches. Context-aware representation and semantic matching on short texts, which lie learning, such as co-attention methods (Seo et al., at the core of many of IR and NLP problems. 2017), has been proved effective in many bench- marks. Though improvements have been shown • We propose a novel model, HCAN (Hybrid Co- from adding exact match signals into represen- Attention Network), to combine best practices tation learning, for example, the Dr.QA model in neural modeling for both relevance and se- of Chen et al.(2017a) concatenates exact match mantic matching. scores to word embeddings, it remains unclear to what extent relevance matching signals can further • Evaluations on multiple IR and NLP tasks, in- improve models primarily designed for semantic cluding answer selection, paraphrase identifi- matching. cation, semantic similarity measurement, and To this end, we examine two research questions: tweet search, demonstrate state-of-the-art effec- (1) Can existing approaches to relevance match- tiveness compared to approaches that do not ing and semantic matching be easily adapted to exploit pretraining on external data. Abla- the other? (2) Are signals from relevance and tion studies show that relevance and semantic semantic matching complementary? We present matching signals are complementary in many a novel neural ranking approach to jointly model problems, and combining them can be more both the relevance matching process and the se- data efficient. mantic matching process. Our model, HCAN (Hybrid Co-Attention Network), comprises three 2 HCAN: Hybrid Co-Attention Network major components: The overview of our model is shown in Figure1. 1. A hybrid encoder module that explores three It is comprised of three major components: (1) a types of encoders: deep, wide, and contextual, hybrid encoder module that explores three types to obtain contextual sentence representations. of encoders: deep, wide, and contextual (Sec. 2.1); 5371 N Blocks 1. Hierarchical Represenation Learning 2. Relevance External Matching weights Embedding Layer ith Conv Layer Product Max/Mean Query Pooling Attention Matrix |Q| x |C| Final Classification Co-attention Layer 3. Semantic Matching Context2Query Context Query2Context BiLSTM Figure 1: Overview of our Hybrid Co-Attention Network (HCAN). The model consists of three major components: (1) a hybrid encoder module that explores three types of encoders: deep, wide, and contextual; (2) a relevance matching module with external weights for learning soft term matching signals; (3) a semantic matching module with co-attention mechanisms for context-aware representation learning. (2) a relevance matching module with external the input embedding incrementally as a sliding weights for learning soft term matching signals window (with window size k) to capture the com- (Sec. 2.2); (3) a semantic matching module with positional representation of k neighboring terms. co-attention mechanisms for context-aware repre- Assuming a convolutional layer has F filters, this sentation learning (Sec. 2.3). Note that the rele- CNN layer (with padding) produces an output ma- o kUk×F vance and semantic matching modules are applied trix U 2 R . at each encoder layer, and all signals are finally For notational simplicity, we drop the super- aggregated for classification. script o from all output matrices and add a super- script h to denote the output of the h-th convo- 2.1 Hybrid Encoders lutional layer. Stacking N CNN layers therefore corresponds to obtaining the output matrix of the Without loss of generality, we assume that inputs h h-th layer Uh 2 kUk×F via: to our model are sentence pairs (q; c), where (q; c) R can refer to a (query, document) pair in a search Uh = CNNh(Uh−1); h = 1; : : : ; N; setting, a (question, answer) pair in a QA setting, h−1 etc. The query q and context c are denoted by their where U is the output matrix of the (h − 1)-th q q q c c c convolutional layer.