Improving Passage Re-Ranking with Word N-Gram Aware Coattention Encoder
Total Page:16
File Type:pdf, Size:1020Kb
Improving Passage Re-Ranking with Word N-Gram Aware Coattention Encoder Chaitanya Sai Alaparthi and Manish Shrivastava Language Technologies Research Centre (LTRC), Kohli Centre on Intelligent Systems(KCIS), International Institute of Information Technology, Hyderabad, India [email protected] [email protected] Abstract extracting answers and the performance of the over- all system. Thus it is important for a QA system to In text matching applications, coattentions effectively rank passages. have proved to be highly effective attention Attention mechanisms have shown tremendous mechanisms. Coattention enables the learn- ing to attend based on computing word level improvements in the deep learning based NLP mod- affinity scores between two texts. In this pa- els (Bahdanau et al., 2014; Wang et al., 2016; Yang per, we propose two improvements to coat- et al., 2016; Lu et al., 2016; Vaswani et al., 2017). tention mechanism in the context of passage Attention allows the model to dynamically focus ranking (re-ranking). First, we extend the only on certain parts of the input that helps in per- coattention mechanism by applying it across forming the task at hand effectively. Coattentions all word n-grams of query and passage. We (Xiong et al., 2016) are class of attention mecha- show that these word n-gram coattentions can capture local context in query and passage nisms which can be applied on text matching prob- to better judge the relevance between them. lems. They proved to be highly effective as they Second, we further improve the model per- enables the learning to attend based on comput- formance by proposing a query based atten- ing word level affinity scores between two texts tion pooling on passage encodings. We eval- thus helping in effectively deciding the relevance uate these two methods on MSMARCO pas- between them. sage re-ranking task. The experiment results This paper builds on previous work on coatten- shows that these two methods resulted in a rel- tion mechanism (Alaparthi, 2019) (we call it as ative increase of 8.04% in Mean Reciprocal Rank @10 (MRR@10) compared to the naive naive coattention encoder) to tackle the problem coattention mechanism. At the time of writ- of passage re-ranking. We recall that the coatten- ing this paper, our methods are the best non tion encoder attends across query and passage by transformer model on MS MARCO passage computing the word-level affinity scores. Similar re-ranking task and are competitive to BERT to (Hui et al., 2018), we argue that attending at base while only having less than 10% of the word-level limits the ability to capture local con- parameters. text in the query-passage interactions. As an exam- 1 Introduction ple (which we later explain in section 5.3), for a query: what is January birthstone color, the naive Passage ranking (or re-ranking) is a key informa- coattention encoder can relate the passages describ- tion retrieval (IR) task in which a model has to ing passages such as November birthstone color, rank (or re-rank) set of passages based on how rel- April birthstone color, etc. This is likely because evant they are to a given query. It is an integral of common matching terms birthstone and color part of conversational search systems, automated and semantically similar words January, Novem- question answering systems (QA), etc., The typical ber, April, etc. We demonstrate that extending the answer extraction process in these systems consists coattentions to words and n-grams can improve of two main phases. The first phase is ranking pas- the matching signals, which will contribute to final sages from the collection that most likely contain relevance scores. the answers. The second phase is extracting an- In the naive coattention encoder, max-pooling swers from these passages. The performance of was applied on the coattention encodings to ob- first phase significantly impact the performance of tain the co-dependent representation of the passage, 161 Proceedings of the 17th International Conference on Natural Language Processing, pages 161–169 Patna, India, December 18 - 21, 2020. ©2020 NLP Association of India (NLPAI) which forms the base in deciding the relevance. relevance. Additionally, the final passage repre- We argue that using max-pooling limits the ability sentation is supervised by query, which helps the to compose complex co-dependent representations. model to better judge the relevance. Intuitively, we can leverage query to supervise the We experimented our methods on MS MARCO co-dependent representation from coattention en- passage re-ranking task 1 (Bajaj et al., 2016). Mak- codings of the passage. With this intuition, we pro- ing the coattention encoder n-gram aware (uni,bi- pose a simple query-based attention pooling. We grams) has increased the Mean Reciprocal Rank show that query-based attention pooling can pick @10 (MRR@10) from 28.6 to 29.9 (+4.5% relative the appropriate clauses which are distributed across increase) when compared to the naive coattention the coattention encodings. This allows the model encoder. Replacing the max pooling layer with the to only focus on relevant parts of passage coatten- query based attention pooling has further improved tion encodings, which are appropriate in judging the MRR@10 to 30.9 (+8.08% overall relative in- the relevance. Additionally, the final passage repre- crease), resulting in the best non transformer based sentation is supervised by the query, which helps model. We show that our methods are competitive the model to better judge the relevance. to BERT base despite having very less number of To solve these challenges, we first extend the parameters. Also, our methods can be easily trained coattention encoder to words and phrases by apply- and requires much lesser computational resources. ing it across all word n-grams of query and passage. To summarize, the key contributions of this work For this purpose, similar to C-LSTM (Zhou et al., are as follows: First, we extend the naive coat- 2015) and Conv-KNRM (Dai et al., 2018), we gen- tention encoder to words and phrases making the erate the word n-gram representations from the coattentions to capture local context. We call it word embeddings of query and passage using the as n-gram coattention encoder. Second, we fur- convolutional layers of different heights and multi- ther extend the n-gram coattention encoder with ple filters. We then encode these n-gram sequences query based attention pooling to pick the appropri- using a BiLSTM to capture the long term dependen- ate clauses which are distributed across the coatten- cies into n-gram encodings. A coattention encoder tion encoding of the passage. We call it as n-gram is then applied on these n-gram encodings of query coattention encoder with attention pooling. We and passage to get the coattention encoding of the show that this can further improve the model in passage. The coattention encoder first generates deciding the relevance. Third, we apply our meth- the co-dependent representations of query and pas- ods on MS MARCO passage re-ranking task and sage by attending across all word n-grams from show that our methods have outperformed all the query and passage. These co-dependent represen- baselines including the previous best non BERT tations of passage are then fused with the n-gram model and are competitive to BERT base. Last, we encodings of the passage using a BiLSTM to get use examples to compare and discuss our methods the coattention encoding of the passage. We show with naive coattention encoder. that this coattention encoding can better capture In section 2, we discuss related work. Then in the local context between query and passage, thus section 3, we describe our two methods of improv- improving the overall judging power of the model. ing naive coattention encoder. In section 4, we describe the dataset, baselines and the settings we To get the final representation of the passage, used in all our experiments. Next, we analyze and Alaparthi(2019) applied max-pooling over time on discuss the results in section 5. Finally, we con- the coattention encoding of the passage (note that clude our work with future plans in section 6. coattention encoding is the outputs of BiLSTM). This final representation forms the base in deciding 2 Related Work the relevance between query and passage. In this paper, we apply a query based attention pooling on Deep learning methods have been successfully ap- the coattention encoding instead of max-pooling to plied to a variety of language and information re- pick the appropriate clauses which are distributed trieval tasks. By exploiting deep architectures, deep across the coattention encoding. We argue that, learning techniques are able to discover from train- query based attention pooling allows the model to ing data the hidden structures and features at dif- only focus on relevant parts of passage coatten- 1https://github.com/microsoft/ tion encoding which are appropriate in judging the MSMARCO-Passage-Ranking 162 ferent levels of abstractions useful for the tasks. and Therefore a new direction of Neural IR is proposed P pt = BiLST M(pt−1; xt ) (2) to resort to deep learning for tackling the feature en- gineering problem of learning to rank, by directly Similar to (Merity et al., 2016; Xiong et al., using only automatically learned features from raw 2016), we also add sentinel vectors qφ, pφ to allow text of query and passage. the query to not attend to any particular word in (n+1)×L The first successful model of this type is Deep the passage. So Q = (q1; : : : ; qn; qφ) 2 IR (m+1)×L Structured Semantic Model (DSSM) (Huang et al., and similarly P = (p1; : : : ; pm; pφ) 2 IR .