Towards Textual Inference and Question Entailment Using Contextualized Representations

Dr.Quad at MEDIQA 2019: Towards Textual Inference and Question Entailment using contextualized representations Vinayshekhar Bannihatti Kumar ∗ Ashwin Srinivasan∗ Aditi Chaudhary∗ James Route Teruko Mitamura Eric Nyberg fvbkumar; ashwinsr; aschaudh; jroute; teruko; [email protected] Language Technologies Institute Carnegie Mellon University Abstract the efficacy of learning universal language representations in providing a decent warm start to a This paper presents the submissions by Team task-specific model, by leveraging large amounts Dr.Quad to the ACL-BioNLP 2019 shared task of unlabeled data. MT-DNN uses BERT as the en- on Textual Inference and Question Entailment in the Medical Domain. Our system is based coder and uses MTL to fine-tune the multiple task- on the prior work Liu et al.(2019) which uses specific layers. This model has obtained state- a multi-task objective function for textual en- of-the-art results on several natural language un- tailment. In this work, we explore different derstanding tasks such as SNLI (Bowman et al., strategies for generalizing state-of-the-art lan- 2015), SciTail (Khot et al., 2018) and hence forms guage understanding models to the specialized the basis of our approach. For the task 3, we use medical domain. Our results on the shared task a simple model to combine the task 1 and task 2 demonstrate that incorporating domain knowl- models as shown in x2.5. edge through data augmentation is a power- ful strategy for addressing challenges posed by As discussed above, state-of-the-art models us- specialized domains such as medicine. ing deep neural networks have shown significant performance gains across various natural language 1 Introduction processing (NLP) tasks. However, their general- The ACL-BioNLP 2019 (Ben Abacha et al., 2019) ization to specialized domains such as the medical shared task focuses on improving the following domain still remains a challenge. Romanov and three tasks for medical domain: 1) Natural Lan- Shivade(2018) introduce a new dataset MedNLI, guage Inference (NLI) 2) Recognizing Question a natural language inference dataset for the med- Entailment (RQE) and 3) Question-Answering re- ical domain and show the importance of incor- ranking system. Our team has made submissions porating domain-specific resources. Inspired by to all the three tasks. We note that in this work we their observations, we explore several techniques focus more on the task 1 and task 2 as improve- of augmenting domain-specific features with the ments in these two tasks reflect directly on the task state-of-the-art methods. We hope that the deep 3. However, as per the shared task guidelines, we neural networks will help the model learn about do submit one model for the task 3 to complete our the task itself and the domain-specific features will submission. assist the model in tacking the issues associated Our approach for both task 1 and task 2 is with such specialized domains. For instance, the based on the state-of-the-art natural language un- medical domain has a distinct sublanguage (Fried- derstanding model MT-DNN (Liu et al., 2019), man et al., 2002) and it presents challenges such which combines the strength of multi-task learn- as abbreviations, inconsistent spellings, relation- ing (MTL) and language model pre-training. MTL ship between drugs, diseases, symptoms. in deep networks has shown performance gains Our resulting models perform fairly on the un- when related tasks are trained together resulting seen test data of the ACL-MediQA shared task. in better generalization to new domains (Ruder, On Task 1, our best model achieves +14.1 gain 2017). Recent works such as BERT (Devlin et al., above the baseline. On Task 2, our five-model en- 2018), ELMO (Peters et al., 2018) have shown semble achieved +12.6 gain over the baseline and ∗ equal contribution for Task 3 our model achieves a a +4.9 gain. 453 Proceedings of the BioNLP 2019 workshop, pages 453–461 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics 2 Approach Train Validation Test Entailment 3744 465 474 In this section, we first present our base model Contradiction 3744 465 474 MT-DNN (Liu et al., 2019) which we use for both Neutral 3744 465 474 Task 1 and Task 2 followed by a discussion on the different approaches taken for natural language Table 1: The number of train and test instances in each inference (NLI) (x2.3), recognizing question en- of the categories of the NLI dataset. tailment (RQE) (x2.4) and question answer (QA) (x2.5). Encoder: Following BERT (Devlin et al., Post-Process 2018), each sentence pair is separated by a [SEP] Dataset Prior Ensemble token. It is then passed through a lexicon encoder which represents each token as a continuous rep- resentation of the word, segment and positional Pairwise Text NLI / RQE Classification Model embeddings. A multi-layer bi-directional trans- RQE only BERT Encoder NLI only former encoder (Vaswani et al., 2017) transforms the input token representations into the contextual embedding vectors. This encoder is then shared across multiple tasks. Data Pre-Processing Data Augmentation Decoder: We use the Pairwise text classifica- Pre-Process tion output layer (Liu et al., 2019) as our de- Premise / CHQ Hypothesis / FAQ coder. Given a sentence pair (a,b), the above encoder first encodes them into u and v respec- Figure 1: System overview for NLI and RQE task tively. Then a K-step reasoning is performed . on these representations to predict the final label. P The initial state is given by s = j αjuj where 2.1 Task 1 and Task 2 Formulation T exp(w uj) αj = P exp(w T u ) . On subsequent iterations k 2 Formally, we define the problem of textual i 1 i k k−1 k entailment as a multi-class classification task. [1;K−1], the state is s = GRU(s ; x ) where P T xk = βjvj and βj = softmax(sk−1w2 v). Given two sentences a =a1; a2:::; an and b = j Then a single-layer classifier predicts the label at b1; b2; :::; bm, the task is to predict the correct label. For NLI, a refers to the Premise and b refers each iteration k: to the Hypothesis and the label set comprises of P k = softmax(w T [sk; xk; jsk − xkj; sk:xk]) entailment, neutral, contradiction. For RQE, a 3 refers to the CHQ and b refers to the FAQ and the Finally, all the scores across the K iterations are label set comprises of True, False. averaged for the final prediction. We now describe 2.2 Model Architecture the modifications made to this model for each respective task. A brief depiction of our system is shown in Fig- ure1. We represent components which were used 2.3 Natural Language Inference for both NLI and RQE in Orange. An example of this is the Data Pre-processing component. This task consists of identifying three inference re- The RQE only components are shown in yellow lations between two sentences: Entailment, Neu- (eg. Data Augmentation). The components which tral and Contradiction were used only for the NLI modules are shown in Data: The data is based off the MedNLI dataset Pink (eg. Dataset Prior). We base our model on introduced by Romanov and Shivade(2018). The the state-of-the-art natural language understanding statistics of the dataset can be seen in Table1. model MT-DNN (Liu et al., 2019). MT-DNN is a hierarchical neural network model which com- Data Pre-Processing: On manual inspection of bines the advantages of both multi-task learning the data, we observe the presence of abbreviations and pre-trained language models. Below we de- in the premise and hypothesis. Since lexical over- scribe the different components in detail. lap is a strong indicator of entailment by virtue of 454 pre-trained embeddings on large corpora, the pres- that local context should get preference while ex- ence of abbreviations makes it challenging. There- panding the abbreviation. fore, we expand the abbreviations using the fol- Training Procedure: For training the MT-DNN lowing two strategies: model, we use the same hyper-parameters pro- 1. Local Context: We observe that often an ab- vided by the authors (Liu et al., 2019). We train breviation is composed of the first letters of model for 4 epochs and early stop when the model contiguous words. Therefore, we first con- reaches the highest validation accuracy. struct potential abbreviations by concatenat- Baselines: We use the following baselines simi- ing first letter of all words in an sequence, af- lar to Romanov and Shivade(2018). ter tokenization. For instance, for the premise shown below we get fCXR, CXRS, XRS, • CBOW: We use a Continuous-Bag-Of-Words CXRSI, XRSI, RSI, etcg. This is done for (CBOW) model as our first baseline. We take both the premise and the hypothesis. We then both the premise and the hypothesis and sum check if this n-gram exists in the hypothesis the word embeddings of the respective state- (or the premise). If yes, then we replace that ments to form the input layer to our CBOW abbreviation with all the words that make up model. We used 2 hidden layers and used the n-gram. Now the model has more scope softmax as the decision layer. of matching two strings lexically. We demon- • Infersent: Inferesent is a sentence encoding strate an example below: model which encodes a sentence by doing Premise: Her CXR was clear and it did not a max-pool on all the hidden states of the appear she had an infection. LSTM across time steps.

Towards Textual Inference and Question Entailment Using Contextualized Representations

Are Annotators' Word-Sense-Disambiguation

Textual Inference for Machine Comprehension Martin Gleize

Repurposing Entailment for Multi-Hop Question Answering Tasks

Techniques for Recognizing Textual Entailment and Semantic Equivalence

Automatic Evaluation of Summary Using Textual Entailment

Dependency-Based Textual Entailment

Parser Evaluation Using Textual Entailments

Textual Entailment Recognition Based on Dependency Analysis and Wordnet

A Probabilistic Classification Approach for Lexical Textual Entailment

Frame Semantics Annotation Made Easy with Dbpedia

Constructing a Textual Semantic Relation Corpus Using a Discourse Treebank

PATTY: a Taxonomy of Relational Patterns with Semantic Types