Infusing Knowledge Into the Textual Entailment Task Using Graph Convolutional Networks

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks Pavan Kapanipathi,† Veronika Thost,∗† Siva Sankalp Patel,† Spencer Whitehead,§ Ibrahim Abdelaziz,† Avinash Balakrishnan,† Maria Chang,† Kshitij Fadnis,† Chulaka Gunasekara,† Bassem Makni,† Nicholas Mattei,‡ Kartik Talamadupula,† Achille Fokoue† †IBM Research, ∗MIT-IBM Watson AI Lab, §University of Illinois at Urbana-Champaign, ‡Tulane University {kapanipa, avinash.bala, kpfadnis, krtalamad, achille}@us.ibm.com {veronika.thost, siva.sankalp.patel, ibrahim.abdelaziz1, maria.chang, chulaka.gunasekara, bassem.makni}@ibm.com [email protected] [email protected] Abstract ing the complexities of human-level natural language under- standing, and in aiding systems tuned for downstream tasks Textual entailment is a fundamental task in natural language such as question answering (Harabagiu and Hickl 2006). processing. Most approaches for solving this problem use only the textual content present in training data. A few approaches have shown that information from external knowledge sources like knowledge graphs (KGs) can add value, in addition to the textual content, by providing background knowledge that may be critical for a task. However, the pro- posed models do not fully exploit the information in the usu- ally large and noisy KGs, and it is not clear how it can be effectively encoded to be useful for entailment. We present an approach that complements text-based entailment models with information from KGs by (1) using Personalized PageR- ank to generate contextual subgraphs with reduced noise and (2) encoding these subgraphs using graph convolutional networks to capture the structural and semantic information in KGs. We evaluate our approach on multiple textual entailment datasets and show that the use of external knowledge Figure 1: A premise and hypothesis pair along with a rele- helps the model to be robust and improves prediction accu- racy. This is particularly evident in the challenging Break- vant subgraph from ConceptNet. Blue concepts occur in the ingNLI dataset, where we see an absolute improvement of premise, green in the hypothesis, and purple connect them. 5-20% over multiple text-based entailment models. Most existing textual entailment models focus only on the text of premise and hypothesis to improve classification ac- 1 Introduction curacy (Parikh et al. 2016; Liu et al. 2019a). A recent and Given two natural language sentences, a premise P and promising line of work has turned towards extracting and a hypothesis H, the textual entailment task – also known harnessing relevant semantic information from knowledge as natural language inference (NLI) – consists of de- graphs (KGs) for each textual entailment pair (Chen et al. termining whether the premise entails, contradicts, or is 2018; Wang et al. 2019). These approaches map terms in neutral with respect to the given hypothesis (MacCart- the premise and hypothesis text to concepts in a KG, such ney and Manning 2009). In practice, this means that as Wordnet (Miller 1995), ConceptNet (Speer, Chin, and textual entailment is characterized as either a three- Havasi 2017), or DBpedia (Auer et al. 2007) and use these class (ENTAILS/NEUTRAL/CONTRADICTS) or a two-class mapped concepts for the textual entailment task. Figure 1 (ENTAILS/NEUTRAL) classification problem (Bowman et al. shows an example of such mapping, where select terms from 2015; Khot, Sabharwal, and Clark 2018). the premise and hypothesis are mapped to concepts from a Performance on the textual entailment task can be an in- knowledge graph (blue and green nodes, respectively). How- dicator of whether a system, and the models it uses, are able ever these models suffer from one or more of the follow- to reason over text. This has tremendous value for model- ing drawbacks: (1) they do not possess the ability to explic- itly capture the semantic and structural information from the Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. KG. For example, in Figure 1, the ability for models to en- code information from paths between blue and green nodes 8074 via purple nodes provides better context facilitating the sys- in Figure 1) – along with the text information. However, this tem to more correctly judge entailment.; (2) they are not eas- approach misses the rich subgraph structure that connects ily integrated with existing NLI models that exploit only the premise and hypothesis entities (purple nodes in Figure 1). text of the premise and hypothesis; and (3) they are not flex- (Chen et al. 2018) have developed a model with WordNet ible with respect to the type of KG that is used. based co-attention that use five engineered features from WordNet for each pair of words from premise and hypoth- Contributions: We present an approach to the NLI prob- esis. This model being tightly integrated with WordNet has lem that can augment any existing text-based entailment the following drawbacks: (1) it is inflexible to be used with model with external knowledge. We specifically address the other external knowledge sources such as ConceptNet or aforementioned challenges by: (1) introducing a neighbor- DBpedia, and (2) it is non-trivial to be integrated with other based expansion strategy in combination with subgraph fil- state of the art text-based entailment systems. This work tering using Personalized PageRank (PPR) (Jeh and Widom addresses the drawbacks of each of these approaches men- 2003). This approach reduces noise and selects contextually tioned above with competitive performance on many NLI relevant subgraphs from larger external knowledge sources datasets. for premise and hypothesis texts ; (2) encoding subgraphs The availability of large-scale datasets (Bowman et al. using Graph Convolutional Networks (GCNs) (Kipf and 2015; Williams, Nangia, and Bowman 2018; Khot, Sabhar- Welling 2017), which are initialized with knowledge graph wal, and Clark 2018) has fostered the advancement of neural embeddings to capture structural and semantic information. NLI models in recent years. However, it is important to dis- This general approach to graph encoding allows us to use cuss the characteristics of these datasets to understand what any external knowledge source that can be represented as a they intend to evaluate (Glockner, Shwartz, and Goldberg graph such as WordNet, ConceptNet, or DBpedia. We show 2018). Particularly, datasets such as (Bowman et al. 2015; that the additional knowledge can improve textual entail- Khot, Sabharwal, and Clark 2018; Williams, Nangia, and ment performance by using four standard benchmarks: Sc- Bowman 2018) contain language artifacts as significant cues iTail, SNLI, MultiNLI, and BreakingNLI. In particular, our for text-based neural models. These artifacts bias the mod- experiments on the BreakingNLI dataset, where we see an els and makes it harder to evaluate the impact of external absolute improvement of 3-20% over four text-based mod- knowledge (Chen et al. 2018; Wang et al. 2019). In order to els, shows that our technique is robust and resilient. evaluate approaches that are more robust and not susceptible to such biases, Glockner, Shwartz, and Goldberg (2018) cre- 2 Related Work ated BreakingNLI – an adversarial test set where most of the We categorize the related approaches for NLI into: (1) ap- common text-based approaches show significant drop in per- proaches that take only the premise and hypothesis text as formance. It is important to note that this test set is generated input, and (2) approaches that utilize external knowledge. using a subset of relationships from online resources for En- Neural models focusing solely on the textual informa- glish learning, making it more suitable for models exploiting tion (Wang and Jiang 2016a; Yang et al. 2019) explore KGs with lexical focus, such as WordNet. However, Break- the sentence representations of premise structure and max ingNLI represents a first and important step in the evaluation pooling layers. Match-LSTM (Wang and Jiang 2016a) and of models that utilize external knowledge sources. Decomposable Attention (Parikh et al. 2016) learn cross- One of the core contributions of this work is the ap- sentence correlations using attention mechanisms, where plication of Graph Convulutional Networks for encoding the former uses a asymmetric network structure to learn knowledge graphs. While (graph-)structured knowledge rep- premise-attended representation of the hypothesis, and the resents a significant challenge for classical machine learn- latter a symmetric attention, to decompose the problem into ing models, graph convolutional networks (Kipf and Welling sub-problems. Latest NLI models use tranformer architec- 2017) offer an effective framework for representation learn- tures such as BERT (Devlin et al. 2019) and RoBERTa (Liu ing of graphs. In parallel, relational GCNs (R-GCNs) et al. 2019b). These models perform exceedingly well on (Schlichtkrull et al. 2018) have been designed to accommo- many NLI leaderboards (Zhang et al. 2018; Liu et al. 2019a). date the highly multi-relational data characteristics of large In this work, we show that performance of text-based entail- KGs. Inspired by these works, we explore the use of R- ment models that use pre-trained BERT embeddings can be GCNs for infusing information from KG into NLI models. augmented with external knowledge. Utilizing external knowledge has shown improvement in 3 KG-Augmented Entailment Model performance on many natural language processing (NLP) In this section, we describe the central contribution of the tasks (Huang et al. 2019; Moon et al. 2019; Musa et al. paper – the KG-augmented Entailment System (KES). As 2019). Recently, for NLI, Li et al. (2019) have shown shown in Figure 2, KES consists of two main components. that features from pre-trained language models and ex- The first component is a standard text-based model that cre- ternal knowledge complement each other. However, ap- ates a fixed-size representation of the premise and hypoth- proaches that do utilize external knowledge for NLI are esis texts.

Load more