Sherliic: a Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference

SherLIiC: A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference Martin Schmitt and Hinrich Schutze¨ Center for Information and Language Processing (CIS) LMU Munich, Germany [email protected] ORGF[A] is granting to EMPL[B] Abstract (1) troponymy ) ORGF[A] is giving to EMPL[B] 1 We present SherLIiC, a testbed for lexical in- synonymy + ORGF[A] is supporter of ORGF[B] (2) ference in context (LIiC), consisting of 3985 derivation ) ORGF[A] is backing ORGF[B] manually annotated inference rule candidates typical AUTH[A] is president of LOC[B] (3) (InfCands), accompanied by (i) ~960k un- actions ) AUTH[A] is representing LOC[B] labeled InfCands, and (ii) ~190k typed tex- script PER[A] is interviewing AUTH[B] (4) tual relations between Freebase entities ex- knowledge ) PER[A] is asking AUTH[B] tracted from the large entity-linked corpus common sense ORGF[A] claims LOC[B] (5) ClueWeb09. Each InfCand consists of one of knowledge ) ORGF[A] is wanting LOC[B] these relations, expressed as a lemmatized dependency path, and two argument placehold- Table 1: Examples of SherLIiC InfCands and NLI ers, each linked to one or more Freebase types. challenges they cover. ORGF=organization founder, Due to our candidate selection process based EMPL=employer, AUTH=book author, LOC=location, on strong distributional evidence, SherLIiC is POL=politician, PER=person. much harder than existing testbeds because distributional evidence is of little utility in the classification of InfCands. We also show that, Levy and Dagan(2016) identified context- due to its construction, many of SherLIiC’s sensitive – as opposed to “context-free” – entail- correct InfCands are novel and missing from ment as an important evaluation criterion and cre- existing rule bases. We evaluate a number of strong baselines on SherLIiC, ranging from se- ated a dataset for LI in context (LIiC). In their data, mantic vector space models to state of the art WordNet (Miller, 1995; Fellbaum, 2005) synsets neural models of natural language inference serve as context for one side of a binary relation, (NLI). We show that SherLIiC poses a tough but the other side is still instantiated with a sin- challenge to existing NLI systems. gle concrete expression. We aim to improve this setting in two ways. 1 Introduction First, we type our relations on both sides, thus Lexical inference (LI) can be seen as a focused making them more general. Types provide a con- variant of natural language inference (NLI), also text that can help in disambiguation and at the same called recognizing textual entailment (Dagan et al., time allow generalization over contexts because ar- arXiv:1906.01393v1 [cs.CL] 4 Jun 2019 2013). Recently, Gururangan et al.(2018) showed guments of the same type are represented abstractly. that annotation artifacts in current NLI testbeds An example for the need for disambiguation is the distort our impression of the performance of state verb “run”. “run” entails “lead” in the context of of the art systems, giving rise to the need for new PERSON / COMPANY (“Bezos runs Amazon”). But evaluation methods for NLI. Glockner et al.(2018) in the context of COMPUTER / SOFTWARE, “run” investigated LI as a way of evaluating NLI systems entails “execute”/“use” (“my mac runs macOS”). and found that even simple cases are challenging to Here, types help find the right interpretation. current systems. In this paper, we release SherLIiC, Second, we only consider relations between a testbed specifically designed for evaluating a sys- named entities (NEs). Inference mining based on tem’s ability to solve the hard problem of modeling non-NE types such as WordNet synsets (e.g., ANI- lexical entailment in context. MAL, PLANT LIFE) primarily discovers facts like 1https://github.com/mnschmit/SherLIiC “parrotfish feed on algae”. In contrast, the focus on NEs makes it more likely that we will capture (Nivre et al., 2007) to generate a dependency graph, events like “Walmart closes gap with Amazon” and where nodes are labeled with their lemmas and thus knowledge about event entailment like [“A edges with dependency types. We take all shortest is closing gap with B” ) “B is having lead over paths between all combinations of two entities in s A”] that is substantially different from knowledge and represent them by alternating edge and node la- about general facts. bels. As we want to focus on relations that express In more detail, we create SherLIiC as follows. events, we only keep paths with a nominal subject First, we extract verbal relations between Freebase on one end. We also apply heuristics to filter out (Bollacker et al., 2008) entities from the entity- erroneous parses. See Appendix A for heuristics linked web corpus ClueWeb09 (Gabrilovich et al., and Table 5 for examples of relations. 2013).2 We then divide these relations into typable Notation. Let R denote the set of extracted re- subrelations based on the most frequent Freebase lations. A relation R 2 R is represented as a types found in their extensions. We then create a set of pairs of Freebase entities (its extension): large set of inference rule candidates (InfCands), R ⊆ E × E, with E the set of Freebase entities. i.e., premise-hypothesis-pairs of verbally expressed Let π1; π2 be functions that map a pair to its first relations. Finally, we use Amazon Mechanical Turk or second entry, respectively. By abuse of notation, to classify each InfCand in a randomly sampled we also apply them to sets of pairs. Finally, let T subset as entailment or non-entailment. be the set of Freebase types and τ : E! 2T the In summary, our contributions are the follow- function that maps an entity to the set of its types. ing: (1) We create SherLIiC, a new resource for LIiC, consisting of 3985 manually annotated Inf- 2.2 Typing Cands. Additionally, we provide ~960k unlabeled We define a typable subrelation of R 2 R as a InfCands (SherLIiC-InfCands), and the typed event subrelation whose entities in each argument slot graph SherLIiC-TEG, containing ~190k typed tex- share at least one type, i.e., an S ⊆ R such that: tual binary relations between Freebase entities. \ (2) SherLIiC is harder than existing testbeds be- 8i 2 f1; 2g : 9t 2 T : t 2 τ(e) cause distributional evidence is of limited utility e2πi(S) in the classification of InfCands. Thus, SherLIiC is a promising and challenging resource for devel- 2 We compute the set Typek2 (R) of the (up to) k oping NLI systems that go beyond shallow seman- largest typable subrelations of R and use them in- tics. (3) Human-interpretable knowledge graph stead of R. First, for each argument slot i of the types serve as context for both sides of InfCands. i binary relation R, the k types tj (with 1 ≤ j ≤ k) This makes InfCands more general and boosts are computed that occur most often in this slot: the number of event-like relations in SherLIiC. (4) SherLIiC is complementary to existing collec- i i tj := arg max p 2 R j t 2 τj (πi(p)) tions of inference rules as evidenced by the low t recall these resources achieve (cf. Table 3). (5) We evaluate a large number of baselines on SherLIiC. with The best-performing baseline makes use of typing. i (6) We demonstrate that existing NLI systems do τ1(e) = τ(e) i i i poorly on SherLIiC. τj+1(e) = τj (e) − tj 2 Generation of InfCands Then, for each pair This section describes creation (§ 2.1) and typing (s; u) 2 t1; t2 j j; l 2 f1; : : : ; kg (§ 2.2) of the typed event graph SherLIiC-TEG and j l then the generation of SherLIiC-InfCands (§ 2.3). of these types, we construct a subrelation 2.1 Relation Extraction R := f (e ; e ) 2 R j s 2 τ(e ); u 2 τ(e ) g For each sentence s in ClueWeb09 that contains s;u 1 2 1 2 at least two entity mentions, we use MaltParser If jRs;uj ≥ rmin, Rs;u is included in Typek2 (R). 2 http://lemurproject.org/clueweb09 In our experiments, we set k = rmin = 5. The type signature (tsg) of a typed relation T is Fact: location[B] is annexing location[A] . Examples for location[B]: Russia / USA / Indonesia defined as the pair of sets of types that is common Examples for location[A]: Cuba / Algeria / Crimea to first (resp. second) entities in the extension: fact incomprehensible 0 1 Please answer the following questions: \ \ tsg(T ) = τ(e); τ(e) Is it certain that location[B] is taking control of location[A]? @ A yes no incomprehensible e2π1(T ) e2π2(T ) Is it certain that location[B] is taking location[A]? yes no incomprehensible Incomplete type information. Like all large Is it certain that location[B] is bordered by location[A]? knowledge bases, Freebase suffers from incom- yes no incomprehensible pleteness: Many entities have no type. To avoid losing information about relations associated with Figure 1: Annotation Interface on Amazon MTurk such entities, we introduce a special type > and define arg max j;j := >. We define the relations t 3. 8i 2 f1; 2g : jπi(A \ B)j ≥ rmin Rs;>, R>;u and R>;> to have no type restriction on entities in a > slot. This concerns approximately 4. Relv(A; B) ≥ #relv 17.6% of the relations in SherLIiC-TEG. 5. σ(A; B) ≥ #σ 2.3 Entailment Discovery 6. esr(A; B) ≥ #esr Our discovery procedure is based on Sherlock where tsg(A ) B) is component-wise intersec- (Schoenmackers et al., 2010). For the InfCand tion of tsg(A) and tsg(B) and #relv = 1000, A ) B (A; B 2 R), we define the relevance score #σ = 15, #esr = 0:6.

Load more