<<

Combined Distributional and Logical

Mike Lewis Mark Steedman School of Informatics School of Informatics University of Edinburgh University of Edinburgh Edinburgh, EH8 9AB, UK Edinburgh, EH8 9AB, UK [email protected] [email protected]

Abstract of limited use in capturing the huge variety of mean- ings that can be expressed in . We introduce a new approach to semantics However, has largely de- which combines the benefits of distributional veloped in isolation from the formal semantics liter- and formal logical semantics. Distributional models have been successful in modelling the ature. Whilst distributional semantics has been ef- meanings of content , but logical se- fective in modelling the meanings of content words mantics is necessary to adequately represent such as and verbs, it is less clear that it can be many function words. We follow formal se- applied to the meanings of function words. Semantic mantics in mapping language to logical rep- operators, such as determiners, , conjunc- resentations, but differ in that the relational tions, modals, tense, mood, aspect, and plurals are constants used are induced by offline distri- ubiquitous in , and are crucial for butional clustering at the level of - high performance on many practical applications— structure. Our clustering algorithm is highly scalable, allowing us to run on cor- but current distributional models struggle to capture pora the size of Gigaword. Different senses of even simple examples. Conversely, computational a are disambiguated based on their in- models of formal semantics have shown low duced types. We outperform a variety of ex- on practical applications, from their re- isting approaches on a wide-coverage liance on ontologies such as WordNet (Miller, 1995) answering task, and demonstrate the ability to to model the meanings of content words (Bobrow et make complex multi- in- al., 2007; Bos and Markert, 2005). volving quantifiers on the FraCaS suite. For example, consider what is needed to answer a question like Did Google buy YouTube? from the 1 Introduction following sentences: Mapping natural language to representa- 1. Google purchased YouTube tions is a central challenge of NLP. There has been 2. Google’s acquisition of YouTube much recent progress in unsupervised distributional semantics, in which the meaning of a word is in- 3. Google acquired every company duced based on its usage in large corpora. This ap- 4. YouTube may be sold to Google proach is useful for a range of key applications in- 5. Google will buy YouTube or Microsoft cluding and relation extraction 6. Google didn’t takeover YouTube (Lin and Pantel, 2001; Poon and Domingos, 2009; Yao et al., 2011). Because such a semantics can be All of these require knowledge of lexical seman- automically induced, it escapes the limitation of de- tics (e.g. that buy and purchase are ), but pending on relations from hand-built training data, some also need of quantifiers, nega- knowledge bases or ontologies, which have proved tives, modals and disjunction. It seems unlikely that

179

Transactions of the Association for Computational Linguistics, 1 (2013) 179–192. Action Editor: Johan Bos. Submitted 1/2013; Revised 3/2013; Published 5/2013. tag">c 2013 Association for Computational Linguistics.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 distributional or formal approaches can accomplish Every dog barks the task alone. NP↑/NNS NP \ λ pλq. x[p(x) = q(x)] λx.dog0(x) λx.bark0(x) We propose a method for mapping natural lan- ∀ ⇒ > guage to first-order representations capable of NP↑ λq. x[dog0(x) = q(x)] capturing the meanings of function words such as ∀ ⇒ > every, not and or, but which also uses distributional S x[dog (x) = bark (x)] statistics to model the meaning of content words. ∀ 0 ⇒ 0 Our approach differs from standard formal seman- Figure 1: A standard derivation using CCG. tics in that the non-logical used in the log- The NP↑ notation means that the is type-raised, ical form are cluster identifiers. Where standard se- and taking the verb- as an argument—so is an ab- mantic formalisms would map the verb write to a breviation of S/(S NP). This is necessary in part to sup- \ write’ , we map it to a cluster identifier such port a correct semantics for quantifiers. as relation37, which the author may also map to. This mapping is learnt by offline clustering. Input Sentence Unlike previous distributional approaches, we Shakespeare wrote Macbeth perform clustering at the level of predicate-argument ⇓ structure, rather than syntactic dependency struc- Intial semantic ture. This means that we abstract away from many writearg0,arg1(shakespeare, macbeth) syntactic differences that are not present in the se- ⇓ mantics, such as conjunctions, passives, relative Entity Typing clauses, and long-range dependencies. This signifi- writearg0:PER,arg1:BOOK(shakespeare:PER, cantly reduces sparsity, so we have fewer predicates macbeth:BOOK) to cluster and more observations for each. ⇓ Of course, many practical inferences rely heavily Distributional on background knowledge about the world—such relation37(shakespeare:PER, macbeth:BOOK) knowledge falls outside the of this work. Figure 2: Layers used in our model.

2 Background form of semantics captures the underlying predicate- Our approach is based on Combinatory Categorial argument structure, but fails to license many impor- (CCG; Steedman, 2000), a strongly lexi- tant inferences—as, for example, write and author calised of language in which lexical entries do not map to the same predicate. for words contain all language-specific . In addition to the , there is a small of The lexical entry for each word contains a syntactic binary combinators and unary rules, which have a category, which determines which categories syntactic and semantic interpretation. Figure 1 gives the word may combine with, and a semantic inter- an example CCG derivation. pretation, which defines the compositional seman- tics. For example, the lexicon may contain the entry: 3 Overview of Approach write (S NP)/NP : λyλx.write (x,y) ` \ 0 Crucially, there is a transparent interface between We attempt to learn a CCG lexicon which maps the syntactic category and the semantics. For ex- equivalent words onto the same logical form—for ample the entry above defines the example learning entries such as: verb syntactically as a function mapping two noun- author N/PP[o f ] : λxλy.relation37(x,y) ` to a sentence, and semantically as a bi- write (S NP)/NP : λxλy.relation37(x,y) ` \ nary relation between its two argument entities. The only change to the standard CCG derivation is This means that it is relatively straightforward to that the symbols used in the logical form are arbi- deterministically map parser output to a logical trary relation identifiers. We learn these by first map- form, as in the Boxer system (Bos, 2008). This ping to a deterministic logical form (using predicates

180

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 such as author’ and write’), using a process simi- exactly preserve meaning, but still captures the most lar to Boxer, and then clustering predicates based on important relations. Note that this allows us to their arguments. This lexicon can then be used to compare semantic relations across different syntac- parse new sentences, and integrates seamlessly with tic types—for example, both transitive verbs and CCG of formal semantics. argument-taking nouns can be seen as expressing bi- Typing predicates—for example, determining that nary semantic relations between entities. is a relation between people and books— Figure 2 shows the layers used in our model. has become standard in relation clustering (Schoen- mackers et al., 2010; Berant et al., 2011; Yao et 4 Initial Semantic Analysis al., 2012). We demonstate how to build a typing The initial semantic analysis maps parser output model into the CCG derivation, by subcategorizing onto a logical form, in a similar way to Boxer. The all terms representing entities in the logical form semantic is based on Steedman (2012). with a more detailed type. These types are also in- The first step is syntactic . We use the duced from text, as explained in Section 5, but for C&C parser (Clark and Curran, 2004), trained on convenience we describe them with human-readable CCGBank (Hockenmaier and Steedman, 2007), us- labels, such as PER, LOC and BOOK. ing the refined version of Honnibal et al. (2010) A key advantage of typing is that it allows us to which brings the closer to the predicate- model ambiguous predicates. Following Berant et argument structure. An automatic post-processing al. (2011), we assume that different type signatures step makes a number of minor changes to the parser of the same predicate have different meanings, but output, which converts the grammar into one more given a type signature a predicate is unambiguous. suitable for our semantics. PP (prepositional phrase) For example a different lexical entry for the verb and PR (phrasal verb complement) categories are born is used in the contexts Obama was born in sub-categorised with the relevant preposition. Noun Hawaii and Obama was born in 1961, reflecting a compounds with the same MUC named-entity type distinction in the semantics that is not obvious in the (Chinchor and Robinson, 1997) are merged into a 1 syntax . Typing also greatly improves the efficiency single non-compositional node2 (we otherwise ig- of clustering, as we only need to compare predicates nore named-entity types). All argument NPs and with the same type during clustering (for example, PPs are type-raised, allowing us to represent quanti- we do not have to consider clustering a predicate fiers. All prepositional phrases are treated as core ar- between people and places with predicates between guments (i.e. given the category PP, not adjunct cat- people and dates). egories like (N N)/NP or ((S NP) (S NP))/NP), \ \ \ \ In this work, we on inducing binary rela- as it is difficult for the parser to distinguish argu- tions. Many existing approaches have shown how ments and adjuncts. to produce good clusterings of (non-event) nouns Initial semantic lexical entries for almost all (Brown et al., 1992), any of which could be sim- words can be generated automatically from the ply integrated into our semantics—but relation clus- syntactic category and POS tag (obtained from tering remains an open problem (see Section 9). the parser), as the syntactic category captures the N-ary relations are binarized, by creating a bi- underlying predicate-argument structure. We use nary relation between each pair of arguments. For a Davidsonian-style of arguments example, for the sentence Russia sold Alaska to (Davidson, 1967), which we binarize by creating a the United States, the system creates three binary separate predicate for each pair of arguments of a relations— corresponding to sellToSomeone(Russia, word. These predicates are labelled with the lemma Alaska), buyFromSomeone(US, Alaska), sellSome- of the head word and a Propbank-style argument key thingTo(Russia, US). This transformation does not (Kingsbury and Palmer, 2002), e.g. arg0, argIn. We distinguish noun and verb predicates based on POS 1Whilst this assumption is very useful, it does not always hold— for example, the genitive in Shakespeare’s book is ambigu- 2For example, this allows us to give Barack Obama the seman- ous between ownership and authorship relations even given the tics λx.barack obama(x) instead of λx.barack(x) obama(x), ∧ types of the arguments. which is more convenient for collecting distributional statistics.

181

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 Word Category Semantics Automatic author N/PP[o f ] λxλy.authorarg0,argOf (y,x) write (S NP)/NP λxλy.write (y,x) \ arg0,arg1 Manual every NP↑/N λ pλq. x[p(x) q(x)] ∀ → not (S NP)/(S NP) λ pλx. p(x) \ \ ¬ Figure 3: Example initial lexical entries

tag—so, for example, we have different predicates based on all of its argument entities (words). We as- for effect as a noun or verb. sume that these arguments are drawn from a small This algorithm can be overridden with man- number of types (topics), such as PER, DAT or ual lexical entries for specific closed- function LOC3. Each type j has a multinomial distribution words. Whilst it may be possible to learn these φ j over arguments (for example, a LOC type is more from data, our approach is pragmatic as there are likely to generate Hawaii than 1961). Each unary relatively few such words, and the complex logical predicate i has a multinomial distribution θi over forms required would be difficult to induce from dis- topics, so the bornargIn predicate will normally gen- tributional statistics. We add a small number of lexi- erate a DAT or LOC type. Sparse Dirichlet priors cal entries for words such as negatives (no, not etc.), α and β on the multinomials bias the distributions and quantifiers (numbers, each, every, all, etc.). to be peaky. The parameters are estimated by Gibbs Some example initial lexical entries are shown in sampling, using the Mallet implementation (McCal- Figure 3. lum, 2002). The generative story to create the data is: 5 Entity Typing Model For every type k: Our entity-typing model assigns types to nouns, Draw the p(arg k) distribution φ from Dir(β) which is useful for disambiguating polysemous | k For every unary predicate i: predicates. Our approach is similar to O’Seaghdha Draw the p(type i) distribution θ from Dir(α) (2010) in that we aim to cluster entities based on | i For every argument j: the noun and unary predicates applied to them (it Draw a type z from Mult(θ ) is simple to convert from the binary predicates i j i Draw an argument w from Mult(φ ) to unary predicates). For example, we want the i j θi pair (bornargIn, 1961) to map to a DAT type, and (bornargIn, Hawaii) to map to a LOC type. This is non-trivial, as both the predicates and arguments can 5.2 Typing in Logical Form be ambiguous between multiple types—but topic models offer a good solution (described below). In the logical form, all constants and variables repre- senting entities x can be assigned a distribution over 5.1 types px(t) using the type model. An initial type distribution is applied in the lexicon, using the φ We assume that the type of each argument of a pred- distributions for the types of nouns, and the θ dis- icate depends only on the predicate and argument, i tributions for the type of arguments of binary predi- although Ritter et al. (2010) demonstrate an advan- cates (inverted using Bayes’ rule). Then at each β- tage of modelling the joint of the types reduction in the derivation, we update of multiple arguments of the same predicate. We use of the types to be the product of the type distribu- the standard Latent Dirichlet Allocation model (Blei tions of the terms being reduced. If two terms x and et al., 2003), which performs comparably to more complex models proposed in O’Seaghdha (2010). In topic-modelling terminology, we construct a 3Types are induced from the text, but we give human-readable document for each unary predicate (e.g. bornargIn), labels here for convenience.

182

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 file a suit

(S NP)/NP NP↑ DOC =0.5 \ CLOTHES = 0.6 LEGAL=0.4 PER = 0.7 LEGAL = 0.3 λy: λx: ORG = 0.2 . filearg0,arg1(x,y) λ p. y: [suit0(y) p(y)] CLOTHES=0.01 ... ∃ DOC =0.001 ∧ ( ... )   ( ... ) < S NP LEGAL\= 0.94 PER = 0.7 CLOTHES = 0.05 λx: ORG = 0.2 y: )[suit0(y) filearg0,arg1(x,y)] ... ∃ DOC = 0.004 ∧   ( ... ) Figure 4: Using the type model for disambiguation in the derivation of file a suit. Type distributions are shown after the declarations. Both suit and the of file are lexically ambiguous between different types, but after the β-reduction only one interpretation is likely. If the verb were wear, a different interpretation would be preferred.

y combine to a term z: The entity-pair counts for authorarg0:PER,argOf :BOOK may be similar, on the assumption that both are sam- px(t)py(t) pz(t) = ples from the same underlying semantic relation. ∑t0 px(t0)py(t0) To find the expected number of occurrences of argument-pairs for typed binary predicates in a cor- For example, in wore a suit and file a suit, the vari- pus, we first apply the type model to the derivation able representing suit may be lexically ambiguous of each sentence, as described in Section 5.2. This between CLOTHES and LEGAL types, but the vari- outputs untyped binary predicates, with distributions ables representing the objects of wear and file will over the types of their arguments. The type of the have preferences that allow us to choose the correct predicate must match the type of its arguments, so type when the terms combine. Figure 4 shows an the type distribution of a binary predicate is simply example derivation using the type model for disam- the joint distribution of the two argument type dis- biguation4. tributions. 6 Distributional Relation Clustering For example, if the arguments in a Model bornarg0,argIn(obama,hawaii) derivation have the respective type distributions (PER=0.9, LOC=0.1) The typed binary predicates can be grouped and (LOC=0.7, DAT=0.3), the distribution over bi- into clusters, each of which represents a dis- nary typed predicates is (bornarg0:PER,argIn:LOC=0.63, tinct semantic relation. Note that because we bornarg0:PER,argIn:DAT =0.27, etc.) The expected cluster typed predicates, bornarg0:PER,argIn:LOC and counts for (obama,hawaii) in the vectors for bornarg0:PER,argIn:DAT can be clustered separately. bornarg0:PER,argIn:LOC and bornarg0:PER,argIn:DAT are then incremented by these probabilities. 6.1 Corpus statistics Typed binary predicates are clustered based on the 6.2 Clustering expected number of times they hold between each argument-pair in the corpus. This means we cre- Many algorithms have been proposed for cluster- ate a single vector of argument-pair counts for each ing predicates based on their arguments (Poon and predicate (not a separate vector for each argument). Domingos, 2009; Yao et al., 2012). The number of For example, the vector for the typed predicate relations in the corpus is unbounded, so the cluster- writearg0:PER,arg1:BOOK may contain non-zero counts ing algorithm should be non-parametric. It is also for entity-pairs such as (Shakespeare, Macbeth), important that it remains tractable for very large (Dickens, Oliver Twist) and (Rowling, Harry Potter). numbers of predicates and arguments, in order to

4 give us a greater coverage of language than can be Our implementation follows Steedman (2012) in using Gener- achieved by hand-built ontologies. alized Skolem Terms rather than existential quantifiers, in order to capture quantifier scope alternations monotonically, but we We cluster the typed predicate vectors using the omit these from the example to avoid introducing new notation. Chinese Whispers algorithm (Biemann, 2006)—

183

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 although somewhat ad-hoc, it is both non-parametric be untyped), we perform a deterministic lookup in and highly scalable5. This has previously been used the cluster model learned in Section 6, using all pos- for noun-clustering by Fountain and Lapata (2011), sible corresponding typed predicates. This allows us who argue it is a cognitively plausible model for to represent the binary predicates as packed predi- . The collection of predicates cates: functions from argument types to relations. and arguments is converted into a graph with one For example, if the clustering maps node per predicate, and edge weights representing bornarg0:PER,argIn:LOC to rel49 (“birthplace”) the similarity between predicates. Predicates with and bornarg0:PER,argIn:DAT to rel53 (“birthdate”), our different types have zero-similarity, and otherwise lexicon contains the following packed lexical entry similarity is computed as the cosine-similarity of the (type-distributions on the variables are suppressed): tf-idf vectors of argument-pairs. We prune nodes oc- born (S NP)/PP[in] : ` \ curring fewer than 20 times, edges with weights less (x: PER,y:LOC) rel49 3 λyλx. ⇒ (x,y) than 10− , and a short list of stop predicates. (x: PER,y:DAT) rel53 The algorithm proceeds as follows:  ⇒  The distributions over argument types then imply 1. Each predicate p is assigned to a different se- a distribution over relations. For example, if the mantic relation rp packed-predicate for bornarg0,argIn is applied to ar- 2. Iterate over the predicates p in a random order guments Obama and Hawaii, with respective type distributions (PER=0.9, LOC=0.1) and (LOC=0.7, = 1 ( , ) 3. Set rp argmax∑p0 r=rp sim p p0 , where 6 r 0 DAT=0.3) , the distribution over relations will be sim(p, p0) is the distributional similarity be- (rel49=0.63, rel53=0.27, etc.). 1 tween p and p0, and r=r0 is 1 iff r=r’ and 0 If 1961 has a type-distribution (LOC=0.1, otherwise. DAT=0.9), the output packed-logical form for 4. Repeat (2.) to convergence. Obama was born in Hawaii in 1961 will be:

7 Semantic Parsing using Relation rel49=0.63 rel49=0.09 rel53=0.27 (ob,hw) rel53=0.81 (ob,1961) Clusters   ∧   ...   ...  The final phase is to use our relation clusters in the lexical entries of the CCG semantic derivation. This The probability of a given logical form can be read is slightly complicated by the fact that our predi- from this packed logical form. cates are lexically ambiguous between all the pos- sible types they could take, and hence the relations 8 Experiments they could express. For example, the system can- Our approach aims to offer a strong model of both born in birthplace not tell whether is expressing a formal and . We perform two eval- birthdate or relation until later in the derivation, uations, aiming to target each of these separately, but when it combines with its arguments. However, all using the same semantic representations in each. the possible logical forms are identical except for We train our system on Gigaword (Graff et al., the symbols used, which means we can produce a 2003), which contains around 4 billion words of packed logical form capturing the full distribution newswire. The type-model is trained using 15 over logical forms. To do this, we make the predi- types7, and 5,000 iterations of Gibbs sampling (us- cate a function from argument types to relations. ing the distributions from the final sample). Table 1 For each word, we first take the lexical semantic definition produced by the algorithm in Section 4. 6These distributions are composed from the type-distributions For binary predicates in this definition (which will for both the predicate and argument, as explained in Section 5 7This number was chosen by examination of models trained 5We also experimented with a Dirichlet Process Mixture Model with different numbers of types. The algorithm produces se- (Neal, 2000), but even with the efficient A* search algorithms mantically coherent clusters for much larger numbers of types, introduced by Daume´ III (2007), the cost of was found but many of these are fine-grained categories of people, which to be prohibitively high when run at large scale. introduces sparsity in the relation clustering.

184

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 Type Top Words answers to the same question (e.g. What did Google 1 suspect, assailant, fugitive, accomplice buy? may have many valid answers), and all of 2 author, singer, actress, actor, dad these contribute to the result. As none of the systems 5 city, area, country, region, town, capital model tense or temporal semantics, annotators were 8 subsidiary, automaker, airline, Co., GM instructed to annotate answers as correct if they were 10 musical, thriller, sequel, special true at any time. This approach means we evaluate on relations in proportion to corpus frequency. We Table 1: Most probable terms in some clusters induced sample 1000 from the New York Times by the Type Model. subset of Gigaword from 2010, and search for an- swers in the New York Times from 2009. shows some example types. The relation clustering We evaluate the following approaches: uses only proper nouns, to improve precision (spar- CCG-Baseline The logical form produced by sity problems are partly offset by the large input cor- • our CCG derivation, without the clustering. pus). Aside from parsing, the pipeline takes around a day to run using 12 cores. CCG-WordNet The CCG logical form, plus • WordNet as a model of lexical semantics. 8.1 Question Answering Experiments CCG-Distributional The logical form includ- As yet, there is no standard way of evaluating lexical • ing the type model and clusters. semantics. Existing tasks like Recognising Textual (RTE; Dagan et al., 2006) rely heavily on Relational LDA An LDA based model for • background knowledge, which is beyond the scope clustering dependency paths (Yao et al., 2011). of this work. Intrinsic evaluations of entailment rela- We train on New York Times subset of Giga- tions have low inter-annotator agreement (Szpektor word9, using their setup of 50 iterations with et al., 2007), due to the difficulty of evaluating rela- 100 relation types. tions out of . Reverb A sophisticated Open Information Ex- Our evaluation is based on that performed by • traction system (Fader et al., 2011). Poon and Domingos (2009). We automatically con- struct a set of questions by sampling from text, Unsupervised Semantic Parsing (USP; Poon and and then evaluate how many answers can be found Domingos, 2009; USP; Poon and Domingos, 2010; in a different corpus. From dependency-parsed nsub j dob j nsub j USP; Titov and Klementiev, 2011) would be another newswire, we sample either X verb Y, X obvious baseline. However, memory requirements pob j nsub j dob j ← pob j → ← verb Y or X be noun Y patterns, mean it is not possible to run at this scale (our system → ← → → where X and Y are proper nouns and the verb is is trained on 4 orders of magnitude more data than not on a list of stop verbs, and deterministically con- the USP evaluation). Yao et al. (2011) found it had vert these to questions. For example, from Google comparable performance to Relational LDA. bought YouTube we create the questions What did For the CCG models, rather than performing full Google buy? and What bought YouTube?. The task first-order inference on a large corpus, we simply is to find proper-noun answers to these questions in test whether the question predicate subsumes a can- a different corpus, which are then evaluated by hu- didate answer predicate, and whether the arguments man annotators based on the sentence the answer match10. In the case of CCG-Distributional, we cal- was retrieved from8. Systems can return multiple culate the probability that the two packed-predicates

8Common nouns are filtered automatically. To focus on evalu- 9This is around 35% of Gigaword, and was the largest scale ating the semantics, annotators ignored garbled sentences due possible on our resources. to errors pre-processing the corpus (these are excluded from 10We do this as it is much more efficient than full first-order the results). We also automatically exclude weekday and -proving. We could in principle make additional in- month answers, which are overwhelmingly syntax errors for ferences with theorem-proving, such as answering What did all systems—e.g. treating Tuesday as an object in Obama an- Google buy? from Google bought the largest video website and nounced Tuesday that... YouTube is the largest video website.

185

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 System Answers Correct The FraCaS suite (Cooper et al., 1996)11 contains Relational-LDA 7046 11.6% a hand-built set of entailment problems designed to Reverb 180 89.4% be challenging in terms of formal semantics. We CCG-Baseline 203 95.8% use Section 1, which contains 74 problems requiring CCG-WordNet 211 94.8% an of quantifiers12. They do not re- CCG-Distributional@250 250 94.1% quire any knowledge of lexical semantics, meaning CCG-Distributional@500 500 82.0% we can evaluate the formal component of our system in isolation. However, we use the same representa- Table 2: Results on wide-coverage Question Answer- tions as in our previous experiment, even though the ing task. CCG-Distributional ranks question/answer pairs clusters provide no benefit on this task. Figure 5 by confidence—@250 means we evaluate the top 250 of gives an example problem. these. It is not possible to give a recall figure, as the total The only previous work we are aware of on number of correct answers in the corpus is unknown. this dataset is by MacCartney and Manning (2007). This approach learns the monotonicity properties of words from a hand-built training set, and uses are in the same cluster, marginalizing over their ar- this to transform a sentence into a polarity anno- gument types. Answers are ranked by this proba- tated string. The system then aims to transform the bility. For CCG-WordNet, we check if the ques- string into a hypothesis. Positively polar- tion predicate is a hypernym of the candidate answer ized words can be replaced with less specific ones predicate (using any WordNet sense of either term). (e.g. by deleting adjuncts), whereas negatively po- Results are shown in Table 2. Relational-LDA in- larized words can be replaced with more specific duces many meaningful clusters, but predicates must ones (e.g. by adding adjuncts). Whilst this is high- be assigned to one of 100 relations, so results are precision and often useful, this logic is unable to per- dominated by large, noisy clusters (it is not possi- form inferences with multiple premise sentences (in ble to take the N-best answers as the cluster assign- to our first-order logic). ments do not have a confidence score). The CCG- Development consists of adding entries to our lex- Baseline errors are mainly caused by parser errors, icon for quantifiers. For simplicity, we treat multi- or relations in the scope of non-factive operators. word quantifiers like at least a few, as being multi- CCG-WordNet adds few answers to CCG-Baseline, word expressions—although a more compositional reflecting the limitations of hand-built ontologies. analysis may be possible. Following MacCartney CCG-Distributional substantially improves recall and Manning (2007), we do not use held-out data— over other approaches whilst retaining good preci- each problem is designed to test a different issue, so sion, demonstrating that we have learnt a powerful it is not possible to generalize from one subset of the model of lexical semantics. Table 3 shows some suite to another. As we are interested in evaluating correctly answered questions. The system improves the semantics, not the parser, we manually supply over the baseline by mapping expressions such as gold-standard lexical categories for sentences with merge with and acquisition of to the same relation parser errors (any syntactic mistake causes incorrect cluster. Many of the errors are caused by conflating semantics). Our derivations produce a distribution predicates where the entailment only holds in one over logical forms—we license the inference if it direction, such as was elected to with ran for. Hier- holds in any interpretation with non-zero probabil- archical clustering could be used to address this. ity. We use the Prover9 (McCune, 2005) theorem prover for inference, returning yes if the premise im- 8.2 Experiments on the FraCaS Suite plies the hypothesis, no if it implies the negation of the hypothesis, and unknown otherwise. We are also interested in evaluating our approach Results are shown in Table 4. Our system im- as a model of formal semantics—demonstrating that 11We use the version converted to machine readable format by it is possible to integrate the formal semantics of MacCartney and Manning (2007) Steedman (2012) with our distributional clusters. 12Excluding 6 problems without a defined solution.

186

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 Question Answer Sentence What did Delta merge with? Northwest The 747 freighters came with Delta’s acquisition of Northwest What spoke with Hu Jintao? Obama Obama conveyed his respect for the Dalai Lama to China’s president Hu Jintao during their first meeting. . . What arrived in Colorado? Zazi Zazi flew back to Colorado. . . What ran for Congress? Young . . . Young was elected to Congress in 1972

Table 3: Example questions correctly answered by CCG-Distributional.

Premises: Every European has the right to live in Europe. Every European is a person. Every person who has the right to live in Europe can travel freely within Europe. Hypothesis: Every European can travel freely within Europe Solution: Yes

Figure 5: Example problem from the FraCaS suite.

System Single Multiple 9 Related Work Premise MacCartney&Manning 07 84% - Much work on semantics has taken place in a su- MacCartney&Manning 08 98% - pervised setting—for example the GeoQuery (Zelle CCG-Dist (parser syntax) 70% 50% and Mooney, 1996) and ATIS (Dahl et al., 1994) se- CCG-Dist (gold syntax) 89% 80% mantic parsing tasks. This approach makes sense for generating queries for a specific database, but means Table 4: Accuracy on Section 1 of the FraCaS suite. the semantic representations do not generalize to Problems are divided into those with one premise sen- other datasets. There have been several attempts tence (44) and those with multiple premises (30). to annotate larger corpora with semantics—such as Ontonotes (Hovy et al., 2006) or the Groningen Meaning Bank (Basile et al., 2012). These typically proves on previous work by making multi-sentence map words onto senses in ontologies such as Word- inferences. Causes of errors include missing a dis- Net, VerbNet (Kipper et al., 2000) and FrameNet tinct lexical entry for plural the, only taking existen- (Baker et al., 1998). However, limitations of these tial interpretations of bare plurals, failing to inter- ontologies mean that they do not support inferences pret mass-noun determiners such as a lot of, and not such as X is the author of Y X wrote Y. → providing a good semantics for non-monotone de- Given the difficulty of annotating large amounts terminers such as most. We believe these problems of text with semantics, various approaches have at- will be surmountable with more work. Almost all er- tempted to learn meaning without annotated text. rors are due to incorrectly predicting unknown — the Distant Supervision approaches leverage existing system makes just one error on yes or no predictions knowledge bases, such as Freebase (Bollacker et al., (with or without gold syntax). This suggests that 2008), to learn semantics (Mintz et al., 2009; Krish- making first-order logic inferences in applications namurthy and Mitchell, 2012). Dependency-based will not harm precision. We are less robust than Compositional Semantics (Liang et al., 2011) learns MacCartney and Manning (2007) to syntax errors the meaning of questions by using their answers as but, conversely, we are able to attempt more of the —but this appears to be specific to ques- problems (i.e. those with multi-sentence premises). tion parsing. Such approaches can only learn the Other approaches based on distributional semantics pre-specified relations in the knowledge base. seem unable to tackle any of these problems, as they The approaches discussed so far in this section do not represent quantifiers or negation. have all attempted to map language onto some pre-

187

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 specified set of relations. Various attempts have use distributional statistics to determine the proba- been made to instead induce relations from text by bility that a WordNet-derived inference rule is valid clustering predicates based on their arguments. For in a given context. Our approach differs in that example, Yao et al. (2011) propose a series of LDA- we learn inference rules not present in WordNet. based models which cluster relations between en- Our lexical semantics is integrated into the lexicon, tities based on a variety of lexical, syntactic and rather than being implemented as additional infer- semantic features. Unsupervised Semantic Pars- ence rules, meaning that inference is more efficient, ing (Poon and Domingos, 2009) recursively clusters as equivalent statements have the same logical form. fragments of dependency trees based on their argu- Natural Logic (MacCartney and Manning, 2007) ments. Although USP is an elegant model, it is too offers an interesting alternative to symbolic , computationally expensive to run on large corpora. and has been shown to be able to capture complex It is also based on frame semantics, so does not clus- logical inferences by simply identifying the scope of ter equivalent predicates with different frames. To negation in text. This approach achieves similar pre- our knowledge, our work is the first such approach cision and much higher recall than Boxer on the RTE to be integrated within a linguistic theory supporting task. Their approach also suffers from such limita- formal semantics for logical operators. tions as only being able to make inferences between Vector space models represent words by vectors two sentences. It is also sensitive to , so based on co-occurrence counts. Recent work has cannot make inferences such as Shakespeare wrote tackled the problem of composing these matrices Macbeth = Macbeth was written by Shakespeare. ⇒ to build up the semantics of phrases or sentences (Mitchell and Lapata, 2008). Another strand (Co- 10 Conclusions and Future Work ecke et al., 2010; Grefenstette et al., 2011) has This is the first work we are aware of that combines a shown how to represent meanings as tensors, whose distributionally induced lexicon with formal seman- order depends on the syntactic category, allowing tics. Experiments suggest our approach compares an elegant correspondence between syntactic and favourably with existing work in both areas. semantic types. Socher et al. (2012) train a com- Many potential areas for improvement remain. position function using a neural network—however Hierachical clustering would allow us to capture their method requires annotated data. It is also not hypernym relations, rather than the synonyms cap- obvious how to represent logical relations such as tured by our flat clustering. There is much potential quantification in vector-space models. Baroni et al. for integrating existing hand-built resources, such (2012) make progress towards this by learning a as Ontonotes and WordNet, to improve the accu- classifier that can recognise entailments such as all racy of clustering. There are cases where the ex- dogs = some dogs, but this remains some way ⇒ isting CCGBank grammar does not match the re- from the power of first-order theorem proving of the quired predicate-argument structure—for example kind required by the problem in Figure 5. in the case of light verbs. It may be possible to re- An alternative strand of research has attempted bank CCGBank, in a way similar to Honnibal et al. to build computational models of linguistic theories (2010), to improve it on this point. based on formal compositional semantics, such as the CCG-based Boxer (Bos, 2008) and the LFG- Acknowledgements based XLE (Bobrow et al., 2007). Such approaches convert parser output into formal semantic repre- We thank Christos Christodoulopoulos, Tejaswini sentations, and have demonstrated some ability to Deoskar, Mark Granroth-Wilding, Ewan Klein, Ka- model complex phenomena such as negation. For trin Erk, Johan Bos and the anonymous reviewers lexical semantics, they typically compile lexical re- for their helpful comments, and Limin Yao for shar- sources such as VerbNet and WordNet into inference ing code. This work was funded by ERC Advanced rules—but still achieve only low recall on open- Fellowship 249520 GRAMPLUS and IP EC-FP7- domain tasks, such as RTE, mostly due to the low 270273 Xperience. coverage of such resources. Garrette et al. (2011)

188

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 Proceedings, Research in , pages 277–286. College Publications. C.F. Baker, C.J. Fillmore, and J.B. Lowe. 1998. The P.F. Brown, P.V. Desouza, R.L. Mercer, V.J.D. Pietra, and berkeley project. In Proceedings of the J.C. Lai. 1992. Class-based n-gram models of natural 36th Annual Meeting of the Association for Computa- language. Computational linguistics, 18(4):467–479. tional Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pages 86–90. N. Chinchor and P. Robinson. 1997. Muc-7 named entity Association for Computational Linguistics. task definition. In Proceedings of the 7th Conference on Message Understanding. M. Baroni, R. Bernardi, N.Q. Do, and C. Shan. 2012. Entailment above the word level in distributional se- Stephen Clark and James R. Curran. 2004. Parsing the mantics. In Proceedings of EACL, pages 23–32. Cite- WSJ using CCG and log-linear models. In Proceed- seer. ings of the 42nd Annual Meeting on Association for Computational Linguistics V. Basile, J. Bos, K. Evang, and N. Venhuizen. 2012. , ACL ’04. Association for Developing a large semantically annotated corpus. In Computational Linguistics. Proceedings of the Eighth International Conference Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. on Language Resources and Evaluation (LREC12). To 2010. Mathematical foundations for a compositional appear. distributional model of meaning. Linguistic Analysis: Jonathan Berant, Ido Dagan, and Jacob Goldberger. A Festschrift for Joachim Lambek, 36(1-4):345–384. 2011. Global learning of typed entailment rules. In Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Proceedings of the 49th Annual Meeting of the Asso- Johan Van Genabith, Jan Jaspars, , David ciation for Computational Linguistics: Human Lan- Milward, Manfred Pinkal, Massimo Poesio, et al. guage Technologies - Volume 1, HLT ’11, pages 610– 1996. Using the framework. FraCaS Deliverable D, 619. Association for Computational Linguistics. 16. C. Biemann. 2006. Chinese whispers: an efficient graph Ido Dagan, O. Glickman, and B. Magnini. 2006. The clustering algorithm and its application to natural lan- PASCAL recognising challenge. guage processing problems. In Proceedings of the Machine Learning Challenges. Evaluating Predictive First Workshop on Graph Based Methods for Natural Uncertainty, Visual Object Classification, and Recog- Language Processing, pages 73–80. Association for nising Tectual Entailment, pages 177–190. Computational Linguistics. D.A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke- D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent Smith, D. Pallett, C. Pao, A. Rudnicky, and dirichlet allocation. the Journal of machine Learning E. Shriberg. 1994. Expanding the scope of the ATIS research, 3:993–1022. task: The ATIS-3 corpus. In Proceedings of the work- D. G. Bobrow, C. Condoravdi, R. Crouch, V. De Paiva, shop on Human Language Technology, pages 43–48. L. Karttunen, T. H. King, R. Nairn, L. Price, and Association for Computational Linguistics. A. Zaenen. 2007. Precision-focused textual inference. Hal Daume´ III. 2007. Fast search for dirichlet process In Proceedings of the ACL-PASCAL Workshop on Tex- mixture models. In Proceedings of the Eleventh In- tual Entailment and Paraphrasing, RTE ’07, pages ternational Conference on Artificial Intelligence and 16–21. Association for Computational Linguistics. Statistics (AIStats), San Juan, Puerto Rico. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim D. Davidson. 1967. 6. the logical form of action sen- Sturge, and Jamie Taylor. 2008. Freebase: a col- tences. Essays on actions and events, 1(9):105–149. laboratively created graph database for structuring hu- Anthony Fader, Stephen Soderland, and Oren Etzioni. man knowledge. In Proceedings of the 2008 ACM 2011. Identifying relations for open information SIGMOD international conference on Management of extraction. In Proceedings of the Conference on data, SIGMOD ’08, pages 1247–1250, New York, NY, Empirical Methods in Natural Language Processing, USA. ACM. EMNLP ’11, pages 1535–1545. Association for Com- J. Bos and K. Markert. 2005. Recognising textual en- putational Linguistics. tailment with logical inference. In Proceedings of T. Fountain and M. Lapata. 2011. Incremental models of the conference on Human Language Technology and natural language category acquisition. In Proceedings Empirical Methods in Natural Language Processing, of the 32st Annual Conference of the pages 628–635. Association for Computational Lin- Society. guistics. D. Garrette, K. Erk, and R. Mooney. 2011. Integrating Johan Bos. 2008. Wide-coverage semantic analysis with logical representations with probabilistic information boxer. In Johan Bos and Rodolfo Delmonte, editors, using markov logic. In Proceedings of the Ninth In- Semantics in . STEP 2008 Conference ternational Conference on Computational Semantics,

189

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 pages 105–114. Association for Computational Lin- Andrew Kachites McCallum. 2002. Mal- guistics. let: A machine learning for language toolkit. D. Graff, J. Kong, K. Chen, and K. Maeda. 2003. English http://mallet.cs.umass.edu. gigaword. Linguistic Data Consortium, Philadelphia. W. McCune. 2005. Prover9 and Mace4. http://cs.unm.edu/˜mccune/mace4/ Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen . Clark, Bob Coecke, and Stephen Pulman. 2011. Con- G.A. Miller. 1995. Wordnet: a lexical database for en- crete sentence spaces for compositional distributional glish. of the ACM, 38(11):39–41. models of meaning. Computational Semantics IWCS M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009. Dis- 2011, page 125. tant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the Julia Hockenmaier and Mark Steedman. 2007. CCG- 47th Annual Meeting of the ACL and the 4th Interna- bank: a corpus of CCG derivations and dependency tional Joint Conference on Natural Language Process- structures extracted from the penn . Compu- ing of the AFNLP: Volume 2-Volume 2, pages 1003– tational Linguistics, 33(3):355–396. 1011. Association for Computational Linguistics. M. Honnibal, J.R. Curran, and J. Bos. 2010. Rebanking J. Mitchell and M. Lapata. 2008. Vector-based models of CCGbank for improved NP interpretation. In Proceed- semantic composition. proceedings of ACL-08: HLT, ings of the 48th Annual Meeting of the Association for pages 236–244. Computational Linguistics, pages 207–215. Associa- R.M. Neal. 2000. Markov chain sampling methods for tion for Computational Linguistics. dirichlet process mixture models. Journal of compu- E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and tational and graphical statistics, 9(2):249–265. R. Weischedel. 2006. Ontonotes: the 90% solution. D.O. O’Seaghdha. 2010. Latent variable models of se- In Proceedings of the Human Language Technology lectional preference. In Proceedings of the 48th An- Conference of the NAACL, Companion Volume: Short nual Meeting of the Association for Computational Papers, pages 57–60. Association for Computational Linguistics, pages 435–444. Association for Compu- Linguistics. tational Linguistics. P. Kingsbury and M. Palmer. 2002. From treebank to Hoifung Poon and Pedro Domingos. 2009. Unsuper- . In Proceedings of the 3rd International vised semantic parsing. In Proceedings of the 2009 Conference on Language Resources and Evaluation Conference on Empirical Methods in Natural Lan- (LREC-2002), pages 1989–1993. Citeseer. guage Processing: Volume 1 - Volume 1, EMNLP ’09, K. Kipper, H.T. Dang, and M. Palmer. 2000. Class-based pages 1–10. Association for Computational Linguis- construction of a verb lexicon. In Proceedings of the tics. National Conference on Artificial Intelligence, pages Hoifung Poon and Pedro Domingos. 2010. Unsuper- 691–696. Menlo Park, CA; Cambridge, MA; London; vised ontology induction from text. In Proceedings of AAAI Press; MIT Press; 1999. the 48th Annual Meeting of the Association for Com- putational Linguistics, ACL ’10, pages 296–305. As- Jayant Krishnamurthy and Tom M. Mitchell. 2012. sociation for Computational Linguistics. Weakly supervised training of semantic parsers. In A. Ritter, O. Etzioni, et al. 2010. A latent dirichlet allo- Proceedings of the 2012 Joint Conference on Empir- cation method for selectional preferences. In Proceed- ical Methods in Natural Language Processing and ings of the 48th Annual Meeting of the Association for Computational Natural Language Learning, EMNLP- Computational Linguistics, pages 424–434. Associa- CoNLL ’12, pages 754–765. Association for Compu- tion for Computational Linguistics. tational Linguistics. Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, P. Liang, M.I. Jordan, and D. Klein. 2011. Learning and Jesse Davis. 2010. Learning first-order horn dependency-based compositional semantics. In Proc. clauses from web text. In Proceedings of the 2010 Association for Computational Linguistics (ACL). Conference on Empirical Methods in Natural Lan- Dekang Lin and Patrick Pantel. 2001. DIRT - Discovery guage Processing, EMNLP ’10, pages 1088–1098. of Inference Rules from Text. In In Proceedings of the Association for Computational Linguistics. ACM SIGKDD Conference on Knowledge Discovery R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012. and , pages 323–328. Semantic compositionality through recursive matrix- Bill MacCartney and Christopher D. Manning. 2007. vector spaces. In Proceedings of the 2012 Joint Natural logic for textual inference. In Proceedings Conference on Empirical Methods in Natural Lan- of the ACL-PASCAL Workshop on Textual Entailment guage Processing and Computational Natural Lan- and Paraphrasing, RTE ’07, pages 193–200. Associa- guage Learning, pages 1201–1211. Association for tion for Computational Linguistics. Computational Linguistics.

190

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 Mark Steedman. 2000. The Syntactic Process. MIT Press. Mark Steedman. 2012. Taking Scope: The Natural Se- mantics of Quantifiers. MIT Press. Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. Instance-based evaluation of entailment rule acquisi- tion. In In Proceedings of ACL 2007, volume 45, page 456. Ivan Titov and Alexandre Klementiev. 2011. A bayesian model for unsupervised semantic parsing. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1445–1455, Portland, Oregon, USA, June. Association for Computational Linguistics. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’11, pages 1456–1466. Association for Computational Linguistics. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In ACL (1), pages 712–720. J.M. Zelle and R.J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artifi- cial Intelligence, pages 1050–1055.

191

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021 192

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00219 by guest on 03 October 2021