Unsupervised Semantic

Hoifung Poon Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195-2350, U.S.A. {hoifung,pedrod}@cs.washington.edu

Abstract In this paper we develop the first unsupervised approach to semantic parsing, using Markov logic We present the first unsupervised approach (Richardson and Domingos, 2006). Our USP sys- to the problem of learning a semantic tem starts by clustering tokens of the same type, parser, using Markov logic. Our USP and then recursively clusters expressions whose system transforms dependency trees into subexpressions belong to the same clusters. Ex- quasi-logical forms, recursively induces periments on a biomedical corpus show that this lambda forms from these, and clusters approach is able to successfully translate syntac- them to abstract away syntactic variations tic variations into a logical representation of their of the same meaning. The MAP semantic common meaning (e.g., USP learns to map active parse of a sentence is obtained by recur- and passive voice to the same logical form, etc.). sively assigning its parts to lambda-form This in turn allows it to correctly answer many clusters and composing them. We evalu- more questions than systems based on TextRun- ate our approach by using it to extract a ner (Banko et al., 2007) and DIRT (Lin and Pantel, knowledge base from biomedical abstracts 2001). and answer questions. USP substantially We begin by reviewing the necessary back- outperforms TextRunner, DIRT and an in- ground on semantic parsing and Markov logic. We formed baseline on both precision and re- then describe our Markov logic network for un- call on this task. supervised semantic parsing, and the learning and inference algorithms we used. Finally, we present 1 Introduction our experiments and results. Semantic parsing maps text to formal meaning representations. This contrasts with semantic role 2 Background labeling (Carreras and Marquez, 2004) and other 2.1 Semantic Parsing forms of shallow semantic processing, which do not aim to produce complete formal meanings. The standard language for formal meaning repre- Traditionally, semantic parsers were constructed sentation is first-order logic. A term is any ex- manually, but this is too costly and brittle. Re- pression representing an object in the domain. An cently, a number of machine learning approaches atomic formula or atom is a predicate symbol ap- have been proposed (Zettlemoyer and Collins, plied to a tuple of terms. Formulas are recursively 2005; Mooney, 2007). However, they are super- constructed from atomic formulas using logical vised, and providing the target logical form for connectives and quantifiers. A lexical entry de- each sentence is costly and difficult to do consis- fines the logical form for a lexical item (e.g., a tently and with high quality. Unsupervised ap- word). The semantic parse of a sentence is de- proaches have been applied to shallow semantic rived by starting with logical forms in the lexi- tasks (e.g., paraphrasing (Lin and Pantel, 2001), cal entries and recursively composing the mean- (Banko et al., 2007)), but ing of larger fragments from their parts. In tradi- not to semantic parsing. tional approaches, the lexical entries and meaning- composition rules are both manually constructed. The major limitation of supervised approaches Below are sample rules in a definite clause gram- is that they require meaning annotations for ex- mar (DCG) for parsing the sentence: “Utah bor- ample sentences. Even in a restricted domain, ders Idaho”. doing this consistently and with high quality re- quires nontrivial effort. For unrestricted text, the V erb[λyλx.borders(x, y)] → borders complexity and subjectivity of annotation render it NP [Utah] → Utah essentially infeasible; even pre-specifying the tar- NP [Idaho] → Idaho get predicates and objects is very difficult. There- V P [rel(obj)] → V erb[rel] NP [obj] fore, to apply semantic parsing beyond limited do- S[rel(obj)] → NP [obj] V P [rel] mains, it is crucial to develop unsupervised meth- ods that do not rely on labeled meanings. The first three lines are lexical entries. They are In the past, unsupervised approaches have been fired upon seeing the individual words. For exam- applied to some semantic tasks, but not to seman- ple, the first rule applies to the word “borders” and tic parsing. For example, DIRT (Lin and Pan- generates syntactic category Verb with the mean- tel, 2001) learns paraphrases of binary relations ing λyλx.borders(x, y) that represents the next- based on distributional similarity of their argu- to relation. Here, we use the standard lambda- ments; TextRunner (Banko et al., 2007) automati- calculus notation, where λyλx.borders(x, y) cally extracts relational triples in open domains us- represents a function that is true for any (x, y)- ing a self-trained extractor; SNE applies relational pair such that borders(x, y) holds. The last two clustering to generate a from rules compose the meanings of sub-parts into that TextRunner triples (Kok and Domingos, 2008). of the larger part. For example, after the first While these systems illustrate the promise of un- and third rules are fired, the fourth rule fires and supervised methods, the semantic content they ex- generates V P [λyλx.borders(x, y)(Idaho)]; this tract is nonetheless shallow and does not constitute meaning simplifies to λx.borders(x, Idaho) by the complete formal meaning that can be obtained the λ-reduction rule, which substitutes the argu- by a semantic parser. ment for a variable in a functional application. Another issue is that existing approaches to se- A major challenge to semantic parsing is syn- mantic parsing learn to parse syntax and semantics tactic variations of the same meaning, which together.1 The drawback is that the complexity abound in natural languages. For example, the in syntactic processing is coupled with semantic aforementioned sentence can be rephrased as parsing and makes the latter even harder. For ex- “Utah is next to Idaho,”“Utah shares a border with ample, when applying their approach to a different Idaho,” etc. Manually encoding all these varia- domain with somewhat less rigid syntax, Zettle- tions into the grammar is tedious and error-prone. moyer and Collins (2007) need to introduce new Supervised semantic parsing addresses this issue combinators and new forms of candidate lexical by learning to construct the grammar automati- entries. Ideally, we should leverage the enormous cally from sample meaning annotations (Mooney, progress made in syntactic parsing and generate 2007). Existing approaches differ in the meaning semantic parses directly from syntactic analysis. representation languages they use and the amount of annotation required. In the approach of Zettle- 2.2 Markov Logic moyer and Collins (2005), the training data con- In many NLP applications, there exist rich rela- sists of sentences paired with their meanings in tions among objects, and recent work in statisti- lambda form. A probabilistic combinatory cate- cal relational learning (Getoor and Taskar, 2007) gorial grammar (PCCG) is learned using a log- and structured prediction (Bakir et al., 2007) has linear model, where the probability of the final shown that leveraging these can greatly improve logical form L and meaning-derivation tree T accuracy. One of the most powerful representa- conditioned on the sentence S is P (L, T |S) = 1 tions for this is Markov logic, which is a proba- exp (P wifi(L,T,S)). Here Z is the normal- Z i bilistic extension of first-order logic (Richardson ization constant and fi are the feature functions and Domingos, 2006). Markov logic makes it with weights wi. Candidate lexical entries are gen- erated by a domain-specific procedure based on 1The only exception that we are aware of is Ge and the target logical forms. Mooney (2009). possible to compactly specify probability distri- not require domain-specific procedures for gener- butions over complex relational domains, and has ating candidate lexicons, as is often needed by su- been successfully applied to unsupervised corefer- pervised methods. ence resolution (Poon and Domingos, 2008) and The input to our USP system consists of de- other tasks. A Markov logic network (MLN) is pendency trees of training sentences. Compared a set of weighted first-order clauses. Together to phrase-structure syntax, dependency trees are with a set of constants, it defines a Markov net- the more appropriate starting point for semantic work with one node per ground atom and one fea- processing, as they already exhibit much of the ture per ground clause. The weight of a feature relation-argument structure at the lexical level. is the weight of the first-order clause that origi- USP first uses a deterministic procedure to con- nated it. The probability of a state x in such a vert dependency trees into quasi-logical forms network is given by the log-linear model P (x) = (QLFs). The QLFs and their sub-formulas have 1 Z exp (Pi wini(x)), where Z is a normalization natural lambda forms, as will be described later. constant, wi is the weight of the ith formula, and Starting with clusters of lambda forms at the atom ni is the number of satisfied groundings. level, USP recursively builds up clusters of larger lambda forms. The final output is a probability 3 Unsupervised Semantic Parsing with distribution over lambda-form clusters and their Markov Logic compositions, as well as the MAP semantic parses of training sentences. Unsupervised semantic parsing (USP) rests on In the remainder of the section, we describe three key ideas. First, the target predicate and ob- the details of USP. We first present the procedure ject constants, which are pre-specified in super- for generating QLFs from dependency trees. We vised semantic parsing, can be viewed as clusters then introduce their lambda forms and clusters, of syntactic variations of the same meaning, and and show how semantic parsing works in this set- can be learned from data. For example, borders ting. Finally, we present the Markov logic net- represents the next-to relation, and can be viewed work (MLN) used by USP. In the next sections, we as the cluster of different forms for expressing this present efficient algorithms for learning and infer- relation, such as “borders”, “is next to”, “share the ence with this MLN. border with”; Utah represents the state of Utah, and can be viewed as the cluster of “Utah”, “the 3.1 Derivation of Quasi-Logical Forms beehive state”, etc. A dependency tree is a tree where nodes are words Second, the identification and clustering of can- and edges are dependency labels. To derive the didate forms are integrated with the learning for QLF, we convert each node to an unary atom with meaning composition, where forms that are used the predicate being the lemma plus POS tag (be- in composition with the same forms are encour- low, we still use the word for simplicity), and each aged to cluster together, and so are forms that are edge to a binary atom with the predicate being composed of the same sub-forms. This amounts to the dependency label. For example, the node for a novel form of relational clustering, where clus- Utah becomes Utah(n1) and the subject depen- tering is done not just on fixed elements in rela- dency becomes nsubj(n1, n2). Here, the ni are tional tuples, but on arbitrary forms that are built Skolem constants indexed by the nodes. The QLF up recursively. for a sentence is the conjunction of the atoms for Third, while most existing approaches (manual the nodes and edges, e.g., the sentence above will or supervised learning) learn to parse both syn- become borders(n1) ∧ Utah(n2) ∧ Idaho(n3) ∧ tax and semantics, unsupervised semantic pars- nsubj(n1, n2) ∧ dobj(n1, n3). ing starts directly from syntactic analyses and fo- cuses solely on translating them to semantic con- 3.2 Lambda-Form Clusters and Semantic tent. This enables us to leverage advanced syn- Parsing in USP tactic parsers and (indirectly) the available rich re- Given a QLF, a relation or an object is repre- sources for them. More importantly, it separates sented by the conjunction of a subset of the atoms. the complexity in syntactic analysis from the se- For example, the next-to relation is represented mantic one, and makes the latter much easier to by borders(n1) ∧ nsubj(n1, n2) ∧ dobj(n1, n3), perform. In particular, meaning composition does and the states of Utah and Idaho are represented by Utah(n2) and Idaho(n3). The meaning com- forms, and then assigning each form to a cluster position of two sub-formulas is simply their con- or an argument type. The final logical form is de- junction. This allows the maximum flexibility in rived by composing the abstract lambda forms of learning. In particular, lexical entries are no longer the parts using the λ-reduction rule.2 limited to be adjacent words as in Zettlemoyer and Collins (2005), but can be arbitrary fragments in a 3.3 TheUSPMLN dependency tree. Formally, for a QLF Q, a semantic parse L par- For every sub-formula F , we define a corre- titions Q into parts p1, p2, · · · , pn; each part p is sponding lambda form that can be derived by re- assigned to some lambda-form cluster c, and is placing every Skolem constant ni that does not further partitioned into core form f and argument appear in any unary atom in F with a unique forms f1, · · · , fk; each argument form is assigned lambda variable xi. Intuitively, such constants to an argument type a in c. The USP MLN de- represent objects introduced somewhere else (by fines a joint probability distribution over Q and L the unary atoms containing them), and corre- by modeling the distributions over forms and ar- spond to the arguments of the relation repre- guments given the cluster or argument type. sented by F . For example, the lambda form Before presenting the predicates and formu- for borders(n1) ∧ nsubj(n1, n2) ∧ dobj(n1, n3) las in our MLN, we should emphasize that they is λx2λx3. borders(n1) ∧ nsubj(n1, x2) ∧ should not be confused with the atoms and formu- dobj(n1, x3). las in the QLFs, which are represented by reified Conceptually, a lambda-form cluster is a set of constants and variables. semantically interchangeable lambda forms. For To model distributions over lambda forms, example, to express the meaning that Utah bor- we introduce the predicates Form(p, f!) and ders Idaho, we can use any form in the cluster ArgForm(p, i, f!), where p is a part, i is the in- representing the next-to relation (e.g., “borders”, dex of an argument, and f is a QLF subformula. is true iff part has core form , and “shares a border with”), any form in the cluster Form(p, f) p f is true iff the th argument in representing the state of Utah (e.g., “the beehive ArgForm(p, i, f) i p has form .3 The “ ” notation signifies that each state”), and any form in the cluster representing f f! the state of Idaho (e.g., “Idaho”). Conditioned part or argument can have only one form. on the clusters, the choices of individual lambda To model distributions over arguments, we in- forms are independent of each other. troduce three more predicates: ArgType(p, i, a!) signifies that the ith argument of p is assigned to To handle variable number of arguments, we ′ argument type a; Arg(p, i, p ) signifies that the follow Davidsonian semantics and further de- ′ ith argument of p is p ; Number(p, a, n) signifies compose a lambda form into the core form, that there are n arguments of p that are assigned which does not contain any lambda variable to type a. The truth value of Number(p, a, n) is (e.g., borders(n1)), and the argument forms, determined by the ArgType atoms. which contain a single lambda variable (e.g., Unsupervised semantic parsing can be captured x2 nsubj n1 x2 and x3 dobj n1 x3 ). Each λ . ( , ) λ . ( , ) by four formulas: lambda-form cluster may contain some number of argument types, which cluster distinct forms of the p ∈ +c ∧ Form(p, +f) same argument in a relation. For example, in Stan- ArgType(p, i, +a) ∧ ArgForm(p, i, +f) ford dependencies, the object of a verb uses the de- Arg(p, i, p′) ∧ ArgType(p, i, +a) ∧ p′ ∈ +c′ pendency dobj in the active voice, but nsubjpass Number(p, +a, +n) in passive. Lambda-form clusters abstract away syntactic All free variables are implicitly universally quan- variations of the same meaning. Given an in- tified. The “+” notation signifies that the MLN stance of cluster T with arguments of argument contains an instance of the formula, with a sep- types A1, · · · , Ak, its abstract lambda form is given arate weight, for each value combination of the k by λx1 · · · λxk.T(n) ∧ Vi=1 Ai(n, xi). 2Currently, we do not handle quantifier scoping or se- Given a sentence and its QLF, semantic pars- mantics for specific closed-class words such as determiners. These will be pursued in future work. ing amounts to partitioning the atoms in the QLF, 3There are hard constraints to guarantee that these assign- dividing each part into core form and argument ments form a legal partition. We omit them for simplicity. variables with a plus sign. The first formula mod- Algorithm 1 USP-Parse(MLN, QLF) els the mixture of core forms given the cluster, and Form parts for individual atoms in QLF and as- the others model the mixtures of argument forms, sign each to its most probable cluster argument clusters, and argument numbers, respec- repeat tively, given the argument type. for all parts p in the current partition do To encourage clustering and avoid overfitting, for all partitions that are λ-reducible from we impose an exponential prior with weight α on p and feasible do the number of parameters.4 Find the most probable cluster and argu- The MLN above has one problem: it often ment type assignments for the new part clusters expressions that are semantically oppo- and its arguments site. For example, it clusters antonyms like “el- end for derly/young”, “mature/immature”. This issue also end for occurs in other semantic-processing systems (e.g., Change to the new partition and assignments DIRT). In general, this is a difficult open problem with the highest gain in probability that only recently has started to receive some at- until none of these improve the probability tention (Mohammad et al., 2008). Resolving this return current partition and assignments is not the focus of this paper, but we describe a general heuristic for fixing this problem. We ob- called feasible if the core form of the new part is serve that the problem stems from the lack of nega- contained in some cluster. For example, consider tive features for discovering meanings in contrast. the QLF of “Utah borders Idaho” and assume In natural languages, parallel structures like con- that the current partition is λx2x3.borders(n1) ∧ junctions are one such feature.5 We thus introduce nsubj(n1, x2) ∧ dobj(n1, x3), Utah(n2), an exponential prior with weight β on the number Idaho(n3). Then the following partition is of conjunctions where the two conjunctive parts λ-reducible from the first part in the above are assigned to the same cluster. To detect con- partition: λx3.borders(n1) ∧ nsubj(n1, n2) ∧ junction, we simply used the Stanford dependen- Utah(n2) ∧ dobj(n1, x3), Idaho(n3). Whether cies that begin with “conj”. This proves very ef- this new partition is feasible depends on whether fective, fixing the majority of the errors in our ex- the core form of the new part λx3.borders(n1) ∧ periments. nsubj(n1, n2) ∧ Utah(n2) ∧ dobj(n1, x3) (i.e. borders(n1) ∧ nsubj(n1, n2) ∧ Utah(n2)) is 4 Inference contained in some lambda-form cluster. Given a sentence and the quasi-logical form Q Algorithm 1 gives pseudo-code for our algo- derived from its dependency tree, the conditional rithm. Given part p, finding partitions that are λ- probability for a semantic parse L is given by reducible from p and feasible can be done in time P r(L|Q) ∝ exp (P wini(L,Q)). The MAP se- i O(ST ), where S is the size of the clustering in mantic parse is simply arg maxL P wini(L,Q). i the number of core forms and T is the maximum Enumerating all L’s is intractable. It is also un- number of atoms in a core form. We omit the proof necessary, since most partitions will result in parts here but point out that it is related to the unordered whose lambda forms have no cluster they can be subtree matching problem which can be solved in assigned to. Instead, USP uses a greedy algorithm linear time (Kilpelainen, 1992). Inverted indexes to search for the MAP parse. First we introduce (e.g., from p to eligible core forms) are used to fur- some definitions: a partition is called λ-reducible ther improve the efficiency. For a new part p and from p if it can be obtained from the current parti- a cluster that contains p’s core form, there are km tion by recursively λ-reducing the part containing ways of assigning p’s arguments to the argu- p with one of its arguments; such a partition is m k ment types of the cluster. For larger k and m, this 4Excluding weights of ∞ or −∞, which signify hard con- is very expensive. We therefore approximate it by straints. 5For example, in the sentence “IL-2 inhibits X in A and assigning each argument to the best type, indepen- induces Y in B”, the conjunction between “inhibits” and “in- dent of other arguments. duces” suggests that they are different. If “inhibits” and “in- duces” are indeed synonyms, such a sentence will sound awk- ward and would probably be rephrased as “IL-2 inhibits X in This algorithm is very efficient, and is used re- A and Y in B”. peatedly in learning. 5 Learning Algorithm 2 USP-Learn(MLN, QLFs) Create initial clusters and semantic parses The learning problem in USP is to maximize the Merge clusters with the same core form log-likelihood of observing the QLFs obtained Agenda ← ∅ from the dependency trees, denoted by Q, sum- repeat ming out the unobserved semantic parses: for all candidate operations O do

Lθ(Q) = log Pθ(Q) Score O by log-likelihood improvement if score is above a threshold then = log PL Pθ(Q,L) Add O to agenda Here, L are the semantic parses, θ are the MLN pa- end if rameters, and Pθ(Q,L) are the completion likeli- end for hoods. A serious challenge in unsupervised learn- Execute the highest scoring operation O∗ in ing is the identifiability problem (i.e., the opti- the agenda mal parameters are not unique) (Liang and Klein, Regenerate MAP parses for affected QLFs 2008). This problem is particularly severe for and update agenda and candidate operations log-linear models with hard constraints, which are until agenda is empty common in MLNs. For example, in our USP return the MLN with learned weights and the MLN, conditioned on the fact that p ∈ c, there is semantic parses exactly one value of f that can satisfy the formula p ∈ c ∧ Form(p, f), and if we add some constant Another major challenge in USP learning is the number to the weights of p ∈ c ∧ Form(p, f) for summation in the likelihood, which is over all pos- all f, the probability distribution stays the same.6 sible semantic parses for a given dependency tree. The learner can be easily confused by the infinitely Even an efficient sampler like MC-SAT (Poon and many optima, especially in the early stages. To Domingos, 2006), as used in Poon & Domingos address this problem, we impose local normaliza- (2008), would have a hard time generating accu- tion constraints on specific groups of formulas that rate estimates within a reasonable amount of time. are mutually exclusive and exhaustive, i.e., in each On the other hand, as already noted in the previous k wi group, we require that Pi=1 e = 1, where wi section, the lambda-form distribution is generally are the weights of formulas in the group. Group- sparse. Large lambda-forms are rare, as they cor- ing is done in such a way as to encourage the respond to complex expressions that are often de- intended mixture behaviors. Specifically, for the composable into smaller ones. Moreover, while rule p ∈ +c ∧ Form(p, +f), all instances given ambiguities are present at the lexical level, they a fixed c form a group; for each of the remain- quickly diminish when more words are present. ing three rules, all instances given a fixed a form a Therefore, a lambda form can usually only belong group. Notice that with these constraints the com- to a small number of clusters, if not a unique one. pletion likelihood P (Q,L) can be computed in We thus simplify the problem by approximating closed form for any L. In particular, each formula the sum with the mode, and search instead for the group contributes a term equal to the weight of the L and θ that maximize log Pθ(Q,L). Since the op- currently satisfied formula. In addition, the opti- timal weights and log-likelihood can be derived in mal weights that maximize the completion likeli- closed form given the semantic parses L, we sim- hood P (Q,L) can be derived in closed form us- ply search over semantic parses, evaluating them ing empirical relative frequencies. E.g., the opti- using log-likelihood. mal weight of p ∈ c ∧ Form(p, f) is log(nc,f/nc), Algorithm 2 gives pseudo-code for our algo- where nc,f is the number of parts p that satisfy rithm. The input consists of an MLN without both p ∈ c and Form(p, f), and nc is the number weights and the QLFs for the training sentences. 7 of parts p that satisfy p ∈ c. We leverage this fact Two operators are used for updating semantic for efficient learning in USP. parses. The first is to merge two clusters, denoted 6Regularizations, e.g., Gaussian priors on weights, allevi- by MERGE(C1, C2) for clusters C1, C2, which does ate this problem by penalizing large weights, but it remains the following: true that weights within a short range are roughly equivalent. 7 c wc,f = 1 To see this, notice that for a given , the total contribu- and there is the local normalization constraint Pf e . tion to the completion likelihood from all groundings in its The optimal weights wc,f are easily derived by solving this = formula group is Pf wc,fnc,f. In addition, Pf nc,f nc constrained optimization problem. 1. Create a new cluster C and add all core forms above a threshold.9 The operation with the highest in C1, C2 to C; score is executed, and the parameters are updated 2. Create new argument types for C by merg- with the new optimal values. The QLFs which contain an affected part are reparsed, and opera- ing those in C1, C2 so as to maximize the log- likelihood; tions in the agenda whose score might be affected are re-evaluated. These changes are done very ef- 3. Remove C1, C2. ficiently using inverted indexes. We omit the de- Here, merging two argument types refers to pool- tails here due to space limitations. USP terminates ing their argument forms to create a new argument when the agenda is empty, and outputs the current type. Enumerating all possible ways of creating MLN parameters and semantic parses. new argument types is intractable. USP approxi- USP learning uses the same optimization objec- mates it by considering one type at a time and ei- tive as hard EM, and is also guaranteed to find a ther creating a new type for it or merging it to types local optimum since at each step it improves the already considered, whichever maximizes the log- log-likelihood. It differs from EM in directly opti- likelihood. The types are considered in decreasing mizing the likelihood instead of a lower bound. order of their numbers of occurrences so that more information is available for each decision. MERGE 6 Experiments clusters syntactically different expressions whose meanings appear to be the same according to the 6.1 Task model. Evaluating unsupervised semantic parsers is dif- The second operator is to create a new clus- ficult, because there is no predefined formal lan- ter by composing two existing ones, denoted by guage or gold logical forms for the input sen- COMPOSE(CR, CA), which does the following: tences. Thus the best way to test them is by using 1. Create a new cluster C; them for the ultimate goal: answering questions 2. Find all parts r ∈ CR, a ∈ CA such that a is based on the input corpus. In this paper, we ap- an argument of r, compose them to r(a) by plied USP to extracting knowledge from biomedi- λ-reduction and add the new part to C; cal abstracts and evaluated its performance in an- swering a set of questions that simulate the in- 3. Create new argument types for from the ar- C formation needs of biomedical researchers. We gument forms of so as to maximize the r(a) used the GENIA dataset (Kim et al., 2003) as log-likelihood. the source for . It contains COMPOSE creates clusters of large lambda-forms 1999 PubMed abstracts and marks all mentions if they tend to be composed of the same sub- of biomedical entities according to the GENIA forms (e.g., the lambda form for “is next to”). ontology, such as cell, protein, and DNA. As a These lambda-forms may later be merged with first approximation to the questions a biomedi- other clusters (e.g., borders). cal researcher might ask, we generated a set of At learning time, USP maintains an agenda that two thousand questions on relations between enti- contains operations that have been evaluated and ties. Sample questions are: “What regulates MIP- are pending execution. During initialization, USP 1alpha?”, “What does anti-STAT 1 inhibit?”. To forms a part and creates a new cluster for each simulate the real information need, we sample the unary atom u(n). It also assigns binary atoms of relations from the 100 most frequently used verbs the form b(n, n′) to the part as argument forms (excluding the auxiliary verbs be, have, and do), and creates a new argument type for each. This and sample the entities from those annotated in forms the initial clustering and semantic parses. GENIA, both according to their numbers of occur- USP then merges clusters with the same core form rences. We evaluated USP by the number of an- (i.e., the same unary predicate) using MERGE.8 At swers it provided and the accuracy as determined 10 each step, USP evaluates the candidate operations by manual labeling. and adds them to the agenda if the improvement is 9We currently set it to 10 to favor precision and guard 8Word-sense disambiguation can be handled by including against errors due to inexact estimates. a new kind of operator that splits a cluster into subclusters. 10The labels and questions are available at We leave this to future work. http://alchemy.cs.washington.edu/papers/poon09. 6.2 Systems resolves coreferent relation and argument strings. On the GENIA data, using the default parameters, Since USP is the first unsupervised semantic RESOLVER produces only a few trivial relation parser, conducting a meaningful comparison of it clusters and no argument clusters. This is not sur- with other systems is not straightforward. Stan- prising, since RESOLVER assumes high redun- dard question-answering (QA) benchmarks do not dancy in the data, and will discard any strings with provide the most appropriate comparison, be- fewer than 25 extractions. For a fair compari- cause they tend to simultaneously emphasize other son, we also ran RESOLVER using all extractions, aspects not directly related to semantic pars- and manually tuned the parameters based on eye- ing. Moreover, most state-of-the-art QA sys- balling of clustering quality. The best result was tems use supervised learning in their key compo- obtained with 25 rounds of execution and with the nents and/or require domain-specific engineering entity multiple set to 200 (the default is 30). To an- efforts. The closest available system to USP in swer questions, the only difference from TextRun- aims and capabilities is TextRunner (Banko et al., ner is that a question string can match any string 2007), and we compare with it. TextRunner is the in its cluster. As in TextRunner, we report results state-of-the-art system for open-domain informa- for both exact match (RS-EXACT) and substring tion extraction; its goal is to extract knowledge (RS-SUB). from text without using supervised labels. Given DIRT: The DIRT system inputs a path and returns that a central challenge to semantic parsing is re- a set of similar paths. To use DIRT in question solving syntactic variations of the same meaning, answering, we queried it to obtain similar paths we also compare with RESOLVER (Yates and Et- for the relation of the question, and used these zioni, 2009), a state-of-the-art unsupervised sys- paths while matching sentences. We first used tem based on TextRunner for jointly resolving en- MINIPAR (Lin, 1998) to parse input text using tities and relations, and DIRT (Lin and Pantel, the same dependencies as DIRT. To determine a 2001), which resolves paraphrases of binary rela- match, we first check if the sentence contains the tions. Finally, we also compared to an informed question path or one of its DIRT paths. If so, and if baseline based on keyword matching. the available argument slot in the question is con- Keyword: We consider a baseline system based tained in the one in the sentence, it is a match, and on keyword matching. The question substring we return the other argument slot from the sen- containing the verb and the available argument is tence if it is present. Ideally, a fair comparison will directly matched with the input text, ignoring case require running DIRT on the GENIA text, but we and morphology. We consider two ways to derive were not able to obtain the source code. We thus the answer given a match. The first one (KW) sim- resorted to using the latest DIRT database released ply returns the rest of sentence on the other side of by the author, which contains paths extracted from the verb. The second one (KW-SYN) is informed a large corpus with more than 1GB of text. This by syntax: the answer is extracted from the subject puts DIRT in a very advantageous position com- or object of the verb, depending on the question. If pared with other systems. In our experiments, we the verb does not contain the expected argument, used the top three similar paths, as including more the sentence is ignored. results in very low precision. TextRunner: TextRunner inputs text and outputs USP: We built a system for knowledge extrac- relational triples in the form (R, A1, A2), where tion and on top of USP. It R is the relation string, and A1, A2 the argument generated Stanford dependencies (de Marneffe et strings. Given a triple and a question, we first al., 2006) from the input text using the Stan- match their relation strings, and then match the ford parser, and then fed these to USP-Learn11, strings for the argument that is present in the ques- which produced an MLN with learned weights tion. If both match, we return the other argument and the MAP semantic parses of the input sen- string in the triple as an answer. We report results tences. These MAP parses formed our knowledge when exact match is used (TR-EXACT), or when base (KB). To answer questions, the system first the triple string can contain the question one as a parses the questions12 using USP-Parse with the substring (TR-SUB). RESOLVER: RESOLVER (Yates and Etzioni, 11α and β are set to −5 and −10. 2009) inputs TextRunner triples and collectively 12The question slot is replaced by a dummy word. syntactic difference between active and passive Table 1: Comparison of question answering re- voices. It successfully identifies many distinct ar- sults on the GENIA dataset. gument forms that mean the same (e.g., “X stimu- # Total # Correct Accuracy lates Y” ≈ “Y is stimulated with X”, “expression KW 150 67 45% of X” ≈ “X expression”). It also resolves many KW-SYN 87 67 77% nouns correctly and forms meaningful groups of TR-EXACT 29 23 79% relations. Here are some sample clusters in core TR-SUB 152 81 53% forms: RS-EXACT 53 24 45% {investigate, examine, evaluate, analyze, study, RS-SUB 196 81 41% assay} DIRT 159 94 59% {diminish, reduce, decrease, attenuate} USP 334 295 88% {synthesis, production, secretion, release} {dramatically, substantially, significantly} learned MLN, and then matches the question parse An example question-answer pair, together with to parses in the KB by testing subsumption (i.e., a the source sentence, is shown below: question parse matches a KB one iff the former is Q: What does IL-13 enhance? subsumed by the latter). When a match occurs, our A: The 12-lipoxygenase activity of murine system then looks for arguments of type in accor- macrophages. dance with the question. For example, if the ques- Sentence: The data presented here indicate tion is “What regulates MIP-1alpha?”, it searches that (1) the 12-lipoxygenase activity of murine for the argument type of the relation that contains macrophages is upregulated in vitro and in vivo the argument form “nsubj” for subject. If such an by IL-4 and/or IL-13, . . . argument exists for the relation part, it will be re- turned as the answer. 7 Conclusion

6.3 Results This paper introduces the first unsupervised ap- proach to learning semantic parsers. Our USP Table 1 shows the results for all systems. USP system is based on Markov logic, and recursively extracted the highest number of answers, almost clusters expressions to abstract away syntactic doubling that of the second highest (RS-SUB). variations of the same meaning. We have suc- It obtained the highest accuracy at 88%, and cessfully applied USP to extracting a knowledge the number of correct answers it extracted is base from biomedical text and answering ques- three times that of the second highest system. tions based on it. The informed baseline (KW-SYN) did surpris- Directions for future work include: better han- ingly well compared to systems other than USP, in dling of antonyms, subsumption relations among terms of accuracy and number of correct answers. expressions, quantifier scoping, more complex TextRunner achieved good accuracy when exact lambda forms, etc.; use of context and discourse match is used (TR-EXACT), but only obtained a to aid expression clustering and semantic parsing; fraction of the answers compared to USP. With more efficient learning and inference; application substring match, its recall substantially improved, to larger corpora; etc. but precision dropped more than 20 points. RE- SOLVER improved the number of extracted an- 8 Acknowledgements swers by sanctioning more matches based on the clusters it generated. However, most of those ad- We thank the anonymous reviewers for their comments. This ditional answers are incorrect due to wrong clus- research was partly funded by ARO grant W911NF-08-1- tering. DIRT obtained the second highest number 0242, DARPA contracts FA8750-05-2-0283, FA8750-07-D- of correct answers, but its precision is quite low 0185, HR0011-06-C-0025, HR0011-07-C-0060 and NBCH- because the similar paths contain many errors. D030010, NSF grants IIS-0534881 and IIS-0803481, and ONR grant N00014-08-1-0670. The views and conclusions 6.4 Qualitative Analysis contained in this document are those of the authors and Manual inspection shows that USP is able to re- should not be interpreted as necessarily representing the offi- solve many nontrivial syntactic variations with- cial policies, either expressed or implied, of ARO, DARPA, out user supervision. It consistently resolves the NSF, ONR, or the United States Government. References Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the Workshop on G. Bakir, T. Hofmann, B. B. Sch¨olkopf, A. Smola, the Evaluation of Parsing Systems, Granada, Spain. B. Taskar, S. Vishwanathan, and (eds.). 2007. Pre- ELRA. dicting Structured Data. MIT Press, Cambridge, MA. Saif Mohammad, Bonnie Dorr, and Graeme Hirst. 2008. Computing word-pair antonymy. In Proceed- Michele Banko, Michael J. Cafarella, Stephen Soder- ings of the 2008 Conference on Empirical Methods land, Matt Broadhead, and Oren Etzioni. 2007. in Natural Language Processing, pages 982–991, Open information extraction from the web. In Pro- Honolulu, HI. ACL. ceedings of the Twentieth International Joint Con- ference on Artificial Intelligence, pages 2670–2676, Raymond J. Mooney. 2007. Learning for semantic Hyderabad, India. AAAI Press. parsing. In Proceedings of the Eighth International Conference on Computational Linguistics and Intel- Xavier Carreras and Luis Marquez. 2004. Introduction ligent Text Processing, pages 311–324, Mexico City, to the CoNLL-2004 shared task: Semantic role la- Mexico. Springer. beling. In Proceedings of the Eighth Conference on Computational Natural Language Learning, pages Hoifung Poon and Pedro Domingos. 2006. Sound and 89–97, Boston, MA. ACL. efficient inference with probabilistic and determin- istic dependencies. In Proceedings of the Twenty Marie-Catherine de Marneffe, Bill MacCartney, and First National Conference on Artificial Intelligence, Christopher D. Manning. 2006. Generating typed pages 458–463, Boston, MA. AAAI Press. dependency parses from phrase structure parses. In Hoifung Poon and Pedro Domingos. 2008. Joint unsu- Proceedings of the Fifth International Conference pervised coreference resolution with Markov logic. on Language Resources and Evaluation, pages 449– In Proceedings of the 2008 Conference on Empiri- 454, Genoa, Italy. ELRA. cal Methods in Natural Language Processing, pages 649–658, Honolulu, HI. ACL. Ruifang Ge and Raymond J. Mooney. 2009. Learning a compositional semantic parser using an existing Matt Richardson and Pedro Domingos. 2006. Markov syntactic parser. In Proceedings of the Forty Sev- logic networks. Machine Learning, 62:107–136. enth Annual Meeting of the Association for Compu- tational Linguistics, Singapore. ACL. Alexander Yates and Oren Etzioni. 2009. Unsuper- vised methods for determining object and relation Lise Getoor and Ben Taskar, editors. 2007. Introduc- synonyms on the web. Journal of Artificial Intelli- tion to Statistical Relational Learning. MIT Press, gence Research, 34:255–296. Cambridge, MA. Luke S. Zettlemoyer and Michael Collins. 2005. Pekka Kilpelainen. 1992. Tree Matching Prob- Learning to map sentences to logical form: Struc- lems with Applications to Structured Text databases. tured classification with probabilistic categorial Ph.D. Thesis, Department of Computer Science, grammers. In Proceedings of the Twenty First University of Helsinki. Conference on Uncertainty in Artificial Intelligence, pages 658–666, Edinburgh, Scotland. AUAI Press. Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. GENIA corpus - a seman- Luke S. Zettlemoyer and Michael Collins. 2007. On- tically annotated corpus for bio-textmining. Bioin- line learning of relaxed CCG grammars for parsing formatics, 19:180–82. to logical form. In Proceedings of the Joint Con- ference on Empirical Methods in Natural Language Stanley Kok and Pedro Domingos. 2008. Extract- Processing and Computational Natural Language ing semantic networks from text via relational clus- Learning, pages 878–887, Prague, Czech. ACL. tering. In Proceedings of the Nineteenth European Conference on Machine Learning, pages 624–639, Antwerp, Belgium. Springer.

Percy Liang and Dan Klein. 2008. Analyzing the er- rors of unsupervised learning. In Proceedings of the Forty Sixth Annual Meeting of the Association for Computational Linguistics, pages 879–887, Colum- bus, OH. ACL.

Dekang Lin and Patrick Pantel. 2001. DIRT - dis- covery of inference rules from text. In Proceedings of the Seventh ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pages 323–328, San Francisco, CA. ACM Press.