<<

It’s a Contradiction—No, it’s Not: A Case Study using Functional Relations

Alan Ritter, Doug Downey, Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98195, USA {aritter,ddowney,soderlan,etzioni}@cs.washington.edu

Abstract 1.1 Contradictions and World Knowledge

Contradiction Detection (CD) in text is a Our paper is motivated in part by de Marneffe et al.’s difficult NLP task. We investigate CD work, but with some important differences. First, over functions (e.g., BornIn(Person)=Place), we introduce a simple logical foundation for the CD and present a domain-independent algorithm task, which suggests that extensive world knowl- that automatically discovers phrases denoting edge is essential for building a domain-independent functions with high precision. Previous work CD system. Second, we automatically generate a on CD has investigated hand-chosen sentence large corpus of apparent contradictions found in ar- pairs. In contrast, we automatically harvested bitrary Web text. We show that most of these appar- from the Web pairs of sentences that appear contradictory, but were surprised to find that ent contradictions are actually consistent statements most pairs are in consistent. For example, due to meronyms (Alan Turing was born in London “Mozart was born in Salzburg” does not con- and in England), synonyms (George Bush is mar- tradict “Mozart was born in Austria” despite ried to both Mrs. Bush and Laura Bush), hypernyms the functional nature of the phrase “was born (Mozart died of both renal failure and kidney dis- in”. We show that background knowledge ease), and ambiguity (one John Smith was about meronyms (e.g., Salzburg is in Austria), born in 1997 and a different John Smith in 1883). synonyms, functions, and more is essential for success in the CD task. Next, we show how background knowledge enables a CD system to discard seeming contradictions and focus on genuine ones. 1 Introduction and Motivation De Marneffe et al. introduced a typology of con- tradiction in text, but focused primarily on contra- Detecting contradictory statements is an important dictions that can be detected from linguistic evi- and challenging NLP task with a wide range of dence (e.g. , antonymy, and structural or potential applications including analysis of politi- lexical disagreements). We extend their analysis to cal discourse, of scientific literature, and more (de a class of contradictions that can only be detected Marneffe et al., 2008; Condoravdi et al., 2003; utilizing background knowledge. Consider for ex- Harabagiu et al., 2006). De Marneffe et al. present a ample the following sentences: model of CD that defines the task, analyzes different types of contradictions, and reports on a CD system. 1) “Mozart was born in Salzburg.” They report 23% precision and 19% recall at detect- 2) “Mozart was born in Vienna.” ing contradictions in the RTE-3 data set (Voorhees, 3) “Mozart visited Salzburg.” 2008). Although RTE-3 contains a wide variety of 4) “Mozart visited Vienna.” contradictions, it does not reflect the prevalence of Sentences 1 & 2 are contradictory, but 3 & 4 are seeming contradictions and the paucity of genuine not. Why is that? The distinction is not syntactic. contradictions, which we have found in our corpus. Rather, sentences 1 and 2 are contradictory because

11 Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 11–20, Honolulu, October 2008. c 2008 Association for Computational Linguistics the relation expressed by the phrase “was born in” delegated to TEXTRUNNER or simply ignored. Nev- can be characterized here as a function from peo- ertheless, extracted tuples are a convenient approxi- ple’s to their unique birthplaces. In contrast, mation of sentence content, which enables us to fo- “visited” does not denote a functional relation.1 cus on function detection and function-based CD. We cannot assume that a CD system knows, in Our contributions are the following: advance, all the functional relations that might ap- • We present a novel model of the Contradiction pear in a corpus. Thus, a central challenge for a Detection (CD) task, which offers a simple log- function-based CD system is to determine which re- ical foundation for the task and emphasizes the lations are functional based on a corpus. Intuitively, central role of background knowledge. we might expect that “functional phrases” such as “was born in” would typically map person names • We introduce and evaluate a new EM-style al- to unique place names, making function detection gorithm for detecting whether phrases denote easy. But, in fact, function detection is surprisingly functional relations and whether nouns (e.g., difficult because ambiguity (e.g., John Smith), “dad”) are ambiguous, which enables a CD sys- common nouns (e.g., “dad” or “mom”), definite de- tem to identify functions in arbitrary domains. scriptions (e.g., “the president”), and other linguistic • We automatically generate a corpus of seem- phenomena can mask functions in text. For example, ing contradictions from Web text, and report the two sentences “John Smith was born in 1997.” on a set of experiments over this corpus, which and “John Smith was born in 1883.” can be viewed provide a baseline for future work on statistical as either evidence that “was born in” does not de- function identification and CD. 2 note a function or, alternatively, that “John Smith” is ambiguous. 2 A Logical Foundation for CD On what basis can a CD system conclude that two 1.2 A CD System Based on Functions statements T and H are contradictory? Logically, We report on the AUCONTRAIRE CD system, which contradiction holds when T |= ¬H. As de Marneffe addresses each of the above challenges. First, AU- et al. point out, this occurs when T and H contain CONTRAIRE identifies “functional phrases” statis- antonyms, negation, or other lexical elements that tically (Section 3). Second, AUCONTRAIRE uses suggest that T and H are directly contradictory. But these phrases to automatically create a large cor- other types of contradictions can only be detected pus of apparent contradictions (Section 4.2). Fi- with the help of a body of background knowledge nally, AUCONTRAIRE sifts through this corpus to K: In these cases, T and H alone are mutually con- find genuine contradictions using knowledge about sistent. That is, synonymy, meronymy, argument types, and ambi- guity (Section 4.3). T |=\ ¬H ∧ H |=\ ¬T Instead of analyzing sentences directly, AUCON- T H TRAIRE relies on the TEXTRUNNER Open Informa- A contradiction between and arises only in tion Extraction system (Banko et al., 2007; Banko the context of K. That is: and Etzioni, 2008) to map each sentence to one or more tuples that represent the entities in the sen- ((K ∧ T ) |= ¬H) ∨ ((K ∧ H) |= ¬T ) tences and the relationships between them (e.g., was born in(Mozart,Salzburg)). Using extracted tu- Consider the example of Mozart’s birthplace in ples greatly simplifies the CD task, because nu- the introduction. To detect a contradiction, a CD merous syntactic problems (e.g., anaphora, rela- system must know that A) “Mozart” refers to the tive clauses) and semantic challenges (e.g., quantifi- same entity in both sentences, that B) “was born in” cation, counterfactuals, temporal qualification) are denotes a functional relation, and that C) Vienna and Salzburg are inconsistent locations. 1Although we focus on function-based CD in our case study, we believe that our observations apply to other types of CD as 2The corpus is available at http://www.cs. well. washington.edu/research/aucontraire/

12 Of course, world knowledge, and reasoning about tional relation appear non-functional. Example B text, are often uncertain, which leads us to associate depicts a distribution of y values that appears less with a CD system’s conclusions. Nev- functional due to the fact that “John Adams” refers ertheless, the knowledge base K is essential for CD. to multiple, distinct real-world individuals with that We now turn to a probabilistic model that helps name. Finally, example C exhibits evidence for a us simultaneously estimate the functionality of re- non-functional relation. lations (B in the above example) and ambiguity of A. was born in(Mozart, PLACE): argument values (A above). Section 4 describes the Salzburg(66), Germany(3), Vienna(1) remaining components of AUCONTRAIRE. B. was born in(John Adams, PLACE): 3 Detecting Functionality and Ambiguity Braintree(12), Quincy(10), Worcester(8) This section introduces a formal model for comput- C. lived in(Mozart, PLACE): ing the that a phrase denotes a function Vienna(20), Prague(13), Salzburg(5) based on a set of extracted tuples. An extracted tuple takes the form R(x, y) where (roughly) x is the sub- Figure 1: Functional relations such as example A have a ject of a sentence, y is the object, and R is a phrase different distribution of y values than non-functional rela- denoting the relationship between them. If the re- tions such as C. However, an ambiguous x argument as in lation denoted by R is functional, then typically the B, can make a functional relation appear non-functional. object y is a function of the subject x. Thus, our dis- cussion focuses on this possibility, though the anal- 3.1 Formal Model of Functions in Text ysis is easily extended to the symmetric case. Logically, a relation R is functional in a vari- To decide whether R is functional in x for all x, able x if it maps it to a unique variable y: we first consider how to detect whether R is lo- ∀x, y1, y2 R(x, y1) ∧ R(x, y2) ⇒ y1 = y2. Thus, cally functional for a particular value of x. The local given a large random sample of ground instances of functionality of R with respect to x is the probabil- R, we could detect with high confidence whether R ity that R is functional estimated solely on evidence is functional. In text, the situation is far more com- from the distribution of y values in a contradiction plex due to ambiguity, polysemy, synonymy, and set R(x, ·). other linguistic phenomena. Deciding whether R is To decide the probability that R is a function, we functional becomes a probabilistic assessment based define global functionality as the average local func- on aggregated textual evidence. tionality score for each x, weighted by the probabil- The main evidence that a relation R(x, y) is func- ity that x is unambiguous. Below, we outline an EM- tional comes from the distribution of y values for style algorithm that alternately estimates the proba- a given x value. If R denotes a function and x is bility that R is functional and the probability that x unambiguous, then we expect the extractions to be is ambiguous. ∗ predominantly a single y value, with a few outliers Let Rx indicate the event that the relation R is due to noise. We aggregate the evidence that R is locally functional for the argument x, and that x is locally functional for a particular x value to assess locally unambiguous for R. Also, let D indicate whether R is globally functional for all x. the set of observed tuples, and define DR(x,·) as the We refer to a set of extractions with the same multi-set containing the frequencies for extractions relation R and argument x as a contradiction set of the form R(x, ·). For example the distribution of R(x, ·). Figure 1 shows three example contradic- extractions from Figure 1 for example A is tion sets. Each example illustrates a situation com- Dwas born in(Mozart,·) = {66, 3, 1}. f monly found in our data. Example A in Figure 1 Let θR be the probability that R(x, ·) is locally shows strong evidence for a functional relation. 66 functional for a random x, and let Θf be the vector out of 70 TEXTRUNNER extractions for was born in of these parameters across all relations R. Likewise, u (Mozart, PLACE) have the same y value. An am- θx represents the probability that x is locally unam- biguous x argument, however, can make a func- biguous for random R, and Θu the vector for all x.

13 We wish to determine the maximum a pos- URNS model has a binomial distribution with pa- teriori (MAP) functionality and ambiguity pa- rameters n and p, where p is the precision of the ex- rameters given the observed data D, that is traction process. If R(x, ·) is not locally functional f u arg maxΘf ,Θu P (Θ , Θ |D). By Bayes Rule: and unambiguous, then we expect k to typically take on smaller values. Empirically, the underlying fre- ∗ P (D|Θf , Θu)P (Θf , Θu) quency of the most frequent element in the ¬Rx case P (Θf , Θu|D) = (1) P (D) tends to follow a Beta distribution. Under the model, the probability of the evidence ∗ We outline a generative model for the data, given Rx is: P (D|Θf , Θu). Let us assume that the event R∗ de- ! x n pends only on θf and θu, and further assume that P (D |R∗) ≈ P (k, n|R∗) = pk(1 − p)n−k R x R(x,·) x x k given these two parameters, local ambiguity and lo- cal functionality are conditionally independent. We obtain the following expression for the probability ∗ And the probability of the evidence given ¬Rx is: ∗ of Rx given the parameters: P (D |¬R∗) ≈ P (k, n|¬R∗) ∗ f u f u R(x,·) x x P (Rx|Θ , Θ ) = θRθx 0k+αf −1 0 n+βf −1−k = n R 1 p (1−p ) dp0 We assume each set of data DR(x,·) is gener- k 0 B(αf ,βf ) ated independently of all other data and parameters, ∗ nΓ(n − k + β )Γ(α + k) given Rx. From this and the above we have: = k f f (3) B(αf , βf )Γ(αf + βf + n) f u Y  ∗ f u P (D|Θ , Θ ) = P (DR(x,·)|Rx)θ θx R where n is the sum over DR(x,·), Γ is the Gamma R,x function and B is the Beta function. αf and βf are ∗ f u  ∗ +P (DR(x,·)|¬Rx)(1 − θRθx) (2) the parameters of the Beta distribution for the ¬Rx case. These parameters and the prior distributions are estimated empirically, based on a sample of the These independence assumptions allow us to ex- data set of relations described in Section 5.1. press P (D|Θf , Θu) in terms of distributions over ∗ DR(x,·) given whether or not Rx holds. We use the 3.2 Estimating Functionality and Ambiguity URNS model as described in (Downey et al., 2005) Substituting Equation 3 into Equation 2 and apply- to estimate these probabilities based on binomial ing an appropriate prior gives the probability of pa- distributions. In the single-urn URNS model that we rameters Θf and Θu given the observed data D. utilize, the extraction process is modeled as draws of However, Equation 2 contains a large product of labeled balls from an urn, where the labels are either sums—with two independent vectors of coefficients, correct extractions or errors, and different labels can Θf and Θu—making it difficult to optimize analyti- be repeated on varying numbers of balls in the urn. cally. P Let k = max DR(x,·), and let n = DR(x,·); If we knew which arguments were ambiguous, we will approximate the distribution over DR(x,·) we would ignore them in computing the function- in terms of k and n. If R(x, ·) is locally func- ality of a relation. Likewise, if we knew which rela- tional and unambiguous, there is exactly one cor- tions were non-functional, we would ignore them in rect extraction label in the urn (potentially repeated computing the ambiguity of an argument. Instead, multiple times). Because the probability of correct- we initialize the Θf and Θu arrays randomly, and ness tends to increase with extraction frequency, we then execute an algorithm similar to Expectation- make the simplifying assumption that the most fre- Maximization (EM) (Dempster et al., 1977) to arrive 3 quently extracted element is correct. In this case, k at a high-probability setting of the parameters. is the number of correct extractions, which by the Note that if Θu is fixed, we can compute the ex- 3As this assumption is invalid when there is not a unique pected fraction of locally unambiguous arguments x ∗ maximal element, we default to the prior P (Rx) in that case. for which R is locally functional, using DR(x0,·) and

14 Equation 3. Likewise, for fixed Θf , for any given x we can compute the expected fraction of locally functional relations R that are locally unambiguous for x. Specifically, we repeat until convergence: 1. Set θf = 1 P P (R∗|D )θu for all R. R sR x x R(x,·) x 2. Set θu = 1 P P (R∗|D )θf for all x. x sx R x R(x,·) R In both steps above, the sums are taken over only those x or R for which DR(x,·) is non-empty. Also, P u the normalizer sR = x θx and likewise sx = P f R θR. As in standard EM, we iteratively update our pa- rameter values based on an expectation computed over the unknown variables. However, we alter- nately optimize two disjoint sets of parameters (the functionality and ambiguity parameters), rather than Figure 2: AUCONTRAIRE architecture just a single set of parameters as in standard EM. Investigating the optimality guarantees and conver- of being correct. gence properties of our algorithm is an item of future Function Learner: Discover a set of functional re- work. lations F from among the relations in E. Assign to By iteratively setting the parameters to the expec- each relation in F a probability pf that it is func- tations in steps 1 and 2, we arrive at a good setting tional. of the parameters. Section 5.2 reports on the perfor- Contradiction Detector: Query E for assertions mance of this algorithm in practice. with a relation R in F, and identify sets C of po- tentially contradictory assertions. Filter out seeming 4 System Overview contradictions in C by reasoning about synonymy, AUCONTRAIRE identifies phrases denoting func- meronymy, argument types, and argument ambigu- tional relations and utilizes these to find contradic- ity. Assign to each potential contradiction a proba- tory assertions in a massive, open-domain corpus of bility pc that it is a genuine contradiction. text. 4.1 Extracting Factual Assertions AUCONTRAIRE begins by finding extractions of the form R(x, y), and identifies a set of relations AUCONTRAIRE needs to explore a large set of R that have a high probability of being functional. factual assertions, since genuine contradictions are Next, AUCONTRAIRE identifies contradiction sets quite rare (see Section 5). We used a set of extrac- of the form R(x, ·). In practice, most contradiction tions E from the Open Information Extraction sys- sets turned out to consist overwhelmingly of seem- tem, TEXTRUNNER (Banko et al., 2007), which was ing contradictions—assertions that do not actually run on a set of 117 million Web pages. contradict each other for a variety of that TEXTRUNNER does not require a pre-defined set we enumerate in section 4.3. Thus, a major chal- of relations, but instead uses shallow linguistic anal- lenge for AUCONTRAIRE is to tease apart which ysis and a domain-independent model to identify pairs of assertions in R(x, ·) represent genuine con- phrases from the text that serve as relations and tradictions. phrases that serve as arguments to that relation. Here are the main components of AUCONTRAIRE TEXTRUNNER creates a set of extractions in a sin- as illustrated in Figure 2: gle pass over the Web page collection and provides Extractor: Create a set of extracted assertions E an index to query the vast set of extractions. from a large corpus of Web pages or other docu- Although its extractions are noisy, TEXTRUNNER ments. Each extraction R(x, y) has a probability p provides a probability that the extractions are cor-

15 rect, based in part on corroboration of from y-values sharing a meronym relationship are ex- different Web pages (Downey et al., 2005). tremely rare. We therefore simply assigned contra- dictions between meronyms a probability close to 4.2 Finding Potential Contradictions zero. We used the Tipster Gazetteer4 and WordNet The next step of AUCONTRAIRE is to find contra- to identify meronyms, both of which have high pre- diction sets in E. cision but low coverage. We used the methods described in Section 3 to Argument Typing: Two y values are not contra- estimate the functionality of the most frequent rela- dictory if they are of different argument types. For tions in E. For each relation R that AUCONTRAIRE example, the relation born in can take a date or a has judged to be functional, we identify contradic- location for the y value. While a person can be tion sets R(x, ·), where a relation R and domain ar- born in only one year and in only one city, a per- gument x have multiple range arguments y. son can be born in both a year and a city. To avoid such positives, AUCONTRAIRE uses a sim- 4.3 Handling Seeming Contradictions ple named-entity tagger5 in combination with large For a variety of reasons, a pair of extractions dictionaries of person and location names to as- R(x, y1) and R(x, y2) may not be actually contra- sign high-level types (person, location, date, other) dictory. The following is a list of the major sources to each argument. AUCONTRAIRE filters out ex- of false positives—pairs of extractions that are not tractions from a contradiction set that do not have genuine contradictions, and how they are handled matching argument types. by AUCONTRAIRE. The features indicative of each condition are combined using Logistic Regression, Ambiguity: As pointed out in Section 3, false con- in order to estimate the probability that a given pair, tradictions arise when a single x value refers to mul- tiple real-world entities. For example, if the con- {R(x, y1),R(x, y2)} is a genuine contradiction. tradiction set born in(John Sutherland, ·) includes Synonyms: The set of potential contradictions birth years of both 1827 and 1878, is one of these a died from(Mozart,·) may contain assertions that mistake, or do we have a grandfather and grandson Mozart died from renal failure and that he died from with the same name? AUCONTRAIRE computes the kidney failure. These are distinct values of y, but probability that an x value is unambiguous as part do not contradict each other, as the two terms are of its Function Learner (see Section 3). An x value synonyms. AUCONTRAIRE uses a variety of knowl- can be identified as ambiguous if its distribution of edge sources to handle synonyms. WordNet is a re- y values is non-functional for multiple functional re- liable source of synonyms, particularly for common lations. nouns, but has limited recall. AUCONTRAIRE also If a pair of extractions, {R(x, y1),R(x, y2)}, does utilizes synonyms generated by RESOLVER (Yates not fall into any of the above categories and R is and Etzioni, 2007)— a system that identifies syn- functional, then it is likely that the sentences under- onyms from TEXTRUNNER extractions. Addition- lying the extractions are indeed contradictory. We ally, AUCONTRAIRE uses edit-distance and token- combined the various knowledge sources described based string similarity (Cohen et al., 2003) between above using Logistic Regression, and used 10-fold apparently contradictory values of y to identify syn- cross-validation to automatically tune the weights onyms. associated with each knowledge source. In addi- Meronyms: For some relations, there is no con- tion, the learning algorithm also utilizes the follow- tradiction when y1 and y2 share a meronym, ing features: i.e. “part of” relation. For example, in the set f born in(Mozart,·) there is no contradiction be- • Global functionality of the relation, θR u tween the y values “Salzburg” and “Austria”, but • Global unambiguity of x, θx “Salzburg” conflicts with “Vienna”. Although this 4http://crl.nmsu.edu/cgi-bin/Tools/CLR/ is only true in cases where y occurs in an up- clrcat ward monotone context (MacCartney and Manning, 5http://search.cpan.org/˜simon/ 2007), in practice genuine contradictions between Lingua-EN-NamedEntity-1.1/NamedEntity.pm

16 • Local functionality of R(x, ·) In our data set, genuine contradictions over func- • String similarity (a combination of token-based tional relations are surprisingly rare. We found only similarity and edit-distance) between y1 and y2 110 genuine contradictions in the hand-tagged sam- • The argument types (person, location, date, or ple, only 1.2% of the potential contradiction pairs. other)

The learned model is then used to estimate how 5.2 Detecting Functionality and Ambiguity likely a potential contradiction {R(x, y ),R(x, y )} 1 2 We ran AUCONTRAIRE’s EM algorithm on the is to be genuine. thousand most frequent relations. Performance con- 5 Experimental Results verged after 5 iterations resulting in estimates of the probability that each relation is functional and each We evaluated several aspects of AUCONTRAIRE: x argument is unambiguous. We used these proba- its ability to detect functional relations and to de- bilities to generate the precision-recall curves shown tect ambiguous arguments (Section 5.2); its preci- in Figure 3. sion and recall in contradiction detection (Section The graph on the left shows results for function- 5.3); and the contribution of AUCONTRAIRE’s key ality, while the graph on the right shows precision at knowledge sources (Section 5.4). finding unambiguous arguments. The solid lines are 5.1 Data Set results after 5 iterations of EM, and the dashed lines are from computing functionality or ambiguity with- To evaluate AUCONTRAIRE we used TEXTRUN- out EM (i.e. assuming uniform values of Θc when NER’s extractions from a corpus of 117 million Web computing Θf and vice versa). The EM algorithm pages. We restricted our data set to the 1,000 most improved results for both functionality and ambigu- frequent relations, in part to keep the experiments ity, increasing area under curve (AUC) by 19% for tractable and also to ensure sufficient statistical sup- functionality and by 31% for ambiguity. port for identifying functional relations. Of course, the ultimate test of how well AUCON- We labeled each relation as functional or not, TRAIRE can identify functional relations is how well and computed an estimate of the probability it is the Contradiction Detector performs on automati- functional as described in section 3.2. Section 5.2 cally identified functional relations. presents the results of the Function Learner on this set of relations. We took the top 2% (20 relations) as F, the set of functional relations in our exper- 5.3 Detecting Contradictions iments. Out of these, 75% are indeed functional. Some examples include: was born in, died in, and We conducted experiments to evaluate how well was founded by. AUCONTRAIRE distinguishes genuine contradic- There were 1.2 million extractions for all thou- tions from false positives. sand relations, and about 20,000 extractions in 6,000 The bold line in Figure 4 depicts AUCONTRAIRE contradiction sets for all relations in F. performance on the distribution of contradictions We hand-tagged 10% of the contradiction sets and seeming contradictions found in actual Web R(x, ·) where R ∈ F, discarding any sets with over data. The dashed line shows the performance of AU- 20 distinct y values since the x argument for that CONTRAIRE on an artificially “balanced” data set set is almost certainly ambiguous. This resulted in a that we constructed to contain 50% genuine contra- data set of 567 contradiction sets containing a total dictions and 50% seeming ones. of 2,564 extractions and 8,844 potentially contradic- Previous research in CD presented results on tory pairs of extractions. manually selected data sets with a relatively bal- We labeled each of these 8,844 pairs as contradic- anced mix of positive and negative instances. As tory or not. In each case, we inspected the original Figure 4 suggests, this is a much easier problem than sentences, and if the distinction was unclear, con- CD “in the wild”. The data gathered from the Web sulted the original source Web pages, Wikipedia ar- is badly skewed, containing only 1.2% genuine con- ticles, and Web search engine results. tradictions.

17 Functionality Ambiguity 1.0 1.0 0.8 0.8 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 AuContraire AuContraire No Iteration No Iteration 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Recall Recall

Figure 3: After 5 iterations of EM, AUCONTRAIRE achieves a 19% boost to area under the precision-recall curve (AUC) for functionality detection, and a 31% boost to AUC for ambiguity detection.

1.0 no meronyms does not indicate that meronyms are not essential to our task, only that our knowledge 0.8 sources for meronyms were not as useful as we hoped. The Tipster Gazetteer has surprisingly low 0.6 coverage for our data set. It contains only 41% of Precision 0.4 the y values that are locations. Many of these are matches on a different location with the same name, 0.2 Web Distribution which results in incorrect meronym information. We Balanced Data

0.0 estimate that a gazetteer with complete coverage

0.0 0.2 0.4 0.6 0.8 1.0 would increase area under the curve by approxi- mately 40% compared to a system with meronyms Recall from the Tipster Gazetteer and WordNet. Figure 4: Performance of AUCONTRAIRE at distinguish- ing genuine contradictions from false positives. The bold Percentage AUC line is results on the actual distribution of data from the Web. The dashed line is from a data set constructed to 100 have 50% positive and 50% negative instances. 80

5.4 Contribution of Knowledge Sources 60

We carried out an ablation study to quantify how 40 much each knowledge source contributes to AU- CONTRAIRE’s performance. Since most of the 20 knowledge sources do not apply to numeric argu- 0 ment values, we excluded the extractions where y AuContraire NS NM NT is a number in this study. As shown in Figure 5, Figure 5: Area under the precision-recall curve for the performance of AUCONTRAIRE degrades with no full AUCONTRAIRE and for AUCONTRAIRE with knowl- knowledge of synonyms (NS), with no knowledge edge removed. NS has no synonym knowledge; NM has of meronyms (NM), and especially without argu- no meronym knowledge; NT has no argument typing. ment typing (NT). Conversely, improvements to any of these three components would likely improve the To analyze the errors made by AUCONTRAIRE, performance of AUCONTRAIRE. we hand-labeled all false-positives at the point of The relatively small drop in performance from maximum F-score: 29% Recall and 48% Precision.

18 Figure 6 reveals the central importance of world relatively few of the “functional contradictions” that knowledge for the CD task. About half of the errors AUCONTRAIRE tackles. On our Web-based data (49%) are due to ambiguous x-arguments, which we sets, we achieved a precision of 0.62 at recall 0.19, found to be one of the most persistent obstacles to and precision 0.92 at recall 0.51 on the balanced data discovering genuine contradictions. A sizable por- set. Of course, comparisons across very different tion is due to missing meronyms (34%) and missing data sets are not meaningful, but merely serve to un- synonyms (14%), suggesting that lexical resources derscore the difficulty of the CD task. with broader coverage than WordNet and the Tipster In contrast to previous work, AUCONTRAIRE is Gazetteer would substantially improve performance. the first to do CD on data automatically extracted Surprisingly, only 3% are due to errors in the extrac- from the Web. This is a much harder problem than tion process. using an artificially balanced data set, as shown in Figure 4.

Missing Meronyms (34%) Automatic discovery of functional relations has Missing Synonyms (14%) been addressed in the database literature as Func- tional Dependency Mining (Huhtala et al., 1999;

Extraction Errors (3%) Yao and Hamilton, 2008). This focuses on dis- covering functional relationships between sets of at- tributes, and does not address the ambiguity inherent in natural language. Ambiguity (49%) 7 Conclusions and Future Work

Figure 6: Sources of errors in contradiction detection. We have described a case study of contradiction de- tection (CD) based on functional relations. In this All of our experimental results are based on the context, we introduced and evaluated the AUCON- automatically discovered set of functions F. We TRAIRE system and its novel EM-style algorithm would expect AUCONTRAIRE’s performance to im- for determining whether an arbitrary phrase is func- prove substantially if it were given a large set of tional. We also created a unique “natural” data set functional relations as input. of seeming contradictions based on sentences drawn from a Web corpus, which we make available to the 6 Related Work research community. We have drawn two key lessons from our case Condoravdi et al. (2003) first proposed contradiction study. First, many seeming contradictions (approx- detection as an important NLP task, and Harabagiu imately 99% in our experiments) are not genuine et al. (2006) were the first to report results on con- contradictions. Thus, the CD task may be much tradiction detection using negation, although their harder on natural data than on RTE data as sug- evaluation corpus was a balanced data set built gested by Figure 4. Second, extensive background by manually negating entailments in a data set knowledge is necessary to tease apart seeming con- from the Recognizing Textual Entailment confer- tradictions from genuine ones. We believe that these ences (RTE) (Dagan et al., 2005). De Marneffe et lessons are broadly applicable, but verification of al. (2008) reported experimental results on a contra- this claim is a topic for future work. diction corpus created by annotating the RTE data sets. Acknowledgements RTE-3 included an optional task, requiring sys- tems to make a 3-way distinction: {entails, contra- This research was supported in part by NSF grants dicts, neither} (Voorhees, 2008). The average per- IIS-0535284 and IIS-0312988, ONR grant N00014- formance for contradictions on the RTE-3 was preci- 08-1-0431 as well as gifts from the Utilika Founda- sion 0.11 at recall 0.12, and the best system had pre- tion and Google, and was carried out at the Univer- cision 0.23 at recall 0.19. We did not run AUCON- sity of Washington’s Turing Center. TRAIRE on the RTE data sets because they contained

19 M. Banko and O. Etzioni. 2008. The tradeoffs between traditional and open relation extraction. In Proceed- ings of ACL. M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the Web. In Procs. of IJCAI. W.W. Cohen, P. Ravikumar, and S.E. Fienberg. 2003. A comparison of string distance metrics for name- matching tasks. In IIWeb. Cleo Condoravdi, Dick Crouch, Valeria de Paiva, Rein- hard Stolle, and Daniel G. Bobrow. 2003. Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 workshop on Text meaning, pages 38–45, Morristown, NJ, USA. Association for Computational Linguistics. I. Dagan, O. Glickman, and B. Magnini. 2005. The PASCAL Recognising Textual Entailment Challenge. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment, pages 1–8. Marie-Catherine de Marneffe, Anna Rafferty, and Christopher D. Manning. 2008. Finding contradic- tions in text. In ACL 2008. A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Max- imum likelihood from incomplete data via the EM al- gorithm. Journal of the Royal Statistical Society Se- ries B, 39(1):1–38. D. Downey, O. Etzioni, and S. Soderland. 2005. A Prob- abilistic Model of Redundancy in Information Extrac- tion. In Procs. of IJCAI. Sanda Harabagiu, Andrew Hickl, and Finley Lacatusu. 2006. Negation, contrast and contradiction in text pro- cessing. In AAAI. Yka¨ Huhtala, Juha Karkk¨ ainen,¨ Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111. B. MacCartney and C.D. Manning. 2007. Natural for Textual . In Workshop on Textual Entail- ment and Paraphrasing. Ellen M. Voorhees. 2008. Contradictions and justifica- tions: Extensions to the textual entailment task. In Proceedings of ACL-08: HLT, pages 63–71, Colum- bus, Ohio, June. Association for Computational Lin- guistics. Hong Yao and Howard J. Hamilton. 2008. Mining func- tional dependencies from data. Data Min. Knowl. Dis- cov., 16(2):197–219. A. Yates and O. Etzioni. 2007. Unsupervised resolution of objects and relations on the Web. In Procs. of HLT.

20