Toward Conditional Models of Identity Uncertainty with Application To
Total Page:16
File Type:pdf, Size:1020Kb
Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference Andrew McCallum† Ben Wellner†∗ †Department of Computer Science ∗The MITRE Corporation University of Massachusetts Amherst 202 Burlington Road Amherst, MA 01003 USA Bedford, MA 01730 USA [email protected] [email protected] Abstract also refer to the other two entities with sometimes overlap- ping strings. The coreference task is to use content and con- Coreference analysis, also known as record linkage text of all the mentions to determine how many entities are or identity uncertainty, is a difficult and important discussed in the article, and which mention corresponds to problem in natural language processing, databases, which entity. citation matching and many other tasks. This pa- per introduces several discriminative, conditional- This task is most frequently solved by examining indi- probability models for coreference analysis, all vidual pair-wise distance measures between mentions inde- examples of undirected graphical models. Un- pendently of each other. For example, database record- like many historical approaches to coreference, the linkage and citation reference matching has been performed models presented here are relational—they do not by learning a pairwise distance metric between records, assume that pairwise coreference decisions should and setting a distance threshold below which records are [ be made independently from each other. Unlike merged Monge and Elkan, 1997; Borthwick et al., 2000; other relational models of coreference that are gen- McCallum et al., 2000b; Bilenko and Mooney, 2002; Cohen ] erative, the conditional model here can incorpo- and Richman, 2002 . Coreference in NLP has also been per- [ rate a great variety of features of the input without formed with distance thresholds or pairwise classifiers Mc- having to be concerned about their dependencies— Carthy and Lehnert, 1995; Ge et al., 1998; Soon et al., 2001; ] paralleling the advantages of conditional random Ng and Cardie, 2002 . fields over hidden Markov models. We present ex- But these distance measures are inherently noisy and the periments on proper noun coreference in two text answer to one pair-wise coreference decision may not be in- data sets, showing results in which we reduce error dependent of another. For example, if we measure the dis- by nearly 28% or more over traditional thresholded tance between all of the three possible pairs among three record-linkage, and by up to 33% over an alterna- mentions, two of the distances may be below threshold, but tive coreference technique previously used in natu- one above—an inconsistency due to noise and imperfect mea- ral language processing. surement. For example, “Mr. Powell” may be correctly coresolved with “Powell,” but particular grammatical circum- stances may make the model incorrectly believe that “Powell” 1 Introduction is coreferent with a nearby occurrence of “she”. Inconsisten- In many domains—including computer vision, databases and cies might be better resolved if the coreference decisions are natural language processing—we find multiple views, de- made in dependent relation to each other, and in a way that scriptions, or names for the same underlying object. Cor- accounts for the values of the multiple distances, instead of rectly resolving these references is a necessary precursor to a threshold on single pairs independently. Valuable distinc- further processing and understanding of the data. In com- tions in real data are also often even more subtle. For exam- puter vision, solving object correspondence is necessary for ple in noun coreference, the gender attribute of one mention counting or tracking. In databases, performing record linkage might combine with the job-title attribute of another corefer- or de-duplication creates a clean set of data that can be ac- ent mention, and this combination, in turn, might help core- curately mined. In natural language processing, coreference solve a third mention. This has been recognized, and there analysis finds the nouns, pronouns and phrases that refer to are some examples of techniques that use heuristic feature the same entity, enabling the extraction of relations among merging [Morton, 1997]. entities as well as more complex propositions. Recently Pasula et al. [2003] have proposed a formal, re- Consider, for example, the text in a news article; it may lational approach to the problem of identity uncertainty using discuss the entities George Bush, Colin Powell, and Donald a type of Bayesian network called a Relational Probabilistic Rumsfeld. The article would contain multiple mentions of Model [Friedman et al., 1999]. A great strength of this model Colin Powell by different strings—“Secretary of State Colin is that it explicitly captures the dependence among multiple Powell,” “he,” “Mr. Powell,” “the Secretary”—and would coreference decisions. However, it is a generative model of the entities, mentions 2 Three Conditional Models of Identity and all their features, and thus has difficulty using many fea- Uncertainty tures that are highly overlapping, non-independent, at vary- ing levels of granularity, and with long-range dependencies. We now describe three possible configurations for conditional For example, we might wish to use features that capture the models of identity uncertainty, each progressively simpler phrases, words and character n-grams in the mentions, the ap- and more specific than its predecessor. All three are based pearance of keywords anywhere in the document, the parse- on conditionally-trained, undirected graphical models. tree of the current, preceding and following sentences, as well as 2-d layout information. To produce accurate generative Undirected graphical models, also known as Markov net- probability distributions, the dependencies between these fea- works or Markov random fields, are a type of probabilistic tures should be captured in the model; but doing so can lead model that excels at capturing interdependent data in which to extremely complex models in which parameter estimation causality among attributes is not apparent. We begin by in- is nearly impossible. troducing notation for mentions, entities and attributes of en- tities, then in the following subsections describe the likeli- Similar issues arise in sequence modeling problems. In this hood, inference and estimation procedures for the specific area significant recent success has been achieved by replacing undirected graphical models. a generative model—hidden Markov models—with a condi- E = (E , ...E ) tional model—conditional random fields (CRFs) [Lafferty et Let 1 m be a collection of classes or “en- X = (X , ...X ) al., 2001]. CRFs have reduced part-of-speech tagging er- tities”. Let 1 n be a collection of random Y = rors by 50% on out-of-vocabulary words in comparison with variables over observations or “mentions”; and let (Y , ...Y ) HMMs [Ibid.], matched champion noun phrase segmentation 1 n be a collection of random variables over integer results [Sha and Pereira, 2003], and significantly improved identifiers, unique to each entity, specifying to which entity y the segmentation of tables in government reports [Pinto et al., a mention refers. Thus the ’s are integers ranging from 1 to m Y = Y X 2003]. Relational Markov networks [Taskar et al., 2002] are , and if i j, then mention i is said to refer to the X similar models, and have been shown to significantly improve same underlying entity as j. For example, some particu- e classification of Web pages. lar entity 4, U.S. Secretary of State, Colin L. Powell, may be mentioned multiple times in a news article that also con- This paper introduces three conditional undirected graphi- tains mentions of other entities: x6 may be “Colin Powell”; cal models for identity uncertainty. The models condition on x9 may be “Powell”; x17 may be “the Secretary of State.” In the mentions, and generate the coreference decisions, (and in this case, the unique integer identifier for this entity, e4, is 4, some cases also generate attributes of the entities). Clique and y6 = y9 = y17 = 4. potential functions can measure, for example, the degree of Furthermore, entities may have attributes. Let A be a ran- consistency between “Mr. Powell”, “he” and our assignment dom variable over the collection of all attributes for all enti- of the attribute MALE to their common entity. In the first ties. Borrowing the notation of Relational Markov Networks most general model, the dependency structure is unrestricted, [Taskar et al., 2002], we write the random variable over the at- and the number of underlying entities explicitly appears in the tributes of entity Es as Es.A = {Es.A1,Es.A2,Es.A3, ...}. model structure. The second model the third models have no For example, these three attributes may be gender, birth year, structural dependence on the number of entities, and fall into and surname. Continuing the above example, then e .a = a class of Markov random fields in which inference exactly 4 1 MALE, e4.a2 = 1937, and e4.a3 = Powell. One can interpret corresponds to graph partitioning [Boykov et al., 1999]. the attributes as the values that should appear in the fields We show preliminary experimental results using the third of a database record for the given entity. Attributes such as model on a proper noun coreference problem two differ- surname may take on one of the finite number of values that ent newswire text domains: broadcast news stories from the appear in the data set. DARPA Automatic Content Extraction (ACE) program, and We may examine many features of the mentions, x, but newswire articles from the DARPA MUC corpus. In both since a conditional model doesn’t generate them, we don’t domains we take advantage of the ability to use arbitrary, need random variable notation for them. Separate measured overlapping features of the input, including string equality, features of the mentions and entity-assignments, y, are cap- substring, and acronym matches. Using the same features, tured in different feature functions, f(·), over cliques in the in comparison with an alternative natural language process- graphical model.