Understanding the Value of Features for Coreference Resolution

Understanding the Value of Features for Coreference Resolution Eric Bengtson Dan Roth Department of Computer Science University of Illinois Urbana, IL 61801 {ebengt2,danr}@illinois.edu Abstract An American1 official2 announced that American1 President3 Bill Clinton3 met In recent years there has been substantial work his3 Russian4 counterpart5, Vladimir on the important problem of coreference res- Putin5, today. olution, most of which has concentrated on the development of new models and algo- the task is to group the mentions so that those refer- rithmic techniques. These works often show ring to the same entity are placed together into an that complex models improve over a weak equivalence class. pairwise baseline. However, less attention Many NLP tasks detect attributes, actions, and has been given to the importance of selecting relations between discourse entities. In order to strong features to support learning a corefer- discover all information about a given entity, tex- ence model. tual mentions of that entity must be grouped to- This paper describes a rather simple pair- gether. Thus coreference is an important prerequi- wise classification model for coreference res- site to such tasks as textual entailment and informa- olution, developed with a well-designed set tion extraction, among others. of features. We show that this produces a state-of-the-art system that outperforms sys- Although coreference resolution has received tems built with complex models. We suggest much attention, that attention has not focused on the that our system can be used as a baseline for relative impact of high-quality features. Thus, while the development of more complex models – many structural innovations in the modeling ap- which may have less impact when a more ro- proach have been made, those innovations have gen- bust set of features is used. The paper also erally been tested on systems with features whose presents an ablation study and discusses the strength has not been established, and compared to relative contributions of various features. weak pairwise baselines. As a result, it is possible that some modeling innovations may have less im- 1 Introduction pact or applicability when applied to a stronger baseline system. Coreference resolution is the task of grouping all the This paper introduces a rather simple but state- 1 mentions of entities in a document into equivalence of-the-art system, which we intend to be used as a classes so that all the mentions in a given class refer strong baseline to evaluate the impact of structural to the same discourse entity. For example, given the innovations. To this end, we combine an effective sentence (where the head noun of each mention is coreference classification model with a strong set of subscripted) features, and present an ablation study to show the relative impact of a variety of features. 1We follow the ACE (NIST, 2004) terminology: A noun phrase referring to a discourse entity is called a mention, and As we show, this combination of a pairwise an equivalence class is called an entity. model and strong features produces a 1.5 percent- 294 Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 294–303, Honolulu, October 2008. c 2008 Association for Computational Linguistics age point increase in B-Cubed F-Score over a com- in Section 4.4, we add the edge (a, m) to the coref- plex model in the state-of-the-art system by Culotta erence graph Gd. et al. (2007), although their system uses a complex, The resulting graph contains connected compo- non-pairwise model, computing features over partial nents, each representing one equivalence class, with clusters of mentions. all the mentions in the component referring to the same entity. This technique permits us to learn to 2 A Pairwise Coreference Model detect some links between mentions while being ag- nostic about whether other mentions are linked, and Given a document and a set of mentions, corefer- yet via the transitive closure of all links we can still ence resolution is the task of grouping the mentions determine the equivalence classes. into equivalence classes, so that each equivalence We also require that no non-pronoun can refer class contains exactly those mentions that refer to back to a pronoun: If m is not a pronoun, we do the same discourse entity. The number of equiv- not consider pronouns as candidate antecedents. alence classes is not specified in advance, but is bounded by the number of mentions. 2.1.1 Related Models In this paper, we view coreference resolution as For pairwise models, it is common to choose the a graph problem: Given a set of mentions and their best antecedent for a given mention (thereby impos- context as nodes, generate a set of edges such that ing the constraint that each mention has at most one any two mentions that belong in the same equiva- antecedent); however, the method of deciding which lence class are connected by some path in the graph. is the best antecedent varies. We construct this entity-mention graph by learning Soon et al. (2001) use the Closest-Link method: to decide for each mention which preceding men- They select as an antecedent the closest preced- tion, if any, belongs in the same equivalence class; ing mention that is predicted coreferential by a this approach is commonly called the pairwise coref- pairwise coreference module; this is equivalent to erence model (Soon et al., 2001). To decide whether choosing the closest mention whose pc value is two mentions should be linked in the graph, we learn above a threshold. Best-Link was shown to out- a pairwise coreference function pc that produces a perform Closest-Link in an experiment by Ng and value indicating the probability that the two men- Cardie (2002b). Our model differs from that of Ng tions should be placed in the same equivalence class. and Cardie in that we impose the constraint that The remainder of this section first discusses how non-pronouns cannot refer back to pronouns, and in this function is used as part of a document-level that we use as training examples all ordered pairs of coreference decision model and then describes how mentions, subject to the constraint above. pc we learn the function. Culotta et al. (2007) introduced a model that pre- 2.1 Document-Level Decision Model dicts whether a pair of equivalence classes should be merged, using features computed over all the men- Given a document d and a pairwise coreference scor- tions in both classes. Since the number of possi- ing function pc that maps an ordered pair of men- ble classes is exponential in the number of mentions, tions to a value indicating the probability that they they use heuristics to select training examples. Our are coreferential (see Section 2.2), we generate a method does not require determining which equiva- coreference graph Gd according to the Best-Link de- lence classes should be considered as examples. cision model (Ng and Cardie, 2002b) as follows: For each mention m in document d, let Bm be the 2.2 Pairwise Coreference Function set of mentions appearing before m in d. Let a be Learning the pairwise scoring function pc is a cru- the highest scoring antecedent: cial issue for the pairwise coreference model. We apply machine learning techniques to learn from ex- a = argmax(pc(b, m)). amples a function pc that takes as input an ordered b∈Bm pair of mentions (a, m) such that a precedes m in If pc(a, m) is above a threshold chosen as described the document, and produces as output a value that is 295 interpreted as the conditional probability that m and In the following description, the term head means a belong in the same equivalence class. the head noun phrase of a mention; the extent is the largest noun phrase headed by the head noun phrase. 2.2.1 Training Example Selection The ACE training data provides the equivalence 3.1 Mention Types classes for mentions. However, for some pairs of The type of a mention indicates whether it is a proper mentions from an equivalence class, there is little or noun, a common noun, or a pronoun. This feature, no direct evidence in the text that the mentions are when conjoined with others, allows us to give dif- coreferential. Therefore, training pc on all pairs of ferent weight to a feature depending on whether it is mentions within an equivalence class may not lead being applied to a proper name or a pronoun. For to a good predictor. Thus, for each mention m we our experiments in Section 5, we use gold mention select from m’s equivalence class the closest pre- types as is done by Culotta et al. (2007) and Luo and ceding mention a and present the pair (a, m) as a Zitouni (2005). positive training example, under the assumption that Note that in the experiments described in Sec- there is more direct evidence in the text for the ex- tion 6 we predict the mention types as described istence of this edge than for other edges. This is there and do not use any gold data. The mention similar to the technique of Ng and Cardie (2002b). type feature is used in all experiments. For each m, we generate negative examples (a, m) for all mentions a that precede m and are not in the 3.2 String Relation Features same equivalence class. Note that in doing so we String relation features indicate whether two strings generate more negative examples than positive ones. share some property, such as one being the substring Since we never apply pc to a pair where the first of another or both sharing a modifier word.

Load more