Verb Phrase Ellipsis Resolution Using Discriminative and Margin-Infused Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Verb Phrase Ellipsis Resolution Using Discriminative and Margin-Infused Algorithms Kian Kenyon-Dean Jackie Chi Kit Cheung Doina Precup School of Computer Science McGill University [email protected], jcheung,dprecup @cs.mcgill.ca { } Abstract in conversational genres such as informal dialogue where VPE occurs more frequently (Nielsen, 2005). Verb Phrase Ellipsis (VPE) is an anaphoric Most current event extraction systems ignore VPE construction in which a verb phrase has been and derive some structured semantic representation elided. It occurs frequently in dialogue and by reading information from a shallow dependency informal conversational settings, but despite its evident impact on event coreference reso- parse of a sentence. Such an approach would not lution and extraction, there has been relatively only miss many valid links between an elided verb little work on computational methods for iden- and its arguments, it could also produce nonsensi- tifying and resolving VPE. Here, we present cal extractions if applied directly on an auxiliary a novel approach to detecting and resolving trigger. In the example above, a naive approach VPE by using supervised discriminative ma- might produce an unhelpful semantic triple such as chine learning techniques trained on features (Dodge, agent, do). extracted from an automatically parsed, pub- licly available dataset. Our approach yields There have been several previous empirical stud- state-of-the-art results for VPE detection by ies of VPE (Hardt, 1997; Nielsen, 2005; Bos and improving F1 score by over 11%; additionally, Spenader, 2011; Bos, 2012; Liu et al., 2016). Many we explore an approach to antecedent identifi- previous approaches were restricted to solving spe- cation that uses the Margin-Infused-Relaxed- cific subclasses of VPE (e.g., VPE triggered by do Algorithm, which shows promising results. (Bos, 2012)), or have relied on simple heuristics for some or all of the steps in VPE resolution, such as 1 Introduction by picking the most recent previous clause as the an- tecedent. Verb Phrase Ellipsis (VPE) is an anaphoric construc- In this paper, we develop a VPE resolution tion in which a verbal constituent has been omitted. pipeline which encompasses a broad class of VPEs In English, an instance of VPE consists of two parts: (Figure 1), decomposed into the following two steps. a trigger, typically an auxiliary or modal verb, that In the VPE detection step, the goal is to determine indicates the presence of a VPE; and an antecedent, whether or not a word triggers VPE. The second which is the verb phrase to which the elided element step, antecedent identification, requires selecting the resolves (Bos and Spenader, 2011; Dalrymple et al., clause containing the verbal antecedent, as well as 1991). For example, in the sentence, “The govern- determining the exact boundaries of the antecedent, ment includes money spent on residential renova- which are often difficult to define. tion; Dodge does not”, the trigger “does” resolves Our contribution is to combine the rich linguis- to the antecedent “includes money spent on residen- tic analysis of earlier work with modern statistical tial renovation”. approaches adapted to the structure of the VPE res- The ability to perform VPE resolution is impor- olution problem. First, inspired by earlier work, tant for tasks involving event extraction, especially 1734 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1734–1743, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics nificant impact on related problems such as event coreference resolution (Lee et al., 2012; Bejan and Harabagiu, 2010; Liu et al., 2014) and event ex- traction (Ahn, 2006; Kim et al., 2009; Ritter et al., 2012). It has, however, received relatively little at- tention in the computational literature. Hardt (1992) engaged in the first study of compu- tational and algorithmic approaches for VPE detec- tion and antecedent identification by using heuris- tic, linguistically motivated rules. Hardt (1997) ex- tracted a dataset of 260 examples from the WSJ cor- pus by using an algorithm that exploited null ele- ments in the PTB parse trees. Nielsen (2005) built a dataset that combined sections of the WSJ and BNC; he showed that the more informal settings captured Figure 1: Example of the VPE resolution pipeline on an exam- in the BNC corpora show significantly more fre- ple found in WSJ file wsj 0036. quent occurrences of VPE, especially in dialogue ex- cerpts from interviews and plays. Using this dataset, he created a full VPE pipeline from raw input text to a full resolution by replacing the trigger with the intended antecedent1. our system exploits linguistically informed features Bos and Spenader (2011) annotated the WSJ for specific to VPE in addition to standard features occurrences of VPE. They found over 480 instances such as lexical features or POS tags. Second, of VPE, and 67 instances of the similar phenomenon we adapt the Margin-Infused-Relaxed-Algorithm of do-so anaphora. Bos (2012) studied do-VPE by (MIRA) (Crammer et al., 2006), which has been testing algorithmic approaches to VPE detection and popular in other tasks, such as machine translation antecedent identification that utilize Discourse Rep- (Watanabe et al., 2007) and parsing (McDonald et resentation Theory. al., 2005), to antecedent identification. This algo- Concurrently with the present work, Liu et al. rithm admits a partial loss function which allows (2016) explored various decompositions of VPE res- candidate solutions to overlap to a large degree. This olution into detection and antecedent identification makes it well suited to antecedent identification, as subtasks, and they corrected the BNC annotations candidate antecedents can overlap greatly as well. created by Nielsen (2005), which were difficult to On VPE detection, we show that our approach sig- use because they depended on a particular set of nificantly improves upon a deterministic rule-based preprocessing tools. Our work follows a similar baseline and outperforms the state-of-the-art system pipelined statistical approach. However, we explore of Liu et al. (2016) by 11%, from 69.52% to 80.78%. an expanded set of linguistically motivated features For antecedent identification we present results that and machine learning algorithms adapted for each are competitive with the state-of-the-art (Liu et al., subtask. Additionally, we consider all forms of 2016). We also present state-of-the-art results with VPE, including to-VPE, whereas Liu et al. only con- our end-to-end VPE resolution pipeline. Finally, we sider modal or light verbs (be, do, have) as candi- perform feature ablation experiments to analyze the dates for triggering VPE. This represented about 7% impact of various categories of features. 1e.g., the resolution of the example in Figure 1 would be 2 Related Work “The government includes money spent on residential renova- tion; Dodge does not [include money spent on residential reno- VPE has been the subject of much work in the- vation]”. We did not pursue this final step due to the lack of a oretical linguistics (Sag, 1976; Dalrymple et al., complete dataset that explicitly depicts the correct grammatical resolution of the VPE. 1991, inter alia). VPE resolution could have a sig- 1735 Auxiliary Type Example Frequency 4 VPE Detection Do does, done 214 (39%) The task of VPE detection is structured as a binary Be is, were 108 (19%) classification problem. Given an auxiliary, a, we Have has, had 44 (8%) extract a feature vector f, which is used to predict Modal will, can 93 (17%) whether or not the auxiliary is a trigger for VPE. In To to 29 (5%) Figure 1, for example, there is only one auxiliary So do so/same3 67 (12%) present, “does”, and it is a trigger for VPE. In our experiments, we used a logistic regression classifier. TOTAL 554 Table 1: Auxiliary categories for VPE and their frequencies in 4.1 Feature Extraction all 25 sections of the WSJ. We created three different sets of features related to the auxiliary and its surrounding context. Auxiliary. Auxiliary features describe the charac- teristics of the specific auxiliary, including the fol- of the dataset that they examined. lowing: word identity of the auxiliary 3 Approach and Data lemma of the auxiliary We divide the problem into two separate tasks: VPE auxiliary type (as shown in Table 1) detection (Section 4), and antecedent identification (Section 5). Our experiments use the entire dataset Lexical. These features represent: the three words before and after the trigger presented in (Bos and Spenader, 2011). For prepro- their part-of-speech (POS) tags cessing, we used CoreNLP (Manning et al., 2014) their POS bigrams to automatically parse the raw text of WSJ for fea- ture extraction. We also ran experiments using gold- Syntactic. We devise these features to encode the standard parses; however, we did not find significant relationship between the candidate auxiliary and its 2 differences in our results . Thus, we only report re- local syntactic context. These features were deter- sults on automatically generated parses. mined to be useful through heuristic analysis of VPE We divide auxiliaries into the six different cate- instances in a development set. The feature set in- gories shown in Table 1, which will be relevant for cludes the following binary indicator features (a = our feature extraction and model training process, the auxiliary): as we will describe. This division is motivated by a c-commands4 a verb the fact that different auxiliaries exhibit different be- a c-commands a verb that comes after it haviours (Bos and Spenader, 2011). The results we a verb c-commands a present on the different auxiliary categories (see Ta- a verb locally5 c-commands a bles 2 and 4) are obtained from training a single clas- a locally c-commands a verb sifier over the entire dataset and then testing on aux- a is c-commanded by “than”, “as”, or “so” iliaries from each category, with the ALL result be- a is preceded by “than”, “as”, or “so” ing the accuracy obtained over all of the test data.