Dependency-Based Empty Category Detection Via Phrase Structure Trees

Dependency-based empty category detection via phrase structure trees Nianwen Xue Yaqin Yang Brandeis University Brandeis University Waltham, MA, USA Waltham, MA, USA [email protected] [email protected] Abstract et al., 2005) use traces to represent the extraction site of a dislocated element, dropped pronouns (rep- We describe a novel approach to detecting resented as *pro*s) are much more widespread in empty categories (EC) as represented in de- the CTB. This is because Chinese is a pro-drop lan- pendency trees as well as a new metric for guage (Huang, 1984) that allows the subject to be measuring EC detection accuracy. The new metric takes into account not only the position dropped in more contexts than English does. While and type of an EC, but also the head it is a detecting and resolving traces is important to the in- dependent of in a dependency tree. We also terpretation of the syntactic structure of a sentence in introduce a variety of new features that are both English and Chinese, the prevalence of dropped more suited for this approach. Tested on a sub- nouns in Chinese text gives EC detection added sig- set of the Chinese Treebank, our system im- nificance and urgency. They are not only an impor- proved significantly over the best previously tant component of the syntactic parse of a sentence, reported results even when evaluated with this but are also essential to a wide range of NLP appli- more stringent metric. cations. For example, any meaningful tracking of entities and events in natural language text would 1 Introduction have to include those represented by dropped pronouns. If Chinese is translated into a different lan- In modern theoretical linguistics, empty categories guage, it is also necessary to render these dropped (ECs) are an important piece of machinery in repre- pronouns explicit if the target language does not al- senting the syntactic structure of a sentence and they low pro-drop. In fact, Chung and Gildea (2010) re- are used to represent phonologically null elements ported preliminary work that has shown a positive such as dropped pronouns and traces of dislocated impact of automatic EC detection on statistical ma- elements. They have also found their way into large- chine translation. scale treebanks which have played an important role in advancing the state of the art in syntactic parsing. Some ECs can be resolved to an overt element in In phrase-structure treebanks, ECs have been used to the same text while others only have a generic ref- indicate long-distance dependencies, discontinuous erence that cannot be linked to any specific entity. constituents, and certain dropped elements (Marcus Still others have a plausible antecedent in the text, et al., 1993; Xue et al., 2005). Together with la- but are not annotated due to annotation limitations. beled brackets and function tags, they make up the A common practice is to resolve ECs in two separate full syntactic representation of a sentence. stages (Johnson, 2002; Dienes and Dubey, 2003b; The use of ECs captures some cross-linguistic Dienes and Dubey, 2003a; Campbell, 2004; Gab- commonalities and differences. For example, while bard et al., 2006; Schmid, 2006; Cai et al., 2011). both the Penn English TreeBank (PTB) (Marcus et The first stage is EC detection, where empty cate- al., 1993) and the Chinese TreeBank (CTB) (Xue gories are first located and typed. The second stage 1051 Proceedings of NAACL-HLT 2013, pages 1051–1060, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics is EC resolution, where empty categories are linked determining the type of EC. Framing EC detection to an overt element if possible. this way also requires a new evaluation metric. An In this paper we describe a novel approach to de- EC is considered to be correctly detected if its linear tecting empty categories in Chinese, using the CTB position, its head, and its type are all correctly de- as training and test data. More concretely, EC de- termined. We report experimental results that show tection involves (i) identifying the position of the even using this more stringent measure, our EC de- EC, relative to some overt word tokens in the same tection system achieved performance that improved sentence, and (ii) determining the type of EC, e.g., significantly over the state-of-the-art results. whether it is a dropped pronoun or a trace. We fo- The rest of the paper is organized as follows. In cus on EC detection here because most of the ECs Section 2, we will describe how to represent ECs in the Chinese Treebank are either not resolved to in a dependency structure in detail and present our an overt element or linked to another EC. For ex- approach to EC detection. In Section 3, we describe ample, dropped pronouns (*pro*) are not resolved, how linguistic information is encoded as features. and traces (*T*) in relative clauses are linked to an In Section 4, we discuss our experimental setup and empty relative pronoun (*OP*). present our results. In Section 5, we describe related In previous work, ECs are either represented lin- work. Section 6 concludes the paper. early, where ECs are indexed to the following word (Yang and Xue, 2010) or attached to nodes in a 2 Approach phrase structure tree (Johnson, 2002; Dienes and In order to detect ECs anchored in a dependency Dubey, 2003b; Gabbard et al., 2006). In a linear tree, we first convert the phrase structure trees in the representation where ECs are indexed to the follow- CTB into dependency trees. After the conversion, ing word, it is difficult to represent consecutive ECs each word token in a dependency tree, including the because that will mean more than one EC will be ECs, will have one and only one head (or parent). indexed to the same word (making the classification We then train a classifier to predict the position and task more complicated). While in English consecu- type of ECs in the dependency tree. Let W be a se- tive ECs are relatively rare, in Chinese this is very quence of word tokens in a sentence, and T is syn- common. For example, it is often the case that an tactic parse tree for W , our task is to predict whether empty relative pronoun (*OP*) is followed imme- there is a tuple (h, t, e), such that h and t are word to- diately by a trace (*T*). Another issue with the lin- kens in W , e is an EC, h is the head of e, and t imme- ear representation of ECs is that it leaves unspecified diately follows e. When EC detection is formulated where the EC should be attached, and crucial depen- as a classification task, each classification instance dencies between ECs and other elements in the syn- is thus a tuple (h, t). The input to our classifier is tactic structure are not represented, thus limiting the T , which can either be a phrase structure tree or a utility of this task. dependency tree. We choose to use a phrase struc- In a phrase structure representation, ECs are at- ture tree because phrase structure parsers trained on tached to a hierarchical structure and the problem the Chinese Treebank are readily available, and we of multiple ECs indexed to the same word token can also hypothesize that phrase structure trees have a be avoided because linearly consecutive ECs may be richer hierarchical structure that can be exploited as attached to different non-terminal nodes in a phrase features for EC detection. structure tree. In a phrase structure framework, ECs are evaluated based on their linear position as well 2.1 Empty categories in the Chinese Treebank as on their contribution to the overall accuracy of According to the CTB bracketing guidelines (Xue the syntactic parse (Cai et al., 2011). and Xia, 2000), there are seven different types of In the present work, we propose to look at EC ECs in the CTB. Below is a brief description of the detection in a dependency structure representation, empty categories: where we define EC detection as (i) determining its linear position relative to the following word token, 1. *pro*: small pro, used to represent dropped (ii) determining its head it is a dependent of, and (iii) pronouns. 1052 2. *PRO*: big PRO, used to represent shared el- In previous work EC detection has been formu- ements in control structures or elements that lated as a classification problem with the target of have generic references. the classification being word tokens (Yang and Xue, 3. *OP*: null operator, used to represent empty 2010; Chung and Gildea, 2010), or constituents in relative pronouns. a parse tree (Gabbard et al., 2006). When word to- 4. *T*: trace left by movement such as topical- kens are used as the target of classification, the task ization and relativization. is to determine whether there is an EC before each 5. *RNR*: right node raising. word token, and what type EC it is. One shortcom- 6. *: trace left by passivization and raising. ing with that representation is that more than one EC 7. *?*: missing elements of unknown category. can precede the same word token, as is the case in the example in Figure 1, where both *OP* and *T* An example parse tree with ECs is shown in precede 涉Ê (“involve”). In fact, (Yang and Xue, Figure 1. In the example, there are two ECs, an 2010) takes the last EC when there is a sequence of empty relative pronoun (*OP*) and a trace (*T*), a ECs and as a result, some ECs will never get the common syntactic pattern for relative clauses in the chance to be detected.

Dependency-Based Empty Category Detection Via Phrase Structure Trees

LINGUISTICS 221 Lecture #3 DISTINCTIVE FEATURES Part 1. an Utterance Is Composed of a Sequence of Discrete Segments. Is the Segm

Arxiv:2106.08037V1 [Cs.CL] 15 Jun 2021 Alternative Ways the World Could Be

Serial Verb Constructions Revisited: a Case Study from Koro

1 English Subjectless Tagged Sentences Paul Kay Department Of

The Syntax of Answers to Negative Yes/No-Questions in English Anders Holmberg Newcastle University

Sequence-Of-Tense and the Features of Finite Tenses Karen Zagona University of Washington*

Topic Structures in Chinese Author(S): Xu Liejiong and D

Empty Categories in LFG

Staged Approach for Grammatical Gender Identification of Nouns Using Association Rule

Serial Verb Constructions: Argument Structural Uniformity and Event Structural Diversity

Inflection, P. 1

Proceedings of the IWCS 2013 Workshop on Annotation of Modal