<<

Dependency-based detection via phrase structure trees

Nianwen Xue Yaqin Yang Brandeis University Brandeis University Waltham, MA, USA Waltham, MA, USA [email protected] [email protected]

Abstract et al., 2005) use traces to represent the extraction site of a dislocated element, dropped (rep- We describe a novel approach to detecting resented as **s) are much more widespread in empty categories (EC) as represented in de- the CTB. This is because Chinese is a pro-drop lan- pendency trees as well as a new metric for guage (Huang, 1984) that allows the to be measuring EC detection accuracy. The new metric takes into account not only the position dropped in more contexts than English does. While and type of an EC, but also the head it is a detecting and resolving traces is important to the in- dependent of in a dependency tree. We also terpretation of the syntactic structure of a sentence in introduce a variety of new features that are both English and Chinese, the prevalence of dropped more suited for this approach. Tested on a sub- nouns in Chinese text gives EC detection added sig- set of the Chinese Treebank, our system im- nificance and urgency. They are not only an impor- proved significantly over the best previously tant component of the syntactic parse of a sentence, reported results even when evaluated with this but are also essential to a wide range of NLP appli- more stringent metric. cations. For example, any meaningful tracking of entities and events in natural language text would 1 Introduction have to include those represented by dropped pro- nouns. If Chinese is translated into a different lan- In modern theoretical , empty categories guage, it is also necessary to render these dropped (ECs) are an important piece of machinery in repre- pronouns explicit if the target language does not al- senting the syntactic structure of a sentence and they low pro-drop. In fact, Chung and Gildea (2010) re- are used to represent phonologically null elements ported preliminary work that has shown a positive such as dropped pronouns and traces of dislocated impact of automatic EC detection on statistical ma- elements. They have also found their way into large- chine translation. scale treebanks which have played an important role in advancing the state of the art in syntactic parsing. Some ECs can be resolved to an overt element in In phrase-structure treebanks, ECs have been used to the same text while others only have a generic ref- indicate long-distance dependencies, discontinuous erence that cannot be linked to any specific entity. constituents, and certain dropped elements (Marcus Still others have a plausible antecedent in the text, et al., 1993; Xue et al., 2005). Together with la- but are not annotated due to annotation limitations. beled brackets and function tags, they make up the A common practice is to resolve ECs in two separate full syntactic representation of a sentence. stages (Johnson, 2002; Dienes and Dubey, 2003b; The use of ECs captures some cross-linguistic Dienes and Dubey, 2003a; Campbell, 2004; Gab- commonalities and differences. For example, while bard et al., 2006; Schmid, 2006; Cai et al., 2011). both the Penn English TreeBank (PTB) (Marcus et The first stage is EC detection, where empty cate- al., 1993) and the Chinese TreeBank (CTB) (Xue gories are first located and typed. The second stage

1051

Proceedings of NAACL-HLT 2013, pages 1051–1060, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics is EC resolution, where empty categories are linked determining the type of EC. Framing EC detection to an overt element if possible. this way also requires a new evaluation metric. An In this paper we describe a novel approach to de- EC is considered to be correctly detected if its linear tecting empty categories in Chinese, using the CTB position, its head, and its type are all correctly de- as training and test data. More concretely, EC de- termined. We report experimental results that show tection involves (i) identifying the position of the even using this more stringent measure, our EC de- EC, relative to some overt tokens in the same tection system achieved performance that improved sentence, and (ii) determining the type of EC, e.g., significantly over the state-of-the-art results. whether it is a dropped or a trace. We fo- The rest of the paper is organized as follows. In cus on EC detection here because most of the ECs Section 2, we will describe how to represent ECs in the Chinese Treebank are either not resolved to in a dependency structure in detail and present our an overt element or linked to another EC. For ex- approach to EC detection. In Section 3, we describe ample, dropped pronouns (*pro*) are not resolved, how linguistic information is encoded as features. and traces (*T*) in relative clauses are linked to an In Section 4, we discuss our experimental setup and empty relative pronoun (*OP*). present our results. In Section 5, we describe related In previous work, ECs are either represented lin- work. Section 6 concludes the paper. early, where ECs are indexed to the following word (Yang and Xue, 2010) or attached to nodes in a 2 Approach phrase structure tree (Johnson, 2002; Dienes and In order to detect ECs anchored in a dependency Dubey, 2003b; Gabbard et al., 2006). In a linear tree, we first convert the phrase structure trees in the representation where ECs are indexed to the follow- CTB into dependency trees. After the conversion, ing word, it is difficult to represent consecutive ECs each word token in a dependency tree, including the because that will mean more than one EC will be ECs, will have one and only one head (or parent). indexed to the same word (making the classification We then train a classifier to predict the position and task more complicated). While in English consecu- type of ECs in the dependency tree. Let W be a se- tive ECs are relatively rare, in Chinese this is very quence of word tokens in a sentence, and T is syn- common. For example, it is often the case that an tactic parse tree for W , our task is to predict whether empty relative pronoun (*OP*) is followed imme- there is a tuple (h, t, e), such that h and t are word to- diately by a trace (*T*). Another issue with the lin- kens in W , e is an EC, h is the head of e, and t imme- ear representation of ECs is that it leaves unspecified diately follows e. When EC detection is formulated where the EC should be attached, and crucial depen- as a classification task, each classification instance dencies between ECs and other elements in the syn- is thus a tuple (h, t). The input to our classifier is tactic structure are not represented, thus limiting the T , which can either be a phrase structure tree or a utility of this task. dependency tree. We choose to use a phrase struc- In a phrase structure representation, ECs are at- ture tree because phrase structure parsers trained on tached to a hierarchical structure and the problem the Chinese Treebank are readily available, and we of multiple ECs indexed to the same word token can also hypothesize that phrase structure trees have a be avoided because linearly consecutive ECs may be richer hierarchical structure that can be exploited as attached to different non-terminal nodes in a phrase features for EC detection. structure tree. In a phrase structure framework, ECs are evaluated based on their linear position as well 2.1 Empty categories in the Chinese Treebank as on their contribution to the overall accuracy of According to the CTB bracketing guidelines (Xue the syntactic parse (Cai et al., 2011). and Xia, 2000), there are seven different types of In the present work, we propose to look at EC ECs in the CTB. Below is a brief description of the detection in a dependency structure representation, empty categories: where we define EC detection as (i) determining its linear position relative to the following word token, 1. *pro*: small pro, used to represent dropped (ii) determining its head it is a dependent of, and (iii) pronouns.

1052 2. *PRO*: big PRO, used to represent shared el- In previous work EC detection has been formu- ements in structures or elements that lated as a classification problem with the target of have generic references. the classification being word tokens (Yang and Xue, 3. *OP*: null operator, used to represent empty 2010; Chung and Gildea, 2010), or constituents in relative pronouns. a parse tree (Gabbard et al., 2006). When word to- 4. *T*: trace left by movement such as topical- kens are used as the target of classification, the task ization and relativization. is to determine whether there is an EC before each 5. *RNR*: right node . word token, and what type EC it is. One shortcom- 6. *: trace left by passivization and raising. ing with that representation is that more than one EC 7. *?*: missing elements of unknown category. can precede the same word token, as is the case in the example in Figure 1, where both *OP* and *T* An example parse tree with ECs is shown in precede 涉及 (“involve”). In fact, (Yang and Xue, Figure 1. In the example, there are two ECs, an 2010) takes the last EC when there is a sequence of empty relative pronoun (*OP*) and a trace (*T*), a ECs and as a result, some ECs will never get the common syntactic pattern for relative clauses in the chance to be detected. Notice that this problem can CTB. be avoided in a dependency structure representation if we make the target of classification a tuple that IP consists of the following word token and the head of NP VP the EC. From Figure 2, it should be clear that while NR NR ADVP VP *OP* and *T* both precede the same word token 涉 VV ASP NP AD CP 及 (“involve”), they have different heads, which are Shanghai Pudong QP ADJP NP 的 涉及 CP (DE) and respectively. WHNP recently issue AS IP DEC CD CLP JJ NN Dependency-based EC detection also has other *OP* NP VP M DEC nice properties. For ECs that are arguments of their VV NP T* M regulatorydocument verbal head, when they are resolved to some overt NN NN involve element, the dependency between the referent of economic field the EC and its head will be naturally established. "Shanghai Pudong recently enacted 71 regulatory documents involving This can be viewed as an alternative to the approach the enconomic field." adopted by Levy and Manning (2004), where phrase structure parses are augmented to recover non-local Figure 1: Empty categories in a phrase structure tree dependencies. Dependency structures are also easily decomposable into head/dependency pairs and this makes the evaluation more straightforward. Each 2.2 Converting phrase structure to dependency classification instance can be evaluated indepen- structure dently of other parts of the dependency structure. We convert the phrase structure parses in the CTB to dependency trees using the conversion tool that 2.3 One pass vs two passes generated the Chinese data sets for the CoNLL 2009 With pairs of tokens (h, t) as the classification tar- Shared Task on multilingual dependency parsing get, all possible pairs in a sentence will have to be 1 and semantic role labeling (Hajicˇ et al., 2009) . considered and there will be a large number of (h, While the Chinese data of CoNLL 2009 Shared Task t) tuples that are not associated with an EC, leading does not include ECs, the tool has an option of pre- to a highly imbalanced data set. One can conceive serving the ECs in the conversion process. As an ex- a two-pass scenario where we first make a binary ample, the dependency tree in Figure 2 is converted decision of whether there is an empty category as- from the phrase structure tree in Figure 1, with the sociated with the head in the first pass and then de- ECs preserved. termine whether there is an EC associated with the 1The tool can be downloaded at tuple as well as the EC type in the second pass. The http://www.cs.brandeis.edu/ clp/ctb/ctb.html. alternative is to have a one-pass model in which we

1053 Figure 2: Empty categories in a dependency structure tree add a NONE category indicating there is no EC as- and the word token before the EC (p) . These in- sociated with the tuple. With the seven EC types clude different combinations of h, t and p, as well presented earlier in this section, this will be an eight- as their parts-of-speech. They also include various way classification problem. There are reasons for ei- linear distance features between h and t. Below is ther model: the one-pass model is simpler but in the the full list of lexical features: two-pass model we can bring different sources of in- formation to bear on each sub-problem. Ultimately 1. ∗The token string representation of h, t and p, which model leads to better accuracy is an empirical as well as their part-of-speech tag (POS). question. We experimented with both models and it 2. ∗The POS combination of h and t, the POS turned out that they led to very similar results. In combination of t and p. this paper, we report results from the simpler one- 3. The normalized word distance between h and pass model. t, with the values of this being same, immediately before, immediately after, 3 Features near before, and near after, and other. We explored a wide range of features, all derived 4. The verb distance between h and t, defined as from the phrase structure parse tree (T ). With each the number of verbs that occur between h and classification instance being a tuple (h, t), the “piv- t. ots” for these features are h the head, t the word 5. The comma distance between h and t, defined token following the EC, and p, the word token pre- as the number of commas that occur between h ceding the EC. The features we tried fall into six and t. broad groups that are all empirically confirmed to 3.2 Vertical features have made a positive contribution to our classifica- tion task. These are (i) horizontal features, (ii) ver- Vertical features are designed to exploit the hierar- tical features, (iii) targeted grammatical construc- chical structure of the syntactic tree. Our hierar- tions, (iv) head information, (v) features, chical features are based on the following observa- and (vi) semantic role features. We obviously have tions. An empty category is always located between looked at features used in previous work on Chinese its left frontier and right frontier, anchored by t and EC detection, most notably (Yang and Xue, 2010), p. Given the lowest common ancestor A of p and which has also adopted a classification-based ap- t, the right frontier is the path from t to A and the proach, but because we frame our classification task left frontier is the path from the p to A. We also very differently, we have to use very different fea- define a path feature from h to t, which constrains tures. However, there is a subset of features we used the distance between the EC and its head, just as it here that has at least a partial overlap with their fea- constrains the distance between a and its tures, and such features are clearly indicated with ∗. in the semantic role labeling task (Gildea and Jurafsky, 2002). Given the lowest common an- 3.1 Horizontal features cestor A0 of h and t, the path from h to t is the path The first group of features we use can be described from h to A0 and from A0 to t. as horizontal features that exploit lexical context of In Figure 3, assuming that t is 迅速 (“rapidly”) the head (h), the word token following the EC (t), and h is 崛起 (“take off”), the vertical features ex-

1054 IP IP often determines the exact empty category type.

NP VP In Figure 3, the IP that has a *PRO* subject is the of a verb in a canonical -control DNP NP VV NP IP NP DEG NP NP 使 QP NP VP construction. An IP can also be a sentential subject, make NN NN NN Q CLP ADVP VP the complement of a preposition or a localizer (also NR NN *PRO* 资本 结构 的 M called postposition in the literature), or the comple- 优化 AD VV capital structure DE optimization 青岛 一 批 企业 Qingdao one CL enterprise ment in a CP (roughly SBAR in the PTB), etc. These 迅速 崛起 rapidly take off different contexts tend to be associated with differ- "The optimization of the capital structure has led to the rapid ent types of empty categories. The full list of fea- take-off of a host of enterprises in Qingdao." tures that exploit these contexts include:

Figure 3: Empty category on the right frontier 1. ∗Whether t starts an IP 2. ∗Whether t starts a subjectless IP 3. The left sisters of the immediate IP parent that tracted include: t starts 1. The string representation of the right frontier, 4. The right sisters of the immediate IP parent that AD↑ADVP↑VP↑IP↑VP t starts 2. The path from the head t to h, 5. The string representation of the governing verb AD↑ADVP↑VP↓VP↓VV of the immediate IP parent that t starts 3. The path from the head h to A, 6. Whether the IP started by t is the complement VV↑VP↑VP↑IP↑VP. Notice there is not of a localizer phrase always a path from h to A. 7. Whether the immediate IP parent that t starts is a sentential subject The vertical features are really a condensed rep- resentation of a certain syntactic configuration that 3.4 Head information helps to predict the presence or absence of an empty Most ECs have a verb as its head, but when there is a category as well as the empty category type. For coordination VP structure where more than one VP example, the right frontier of *PRO* in Figure share an EC subject, only one such verb can be the 3 AD↑ADVP↑VP↑IP↑VP represents a subjectless head of this EC. The phrase structure to dependency IP. Had there been an overt subject in the place structure conversion tool designates the first verb as of the *PRO*, the right frontier would have been the head of the coordinated VP and thus the head of AD↑ADVP↑VP↑IP. Therefore, the vertical features the EC subject in the dependency structure. Other are discriminative features that can help detect the verbs have no chance of being the head. We use a presence or absence of an empty category. VP head feature to capture this information. It is a binary feature indicating whether a verb can be a 3.3 Targeted grammatical constructions head. The third group of features target specific, linguisti- cally motivated grammatical constructions. The ma- 3.5 Transitivity features jority of features in this group hinge on the immedi- A transitivity has been extracted from the ate IP (roughly corresponds to S in the PTB) ances- Chinese Treebank and it is used to determine the tor of t headed by h. These features are only invoked transitivity value of a word. A word can be when t starts (or is on the left edge of) the immedi- transitive, intransitive, or unknown if it is not a ate IP ancestor, and they are designed to capture the verb. Ditransitive verbs are small in number and are context in which the IP ancestor is located. This con- folded into transitive verbs. Transitivity features are text can provide discriminative clues that may help defined on h and constrained by word distance: it is identify the types of empty category. For example, only used when h immediately precedes t. This fea- both *pro*s and *PRO*s tend to occur in the sub- ture category is intended to capture transitive verbs ject position of an IP, but the larger context of the that are missing an object.

1055 3.6 Semantic role features CTB file IDs for training, development and testing There are apparent connections between semantic are listed in Table 1. The development data is used role labeling and EC detection. The task of seman- for feature selection and tuning, and results are re- tic role labeling is typically defined as one of detect- ported on the test set. ing and classifying arguments for verbal or nomi- Train Dev Test nal predicates, with more work done so far on ver- 81-325, 400-454, 500-554 41-80 1-40 bal than nominal predicates. Although empty cat- 590-596, 600-885, 900 901-931 egories are annotated as arguments to verbal pred- icates in linguistic resources such as the English Table 1: Data set division. (Palmer et al., 2005) and Chinese (Xue and Palmer, 2009) Propbanks, they are often left out in seman- As discussed in Section 2, the gold standard de- tic role labeling systems trained on these resources. pendency structure parses are converted from the This is because the best performing semantic role la- CTB parse trees, with the ECs preserved. From beling systems rely on syntactic features extracted these gold standard parse trees, we extract triples of from automatic parses (Gildea and Palmer, 2002; (e, h, t) where e is the EC type, h is (the position of) Punyakanok et al., 2005) and the parsers that pro- the head of the EC, and t is (the position of) the word duce them do not generally reproduce empty cate- token following the EC. During the training phrase, gories. As a result, current semantic role labeling features are extracted from automatic phrase struc- systems can only recover explicit arguments. How- ture parses and paired with these triples. The au- ever, assuming that all the explicit arguments to a tomatic phrase structure parses are produced by the predicate are detected and classified, one can infer the Berkeley parser2 with a 10-fold cross-validation, the empty arguments of a predicate from its explicit which each fold parsed using a model trained on the arguments, given a list of expected arguments for other nine folds. Measured by the ParsEval met- the predicate. The list of expected arguments can ric (Black et al., 1991), the parsing accuracy on be found in the “frame files” that are used to guide the CTB test set stands at 83.63% (F-score), with probank annotation. We defined a semantic role fea- a precision of 85.66% and a recall of 81.69%. We ture category on h when it is a verb and the value of chose to train a Maximum Entropy classifier using this feature is the semantic role labels for the EC ar- the Mallet toolkit3 (McCallum, 2002) to detect ECs. guments. Like transitivity features, this feature cate- gory is also constrained by word distance. It is only 4.2 Evaluation metric used when h immediately precedes t. We use standard metrics of precision, recall and F- To extract semantic role features, we retrained a measure in our evaluation. In a dependency struc- Chinese semantic role labeling system on the Chi- ture representation, evaluation is very straightfor- nese Propbank. We divided the Chinese Propbank ward because individual arcs from the dependency data into 10 different subsets, and automatically as- tree can be easily decomposed. An EC is considered signed semantic roles to each subset with a system to be correctly detected if it is attached to the correct trained on the other nine subsets. Using the frame head h, correctly positioned relative to t, and cor- files for the Chinese Propbank, we are able to infer rectly typed. This is a more stringent measure than the semantic roles for the missing arguments and use metrics proposed in previous work, which evaluates them as features. EC detection based on its position and type without considering the head it is a dependent of. 4 Experimental Results 4.3 Results 4.1 Experimental setup Our EC detection models are trained and evaluated There are 1,838 total EC instances in the test set, and on a subset of the Chinese TreeBank 6.0. The train- if we follow (Yang and Xue, 2010) and collapse all ing/development/test data split in our experiments 2http://code.google.com/p/berkeleyparser is recommended in the CTB documentation. The 3http://mallet.cs.umass.edu

1056 consecutive ECs before the same word token to one, class correct prec rec F1 we will end up with a total EC count of 1,352, and *pro* 46 .397 .154 .222 this is also the EC count used by (Cai et al., 2011) *PRO* 162 .602 .531 .564 in their evaluation. On the dependency-based repre- *OP* 344 .724 .653 .687 sentation adopted here, after collapsing all consecu- *T* 331 .673 .567 .615 tive ECs before the same word token AND attached * 0 0 0 0 to the same head to one, we end up with a total EC *RNR* 20 .714 .625 .667 count of 1,765. The distribution of the ECs in the .512 .574 test set are presented in Table 2, with the EC count all 903 .653 (.491) (.561) per type from (Yang and Xue, 2010) in parenthesis (.668) (.660) if it is different. The number of *OP*s, in particular, CCG .660 .545 .586 has increased dramatically from 134 to 527, and this is because a null relative pronoun (*OP*) immedi- Table 3: EC detection results on the CTB test set and ately followed by a trace (*T*) in the subject posi- with (Cai et al., 2011) [CCG] tion of a relative clause is a very common pattern in the Chinese Treebank, as illustrated in Figure 2. In (Yang and Xue, 2010), the *OP*-*T* sequences are traces (*T*), indicating that our features are effec- collapsed into one, and only the *T*s are counted. tive in capturing information from relative clause That leads to the much smaller count of *OP*s. constructions. This accounts for most of the gain compared with previous approaches. The *OP* cat- type count type count egory, in particular, benefits most from the depen- *pro* 298 (290) *PRO* 305 (299) dency representation because it is collapsed to the *OP* 527 (134) *T* 584 (578) immediately following *T* in previous approaches * 19 *RNR* 32 and does not even get a chance to be detected. On *?* 0 total (1352)/1765/(1838) the other hand, our model did poorly on dropped pronouns (*pro*). One possible explanation is that Table 2: EC distribution in the CTB test set *pro*s generally occupy subject positions in a sen- Our results are shown in Table 3. These results tence and is attached as an immediate child of an are achieved by using the full feature set presented IP, which is the top-level structure of a sentence in Section 3. The overall accuracy by F1-measure is that an automatic parser tends to get wrong. Unlike 0.574 if we assume there can only be one EC asso- *PRO*, it is not constrained to well-defined gram- ciated with a given (h, t) tuple and hence the total matical constructions such as subject- and object- EC count in the gold standard is 1,765, or 0.561 if control structures. we factor in all the EC instances and use the higher To evaluate the effectiveness of our features, we total count of 1,838, which lowers the recall. If in- also did an ablation study on the contribution of dif- stead we use the total EC count of 1,352 that was ferent feature groups. The most effective features used in previous work (Yang and Xue, 2010; Cai et are the ones when taken out lead to the most drop in al., 2011), then the F1-measure is 0.660 because the accuracy. As should be clear from Table 4, the most lower total count greatly improves the recall. This effective features are the horizontal features, fol- is a significant improvement over the best previous lowed by vertical structures. Features extracted from result reported by Cai et al (2011), which is an F1 targeted grammatical constructions and features rep- measure of 0.586 on the same test set but based on resenting whether h is the head of a coordinated VP a less stringent metric of just comparing the EC po- lead to modest improvement. Transitivity and se- sition and type, without considering whether the EC mantic role features make virtually no difference at is attached to the correct head. all. We believe it is premature to conclude that they There are several observations worth noting from are not useful. Possible explanations for their lack these results. One is that our method performs par- of effectiveness is that they are used in very limited ticularly well on null relative pronouns (*OP*) and context and the accuracy of the semantic role label-

1057 ing system is not sufficient to make a difference. information is more likely to succeed than determin- istic rules that have to make reference to morpho- class correct prec rec F1 syntactic clues such as to-infinitives and gerunds that all 903 .653 .512 .574 (.561) are largely non-existent in Chinese. Without these -Horizontal 827 .627 .469 .536 (.524) clues, we believe a preprocessing approach that does -Vertical 865 .652 .490 .559 (.547) not take advantage of skeletal parses is unlikely to -Gr Cons 887 .646 .483 .565 (.552) succeed either. The work we report here also builds -V head 891 .651 .505 .569 (.556) on emerging work in Chinese EC detection. Yang -Trans 899 .654 .509 .573 (.560) and Xue (2010) reported work on detecting just the -SRL 900 .657 .510 .574 (.561) presence and absence of empty categories without further classifying them. Chung and Gildea (2010) Table 4: Contribution of feature groups reported work on just detecting just a small subset of the empty categories posited in the Chinese Tree- 5 Related Work Bank. Kong and Zhou (2010) worked on Chinese zero resolution, where empty category de- The work reported here follows a fruitful line of re- tection is a subtask. More recently, Cai et al (2011) search on EC detection and resolution, mostly in has successfully integrated EC detection into phrase- English. Empty categories have initially been left structure based syntactic parsing and reported state- behind in research on syntactic parsing (Collins, of-the-art results in both English and Chinese. 1999; Charniak, 2001) for efficiency reasons, but more recent work has shown that EC detection can 6 Conclusions and Future Work be effectively integrated into the parsing process We described a novel approach to detecting empty (Schmid, 2006; Cai et al., 2011). In the meantime, categories (EC) represented in dependency trees and both pre-processing and post-processing approaches a new metric for measuring EC detection accuracy. have been explored in previous work as alternatives. The new metric takes into account not only the po- Johnson (2002) has showed that empty categories sition and type of an EC, but also the head it is a can be added to the skeletal parses with reasonable dependent of in a dependency structure. We also accuracy with a simple pattern-matching algorithm proposed new features that are more suited for this in a postprocessing step. Dienes and Dubey (2003b; new approach. Tested on a subset of the Chinese 2003a) achieved generally superior accuracy using a Treebank, we show that our system improved signif- machine learning framework without having to refer icantly over the best previously reported results de- to the syntactic structure in the skeletal parses. They spite using a more stringent evaluation metric, with described their approach as a pre-processing step for most of the gain coming from an improved represen- parsing because they only use as features morpho- tation. In the future, we intend to work toward re- syntactic clues (passives, gerunds and to-infinitives) solving ECs to their antecedents when EC detection that can be found in certain function and part- can be done with adequate accuracy. We also plan to of-speech tags. Even better results, however, were test our approach on the Penn (English) Treebank, obtained by Campbell (2004) in a postprocessing with the first step being converting the Penn Tree- step that makes use of rules inspired by work in theo- bank to a dependency representation with the ECs retical linguistics. Gabbard et al (2006) reported fur- preserved. ther improvement largely by recasting the Campbell rules as features to seven different machine learning Acknowledgments classifiers. We adopted a machine-learning based postpro- This work is supported by the National Sci- cessing approach based on insights gained from ence Foundation via Grant No. 0910532 enti- prior work in English and on Chinese-specific con- tled “Richer Representations for Machine Transla- siderations. All things being equal, we believe that tion”. All views expressed in this paper are those a machine learning approach that can exploit partial of the authors and do not necessarily represent the

1058 view of the National Science Foundation. Marquez,` Adam Meyers, Joakim Nivre, Sebastian Pado,´ Jan Stˇ epˇ anek,´ Pavel Stranˇak,´ Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL- References 2009 shared task: Syntactic and semantic depen- dencies in multiple languages. In Proceedings of E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Gr- the 13th Conference on Computational Natural Lan- ishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, guage Learning (CoNLL-2009), June 4-5, Boulder, J. Klavans, M. Liberman, M. Marcus, S. Roukos, Colorado, USA. B. Santorini, and T. Strzalkowski. 1991. A proce- dure for quantitively comparing the syntactic coverage James C.T. Huang. 1984. On the distribution and refer- Linguistics Inquiry of English grammars. In Proceedings of the DARPA ence of empty pronouns. , 15:531– Speech and Natural Language Workshop, pages 306– 574. 311. Mark Johnson. 2002. A simple pattern-matching al- Shu Cai, David Chiang, and Yoav Goldberg. 2011. gorithm for recovering empty nodes and their an- Language-independent parsing with empty elements. tecedents. In Proceedings of the 40th Annual Meeting In Proceedings of the 49th Annual Meeting of the As- of the Association for Computational Linguistics. sociation for Computational Linguistics: Human Lan- Fang Kong and Guodong Zhou. 2010. A Tree Kernel- guage Technologies, pages 212–216, Portland, Ore- based unified framework for Chinese zero anaphora gon, USA, June. Association for Computational Lin- resolution. In Proceedings of the 2010 Conference on guistics. Empirical Methods in Natural Language Processing, Richard Campbell. 2004. Using linguistic principles MIT, Massachusetts. to recover empty categories. In Proceedings of the Roger Levy and Christopher Manning. 2004. Deep de- 42nd Annual Meeting on Association For Computa- pendencies from context-free statistical parsers: cor- tional Linguistics. recting the surface dependency approximation. In Pro- E. Charniak. 2001. Immediate-head Parsing for Lan- ceedings of the ACL. guage Models. In ACL-01. M. Marcus, B. Santorini, and M. A. Marcinkiewicz. Tagyoung Chung and Daniel Gildea. 2010. Effects of 1993. Building a Large Annotated Corpus of English: empty categories on machine translation. In Proceed- the Penn Treebank. Computational Linguistics. ings of the 2010 Conference on Empirical Methods in Andrew Kachites McCallum. 2002. Mal- Natural Language Processing, pages 636–645, Cam- let: A machine learning for language toolkit. bridge, MA. http://mallet.cs.umass.edu. Michael Collins. 1999. Head-driven Statistical Models Martha Palmer, Daniel Gildea, and Paul Kingsbury. for Natural Language Parsing. Ph.D. thesis, Univer- 2005. The Proposition Bank: An annotated corpus of sity of Pennsylvania. semantic roles. Computational Linguistics, 31(1):71– Peter´ Dienes and Amit Dubey. 2003a. Antecendant Re- 106. covery: Experiments with a Trace Tagger. In Proceed- Vasin Punyakanok, Dan Roth, and W. Yih. 2005. The ings of the Conference on Empirical Methods in Natu- Necessity of Syntactic Parsing for Semantic Role La- ral Language Processing , Sapporo, Japan. beling. In Proceedings of IJCAI-2005, pages 1124– Peter´ Dienes and Amit Dubey. 2003b. Deep syntactic 1129, Edinburgh, UK. processing by combining shallow methods. In Pro- Helmut Schmid. 2006. Trace prediction and recovery ceedings of the 41st Annual Meeting of the Association with unlexicalized PCFGs and slash features. In Proc for Computational Linguistics, Sapporo, Japan. of ACL. Ryan Gabbard, Seth Kulick, and Mitchell Marcus. 2006. Nianwen Xue and Martha Palmer. 2009. Adding seman- Fully parsing the penn treebank. In Proceedings of tic roles to the Chinese Treebank. Natural Language HLT-NAACL 2006, pages 184–191, New York City. Engineering, 15(1):143–172. D. Gildea and D. Jurafsky. 2002. Automatic label- Nianwen Xue and Fei Xia. 2000. The Bracketing Guide- ing for semantic roles. Computational Linguistics, lines for Penn Chinese Treebank Project. Technical 28(3):245–288. Report IRCS 00-08, University of Pennsylvania. Dan Gildea and Martha Palmer. 2002. The Necessity Nianwen Xue, Fei Xia, Fu dong Chiou, and Martha of Parsing for Predicate Argument Recognition. In Palmer. 2005. The Penn Chinese TreeBank: Phrase Proceedings of the 40th Meeting of the Association for Structure Annotation of a Large Corpus. Natural Lan- Computational Linguistics, Philadelphia, PA. guage Engineering, 11(2):207–238. Jan Hajic,ˇ Massimiliano Ciaramita, Richard Johans- Yaqin Yang and Nianwen Xue. 2010. Chasing the Ghost: son, Daisuke Kawahara, Maria Antonia` Mart´ı, Llu´ıs Recovering Empty Categories in the Chinese Tree-

1059 bank. In Proceedings of the 23rd International Con- ference on Computational Linguistics (COLING), Bei- jing, China.

1060