Leveraging in Parallel Corpora for Natural Language Processing

by

Shumin Wu

B.S., Duke University, 1998

M.S., Duke University, 2003

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2015 This thesis entitled: Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing written by Shumin Wu has been approved for the Department of Computer Science

Martha Palmer

Prof. James Martin

Prof. Wayne Ward

Prof. Daniel Gildea

Prof. Nianwen Xue

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii

Wu, Shumin (Ph.D., Computer Science)

Leveraging Semantic Similarity in Parallel Corpora for Natural Language Processing

Thesis directed by Prof. Martha Palmer

This thesis introduces a alignment based approach for mapping PropBank predicate- argument structures across languages. We used Chinese and English as the language pair for the study and found our approach was able to reliably predict alignments between predicates and their arguments for the predicates in parallel sentence pairs, even when faced with incorrect word alignment input. Further more, by modeling the predicate-to-predicate and argument-to-argument probabilities over a large unannotated parallel corpora and using an expectation maximization (EM) based approach, we were able to further improve the alignment performance of the system.

As part of the semantic mapping system, we also developed both a Chinese and an English constituent-based semantic role labeler (SRL) for arguments of verbal and non-verbal predicates.

By building based selectional preferences for each language, as well as using techniques such as support predicate identification and multi-stage classification, we were able to achieve what we believe is the state-of-the-art Chinese SRL performance and one of the best English SRL performances using a single constituent tree input. More over, by improving the performance of Chinese nominal SRL by over 2 F points, we demonstrated that selectional preferences can significantly improve SRL when the argument candidates are not well constrained by syntax.

As a case study, we successfully applied our semantic mapping system to aligning Chinese dropped pronouns to English text as a way of understanding when replacement would be required in English in place of the dropped pronouns during Chinese-English machine .

In the process, we also demonstrated that semantic knowledge is essential for recovering Chinese dropped pronouns by producing a state-of-the-art SRL-enhanced Chinese empty category recovery system. Dedication

This dissertation is dedicated to my parents, Wei Xiong Wu and Guoqin Jia, who have loved, supported, and encouraged me throughout my life. It is also dedicated to my wife, Simone Liu, who has loved and patiently supported me while I struggled to finish this work. v

Acknowledgements

We gratefully acknowledge the support of the following grants. Any contents expressed in this material are those of the authors and do not necessarily reflect the views of the grant agency.

• National Science Foundation CISE-IISRI-0910992, Richer Representations for Machine

Translation

• DARPA FA8750-09-C-0179 (via BBN) Machine Reading: Ontology Induction:Semlink+

and AMR

• DARPA HR0011-11-C-0145 (via LDC) BOLT

This work utilized the Janus supercomputer, which is supported by the National Science

Foundation (award number CNS-0821794) and the University of Colorado Boulder. The Janus supercomputer is a joint effort of the University of Colorado Boulder, the University of Colorado

Denver and the National Center for Atmospheric Research. vi

Contents

Chapter

1 Introduction 1

1.1 Motivation ...... 1

1.2 Semantic representation ...... 3

1.3 Semantic mapping ...... 3

1.4 Automatic ...... 4

1.5 Chinese dropped pronouns ...... 5

2 Background and Related Work 8

2.1 Semantic mapping ...... 8

2.2 Leveraging parallel corpora ...... 9

2.3 PropBank Semantic role labeling ...... 10

2.3.1 Chinese SRL ...... 11

2.3.2 Selectional Preference for SRL ...... 12

2.3.3 Topic Model based Selectional Preference ...... 13

2.4 Empty category recovery ...... 14

3 Semantic mapping 16

3.1 General approach ...... 16

3.2 Mapping predicate-arguments ...... 17

3.2.1 Argument mapping ...... 17 vii

3.2.2 Word alignment based argument mapping ...... 19

3.2.3 One-to-one predicate-argument mapping ...... 22

3.3 Building a mapping probability model ...... 23

3.3.1 Predicate-to-predicate mapping probability ...... 23

3.3.2 Argument-to-argument mapping probability ...... 23

3.4 Probabilistic mapping ...... 24

3.5 Experiment ...... 25

3.5.1 Reference predicate-argument mapping ...... 25

3.5.2 Parser, SRL, word aligner ...... 26

3.5.3 Predicate-argument mapping results ...... 26

3.5.4 Mapping coverage ...... 30

3.6 Chinese VerbNet Mapping ...... 30

3.6.1 Setup ...... 31

3.6.2 Results ...... 32

3.7 Discussion ...... 33

4 Semantic Role Labeling 35

4.1 Baseline Approach ...... 35

4.1.1 Features ...... 36

4.1.2 Classifiers ...... 39

4.2 Improvements to the Baseline SRL system ...... 39

4.2.1 Support Identification ...... 40

4.2.2 2-stage Argument Label Classification ...... 41

4.2.3 Selectional Preference ...... 42

4.3 Topic Model based Selectional Preference ...... 42

4.3.1 Representation ...... 42

4.3.2 SRL Filtering ...... 43 viii

4.3.3 Multi-lingual SRL filtering ...... 43

4.3.4 SP with LDA-based Topic Model ...... 44

4.3.5 SP Extraction Steps ...... 45

4.3.6 Topic Model feature for SRL ...... 46

4.4 Distributional Similarity based Selectional Preference ...... 46

4.4.1 Approach ...... 46

4.4.2 Distributional Similarity Corpus ...... 47

4.4.3 Selectional Preference Measures ...... 48

4.4.4 SRL Integration ...... 49

4.5 Experiment ...... 49

4.5.1 Setup ...... 49

4.5.2 Performance ...... 50

4.5.3 Automatic SRL Error Analysis ...... 55

4.6 Discussion ...... 55

5 Chinese Empty Category Recovery 60

5.1 Chinese dropped subject mapping to English ...... 60

5.1.1 Motivation ...... 60

5.1.2 Framework ...... 61

5.1.3 Mapping semantic roles between Chinese and English ...... 61

5.1.4 Heuristics ...... 62

5.1.5 Experiment ...... 67

5.1.6 Heuristics distribution ...... 68

5.1.7 Heuristics performance ...... 69

5.1.8 Analysis ...... 72

5.2 Chinese empty category recovery with parallel corpora ...... 73

5.2.1 Monolingual implementation ...... 73 ix

5.2.2 Parallel feature enhancement ...... 75

5.2.3 Experiment ...... 75

5.3 Dependency-based empty category recovery ...... 76

5.3.1 Implementation ...... 77

5.3.2 Experiment ...... 78

5.4 Discussion ...... 80

6 Summary 83

Bibliography 87

Appendix

A SRL Comparison 95 x

Tables

Table

3.1 Chinese argument type (column) to English argument type (row) mapping on triple-

gold Xinhua corpus ...... 20

3.2 Predicate-argument mapping results ...... 27

3.3 Predicate-argument mapping coverage. Predicate coverage denotes the number of

mapped predicates over all predicates in the corpus, word coverage denotes the num-

ber of words in the mapped predicate-arguments over all words in the corpus . . . . 30

3.4 results of English verbs with VerbNet annotation projected back to themselves

through Chinese verbs. The restricted rows indicate restricting the verb pair out-

put of the predicate-argument mapping methods to only those English/Chinese verb

predicates that are also word aligned (rather than just induced through the argument

mappings) ...... 32

4.1 Topics in Chinese Gigaword ...... 50

4.2 Chinese PropBank 1.0 verb results ...... 51

4.3 Chinese PropBank 1.0 results ...... 51

4.4 Chinese PropBank 1.0 argument performance ...... 52

4.5 Chinese PropBank 3.0 out-of-genre results ...... 53

4.6 Chinese SRL comparison ...... 53

4.7 Topics in English Gigaword ...... 58 xi

4.8 English SRL comparison (CoNLL-2005 WSJ) ...... 58

4.9 English SRL argument performance (CoNLL-2005 WSJ) ...... 59

5.1 *pro* alignment (including & excluding alignment to English empty category (EC))

results on broadcast conversation. The baseline system is the Berkeley Aligner out-

put trained with *pro* and *PRO* in place. The heuristic auto output is the heuris-

tics applied using automatic word alignment (Berkeley Aligner). The heuristic gold

output is the heuristics applied using gold standard word alignment annotation. . . . 70

5.2 Inter-annotator agreement F-score vs system F-score (averaged between 2 annota-

tors) on *pro* alignment ...... 71

5.3 *pro* alignment heuristics F-score using GIZA++, Berkeley, and gold word align-

ment compared against human annotation on the entire broadcast conversation portion 72

5.4 EC results of bilingual and monolingual systems (trained on the parallel sentences

and all of OntoNotes 4.0) ...... 76

5.5 EC results compared to Xue and Yang (paper results and results considering all 1838

EC instances in the test corpus) ...... 79

5.6 EC results comparing using no SRL features and using different SRL system outputs 80

5.7 Empty category confusion matrix for full-featured SRL system. Each row represents

the gold EC type and the prediction counts of the automatic system...... 81

A.1 SRL Comparisons ...... 99 xii

Figures

Figure

3.1 Chinese predicate-arguments mapping example ...... 16

3.2 Wrong mapping caused by word alignment error ...... 28

3.3 Corrected mapping based on alignment probability ...... 29

4.1 SRL annotation for make.01 ...... 36

4.2 Chinese-English sentence pair w/ SRL annotation of let and made ...... 40

4.3 Chinese nominal predicate translated to English verb predicate ...... 41

4.4 LDA plate diagram ...... 45

5.1 Distribution of *pro* heuristics applied to broadcast conversation. “N/A” denotes

the number of mapped *pro*s not applicable by any of the heuristics...... 68

5.2 Comparison of frequency of heuristic application between the broadcast conversation

(bc) and Xinhua News (nw) corpora ...... 69

5.3 Effects of SRL: correct Arg1 labeling led the system to insert *T* as the object of

participate. This in turn disambiguated the missing EC subject type of participate

as *pro*...... 81 Chapter 1

Introduction

1.1 Motivation

With the availability of large annotated corpora, comprehensive lexical resources and ad- vances in statistical machine learning techniques, many strides have been made in natural language processing (NLP) for English as well as some of the other popular European languages. However, not all languages have the same lexical resources or large annotated corpora for certain language specific phenomena. An example of this is Chinese. While the recent interest in Chinese-English has prompted -ing and PropBank-ing of more diverse genres of Chi- nese text, a large portion with parallel English text (OntoNotes release 4.0 [39]), some areas are less well addressed: Chinese lacks the rich lexical resources similar to VerbNet and FrameNet for En- glish; there is not a sufficiently large or diverse annotated corpus for Chinese temporal relations or for dropped pronouns. These shortcomings have been a road block to higher quality automatic sys- tems in semantic , temporal resolution, empty category recovery, and coreference resolution that would be helpful for , machine translation, and other NLP applications.

One approach to address these issues is through inference from another language that is more explicit or has more comprehensive resources. For example, grammatical tense in English is a very useful feature in temporal resolution, while the lack of tense in Chinese makes the task much more difficult. But with parallel bi-text, we can mark “tense” in a Chinese verb using the tense of the counter part English verb and thereby improve Chinese temporal resolution. Similarly, compatible pronoun types are a good indicator that 2 entities may be coreferent. By identifying the pronoun 2 type of a Chinese dropped pronoun through parallel bi-text, we may be able to improve Chinese coreference/zero-anaphora resolution and machine translation (whereby the correct pronoun type can be inserted for a non-pro drop target language like English). Fortunately, parallel bi-text is typically much more abundant than annotated text (many millions of Chinese-English parallel sentences versus around 40K Treebank-ed and PropBank-ed Chinese sentences in OntoNotes release

4.0).

Although similarity inference can be performed at the word level or at the syntax level, differences between some language pairs make them less suitable for certain tasks. In the case of

Chinese dropped pronoun recovery, automatic word alignment is of limited use. While it may reveal which English pronouns in the sentence are unaligned (and as we’ll later show, Chinese dropped pronouns may also align to other English words), it provides little direct information as to where the dropped pronouns are in the Chinese sentence. Also, automatic word alignment tends to favor word pairs with high co-occurrence frequency and often misses rarer translation pairs even if they serve similar syntactic or semantic purposes. As we have illustrated previously [85], syntactic structure of Chinese and the English translation can be quite different: Chinese verb predicates (especially verb adjectives (VA)) are often not the parallel of verb predicates in English. And if active voice becomes passive in the translation, then the syntactic subject of the source predicate may not parallel the syntactic subject of the translation. Therefore, we propose a framework of inducing semantic similarity, hypothesizing that comparing predicate-argument structures can abstract away from these language specific syntactic variations and provide more robust features for downstream

NLP applications.

We propose Chinese and English as the two languages for a case study on how we can leverage semantic similarity in parallel corpora to improve NLP applications. Specifically, we plan to use existing English verb class resources to bootstrap a set of Chinese verb classes through semantic mapping. We also plan to study ways to improve Chinese dropped pronoun recovery/identification through semantic mapping of English, a non-pro drop language. 3

1.2 Semantic representation

While there are a number of different semantic representations, such as FrameNet or Abstract

Meaning Representation [2], we chose PropBank style predicate-argument structures based on the availability of a large quantity of annotated data for Chinese and English as well as the higher accuracy of automatic semantic role labeling systems for PropBank.

In the PropBank semantic presentation, each predicate is annotated with arguments to which it has different semantic relations. A predicate is typically a verb, although eventive-nouns as well as adjective predicates have also been annotated recently. Arguments are phrases, with the label identifying the type of semantic role (relation) to the predicate. Both English and Chinese

PropBanks have a number of core argument types, with Arg0 representing the Agent and Arg1 representing the Patient or Theme of the predicate, for example. The English and Chinese

PropBanks differs slightly on the modifier types, Theoretically, the core argument types (as well as many of the modifier argument types) are not language specific. Although, as we’ll see, between

Chinese and English, argument mapping is frequently not determined by matching argument labels.

1.3 Semantic mapping

While applications like question and answer systems have enjoyed the benefit of high quality automatic semantic role labeling (to answer questions such as who did what to whom), semantic consistency has only recently become a focal point in machine translation. This is evident by the work that uses word sense disambiguation to select the correct target translation [13] and reordering/reranking MT output based on semantic consistencies [84, 12], as well as MT evaluation metrics based on matching scores of semantic components between the source language and the translation [55]. With this new focus on multi-lingual semantics, the need for a comprehensive semantic mapping system has become apparent.

We introduce a predicate-argument mapping system based on word alignment and Prop-

Bank style semantic role labeling on both the source and target language. We begin by using 4

GIZA++ [62] and the Berkeley Aligner [48], two of the more popular word alignment tools, to obtain automatic word alignments between parallel English/Chinese corpora. To achieve a broader coverage of semantic mappings than just those annotated in the parallel PropBank-ed corpora, we attempt to map automatically generated predicate-argument structures (using Chinese and English

SRL systems developed on our own). For each Chinese and English verb predicate pairs within a parallel sentence, we examine the quality of both the predicate and argument alignment (using automatic word alignment output) and devise a many-to-many argument mapping technique. From that, we pose predicate-argument mapping as a linear assignment problem (optimizing the total similarity of the mapping) and solve it with the Kuhn-Munkres method [47]. By modeling the predicate-to-predicate and argument-to-argument probabilities over a large unannotated parallel corpora, we show how we can improve the purely word alignment based mapping approach through expectation maximization (EM) of predicate-argument mapping scores over the entire corpora.

The resulting mapping system, when using automatic SRL input, achieved performance close to the one using gold standard SRL annotation on predicate-to-predicate mapping (83.87 vs 87.31

F score). More over, by using the mapping probability model, we were able to further improve the results with both types of SRL input, even though the probability model was generated using automatic SRL. We’ll detail our mapping system in chapter 3.

1.4 Automatic semantic role labeling

Since the performance of our semantic mapping approach relies on quality SRL inputs, we also developed both a Chinese and an English constituent-based semantic role labeler (SRL) for arguments of verbal, and non-verbal predicates. We introduce 3 improvements to the standard

SRL implementation: support identification We identify the predicate hierarchy in a sentence and process the argu-

ments from top to bottom, so that the predicted argument for a top level predicate can be

used to predict arguments of a nested predicate. This can help predicting arguments out- 5

side the local clause of the nested predicate but that are shared with the top level predicate

(frequent with nominal predicates as well as predicates in a relative clause).

2-stage argument classification We classify the argument labels in 2 stages, whereby the pre-

dicted argument labels of the first stage can be used as features to identify missing or

duplicate argument types that the second stage classifier can learn to correct. By filtering

out of low probability argument candidates after the first stage classification, the overall

overhead of the 2-stage argument classification technique is modest. topic model based selectional preference We perform SRL on a large unannotated corpus

and extract argument label, head word pairs for each predicate instance. We then model

the latent topic distribution of the argument label, head word pairs using latent Dirichlet

allocation (LDA) to improve prediction of out-of-vocabulary argument candidates.

With these improvements, we improve our Chinese SRL system from 75.99 to 76.40 F score on arguments of verb predicates, which we believe is the state-of-the-art Chinese SRL performance.

Our English SRL performance was similarly improve and achieved over 80 F score using a single constituent tree input. More over, by improving the performance of Chinese nominal SRL by over 2 F points, we demonstrated that selectional preferences can significantly improve SRL when the argument candidates are not well constrained by syntax. We’ll detail our SRL systems in chapter 4.

1.5 Chinese dropped pronouns

Chinese often allows a subject pronoun to be dropped under certain syntactic and pragmatic constraints (for instance, it should be relatively clear to a human reader what the subject is at a discourse level, although it may be ambiguous at the sentence level). In Chinese Treebank, now in release 8.0, the location of the dropped pronoun is annotated with *pro* or *PRO*, as part of broader empty category annotation in the phrase structure tree. Not until recently has there been effort directed towards annotating the types of dropped pronoun [4], although the scale of this 6 annotation is still relatively small.

Machine translation from a pro-drop language like Chinese to English has always presented a unique challenge: not only does the system need to identify the occurrence of a dropped subject, it must also insert the correct pronoun in the translation output when necessary. The importance of

Chinese dropped pronoun recovery to machine translation output has been demonstrated by Chung and Gildea [18]. In their work, Chinese *PRO* and *pro* in the training sentences were inserted automatically through a conditional random field model (sequence modeling). GIZA++ was then trained on these empty category enhanced sentences and fed to the Moses MT system, thereby creating empty category aware phrase substitution rules. Even with a Chinese *pro* recovery F- score in the 40s, the resulting MT system achieved a statistically significant improvement in BLEU score.

While dropped pronoun identification has been studied as part of empty category recovery [94,

18, 10, 93], the state of the art results [10, 93] lag far behind empty category recovery performance in English: 66.0 labeled F-score vs 86.2 labeled F-score, and the performance on just *pro* is typically much lower [94, 18, 93].

With such a large difference in performance, we study an alternative approach that leverages

Chinese-English parallel corpora. Since English is not a pro-drop language (and the few cases of omitted syntactic or logical subjects, such as a non-finite clause or a passive construction, are relatively easy to detect automatically), when the English translation is available, it may provide valuable clues to the occurrence as well as to the type of dropped subjects in the parallel Chinese sentence. Since a dropped subject serves as the subject of a verb predicate, by comparing the verb predicates and syntactic structure of the parallel sentences, the translated subject may be revealed.

Therefore, we present a semantic similarity based framework using features from English text to help recover Chinese dropped subjects.

For recovering Chinese dropped pronouns in a monolingual environment, we observe that syntactic subjects of a verb are typically Arg0 or Arg1 in PropBank annotation. By comparing the predicted argument sets and the expected argument sets of a verb predicate, we may be able to 7 infer whether there is a dropped subject. Therefore, we study the effect of SRL for Chinese empty category recovery.

By developing a set of alignment heuristics and applying them through semantic mapping, we were able to dramatically improve Chinese dropped subject to English text alignment performance.

And with our enhanced SRL systems, we were able to achieve new state-of-the-art Chinese empty category recovery results. We’ll detail our systems and findings in chapter 5. Chapter 2

Background and Related Work

Our approach for leveraging semantic similarity in parallel corpora borrowed ideas from many mono-lingual and multi-lingual NLP areas. These include ideas of semantic mapping, using parallel corpora to improve mono-lingual NLP performances, PropBank style semantic role labeling for

English and Chinese, selectional preference, and empty category recovery. We divided the related work in the following sections based on these areas.

2.1 Semantic mapping

Resnik [71] (and later Madnani et al. [56, 57]) was an early work proposing semantic similarity with triangulation between parallel corpora. Although the idea of semantic similarity here is a looser definition of semantically similar/equivalent phrases.

Mareˇcek[59] proposed aligning tectogrammatical trees, where only content (autosemantic) words are nodes, in a parallel English/Czech corpus to improve overall word alignment and thereby improve machine translation. The Czech corpus is first lemmatized because of the rich morphology, and then the word alignment is “symmetrized” (by merging unidirectional alignment). This ap- proach does not explicitly make use of the predicate-argument structure to confirm the alignments or to suggest new ones.

Pad´oand Lapata [63, 64] used word alignment and syntax based argument similarity to project English FrameNet semantic roles to German. The approach relied on annotated seman- tic roles on the source side only, precluding joint inferenece of the projection using reference or 9 automatic target side semantic roles.

Fung et al. [33] demonstrated that there is poor semantic parallelism between Chinese-English bilingual sentences. Their technique for improving Chinese-English predicate-argument mapping

(ARGChinese,i 7→ ARGEnglish,j) consists of matching predicates with a bilingual lexicon, computing cosine-similarity (based on lexical translation) of arguments and tuning on an unannotated par- allel corpus. One restriction is the system only provided one-to-one mapping of core (numbered) arguments and may not be able to detect predicate mapping with no lexical relations that are nevertheless semantically related. Later, Wu and Fung [84] used parallel semantic roles to improve

MT system outputs. Given the outputs from Moses [45], a machine translation decoder, they re- ordered the outputs based on the best predicate-argument mapping. The resulting system showed a 0.5 point BLEU score improvement even though the BLEU metric often discounts improvement in semantic consistency of MT output.

Choi et al. [17] showed how to enhance Chinese-English verb alignments by exploring predicate- argument structure alignment using parallel PropBanks. The system, using GIZA++ word align- ment, deduced alternate verb alignments that showed improvement over pure GIZA++ alignment.

Some of the limitations of the system are that it operated only on gold standard parses and seman- tic roles and did not provide explicit argument mapping between the aligned predicate-argument structures. Most the improvement evaporated when it was used with a better word alignment system.

2.2 Leveraging parallel corpora

The idea of using constraints in parallel text to extract a large set of noisy training data has been researched for a number of NLP building blocks, including part-of-speech tagging, morpholog- ical analyzers, noun-phrase bracketers and named entity taggers [71]. Specific to English-Chinese parallel corpora, Resnik [71] used this idea for syntactic dependency and word sense disambigua- tion (where the translation text often disambiguated words in the source text). Issues specific to

Chinese like number prediction [3] and tense prediction [53] have also borrowed this idea: Baran 10 and Xue [3] used annotated parse trees and word alignment to induce number information in Chi- nese nouns and used that to train a classifier for improving Chinese number prediction, while Liu et al. [53] produced pseudo-training data from automatic parsing and word alignment to bolster training data for Chinese tense prediction. Notably, Liu et al.’ iteratively added a smaller set of higher confidence data from the entire set of noisy training data based on the agreement between the label derived from the current model and the assigned label of the noisy training data.

2.3 PropBank Semantic role labeling

There are many semantic representations and automatic semantic parsing systems. For a comparisons between the different representations and systems, please refer to appendix A. For this thesis, I’ll focus on PropBank semantic role labeling.

Gildea and Jurafsky [37] produced one of the first constituent-based automatic SRL sys- tems using FrameNet [1] annotations. The syntactic features have been widely adopted by most constituent-based PropBank semantic role labeling systems [46, 61, 69, 82]. The general approach for most of the top performing PropBank SRL systems consists of extracting a number of lexical and syntactic features (including constituent tree path to predicate, subcategorization frame, etc), then performing classification on argument identification and argument labeling (either as separate steps or in a single step). Usually, a joint inference step on the final set of output labels (to ensure the arguments don’t overlap, and label types do not duplicate, etc) is performed.

Koomen et al. [46] used multiple parses and argument label classifiers (each outputting a set of argument labels w/ probability) and performed this joint inference by formulating the SRL output constraints as an integer linear programming problem by optimizing the sum of the probability of the chosen set of argument labels. The notable constraints used were: no overlapping or embedded arguments, no duplicate core arguments, core argument sets that do not violate PropBank frameset definition for each verb predicate, C-argument labels (continuation) and R-argument labels (relative pronoun) must have a matching argument label (ex: C-Arg1 must also have an Arg1 in the output, and Arg1 must appear before C-Arg1). This was the highest performing SRL system in the CoNLL- 11

2005 shared task.

Pradhan et al. [69] produced one of the most widely available and used state-of-the-art SRL systems: ASSERT (which has also been reimplemented in C and adapted to Chinese SRL as C-

ASSERT [34]). Of note is the use of SVM as the argument identification and labeling classifier.

Since kernel-space classification requires computing a correlation matrix (that transforms the sparse input feature matrix into a dense kernel matrix) of training samples, leading a to large memory requirement when faced with a large quantity of training data, the system selects training samples iteratively based on how well the classifiers can correctly label the training data.

Moschitti et al. [61] used tree-kernel features in their SRL system, which, unlike many features used in SRL systems, can compare samples with similar but not identical sub-tree structures.

However, a naive representation of all the possible sub-tree structures would lead to an explosion of feature space. To solve this issue, kernel-space classifier (SVM) is used: only tree structure comparisons need to be made between pairs of samples, but no explicit enumeration of sub-trees need to be made. Compared to other SRL systems, the use of tree-kernels produced a higher performing SRL system when the amount of training data is limited (where matching features values are more rare).

Collobert et al. [19] proposed a fast & compact neural network model for argument identi-

fication and labeling (SENNA). While the system does take advantage of phrase chunks as well as constituent parse features for the best performance, much of the gain of the system can be at- tributed to semi-supervised learning using a language model collected on large corpora (Wikipedia and Reuters news).

2.3.1 Chinese SRL

Xue and Palmer [91] produced one of the first Chinese SRL systems that is largely based on English SRL systems but with heuristics that prune away subtrees from being considered for argument identification. However, because of the less accurate Chinese constituent parser and the smaller amount of training data, the results lag behind English SRL system. C-ASSERT [34] is 12 another Chinese SRL system. It has been used for scoring machine translation output based on predicate-argument matching [55].

More recently, Sun [78] has proposed a new set of Chinese specific syntactic features for

Chinese SRL. The results show improvement over existing Chinese SRL systems on gold-standard parse input. It’s unclear whether these features will translate to improvement on automatic parse input.

Zhuang and Zong [100] demonstrated that by performing Chinese SRL and English SRL on parallel corpora simultaneously, and constraining the output based on argument alignment, one can improve the output of both systems, even when one system, in this case English, has a lower baseline performance. While they improved the Chinese SRL performance by 1.52 F points, this gain can only be achieved during decoding when English parallel text is available.

2.3.2 Selectional Preference for SRL

Inducing selectional preferences from corpus data was first proposed by Resnik [70] for sense disambiguation. He generalized seen words using the WordNet [31] hierarchy. In particular, given a set of classes C from the WordNet nominal hierarchies, he defined the selectional preference strength of a predicate, role pair (p, r) using the Kullback-Leibler distance between the prior distribution P (C) and the posterior distribution P (C|p, r):

X P (c|p, r) SelStr(p, r) = P (c|p, r) log( ) (2.1) P (c|r) c∈C

Later models based on distributional similarity (word co-occurrence in a corpus) were studied by

Pantel and Lin[67], Erk [28], and others. Pantel and Lin’s dependency relationship based distribu- tional similarity and pointwise mutual information based word similarity measures were particularly successful and were adopted by Zapirain et al. [98] and others. Erk [28] defined distributional sim- ilarity based selectional preference as

X SelP ref(p, r, w0) = Sim(w0, w) · weight(p, r, w) (2.2) w∈Seen(p,r) 13 where Seen(p, r) is a set of permissible (p, r, w) tuples typically collected from a domain specific primary (training) corpus, and weight(p, r, w) may be uniform, frequency count of the tuple, or some other measure. In later work, Erk et al. [29] proposed normalizing the weight with Zp,r = P w∈Seen(p,r) weight(p, r, w) to make the number of seen exemples not matter. One of the earlier applications of selectional preference to automatic SRL is the system by

Gildea and Jurafsky [37] where verb-direct object pairs were extracted from a large corpus and clustered based on the co-occurrence of the verb and the head word of the direct object. The cluster features were able to modestly improve the overall system. Zapirain et al. [98] improved the end-to-end performance of an English PropBank SRL system by 0.4 F1 points using a variety of word similarity measures. The best performing ones were backed by a dependency relationship based distributional similarity. They found that for English prepositional phrases, an argument had stronger selectional preference with the preposition than with the predicate.

2.3.3 Topic Model based Selectional Preference

Ritter and Etzioni [72] reasoned that the set of hidden variables modeled by latent Dirichlet allocation (LDA) naturally represents the semantic structure of a document collection. Therefore, the topics found can be viewed as the latent set of classes that store preferences. The system differs from the other SP work in that it models two sets of distributions for each topic simultaneously using LinkLDA [77]. This loosely encodes the mutual preference of a pair of arguments for the same predicate. If used in an SRL system, this type of selectional preference would go beyond simply judging the admissibility of each argument independently.

S´eaghdhaand Korhonen [75] proposed two other LDA variants for selectional preference:

ROOTH-LDA and LEX-LDA. Compared with LinkLDA, ROOTH-LDA models preference of pairs of arguments with two sets of topics, one for each argument in the pair. On the other hand,

LEX-LDA models selectional preference probability of a word using a mixture of lexical-based and class-based probability distribution. This allows for idiosyncratic argument patterns that are best learned by observing that predicate’s co-occurrences in isolation while still able to generalize 14 class-based argument patterns.

2.4 Empty category recovery

There is a wealth of work in recovering empty categories for English [42, 23, 24, 11, 35, 73, 10].

Johnson [42] proposed the first English empty category recovery work using pattern extraction

(about 9000 tree fragment patterns) and pattern matching. Dienes and Dubey [23, 24] formulated the problem as a tagging task: each word is tagged with whether there’s an empty element preceding it. They introduced a number of local features (word token n-gram, POS tag) and some subtree structure patterns and trained a MaxEnt classifier to produce the tags. Campbell [11]’s system, on the other hand, was based on principles of Government-Binding theory that underly the Treebank annotation and did not rely on a training corpus. Even so, it performed better than earlier corpus trained systems. Gabbard et al. [35] developed a 2 stage system: modifying the (Collins) parse to recover function tags during parsing and then performing empty category recovery using function tags and other features with supervised learning a training corpus. Schmid [73] achieved the then state-of-the-art English empty category recovery system using an unlexicalized PCFG parser and slash features. While the output of the unlexicalized parser cannot equal the performance of a lexicalized parser in overall parsing accuracy, it improved upon early empty category recovery results. Cai et al. [10] modified the Berkeley parser with word-lattice parsing (instead of parsing on a predetermined number of terminals). The system achieved the best result on both overall parsing accuracy and accuracy on empty elements for both English and Chinese.

For Chinese, only recently has there been an equally concerted effort on empty category re- covery. Yang and Xue [94] proposed inserting empty categories, after performing syntactic parsing, as a tagging task, similar to the approach of Dienes and Dubey [23] on English. While the proposed syntactic features made a big difference on gold standard parses, they weren’t nearly as helpful on automatically generated parses. Chung and Gildea [18] proposed a number of approaches includ- ing pattern matching, conditional random field (sequence modeling), and integrated parsing and empty category recovery (where the empty category is encoding in the subtree during training). 15

They found the heavily lexicalized CRF approach to work best on *pro* while the parsing and pat- tern matching approaches worked better for *PRO*. More significantly, by training GIZA++ with automatically inserted Chinese *pro* and *PRO*, thereby creating empty category aware phrase tables in Moses, they achieved improved MT output. While Cai et al. [10] produced the best per- forming system for both English and Chinese empty category recovery at the time, the performance difference between the 2 languages, however, is very large: 86.2 labeled F-score for English and

58.6 F-score for Chinese. In fairness, these numbers are not on the same parallel corpus so it may not be a valid comparison (our English SRL system performed better on Wall Street Journal but not as well on English translation of Xinhua News when compared to our Chinese SRL system).

Xue and Yang [93] achieved a new state-of-art Chinese empty category recovery performance by modeling the dependency relationship of the empty element and its parent. Instead of deciding only whether an empty element should occur between 2 words, they also looked at whether the potential parent of the empty element is missing a particular type of dependent. With this structural constraint, they improved the upon the previous state-of-the-art results by 7.4 F points. However, much of the gain came from better recovery of empty elements in relative clause constructions. The labeled *pro* recovery F-score is only in the low 20’s. Chapter 3

Semantic mapping

3.1 General approach

The foundation of this thesis relies on finding the corresponding semantic components between the source and target languages. We chose PropBank style predicate-argument structures as the semantic representation (as opposed to FrameNet or VerbNet) based on the availability of large quantities of annotated data for English and Chinese as well as the higher accuracy of automatic semantic role labeling systems.

Given a parallel sentence pair, we would like to find the corresponding PropBank predicate- arguments mapping between the sentences as illustrated by figure 3.1.

Figure 3.1: Chinese predicate-arguments mapping example

Our framework of semantic mapping requires the following components: monolingual auto- matic semantic role labelers in both the source and target languages an an automatic word aligner for the parallel corpora to link the semantic roles between the languages. We chose to develop 17 our own constituent based semantic role labeling system while taking advantage of existing au- tomatic word aligners such as GIZA++ and the Berkeley Aligner. We also used the Berkeley phrase-structure parser as it supports parsing of both English and Chinese text.

3.2 Mapping predicate-arguments

3.2.1 Argument mapping

To produce a good predicate-argument mapping, we needed to consider 2 things: whether good argument mapping can be produced based on argument type only, and whether each argument only maps to one argument in the target language.

3.2.1.1 Predicate-dependent argument mapping

Theoretically, PropBank numbered arguments are supposed to be consistent across predi- cates: ARG0 typically denotes the agent of the predicate and ARG1 the theme. While this consistency may hold true for predicates in the same language, as Fung et al. [33] noted, this is not a reliable indicator when mapping predicate-arguments between Chinese and English. For example, when comparing the PropBank frames of the English verb arrive and the synonymous Chinese verb 抵

达, we see ARG1 (entity in motion) for arrive.01 is equivalent to ARG0 (agent) of 抵达.01 while ARG4

(end point, destination) is equivalent to ARG1 (destiny).

3.2.1.2 Many-to-many argument mapping

Just as there are shortcomings in assuming predicate independent argument mappings, as- suming one-to-one argument mapping may also be overly restrictive. For example, in the following

Chinese sentence:

大 通道 建设 搞搞搞活活活 了 大 西南 的 物流

big passage construction invigorated big southwest’s material flow the predicate 搞活(invigorate) has 2 arguments: 18

• ARG0: 大 通道 建设 (big passage construction)

• ARG1: 大 西南 的 物流 (big southwest’s material flow)

In the parallel English sentence:

Construction of the main passage has activated the flow of materials in the great southwest activate has 3 arguments:

• ARG0: construction of the main passage

• ARG1: the flow of materials

• ARGM-LOC: in the great southwest

In these parallel sentences, ARG1 of 搞活 should be mapped to both ARG1 and ARGM-LOC of activate.

While the English translation of 搞活, invigorate, is not a direct synonym of activate, they at least have some distant relationship as indicated by sharing the inherited hypernym make in the WordNet [31] database. The same cannot be said for all predicate-pairs. For example, in the following parallel sentence fragments:

街 上 客流 如如如 潮

on the street people flow like the tide the Chinese predicate-argument structure for 如(like) is:

• ARG0: 客流 (flow of guests)

• ARG1: 潮 (tide)

• ARGM-LOC: 街 上 (on the street) while the English predicate-argument structure for flow is:

• ARG1: people

• ARGM-LOC: on the street 19

• ARGM-MNR: like the tide

Semantically, the predicate-argument pairs are equivalent. The argument mapping, however, is more complex:

• 如.ARG0 ⇐⇒ flow.ARG1, flow.V

• 如.V, 如.ARG1 ⇐⇒ flow.ARGM-MNR

• 如.ARGM-LOC ⇐⇒ flow.ARGM-LOC

Table 3.1 (page 20) details the argument mapping for the triple-gold Xinhua data. The mapping distribution for ARG0 and ARG1 is relatively deterministic (and similar to ones found by

Fung et al. [33]). Mappings involving ARG2-5 and modifier arguments, on the other hand, are much more varied. Typically, when there is a many-to-many argument mapping, it’s constrained to a one-to-two or two-to-one mapping. Much more rarely is there a case of a two-to-two or even more complex mapping. A closer look at the argument mapping output revealed that while a lot of the diversity can be attributed to PropBank annotation differences between English and Chinese (the current effort on nominal and adjective predicate annotation expansion for the 2 languages should bring the SRL mapping closer together), much of it is caused by inherent differences between the

2 languages as well as translation choices. We plan to perform a more detailed analysis in [87] and other possible follow-on works.

3.2.2 Word alignment based argument mapping

To achieve optimal mappings between parallel predicate-argument structures, we would like to maximize the number of words in the mapped argument set (over the entire set of arguments) while minimizing the number of unaligned words in the mapped argument set.

Let ai,C and aj,E denote an argument in Chinese and English respectively, AI,C and AJ,E as a set of mapped Chinese and English arguments respectively, Wi,C as the words in argument ai,C , and mapE(ai,C ) = Wi,E as the word alignment function that takes the source argument and 20

arg type Arg0 Arg1 Arg2 Arg3 Arg4 ADV BNF DIR DIS EXT LOC MNR PRP TMP TPC V Arg0 1610 79 25 0 0 28 1 0 0 0 8 5 1 11 1 9 Arg1 432 2665 128 11 0 83 9 12 0 0 29 12 5 21 3 142 Arg2 43 310 140 8 3 55 6 9 0 2 20 10 1 4 1 67 Arg3 2 14 21 7024200121014 Arg4 1 37 9 3 6 00000101004 ADV 33 36 9 6 0 307 2 5 6 0 44 121 6 11 2 19 CAU 1 00001000000 16 0 0 1 DIR 1 13 3 2 0 1 0 3 0 0 3 0 0 0 0 20 DIS 2 0 0 0 0 69 0 0 40 0 2 1 3 3 0 0 EXT 0 4 0 0 0 26 0 0 0 0 0 0 0 0 0 2 LOC 23 65 13 1 0 3 1 0 0 0 162 0 0 5 0 4 MNR 9 9 5 0 0 260 0 0 0 1 3 34 0 0 0 25 MOD 1 0 0 0 0 159 0 0 0 0 0 0 0 0 0 84 NEG 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 5 PNC 3 23 11 4 0 1 6 1 0 0 1 2 35 2 0 8 PRD 0 0 3 0000000000001 TMP 14 21 2 0 0 235 0 3 0 1 8 16 0 647 0 6 V 25 28 22 1 0 211 1 0 1 0 2 12 0 0 0 3278

Table 3.1: Chinese argument type (column) to English argument type (row) mapping on triple-gold Xinhua corpus 21 produces a set of words in the target language sentence. We define precision as the fraction of aligned target words in the mapped argument set:

|(∪i∈I mapE(ai,C )) ∩ (∪j∈J Wj,E)| PI,C = (3.1) |∪i∈I mapE(ai,C )| and recall as the fraction of source words in the mapped argument set:

P i∈I |Wi,C | RI,C = P (3.2) ∀i |Wi,C |

We then choose the AI,C that optimizes the F1-score of PC and RC :

2 · PI,C · RI,C AI,C = arg max = FI,C (3.3) I PI,C + RI,C

Finally, to constrain both the source and target argument sets, we optimize:

2 · FI,C · FJ,E AI,C ,AJ,E = arg max = FIJ (3.4) I,J FI,C + FJ,E

To measure similarity between a single pair of source, target arguments, we define:

|mapE(ai,C ) ∪ Wj,E| |mapC (aj,E) ∪ Wi,C | Pij = ,Rij = (3.5) |mapE(ai,C )| |mapC (aj,E)| which involve computing the proportion of the aligned words between the argument pair over all aligned words to the other language for either argument.

To generate the set of argument mapping pairs, we simply choose all pairs of ai,C , aj,E ∈

AI,C ,AJ,E where Fij ≥  ( > 0).

Directly optimizing equation 3.4 requires exhaustive search of all argument set combinations

(2|ai,C | · 2|aj,E |) between the source and target. While the typical number of arguments for each predicate is relatively small, this is still quite inefficient. We performed the following greedy-based approximation with quadratic complexity:

(1) Compute the best (based on F-score of equation 3.5) pair of source-target argument map-

pings for each source argument (target argument may be reused)

(2) Select the remaining argument pair with the highest F-score 22

(3) Insert the pair in AI,C ,AJ,E if it increases FIJ , else discard

(4) repeat until all argument pairs are exhausted

(5) repeat 1-4 reversing the source and target direction

(6) merge the output of the 2 directions

Much like GIZA++ word alignment where the output of each direction produces only one-to-many mappings, merging the output of the two directions produces many-to-many mappings.

3.2.3 One-to-one predicate-argument mapping

To find the best predicate-argument mapping between Chinese and English parallel sentences, we assume each predicate in a Chinese or English sentence can only map to one predicate in the target sentence. As noted by Wu et al. [85], this assumption is mostly valid for the Xinhua news corpus, though occasionally, a predicate from one sentence may align more naturally to two predicates in the target sentence. This typically occurs with verb conjunctions. For example the Chinese phrase “观光 旅游” (sightseeing and tour) is often translated to the single English verb “travel”. As noted by Xue and Palmer [92], the Chinese PropBank annotates predicative adjectives, which tend not to have an equivalent in the English PropBank. Additionally, some verbs in one language are nominalized in the other. This results in a good portion of Chinese or

English predicates in parallel sentences not having an equivalent in the other language.

With the one-to-one mapping constraint, we optimize the mapping by maximizing the sum of the F1-scores (as defined by equation 3.4) of the predicates and arguments in the mapping.

Let predI,C and predJ,E denote the sets of predicates in Chinese and English respectively. With

G(predI,C , predJ,E) = {g : predI,C 7→ predJ,E} as the set of possible mappings between the two predicate sets, the optimal mapping is:

∗ X g = arg max Fij (3.6) g∈G i,j∈g

To turn this into a classic linear assignment problem, we define Cost(predi,C , predj,E) = 1 − Fij, 23 and (3.6) becomes:

∗ X g = arg min Cost(predi,C , predj,E) (3.7) g∈G i,j∈g (3.7) can be solved in polynomial time with the Kuhn-Munkres algorithm [47].

3.3 Building a mapping probability model

With the predicate-argument mapping technique, we can collect some mapping probabilities between a Chinese predicate and argument types and a English predicate and argument types.

Specifically, we are interested in the following probabilities: p(predj,E|predi,C ) : given the aligned Chinese predicate, the probability of an English predicate p(al,E|ak,C , predi,C , predj,E) : given an aligned Chinese & English predicate pair and the Chinese

argument type, the probability of an English argument type

These 2 probabilities (and the probabilities in the English-to-Chinese direction) can be used to compute the semantic similarity of a pair of parallel sentences.

3.3.1 Predicate-to-predicate mapping probability

There are over 20,000 Chinese predicates and over 10,000 English predicates (in OntoNotes

5.0 PropBank frame files). Even on a large corpora, freqmap(predi,C , predj,E) will be low or zero for many predicate pairs to produce a good probability estimate. We chose the Simple Good-Turing smoothing method [36] to smooth the seen mapping frequency counts and estimate the total unseen mapping probability P p(pred |pred ). j∈freqmap(predi,C ,predj,E )=0 j,E i,C

3.3.2 Argument-to-argument mapping probability

Since freqmap(predj,E|predi,C ) is sparse, freqmap(argl,E|predi,C , predj,E, argk,C ) will also be sparse. We address this using absolute discounting [15] to smooth

max(freq(al,E|ak,C , predi,C , predj,E) − d, 0) p(al,E|ak,C , predi,C , predj,E) = P +(1−λ)·pbackoff (al,E) l freq(al,E|ak,C , predi,C , predj,E) (3.8) 24 with a few different back-off probability distributions:

(1) p(al,E|ak,C , predi,C ): given the Chinese predicate and argument type, the probability of an

English argument type

(2) p(al,E|ak,C , predj,E): given the English predicate and Chinese argument type, the proba-

bility of an English argument type

(3) p(al,E|ak,C ): given the Chinese argument type, the probability of an English argument type

(1) and (2) can be further smoothed using (3), while (3) can be computed directly from the frequency count over a large corpus since there are less than 30 argument types for either Chinese or English.

To choose between (1) and (2) as the back-off probability distribution, we compute the cosine similarity between (1), (3) and (2), (3) and choose the smaller of the 2 (i.e., choose the more specific distribution that’s less similar (more informative) to the base distribution).

3.4 Probabilistic mapping

With the probability model described in the previous section, we attempted to improve predicate-argument mapping by integrating the model with the alignment algorithm. Because the model is computed using automatic system output, we wanted to ensure the alignment algorithm does not overly rely on it. Therefore we modify equation 3.5 to:

0 Pkl = (1 − β) · Pkl + β · Pkl · w(al,E|ak,C , predi,C , predj,E) (3.9) 0 Rkl = (1 − β) · Rkl + β · Rkl · w(ak,C |al,E, predi,C , predj,E) where 0 ≤ β ≤ 1 and

p(ak) w(ak) = P 2 (3.10) k p(ak) 0 0 0 so that the expected value of w(ak), E(w(ak)) = 1. If Pkl > 1 or Rkl > 1, we change Pkl = 1,

0 Rkl = 1. We also update equation 3.3 to take into account of predicate-to-predicate mapping 25 likelihood:

0 Fi,C = (1 − α) · Fi,C + α · Fi,C · w(predj,E|predi,C ) (3.11) 0 Fj,E = (1 − α) · Fj,E + α · Fj,E · w(predi,C |predj,E) We choose α and β (through grid-search) to maximize the sum of the alignment score of all the predicate-argument pairs in the corpus. This is analogous to the maximization step of the expectation–maximization (EM) algorithm. In our case, the expectation step is computing the predicate/argument mapping probability. Because the alignment did not vary wildly with different

α and β values (since they are still based on SRL and word alignment input), the probability model converges after 2-3 iterations.

3.5 Experiment

3.5.1 Reference predicate-argument mapping

We used a portion of OntoNotes Release 4.0 with Chinese-English word alignment annota- tion 1 as the basis for evaluating semantic mapping. The word alignment corpus contains around

2000 Xinhua News parallel sentences and 3000 broadcast conversation (CCTV and Phoenix) parallel sentences. A small percentage of the sentences were discarded because of tokenization differences.

We dubbed the resulting 2092 newswire and 2944 broadcast conversation parallel sentences as the triple-gold (gold Treebank, gold PropBank, and gold word alignment annotations) corpus.

To generate reference predicate-argument mappings, we ran the mapping system described in section 3.2.2 with a cutoff threshold of Fij < 0.4 (i.e., alignments with F-score below 0.4 are discarded) using all gold annotations. We selected a small random sample of the Xinhua output and found the output to have both high precision and recall, with only occasional discrepancies caused by possible word alignment errors (and was no worse than inter-annotator agreement). If one-to-one argument mapping is imposed, the reference predicate-argument mapping will lose 8.2% of the alignments. For mappings using automatic word alignment, we chose a cutoff threshold of

Fij < 0.2. This can easily be tuned for higher precision or recall based on application needs. 1 LDC2009E83 26

3.5.2 Parser, SRL, word aligner

We trained our Chinese SRL system with Berkeley Parser output on Chinese PropBank 1.0.

Our English SRL system is also trained with Berkley parser output but on OntoNotes release

5.0 (excluding the triple-gold Xinhua and broadcast conversation sections). We describe detailed implementations of our SRL systems in chapter 4. We use the Berkeley aligner for word alignment, as we found it outperformed GIZA++. The aligner is trained on a 1.6M sentence parallel corpora

2 collected from a variety of sources . For the word mapping functions mapE(aC ), mapC (aE) in equation 3.5, because the Berkeley aligner outputs bidirectional word alignment, we did not need to decide between using intersection or union of uni-directional GIZA++ alignments as Pad´oand

Lapata [64]. On average (from the 1.6M sentence pair corpora), an English sentence contains 28.5% more tokens than the parallel Chinese sentence (even greater at 36.2% for the Xinhua portion).

3.5.3 Predicate-argument mapping results

We ran the predicate-argument mapper on the triple-gold Xinhua News section that intersects with the standard Chinese PropBank 1.0 test section that has an English translation (sections that fall between 01-40, 242 total sentence pairs), as well as all of the triple-gold broadcast conversation sections. Using the 1.6M sentence pair corpus, we found the optimal α = 0.175 and the optimal

β = 0.2. In general, the choice of β had a smaller impact on the overall mapping score of the corpus than α. We used these values for the probability enhanced mapper.

The results, detailed in table 3.4, show that using automatic SRL and word alignment, the system achieved an 83.87 predicate-argument mapping F-score on Xinhua News, 3.44 F points less than using gold standard SRL annotation. Using the probability model enhanced mapper, however, this gap is closed to 2.27 F points, a 1.17 F point improvement over the baseline model.

Similar performance enhancement was observed on the broadcast conversation data (1.12 F points improvement with the probability model enhanced aligner), albeit with a lowered overall perfor-

2 LDC2002E18, LDC2002L27, LDC2003E07, LDC2003E14, LDC2004T08, LDC2005E83, LDC2005T06, LDC2005T10, LDC2005T34, LDC2006E24, LDC2006E26, LDC2006E34, LDC2006E85, LDC2006E86, LDC2006E92, LDC2006E93 27

predicate pair core argument label all argument label corpus system p r f1 p r f1 p r f1 baseline 77.36 77.92 77.64 71.37 58.46 64.27 63.41 53.28 57.91 broadcast +prob model 78.51 79.01 78.76 72.01 59.17 64.96 65.09 53.76 58.89 conversation gold SRL 80.90 81.16 81.03 86.62 83.39 84.98 77.75 73.55 75.59 +prob model 80.46 82.25 81.35 86.21 83.97 85.07 78.30 73.57 75.86 gold WA 89.90 88.45 89.16 79.84 65.57 72.01 77.61 67.22 72.04 baseline 84.03 83.71 83.87 79.57 69.80 74.37 75.60 64.68 69.72 Xinhua +prob model 84.49 85.61 85.04 80.20 70.55 75.07 75.91 65.12 70.10 News gold SRL 86.03 88.64 87.31 88.57 90.35 89.45 84.49 81.61 83.03 +prob model 87.92 88.26 88.09 89.86 89.97 89.92 86.06 81.45 83.69 gold WA 93.41 91.29 92.34 85.33 75.06 79.87 85.12 76.74 80.72

Table 3.2: Predicate-argument mapping results 28

Figure 3.2: Wrong mapping caused by word alignment error

mance. The F score differences (on predicates, core arguments, or all arguments) for the broadcast conversation data were all found to be statistically significant3 (p ≤ 0.01). The Xinhua News F score differences, on the other hand, were not statistically significant due to the small test corpus

(had we retrained our Chinese SRL to exclude all of triple-gold Xinhua News and tested it on its entirety, it likely would have been statistically significant as well).

Surprisingly, the probability model (which was extracted from automatic SRL output), was able to improve the mapping performance of the system using gold standard SRL by 0.78 F point on Xinhua News and 0.32 F point on broadcast conversation, although these margins were not statistically significant.

Figure 3.2 (bad mapping) and figure 3.3 (good mapping on page 29) provide an example of the probability model correcting a predicate-argument mapping error. Because the automatic word aligner erroneously aligned both 自筹/self-provide and 建设/construct to build (shown with dotted lines), as well as missing the correct word alignments of 自筹 to Using its own, 自筹 is mapped to build, since they share more aligned words amongst the arguments. However, since

ARG1 in Chinese rarely maps to ARGM-MNR but frequently maps to ARG1, the enhanced mapper preferred the correct mapping of 自筹 to use. This alone would have led to the correct mapping of 建设/construct to build using the Kuhn-Munkres algorithm, but the probability model further boosted the confidence of the mapping because 建设/construct frequently maps to build.

3 SIGF (www.nlpado.de/%7esebastian/software/sigf.shtml), using stratified approximate randomization test [95] 29

Figure 3.3: Corrected mapping based on alignment probability

Comparing the performance with the systems using gold standard word alignment, we see the bottleneck is the word aligner performance, as using gold standard word alignment produced much better results than using gold standard SRL (92.34 vs 87.31 F score on Xinhua News and 89.16 vs 81.03 F score on broadcast conversation). As we had expected, with such a large performance gap between automatic word alignment and gold annotation, adding the probability model only degraded the output of the systems using gold standard word alignment (the margin was no more than 0.5 F point).

We also experimented building the probability model using only 10% of the data. The improvements were generally 0.1-0.3 F points less than using the full dataset, although it degraded the gold word alignment based system by another 0.5-1 F point over using the full dataset. The optimal α = 0.175 and β = 0.2 did not change.

When looking at core argument mapping, however, the differences between using automatic

SRL and gold standard SRL is substantial: on Xinhua News, automatic SRL based output produced a 75.07 F-score for core arguments. While this is comparable to Fung et al. [33]’s 72.5 (albeit with different sections of the corpus and based on gold standard predicates from a bi-lingual dictionary), it’s 14.85 F points lower than using gold standard SRL based output. When including all arguments, automatic SRL based output achieved 70.10% while the gold SRL based output achieved 83.69%.

The performance on broadcast conversation shows a similar drop between the 2 SRL outputs.

The argument results are not too surprising as the mapping system need to deal with many 30 output type language coverage predicate Chinese 50.0% predicate English 81.3% triple-gold word Chinese 66.0% word English 64.2% predicate Chinese 49.6% predicate English 80.7% automatic word Chinese 57.4% word English 55.4%

Table 3.3: Predicate-argument mapping coverage. Predicate coverage denotes the number of mapped predicates over all predicates in the corpus, word coverage denotes the number of words in the mapped predicate-arguments over all words in the corpus

sources of error, from errors introduced by the automatic Chinese SRL, English SRL and word alignment systems to incompatibilities between English and Chinese frame files, as well as confusions arising from implicit arguments.

3.5.4 Mapping coverage

Table 3.3 provides predicate and word coverage details of the predicate-argument mapping on Xinhua News, a potentially relevant statistic for applications of predicate-argument mapping.

High coverage of predicates and words in the mappings may provide more relevant constraints to help reorder MT output or rerank word alignment. We expect increased annotation of non-verbal predicates and their arguments for English will help increase both predicate and word coverage in the mapping output.

3.6 Chinese VerbNet Mapping

Since Chinese lacks some of the manually annotated lexical resources available to English, we also exploited the possibility of automatically inducing a Chinese VerbNet resource using the mapping system. Our intent was to automatically generate a set of seed classes from English

VerbNet, which could then be used to accelerate the development of a hand-corrected Chinese

VerbNet. As such, the results in the following section are still quite preliminary. 31

3.6.1 Setup

We used the triple-gold Xinhua corpus and collected a list of English-Chinese verb predicate pairs using one of the following 4 methods: gold-WA gold-standard word alignment of verb predicates gold-PM predicate-argument mapping using the triple-gold Xinhua corpus auto-WA automatic word alignment with the Berkeley aligner [22] auto-PM predicate-argument mapping using the Berkeley aligner, automatic parse output, and

automatic Chinese/English SRL output

From this list, we filtered out verbs that are frequently used in light verb constructions: take, make, have etc, as well as be verbs in English and Chinese. We grouped the Chinese verb instances according to their corresponding English verb’s VerbNet class membership. For each

Chinese verb group formed, we also trimmed verbs with only a single instance. Two of the sample verb groups (using method 2) are: create-26.4: 组建(form), 组织(organize), 制定(formulate), 创造(create),

生产(produce), 兴建(build), 形成(form)

appear-48.1.1: 出现(appear), 来自(come from), 露面(show up), 形成(form),

产生(produce), 涌现(emerge)

In this mapping, 形 成(form) is grouped into both the create-26.4 and appear-48.1.1

VerbNet classes. The English verb form also belongs to those 2 VerbNet classes (along with reflexive appearance-48.1.2). As with English verbs, most Chinese verbs only belonged to one

VerbNet class. Of the 18% that are in multiple VerbNet classes, many have an English verb translation in multiple VerbNet classes (ex: 形 成(form) as previously mentioned as well as 发

展(develop), which are in both the build-26.1 and grow-26.2 classes). Others may be translated 32 method precision verb pairs VN classes gold-WA 84.9% 2729 131 gold-PM 82.9% 2722 128 restricted 85.3% 2600 126 auto-WA 88.0% 2476 125 auto-PM 80.4% 2709 126 restricted 88.2% 2338 121

Table 3.4: results of English verbs with VerbNet annotation projected back to themselves through Chinese verbs. The restricted rows indicate restricting the verb pair output of the predicate- argument mapping methods to only those English/Chinese verb predicates that are also word aligned (rather than just induced through the argument mappings)

to different English verbs: 参加 could mean both participate (cooperate-73-2) as well as join

(cooperate-73-1).

3.6.2 Results

Using predicate-argument mappings with gold standard annotations (word alignment, Tree- bank, PropBank), the system found 3430 aligned English-Chinese verb pair instances (that have

English VerbNet class assignments). Of these, 322 unique Chinese verbs were grouped into 138

VerbNet classes, with 59 of the Chinese verbs in multiple verb groups. A bilingual annotator was tasked with evaluating the Chinese verb groupings against the English VerbNet class definitions.

The annotator could not decide the VerbNet class membership for 4 Chinese verbs and they were excluded from the results.

The annotator found 81.4% of the Chinese verb types fit appropriately in the projected

English VerbNet classes. The ones that didn’t fit typically appeared in lower frequencies: 88.7% of the Chinese verb occurrences belong to the projected English verb classes.

Another potential measure is to see what would happen if the Chinese verbs are mapped back to English. When a Chinese verb is found to be polysemous, the most frequent VerbNet class is assigned to all instances of the verb. Ideally when the English verbs are projected back to themselves, they would still retain the same VerbNet class. Table 3.4 shows the results using different methods of extracting verb pairs. 33

Compared with pure word alignment (gold standard or automatic) based methods, the predicate-argument mapping methods did not provide any advantage in precision. With auto- matic word alignment, the auto-PM predicate-argument mapping method did propose more verb pairings, although with decreased precision. When the predicate-argument mapping method is restricted to only output verb pairings that are also word aligned, then the precision becomes marginally higher than using only word alignment, but with reduced verb pairing output. In our view, the higher recall from the auto-PM mappings has a significant advantage for inducing new lexical resources, in spite of the lower precision.

As we saw earlier with the verb pairs 搞活(invigorate) and activate, the results are not completely surprising: frequently co-occurring words tend to be better direct of each other. Since these are also more easily found by automatic word aligners, they can survive the projection cycle better. On the other hand, some semantically equivalent verb pairings in the context of parallel text, which can be inferred by predicate-argument structures, may differ slightly in their VerbNet class assignment.

3.7 Discussion

We described a word alignment based semantic mapping system that, even when using in- puts from automatic tools (parsing, SRL, word alignment), can produce good predicate-argument mapping between Chinese and English. We also described a probability model enhancement that raised the performance of the system by over 1 F point.

Given that the probability model, built using all automatic system output, provides smaller improvements to (or even degrades) the system when either gold standard SRL or word alignment is used, the probability model still has room for improvement. One such possible improvement would be to build a probability model predicated on verb classes/clusters. This could address the sparse mapping frequency count issue from the many possible Chinese-English predicate-argument pairings. For English, we can use the existing VerbNet class resource and train an automatic system for polysemous verbs. For Chinese, however, we would need to either induce verb classes through 34 mapping (as described in the previous section), or via an automatic verb clustering method.

While we have achieved good predicate-argument mapping performance, specific argument mapping performance still lags behind. One reason is that while we can induce correct predicate- argument mapping from the argument mapping pairs, even when the predicates themselves are misaligned, for argument mapping, our system currently does not attempt to directly correct argu- ment labels from automatic SRL output. Therefore, any SRL labeling error in the automatic SRL system output (made worse by having 2 languages) is propagated through the mapping system.

A joint-inference/joint-learning framework between semantic mapping, SRL, and word alignment could potentially address the shortcomings in our current implementation. Chapter 4

Semantic Role Labeling

4.1 Baseline Approach

We approach automatic semantic role labeling like many previous systems: SRL is posed as a multi-class classification problem requiring the identification of argument candidates for each predicate and their argument types. Even when using argument filtering as suggested by Xue and

Palmer [90], the number of non-arguments in the resulting candidate list is much larger than the number of arguments. Therefore, many systems using expensive learning algorithms [69][82] choose to perform separate argument identification (binary classification) and argument labeling (multi- class classification) to reduce training/decoding complexity, though such techniques have not been shown to improve (and sometimes slightly degrades) overall SRL accuracy [16].

For our system, we chose LIBLINEAR [30], a library for large linear classification problems, as the classifier. Because of its relative efficiency, we did not separate the identification and labeling stages: argument identification is trained simply by incorporating the “NOT-ARG” label into the training data. We adopted more relaxed filtering heuristics than Xue and Palmer [90]: we include all children of argument candidates considered by Xue and Palmer as long as they have different headwords than their parents. This is done to account for argument candidates that may otherwise be filtered out due to automatic parse errors. 36

Figure 4.1: SRL annotation for make.01

4.1.1 Features

We use many of the same types of lexical and syntactic features as previous SRL systems

(and will use the make.01 example in figure 4.1 for illustration):

Predicate predicate lemma and its POS tag

Voice indicates the voice of the predicate. For English, we used the six heuristics detailed by

Igo [40], which detects both ordinary and reduced passive constructions with high accuracy

for gold standard trees. For Chinese, we detected the presence of passive indicator words

(those with SB, LB POS tags) amongst the siblings of the predicate. However, as we’ll

see later, Chinese passive construction can occur without the use of any passive indicator

words.

Phrase type phrase type of the constituent (for Arg1: NP)

Subcategorization phrase structure rule expanding the parent (verb phrase) of the predicate.

The feature value is the enumeration of the phrase type of all children of the verb phrase

(VBD→NP→NP→PP)

Headword the head word of the constituent and its POS tag

Parent headword whether the head word of the parent constituent is the same as the headword

of the constituent in focus. 37

Position whether the constituent is before or after the predicate

Path the syntactic tree path from the predicate to the constituent (for Arg0: VBD↑VP↑S↓NP).

Path Generalizations We implemented 3 of the path generalizations proposed by Pradhan et

al. [69]:

Partial Path path from the constituent to the lowest common ancestor of the predicate

and the constituent (for Arg0: NP↑S)

Clause-based Variations these include 4 variations:

(1) replace all the nodes in a path other than clause nodes with an “*”.

(2) retain only the clause nodes in the path.

(3) binary feature that indicates whether the constituent is in the same clause as the

predicate.

(4) collapse the nodes in between clause nodes

Single character phrase tags only use the first character of each phrase type (for Arg0:

VBD↑V↑S↓N)

Dependency Path the dependency path from predicate to the head of the constituent. If depen-

dency labels are not available, use the phrase type of the constituent (for Arg0: pred→subj

or pred→NP).

First word first word of the constituent and its POS tag

Last word last word of the constituent and its POS tag

Syntactic frame the position of the constituent amongst its siblings. The feature value is the

enumeration of all children of the parent of the constituent, with the constituent highlighted.

(for Arg1 (notice the capitalization): np→NP→pp)

Constituent distance the number of potential constituents with the same phrase type between

the predicate and the constituent 38

Roleset the permissible core arguments for the predicate as indicated by the PropBank frame files

Since we use a linear classifier for argument labeling, we’ve also created some (and a few ) feature combinations that may not be expressible as a linear combination of the individual feature values. For example, neither Voice nor Position is a strong feature for a particular argument label, but Arg0 is likely to appear only with either active-before or passive- after feature values. The list of n-gram features are:

• Voice-Position

• Path-Position

• Predicate-Voice-Position

• Roleset-Voice-Position

• Predicate-Path

• Predicate-Phrase type

• Predicate-Headword

• Predicate-Subcategorization

• Predicate-Syntactic frame

• Headword-Phrase type

• Headword-Parent headword

4.1.1.1 Chinese specific constructions

In addition to detecting SB and LB constructions for Chinese, we also detect 把/Baˇ con- structions. The Chinese PropBank currently annotates just the verb with semantic content as the predicate in a 把/Baˇ construction. This results in a subject–object–verb (SOV) order sentence, in contrast with the typical subject–verb–object (SVO) order. For example: in “我 把/Baˇ 书/book 39

卖/sell 了/Le”, 卖/sell is the predicate and 我 is the agent (ARG0). Without detecting the 把/Baˇ construction, the current SRL system often confuses 书/book as the agent instead of the patient and omits 我 as the agent.

4.1.2 Classifiers

We have experimented with different LIBLINEAR solver types and found both L2-regularized

L1-loss support vector classification (SVC) and L2-regularized logistic regression to perform well for SRL argument identification and labeling. In practice, the dual form of L2-regularized L1-loss

SVC solver required the least training time, produced a relatively compact model, and was one of the top performers. The L2-regularized logistic regression solver was much slower in training, produced a larger dense feature weight model, but had competitive performance. It does, however, output class probabilities. Unless stated otherwise, most of the experiment results are obtained using the the L2-regularized L1-loss SVC dual solver.

By default, LIBLINEAR uses the one-vs-all approach for multi-class classification. This does not always perform well for some easily confusable class labels. Also, as noted by Xue [90], certain features are strong discriminators for argument identification but not for argument labeling, while the reverse is true for others. Likewise, certain features may be strong discriminators between 2 argument label types but not between one label type and all other label types. To understand whether this may be the case for an SRL argument labeling type, we built a pairwise multi-class classifier (using simple majority voting) on top of LIBLINEAR. We found pairwise multi-class classification to be at least as competitive as the default one-vs-all approach, but in many instances improved the overall SRL F-score by 0.1-0.3 points. Unless argument label probability is needed, we performed most of the experiments with pairwise multi-class classification.

4.2 Improvements to the Baseline SRL system

We improved upon the baseline SRL system in 3 different ways:

Support Identification identifies the support predicate and its arguments to better classify ar- 40 Arg0 Arg0 V Arg1 Arg2 [香港 电影] 成为 让 了解 香港 的 窗户, [*pro*][令令令][香港 的大都会形象][在 国际 间 彰显]

Hong Kong movie become let understand Hong Kong window let Hong Kong big city image at international between highlight [Arg0 Hong Kong movies] have become a window for the world to see Hong Kong [R-Arg0 which] have made [Arg1 the image of metropolitan Hong Kong prominent internationally].

Figure 4.2: Chinese-English sentence pair w/ SRL annotation of let and made

guments of the predicate in focus

2-stage Argument Label Classification uses 2 argument label classifiers in series to model

SRL structural constraints

Selectional Preference models the argument label selectional preference of headwords collected

through large unannotated corpora

4.2.1 Support Identification

Typically, arguments within the verb phrase or within the same clause as the verb predicate are easy to identify and label correctly, as the syntactic path between the argument and the predi- cate would be a very informative feature. However, sometimes, an argument outside the domain of locality for one predicate may be a local argument to another predicate. For example, in figure 4.2,

成为/become and 令/let both share the agent 香港/Hong Kong 电影/movie, and ARG1 of 令/let, the image of metropolitan Hong Kong, is also ARG1 of 彰显/highlight. If we perform argu- ment labeling in the order of 成为/become, 令/let, 彰显/highlight, and use the found arguments of the previous predicates as features to argument labeling of the next predicate, we may improve argument identification/labeling outside the local clause. In English, this may not be as critical as relative pronouns make it easier to identify an embedded clause as a relative clause.

Chinese PropBank annotates certain verbs as the support of nominal predicates explicitly.

For the example in figure 4.3, 表示/express is annotated as the support of the nominal predicate 欢

迎/welcome, and 3 arguments of 表示/express are also arguments of 欢迎/welcome. Very much like nominal SRL in English [41], identifying the support verb and its arguments is important to 41 Arg0 AM-tmp Arg1 Sup V [香港 长官 董建华][今天][对 美国 基金会 发表的 经济 报告][表示] 欢欢欢迎迎迎 Hong Kong official Dong Jianhua today toward US foundation post economic report express welcome [AM-tmp Today], [Arg0 Hong Kong official Dong Jianhua] [V welcomed] [Arg1 the economic report released by the US foundation].

Figure 4.3: Chinese nominal predicate translated to English verb predicate

nominal SRL, since a larger portion of arguments tend to be outside the domain of locality of the nominal predicate, especially within the context of a light verb.

For our system, we process the verb predicates in order from the root clause to embedded clauses and use the found arguments in the parent clause as features to predicates in the embedded clauses. For nominal predicates, we follow the dependency path to the first verb predicate and use its arguments as features.

4.2.2 2-stage Argument Label Classification

The top SRL systems from the CoNLL 2005 shared task [79] as well some subsequent systems are combination systems using multiple parses and they impose structural constraints (no repeating core arguments, no overlapping arguments) to improve the output. While these constraints can improve both precision and recall with multiple parses, for a single input parse, the benefit would mostly be in improved precision.

To address this, we implement 2 stage argument label classification, where the argument label set found by the first classifier is used as an additional feature for the second classifier. If an expected argument type is missing from the output of the first classifier, the second stage classifier can learn to assign a previously constituent labeled NOT-ARG as the missing argument type. For example, if the stage-1 classifier erroneously labels Arg2 of 令/let in figure 4.2 as AM-loc (since

在/at is a common preposition for the head word of a location argument), given the role set feature and the output of the first classifier, the stage-2 classifier has a better chance of correctly relabeling

AM-loc as the missing core argument.

By performing some argument candidate filtering (excluding constituents deemed highly un- 42 likely to be arguments by the first stage classifier), the computational complexity of an additional classification stage is reduced.

4.2.3 Selectional Preference

An issue with a heavily lexicalized SRL model is when the system encounters a word not in the training data. This is particularly problematic for arguments that are not well constrained by syntax (arguments that are not direct dependent of the predicate, which is common for arguments of non-verbal predicates). Without some form of similarity measure between the unseen word and words in the training data, the system will not be able to make an informed decision about the argument label of the constituent. To address this, we implemented 2 selectional preference systems: topic model based and distributional similarity based. The topic model based system models selectional preferences by clustering the constituents, represented by their (headwords and label types, such that constituents in the same cluster are likely to have similar label selectional preference to a predicate. The distributional similarity based selectional preference system computes the selectional preferences of constituent headwords from the training corpus, then infers the selectional preference of an unseen headword based on its distributional similarity (extracted from unannotated corpora) with headwords in the training corpus. We will devote the next 2 sections to detailing the implementation of each system.

4.3 Topic Model based Selectional Preference

4.3.1 Representation

Some of the most discriminative SP models used by Zapirain et al. [98] relied on distributional similarity computed over dependency relationships (the data set from [52]). For example, in “John lent Mary the book.”, we would extract John-nsubj, Mary-iobj, book-dobj for the predicate lend. While syntactic based similarity has proven to be of higher quality than pure word occurrence based similarity, it may not be optimal for semantic-based processing. With nominal SRL, a large 43 portion of arguments (around 50% in Chinese PropBank) are not the direct syntactic dependents of the nominal predicate. For the example in figure 4.3, because of a light verb-like construction, all the arguments of 欢迎/welcome are the syntactic dependents of 表示/express.

To address this issue, we directly extract the semantic selectional preferences of the predicates by running our SRL system over the unannotated corpus. For the sample sentence, we would extract the selectional preferences of lend as John-Arg0, Mary-Arg2, book-Arg1.

4.3.2 SRL Filtering

Building selectional preferences by means of using the output of an SRL system is unlikely to improve the same SRL system unless one filters out the lower quality labels (in earlier experiments where we performed no filtering, this was indeed the case). We ran SRL on the unannotated corpus using a logistic regression model. To balance between precision and recall, we set a relatively low

filtering threshold of 0.5 argument label probability but also discounted the occurrence count based on the probability (ex: an argument label with 0.8 probability will only be counted as having occurred 0.8 times).

4.3.3 Multi-lingual SRL filtering

Zhuang and Zong [100] demonstrated that by performing Chinese SRL and English SRL on parallel corpora simultaneously, and constraining the output based on argument alignment, one can improve the output of both systems, even when one system, in this case English, has a lower beginning performance. We chose this approach as unannotated Chinese-English parallel corpora are readily available, and the potential for extracting higher quality SP examples means a modest-sized corpus may be sufficient.

To constrain the predicate-argument set found by semantic mapping for bilingual SP, we experimented with a number of heuristics.

Predicates must align The mapping does not always produce aligned predicates (even if the

overall meaning conveyed is equivalent), since adjective-like verb predicates in Chinese 44

usually do not align to an English verb predicate. However, unaligned predicates are

unlikely to have similar argument structures.

Head word of the arguments must align This is a bit restrictive in general (figure 4.2 shows

an example where both Arg1 and Arg2 of 令/let align to Arg1 of made, and only one of the

Chinese arguments could possibly have an aligned head word with the English argument).

Since we represent an argument in SP by its head word, this is necessary.

Label type should be compatible This is less straight forward. As we have demonstrated [86],

since Chinese PropBank and English PropBank are created separately, not all similar pred-

icates have the same argument label set. While Arg0, Arg1 and certain adjuncts like tempo-

ral and location arguments align to each other with high probability, the rest are much less

consistent. We compromised by discounting the alignment occurrences with incompatible

labels by a factor of 2. In the future, we plan to explore using alignment probabilities to

weight different alignment label pairs.

This still leaves us with a large number of unaligned predicates, some of which may not have a good English equivalent. We included these unconstrained arguments to improve the recall, but heavily discounted their occurrences as they are likely to be much more noisy. Similarly, because of the poor Arg0 labeling performance (especially recall) in Chinese SRL, arising from pro-drop confusion, we also added head words derived from unaligned English Arg0 arguments .

4.3.4 SP with LDA-based Topic Model

Our approach to modeling selectional preferences (SP) follows a relatively straight forward application of LDA modeling to a set of predicate-argument instances derived from a corpus. In the standard LDA model, a corpus is represented by a set of M documents, and each document d is represented by a bag of N words and is assumed to be drawn from a multi-nominal Dirichlet

θd over topics (i.e. the topic distribution of the document). The model has 3 parameters: α is the

Dirichlet prior on the per-document topic distributions (typically assumed to be the same for all 45 documents), β is the Dirichlet prior on the per-topic word distribution (typically assumed to be the same for all words), K is the total number of predetermined topics. The resulting model is a topic distribution of all words in the corpus.

Figure 4.4: LDA plate diagram

For the SRL application, we represent each “word” as an extracted argument (with a (label, headword) pair), and each “document” as the collection of arguments for all instances of a particular predicate. For prepositional phrases, we used the dependent of the preposition as the head word instead of the usual (preposition, head of dependent) combination used in many SRL systems as the preposition can often be omitted in Chinese.

4.3.5 SP Extraction Steps

To integrate selectional preference into SRL with our approach, we perform the following steps:

(1) train SRL with probabilistic arg label model

(2) label SRL on unannotated corpus

(3) filter low probability semantic roles

(4) LDA topic model on label-word pairs

(5) retrain SRL system with topic features

Iterating steps 2-5 with the improved SRL system can potentially provide higher quality predictions for semantic roles. We arrived at diminishing returns after one additional iteration. 46

4.3.6 Topic Model feature for SRL

The LDA topic model produces a probability distribution of words (represented here by the

(label, headword) pair) over topics. For the SRL task, argument candidates with topic distributions similar to those of the arguments found in the training set are likely to be permissible. Ideally, we would use these distributions directly, but since our SRL system was designed to accept lexical

(binary) features only (for training/decoding performance), we pared the distribution down to at most 3 topics for each label type. Words that do not have high affinity to a small number of topics

(the top 3 probable topics do not sum to 50%) are excluded. We used the resulting list of (label, topic id) pairs for each word as the selectional preference feature for each encountered constituent in the Chinese SRL system.

During the normal LDA inference stage, using the learned topic model, a predicate instance

(“document”) will be assigned a probability distribution over topics based on its arguments, and each argument will be assigned a specific topic (or topic distribution). This could further con- strain an argument’s selectional preference within the context of the predicate instance and other arguments. For our system, we experimented with performing inference on the argument label set extracted from the first stage classifier and using the constrained argument topic distribution for the second stage classifier. However, we observed no improvement, likely because there are only a few arguments for each predicate instance.

4.4 Distributional Similarity based Selectional Preference

4.4.1 Approach

For distributional similarity based selectional preference, we followed the general approach proposed by Erk [28] and successfully implemented for PropBank SRL by Zapirain et al. [98].

We collected the headword selectional preference of a primary corpus (in this case, the annotated training corpus) using the following definition:

freq(p, ri, w0) SelP ref(p, ri, w0) = P (4.1) r∈R freq(p, r, w0) 47 where p is a predicate, ri is an argument label type, and w0 is the headword of the constituent in consideration. Essentially, the selectional preference of a constituent for a predicate to a particular argument label type, defined by its headword, is the relative frequency of the headword associated with the particular label type, over the total counts of all label types. We can then predict the label type of a constituent with:

r(p, w0) = arg max SelP ref(p, ri, w0) (4.2) i

With this definition, we can predict the label type of the WSJ training sections with 93.9% precision

(and the same 93.9 F score since there are no unseen headwords). However, on the test section, while the precision is only degraded to 81.4%, the recall is degraded to 42.0% due to out-of-vocabulary words, even when using separate verb-role and preposition-role selectional preference as proposed by Zapirain et al. [98] (our numbers are slightly different from Zapirain et al., possibly due to using different head rules and evaluating on different argument label types). We can replicate this level of performance drop on the training data by discounting freq(p, ri, w0) by 1 when computing the selection preference (i.e., pretend we’ve not seen this particular constituent instance) and used this

(leave-one-out) technique for extracting sectional preference features in our SRL system to ensure the model does not overfit to the training data.

4.4.2 Distributional Similarity Corpus

To expand selectional preference over unseen words, Erk [28] proposed a selectional preference definition using a distributional word similarity database:

X SelP ref(p, r, w0) = Sim(w0, w) · weight(p, r, w) (4.3) w∈Seen(p,r)

We do not yet have such type of pre-computed resource for Chinese. For English, we decided to use a rather novel distributional similarity dataset: it’s derived from VerbNet role selectional strength computed over automatic SRL output on the English Gigaword, and then turned into VerbNet role selectional preference using a co-occurrence based distributional similarity dataset computed 48 over several larger corpora. The resulting distributional similarity dataset contains around 90K lemmatized words. In theory, this results in a semantic-based distributional similarity dataset that may be more appropriate for the SRL task than the dependency-based dataset used by Erk and

Zapirain et al.).

4.4.3 Selectional Preference Measures

Adopting the VerbNet selectional preference database as the distributional similarity dataset for the PropBank SRL task, we define the similarity between two words as the VerbNet role selectional preference similarity of the two words:

Sim(wi, wj) = Sim(SelP refvn(r, wi), SelP refvn(r, wj)) (4.4)

Because of the rather unique database, we experimented with different similarity measures including cosine and the normalized-generalized context model (nGCM) proposed by Erk et al. [29] (Jaccard similarity would be inappropriate for this dataset as most words have non-zero selectional preference to all VerbNet roles). We also tested the performance of alternative (to equation 4.3) selectional preference definitions. The best combination we found was using

freq(p, ri, w) SelP ref(p, ri, w0) = [ arg max Simcosinek (w0, w)] · P (4.5) w∈Seen(p,ri) i freq(p, ri, w) where k k w0 · w Simcosinek (w0, w) = (Simcosine(w0, w)) = ( ) (4.6) kw0kkwk

We found applying a polynomial function over raw cosine similarity provided better value scaling for our dataset as the raw similarity value of word pairs using VerbNet role selectional preference clustered tightly around (0.95, 1.0] in the cosine space. We tuned k on the development set and found optimal k ≈ 20. The resulting SPVN model achieved similar gain (∼16 F points) over lexical based selectional preference when compared to the better performing models used by Zapirain et al. (table 12 of [98], SP pre and SP pre ) simLin×cos simLin×Jac 49

4.4.4 SRL Integration

We integrated the SPVN model with the SRL system just like Zapirain et al. [98], by adding a feature to the SRL system indicating the best argument label for the constituent found using the SPVN model. But because our SRL system performs argument identification and labeling simultaneously, the SPVN model was applied to all constituents. On the other hand, since we only used one selectional preference model, we did not need to build a meta classifier on the SRL system output.

4.5 Experiment

4.5.1 Setup

For evaluation, we trained our Chinese SRL system on Chinese TreeBank 5.1 and Chinese

PropBank 1.0 (as of the writing, the current release of the Chinese PropBank is 3.0, and there are more BOLT PropBank annotation available through LDC). We used the standard section splits: sections 81-885 for training, sections 41-80 for development, and sections 1-40, 900-931 for testing. We generated the training parses (with 10 fold cross-validation) and the test parses using the Berkeley parser1 (5 split-merge cycles). The parser F1 score on the test sections is 82.73 as measured by ParseEval [6].

We used 2 corpora for our topic model based SP approach: the monolingual Chinese Gigaword

Fifth Edition 2 and a 1.6M sentence parallel corpora collected from a variety of sources (used by our semantic mapping system in section 3.5.2). We used the latter corpora to implement our multi-lingual SRL filtering technique.

We prepared the Chinese Gigaword corpus with the Stanford Chinese Word Segmenter3 .

The parallel corpora was already segmented (with an automatic segmenter).

We then performed LDA topic modeling using PLDA+ [54] and the recommended α =

1 code.google.com/p/berkeleyparser/ 2 LDC2011T13 3 nlp.stanford.edu/software/segmenter.shtml 50

50/topic cnt, β = 0.01 values. We chose 1500 topics with the 1.6M sentence parallel corpora and

2000 topics with Chinese Gigaword (tuned on the SRL performance of the development set rather than any topic based metrics). Table 4.1 lists some of the found topics (with the most frequent, relatively interesting, and least frequent words) using Chinese Gigaword.

topic headword, argument label pairs 破 坏/damage:Arg1 阻 止/stop:Arg1 制 造/fabricate:Arg1 寻 找/search:Arg1 emergency 自 杀/suicide:Arg1 ... 灭 火/extinguish:Arg1 敲 诈/blackmail:Arg1 挣 response 脱/break free:Arg1 东山再起/comeback:Arg1 海 关/custom:Arg0 联 合 会/union:Arg0 务 部/work department:Arg0 旅 游 government 局/travel department:Arg0 统计局/census:Arg0 ... 部会/ministries:Arg0 边 agency 检站/checkpoint:Arg0 财政局/finance bureau:Arg0 警 方/police:Arg0 嫌 犯/suspect:Arg1 男 子/male:Arg1 到 law & order 案/court appearance:Arg1 公安/public safety:Arg0 ... 巷/alley:Argm-loc 嘉 义市/Chiayi City:Argm-loc 哥伦比亚人/Columbian:Arg1 道路/road:Arg1 路/path:Arg1 大道/avenue:Arg1 ... 红地毯/red carpet:Arg1 path 钢 丝/steel wire:Arg1 独 木 桥/plank bridge:Arg1 ... 迷 宫/maze:Arg1 侧 门/side entrance:Arg1 险棋/risky move:Arg1 比 赛/competition:Arg1 决 赛/final:Arg1 联 赛/league comp:Arg1 ... 考 competition 试/exam:Arg1 大选/election:Arg1 世乒赛/world pingpong match:Arg1 ... 加 赛/playoff:Arg1 分团/sub-group:Arg0 精 神/spirit:Arg1 传 统/tradition:Arg1 作 风/style:Arg1 文 明/civil:Arg1 moral & ethics ... 校 风/school spirit:Arg1 同 舟 共 济/share hard time:Arg1 ... 幸 福 观/happy outlook:Arg1 博爱/universal love:Arg1

Table 4.1: Topics in Chinese Gigaword

4.5.2 Performance

Table 4.2 (page 51) compares verb SRL results of selectional preference using the multi- lingual SRL filtering (SPparallel) versus using monolingual probabilistic SRL filtering on the same corpus (SPmonolingual) and on Chinese Gigaword (SPGigaword). While all versions improved on the system without SP, the multi-lingual SRL filtering technique was only marginally better than SP using monolingual probabilistic SRL filtering and produced the same F score as the monolingual

SP system using the larger Chinese Gigaword corpus (although the differences between the 3 SP systems are small). For the rest of the experiment results on Chinese PropBank, we use SPLDA to refer to the SP enhanced system using Chinese Gigaword. 51 systems p r f1 no SP 82.70 70.44 76.08 SPparallel 82.10 71.43 76.40 SPmonolingual 82.75 70.90 76.37 SPGigaword 82.74 70.96 76.40 Table 4.2: Chinese PropBank 1.0 verb results

As table 4.3 shows, the addition of the SPLDA feature improved nominal SRL by 2.34 F1 points. Verb SRL improved by 0.40 F1 point and overall SRL improved by 0.66 F1 point. These

F1 differences were all found to be statistically significant4 (p ≤ 0.05).

nominal verb all system p r f1 f1 f1 baseline 64.71 46.90 54.39 74.99 71.53 +SPLDA 65.98 48.47 55.88 75.39 72.08 +sup 64.07 47.82 54.76 75.16 71.69 +sup, stage2 64.71 48.20 55.25 75.53 72.08 All features 65.70 51.27 57.59 75.93 72.74

Table 4.3: Chinese PropBank 1.0 results

The per-argument performances (table 4.4 on page 52 ) showed that the selectional preference features (SPLDA) mostly improve core argument types, with a larger margin on Arg1 than Arg0.

One out-lier in the adjunct argument performance is the discourse modifier (Argm-DIS), where

SPLDA improved the F-score by ∼ 10 points for both verb and nominal predicates. Examining the generated topics revealed that there was a very homogenous topic that contained a large portion of Argm-DIS occurrences (但/but, 但是/but, 如果/if, 而/yet, 虽然/although, 所以/therefore, etc).

We also tested the system on Sinorama magazine and other out-of-genre sections (broadcast conversation, broadcast news, web blog) in Chinese PropBank 3.0. Only Sinorama has nominal

SRL annotations. As table 4.5 (page 53) shows, even though the absolute performance is much lower, SP improved the precision and recall in all cases, the nominal SRL score on Sinorama by 2.30

F1 points, and verb SRL score by 0.31-0.46 F1 point. Again, these F1 differences were statistically significant.

4 SIGF (www.nlpado.de/%7esebastian/software/sigf.shtml), using stratified approximate randomization test [95] 52

+sup, stage2 All Features verb nominal verb nominal p r f1 p r f1 p r f1 p r f1 Arg0 75.70 66.06 70.55 70.88 50.64 59.08 76.84 65.74 70.86 70.43 53.96 61.11 Arg1 84.61 75.92 80.03 59.10 46.61 52.12 85.47 75.99 80.45 61.28 50.81 55.56 Arg2 80.90 59.34 68.46 92.73 67.11 77.86 81.09 61.26 69.80 87.30 72.37 79.14 Arg3 71.43 35.71 47.62 50.00 16.67 25.00 66.67 35.71 46.51 33.33 16.67 22.22 Arg4 83.33 100 90.91 100 50.00 66.67 80.00 80.00 80.00 0.00 0.00 0.00 ADV 88.27 79.15 83.46 42.03 42.03 42.03 88.65 78.10 83.04 41.18 40.58 40.88 BNF 72.73 69.57 71.11 --- 75.00 65.22 69.77 --- CND 71.43 35.71 47.62 0.00 0.00 0.00 66.67 28.57 40.00 0.00 0.00 0.00 DGR --- 100 25.00 40.00 --- 83.33 41.67 55.56 DIR 64.00 55.17 59.26 66.67 40.00 50.00 66.67 55.17 60.38 77.78 46.67 58.33 DIS 73.21 42.27 53.59 71.43 40.00 51.28 74.65 54.64 63.10 76.47 52.00 61.90 EXT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 FRQ --- 0.00 0.00 0.00 --- 0.00 0.00 0.00 LOC 77.38 62.30 69.03 61.38 49.60 54.87 82.86 64.86 72.76 61.32 52.00 56.28 MNR 72.17 67.21 69.60 69.40 53.36 60.33 70.93 65.18 67.93 70.49 54.20 61.28 NEG --- 100 25.00 40.00 --- 100 25.00 40.00 PRP 78.38 60.42 68.24 40.00 40.00 40.00 83.87 54.17 65.82 50.00 40.00 44.44 TMP 77.28 64.70 70.43 61.29 35.19 44.71 78.95 64.84 71.20 63.16 33.33 43.64 TPC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100 100 100 ALL 81.31 70.52 75.53 64.71 48.20 55.25 82.28 70.49 75.93 65.70 51.27 57.59

Table 4.4: Chinese PropBank 1.0 argument performance 53 sections system p r f1 Sinorama baseline 37.58 25.10 30.10 nominal SPLDA 39.72 27.36 32.40 baseline 67.13 50.37 57.55 verb SPLDA 67.56 50.59 57.86 4051- baseline 62.01 50.74 55.81 4411 (verb) SPLDA 62.70 51.03 56.27

Table 4.5: Chinese PropBank 3.0 out-of-genre results

4.5.2.1 Comparison

Direct performance comparison with previous Chinese SRL systems is a bit difficult: Xue [89],

Zhuang an d Zong [100] trained the syntactic parsers with an additional 250K word broadcast news corpus found in Chinese TreeBank 6.0, while Sun [78] only reported results using gold POS tags but no additional gold parses. However, as table 4.6 (page 53) shows, for verb predicates, our system bests Xue’s [89] system by 4-7 F1 points with less parser training data and when tested with (but was not retrained to take full advantage of) gold POS tags besting Sun’s [78] system by 0.53 F1 point. For nominal predicates, our system bests Xue’s [89] system, by 1.9 F1 points on arguments of nominal predicates (since we have an integrated SRL system, the results are obtained by training both verb and nominal predicates, then using only the nominal classifier to classify the nominal predicates).

type system p r f1 Xue 2008 76.8 62.5 68.9 w/ gold POS 79.5 65.6 71.9 Sun 2010 verb 81.03 72.38 76.46 (gold POS) SPLDA 82.74 70.96 76.40 w/ gold POS5 82.81 71.93 76.99 Xue 2008 62.9 53.1 57.6 nominal SPLDA 67.30 53.31 59.50

Table 4.6: Chinese SRL comparison

5We could not make the Berkeley parser fully utilize gold POS: the parser performance improved, but the output parse trees did not strictly adhere to the gold POS input 54

The verb results in this table are with the SRL system trained on verb predicates only, whereas the table 4.3 results are of SRL systems trained and tested on all predicates.

4.5.2.2 English SRL

We also studied the impact of English selectional preference on the Wall-Street Journal

(WSJ) corpus as annotated for CoNLL-2005. The training set is composed of sections 02-21, the development set is section 24, and the test set is section 23. We again generated the input parses using the Berkeley parser (with 10 fold cross-validation on the training data). Interestingly, we obtained the best SRL results by using 5 split-merge cycles for the training parses and 6 split- merge cycles on the development set, even though 6 split-merge cycles produced the most accurate parses. We surmise that the lower quality training parse prevented the SRL model from overfitting to syntactic features extracted from the parse tree. We parsed the test set with 6 split-merge cycles.

We applied the same topic model based selectional preference techniques (SPLDA) to English

SRL using the English Gigaword6 corpus. We used 800 topics (w/ lemmatized headwords) tuning on the CoNLL-2005 development set. Table 4.7 (page 58) lists some of the found topics (with the most frequent, relatively interesting, and least frequent words) using English Gigaword. We also integrated the VerbNet role based selectional preference features (SPVN ) to our SRL system.

Compared to Zapirain et al. [98] (table 4.8 on page 58), our topic model based selectional preference approach had a smaller (but still statistically significant) absolute F1 improvement. But with a much higher performing baseline system, the relative error reduction rate is comparable.

When we took out the support identification and 2 stage classifier enhancements, the margin of improvement for the topic model based selectional preference features did increase (0.65 F point), although that’s not completely indicative of how much the selectional preference features would help in a lower performing baseline system as those features were extracted using the support identification and 2 stage classifier enhancements. The VerbNet role based selectional preference features, on the other hand, only provided marginal gains (and was not statistically significant).

6 LDC2003T05 55

This is not too surprising as each of Zapirain et al.’s selectional preference models only improved the SRL system marginally, if at all (table 13 in [98]). We did achieve the highest English SRL F score by combining both the topic model and VerbNet role based selectional preference features.

Breaking down by argument types (table 4.9 on page 59 ) showed again selectional preference mostly improved the performance of the core argument types. Arg0 did not show much improvement with the full feature set, possibly because the baseline was already very accurate. The selectional preference features did improve Arg0 by a similar margin as Arg1 when starting from a lower baseline.

4.5.3 Automatic SRL Error Analysis

A closer look at the automatic SRL outputs revealed error types very similar to ones described in [46] (for English) and [89] (for Chinese). A significant source of error for both Chinese and

English SRL is input parse error. This could be both in the form a non-disjoint argument span not mappable to a single constituent in the input parse or having the constituent erroneously attached

(in relationship to the predicate) in the parse tree hierarchy. For Chinese, incorrect predicate part-of-speech identification was also a significant issue, as separately trained verb and nominal

SRL classifiers both performed better when the correct predicate type inputs were used for each classifier. For English, because the predicates tend to be more polysemous, roleset classification error was a more significant issue, as incorrect roleset classification can lead to incorrect assumptions on permissible core argument types. Out-of-vocabulary words and argument role ambiguities (the latter often exacerbated by dropped pronouns and relative clauses in Chinese), which the SRL systems were able to partially correct with the addition of selectional preference features, dominated the rest of the error types.

4.6 Discussion

With the addition of support identification, 2 stage classification, and topic model based se- lectional preference features, we improved upon the Chinese SRL performance of argument labeling 56 of verb predicates by 0.94 F point, argument labeling of nominal predicate by 3.2 F points, and

1.21 F points overall. In the process, we have produced a new state-of-art Chinese SRL system.

When the same techniques are applied to English SRL, our system may have achieved one of the highest constituent based SRL F-score (80.16) that uses a single input parse tree.

Still, these improvements are largely incremental: while significant in themselves, they may not result in significant improvements when used in downstream NLP applications. The issue likely lies in over-reliance on syntactic parse input, as the top SRL systems, even with structural inference on multiple (but likely very similar) input parses, are only able to produce small SRL performance gains. To make further inroads into automatic SRL, we likely need to explore joint learning of syntactic parsing and SRL.

On the other hand, there is room to improve our current multi-lingual filtering approach as our current semantic mapping system does not attempt to directly correct argument labels from automatic SRL outputs. By using Chinese argument label probability, English argument label probability, and argument mapping probabilities (instead of using label compatibility heuristics), we may be able to filter out the less likely Chinese arguments from the predicted set as well as introducing new ones from the English SRL output.

Likewise, improvements likely can be made on the current implementation of distributional similarity based selectional preference, as the coarser-grained topic model based selectional prefer- ence model performed much better (and achieved a comparable performance gain as the ensemble of 11 similarity score based selectional preference models proposed by Zapirain et al. [98]). One reason may be each selectional preference model is a poor argument label predictor on its own com- pared to the baseline SRL system. And when only the predicted label output is used as a feature, the overall SRL system cannot take full advantage of it. For the topic model based selectional pref- erence implementation, the model is not used to directly predict the argument label, instead only the topic ids of the headwords are used as features. To mirror this for the distributional similarity based models, we can directly integrate the similarity score between different word pairs into the

SRL model. We did not do this on our SRL system as it’s based on a feature-space classifier, and 57 explicitly enumerating all the similarity scores as features would lead to feature explosion. How- ever, with a kernel space classifier, Moschitti et al. [61] demonstrated that similarity scores (in their case, sub-tree similarity) can be efficiently integrated into an SRL system, without enumerating all possible features. We plan to explore this technique for distributional similarity based selectional preference in the future. 58

topic headword, argument label pairs Ronaldo:Arg0 Hull:Arg0 Baggio:Arg0 ... breakaway:Arg1 soccer/football grounded:Argm-tmp ... Benda:Arg0 Bak:Arg0 Askins:Arg0 disease:Arg2 virus:Arg2 patients:Arg1 cancer:Arg2 ... anemia:Arg2 disease rudely:Argm-mnr ... herpes:Arg1 fungus:Arg2 bruise:Arg2 military & warplane:Arg0 village:Arg1 town:Arg1 jet:Arg0 ... Nagasaki:Arg1 wan- conflict tonly:Argm-mnr ... Dresden:Arg1 detachment:Arg1 Biko:Arg0 disaster:Arg1 recession:Arg1 collapse:Arg1 catastrophe:Arg1 ... armaged- disaster don:Arg1 devaluations:Arg1 ... diamondback:Arg0 cigarette:Arg1 loss:Arg1 injury:Arg1 wound:Arg1 damage:Arg1 casualty:Arg1 ... tibia:Arg1 ailment pox:Arg1 laryngitis:Arg1 famous Hardaway:Arg0 O’Neal:Arg0 Lemieux:Arg0 Sprewell:Arg0 Olajuwon:Arg0 ... athletes Bergoust:Arg0 Basso:Arg0 Addison:Arg0 water:Arg1 blood:Arg1 river:Arg1 juice:Arg1 liquid:Arg1 ... liquid pipeline:ArgM-DIR ... grease:Arg0 condensation:Arg1 aquifer:Arg0 food:Arg1 meat:Arg1 chicken:Arg1 fish:Arg1 ... carbohydrate:Arg1 edibles grill:Argm-loc stir:Argm-mnr ... fat:Argm-mnr equip:Argm-pnc earth- worm:Arg1

Table 4.7: Topics in English Gigaword

system p r f1 error∆ SwiRL 79.7 70.9 75.0 Zapirain 2013 80.0 71.3 75.4 −1.60% baseline+sup, stage2 82.59 77.27 79.84 +SPVN 82.57 77.39 79.90 −0.30% +SPLDA 82.96 77.52 80.15 −1.54% +SPLDA&SPVN 82.93 77.57 80.16 −1.59% baseline 79.93 76.60 78.23 +SPVN 79.99 76.60 78.25 −0.09% +SPLDA 80.83 77.02 78.88 −2.99%

Table 4.8: English SRL comparison (CoNLL-2005 WSJ) 59

baseline+sup, stage2 +SPVN +SPLDA p r f1 p r f1 p r f1 Arg0 90.00 87.26 88.61 89.83 87.37 88.58 89.71 87.67 88.68 Arg1 80.93 78.49 79.69 81.09 78.61 79.83 81.51 78.95 80.21 Arg2 75.67 66.13 70.58 75.23 67.03 70.89 77.54 68.11 72.52 Arg3 74.05 56.40 64.03 74.05 56.40 64.03 76.92 58.14 66.23 Arg4 74.07 78.43 76.19 74.77 78.43 76.56 68.97 78.43 73.39 Arg5 60.00 60.00 60.00 75.00 60.00 66.67 80.00 80.00 80.00 ADV 64.60 51.58 57.36 64.65 50.59 56.76 63.87 48.22 54.95 CAU 77.36 53.95 63.57 78.85 53.95 64.06 76.47 51.32 61.42 DIR 69.09 44.71 54.29 72.55 43.53 54.41 67.92 42.35 52.17 DIS 82.37 75.94 79.02 82.31 75.63 78.83 82.09 75.94 78.90 EXT 84.21 50.00 62.75 84.21 50.00 62.75 72.73 50.00 59.26 LOC 68.20 52.73 59.48 67.48 52.73 59.20 68.40 50.27 57.95 MNR 60.83 54.89 57.70 61.32 56.03 58.56 62.78 55.75 59.06 MOD 98.18 98.00 98.09 98.18 97.82 98.00 98.17 97.46 97.81 NEG 98.67 96.52 97.58 98.67 96.52 97.58 98.24 96.96 97.59 PNC 64.86 41.74 50.79 64.86 41.74 50.79 67.14 40.87 50.81 PRD 50.00 20.00 28.57 50.00 20.00 28.57 50.00 20.00 28.57 REC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 TMP 80.93 76.01 78.39 80.76 76.28 78.45 80.88 76.10 78.41 All 82.59 77.27 79.84 82.57 77.39 79.90 82.96 77.52 80.15

Table 4.9: English SRL argument performance (CoNLL-2005 WSJ) Chapter 5

Chinese Empty Category Recovery

We experiment with 2 Chinese empty category (EC) recovery approaches, one attempting to take advantage of semantic mapping of parallel English sentences, the other using the dependency- based EC recovery approach proposed by Xue and Yang [93]. With the machine translation ap- plication in mind, we first study mapping Chinese dropped subjects to English as both a means to improve Chinese-English machine translation as well as providing a basis for improving Chinese

EC recovery using parallel English data.

5.1 Chinese dropped subject mapping to English

5.1.1 Motivation

In order to take full advantage of parallel corpora to recover Chinese dropped subjects, we first would like to understand what words need to be inserted (if any) in place of dropped subjects in the translation, and under what circumstances. Also, currently, there are no automatic systems for classifying the type of Chinese dropped subjects. Only recently has there been effort directed towards manually annotating Chinese dropped subjects [4]. Through semantic mapping of parallel corpora, we may be able to annotate Chinese dropped subjects with their English equivalent automatically on a much larger corpus. 61

5.1.2 Framework

We begin by performing a semantic mapping on parallel corpora that has been Treebank-ed and PropBank-ed using both human annotated word alignment and automatic word alignment.

From that, the system picks the one-to-one Chinese and English verb predicate pairings that opti- mize the combined semantic similarity. The next step is to look for English pronoun arguments as well as other unaligned words related to the mapped English predicate to determine whether they should align to the *pro* argument in the Chinese predicate-argument structure. In cases where no English words or empty categories align to the *pro* argument, we examine the differences in syntactic and semantic structures between the parallel sentences for common patterns.

5.1.3 Mapping semantic roles between Chinese and English

To align Chinese dropped subjects to English translations, we rely on finding parallel predicate- argument structures between the two languages. For the following parallel fragments (note: the

Chinese speaker used 他(he) and 他们(they) interchangeably):

他 怕怕怕 *pro* 到到到 城里 以后 ... he afraid arrive inside city after They were afraid that once they go to the city

The Chinese fragment has 2 verb predicates: 怕(afraid) and 到(arrive). The PropBank SRL for

怕(afraid) is:

• Arg0: 他 (he)

• Arg1: *pro*到 城里 以后 (after entering the city) while the SRL for 到(arrive) is:

• Arg0: *pro*

• Arg1: 城里 (in the city)

The English translation also has 2 verb predicates: were and go. The SRL for were is: 62

• Arg0: they

• Arg1: afraid that once they go to the city while the SRL for go is:

• Arg0: they

• Arg4: to the city

If we can infer that the predicate-argument structure of 怕(afraid) is equivalent to the predicate-argument structure of were, 到(arrive) equivalent to go, and 到.Arg1 is equivalent to go.Arg4, then it’s easy to infer 到.Arg0 is equivalent to go.Arg0 (or, if we are dealing with auto- matic parses/SRL without empty categories, that 到(arrive) has 他(he) as a dropped subject).

5.1.4 Heuristics

We propose 8 heuristics for aligning Chinese *pro* to English: 1 heuristic aligns *pro* to the

English empty category, 4 heuristics align *pro* to one or more English words, while the remaining

3 heuristics detect cases where the *pro* should not align to any English empty category or word.

5.1.4.1 Alignment to English *PRO*

Because the Chinese grammar does not restrict the subject in what would be a non-finite clause in English, *PRO* in the English translation often becomes the equivalent of Chinese *pro*.

For example, in the following parallel sentences:

日本 把 这个 [*pro*] 入入入 常 的 希望 放在 今年

Japan let this enter frequent hope put this year Japan placed its hope of [*PRO*] becoming a permanent member of the Security Council this year

*pro* is the subject of 入入入(enter) while *PRO* is the subject of becoming in the translation.

Heuristic 1 (PRO): if the mapped SRL pair has a Chinese *pro* subject and an English

*PRO* subject, we align the *pro* to *PRO*. 63

5.1.4.2 Alignment to personal pronoun

A Chinese dropped subject often needs be replaced with a personal pronoun in the English translation. For example:

他 怕 [*pro*] 到到到 城里 以后 ... he afraid enter inside city after They were afraid that once [they] go to the city

*pro* is the subject of 到到到 while they is the subject of go.

Heuristic 2 (PRP): if the mapped SRL pair has a Chinese *pro* subject and an English personal pronoun subject, we align the *pro* to the personal pronoun.

5.1.4.3 Alignment to existential word

An existential clause (often with the verb predicate 有(have)) in Chinese typically has a dropped subject, whereas in English, “there” needs to be in the subject position. For example:

每当 [*pro*] 有有有 重大 活动 或 节日 时 whenever have major event or holiday time Whenever [there] is a major event or holiday

*pro* is the subject of 有有有(have) while there is the subject of is.

Heuristic 3 (EX): if the mapped SRL pair has a Chinese *pro* subject and an English existential subject, we align the *pro* to the existential word there.

5.1.4.4 Alignment to demonstrative determiner

Demonstrative determiner subjects (这(this), 那(that), etc) in Chinese are often dropped within the sentence. For example:

没有 遭到 破坏,[*pro*] 还是 比较 正正正常常常 的 haven’t suffer damage still is relatively normal they were not damaged and [this] is quite normal 64

*pro* is the subject of 正正正常常常(normal) while this is the subject of is.

Sometimes in the English translation, a multi-word noun phrase (starting with an uniquitive determiner) is used to further disambiguate a dropped (personal pronoun) subject:

在 八路军 的 强大 攻势 下, [*pro*] 只 有有有 着 招架 之 功 at eighth route army powerful offensive under only have defend of power Under the powerful offensive of the Eighth Route Army , [the enemy] was only able to defend itself

Here *pro* is the subject of 有有有(have) while the enemy (outside the context of the Chinese sentence) is the subject of was.

Heuristic 4 (DET): if the mapped SRL pair has a Chinese *pro* subject and an (or starts with) uniquitive determiner (the, this, that, these, those) English subject, we align the *pro* to the noun phrase subject (provided the words in the NP are not aligned to any other words in the

Chinese sentence).

5.1.4.5 Alignment to relative pronoun

With two conjoined Chinese clauses, the subject of the second clause is often dropped. With relative clauses construction in the English translation, a relative pronoun must be inserted. For example:

香港 电影 更 成为 让 了解 香港 的 一

Hongkong movie further become let understand Hongkong one

扇 窗户,[*pro*] 令令令 香港 的 大 都会 形象 在 国际 间 更为 彰显

window let Hongkong big city image at international between more highlight Hong Kong movies have even become a window for the world to see Hong Kong , [which] have made the image of metropolitan Hong Kong more prominent internationally .

*pro* is the subject of 令令令(let) while which is the syntactic subject of made.

Heuristic 5 (WP): if the mapped SRL pair has a Chinese *pro* subject and an English relative pronoun subject, we align the *pro* to the relative pronoun. 65

5.1.4.6 Omission with passive construction

With passive construction, the logical subject/syntactic object can often be omitted. With the English translation in a passive construction, a Chinese dropped subject in an active construc- tion need not align to anything. For example:

到底 [*pro*] 泄泄泄 的 什么 密 ? in the end divulge what secret What secret exactly was divulged?

*pro* is the subject of 泄泄泄(divulge) while the logical subject of divulged is omitted in passive construction.

Heuristic 6 (pass): if the mapped SRL pair has a Chinese *pro* subject in an active con- struction and no English syntactic object in a passive construction, we align no word or empty category to the *pro*.

For English, we used the six heuristics detailed by Igo [40] to detect both ordinary and reduced passive constructions.

For Chinese, we detected the presence of passive indicator words (those with SB, LB POS tags) amongst the siblings of the predicate. However, this is not always sufficient. For example:

一 个 出奇制胜 作战 计划 产生 了 one win by surprise battle plan formulate A battle plan based on surprise was formulated both the Chinese and English (产生, formulate) use a passive construction, but there are no Chinese passive indicator words. To prevent over application of the heuristic, we check whether the order

(before or after predicate) of aligned arguments is different.

5.1.4.7 Omission with coordination

With 2 Chinese clauses sharing the same logical subject, the English translation may avoid inserting the subject of the second clause by conjoining the 2 clauses with coordination. For example: 66 蚊子 盯 几 个 疙瘩,[*pro*] 再 回来 mosquito bite several welt again come back The mosquito sees a few welts and comes back again

蚊子(mosquito) (subject of 盯(bit)) and the *pro* (subject of 回来(come back)) are coreferent.

The 2 clauses in the English translation share the same subject with coordination, omitting the

*pro* subject.

Heuristic 7 (coord): if the mapped SRL pair has a Chinese *pro* subject and the verb phrase

(VP) of the English predicate is part of a coordinated verb phrase (but not the first child VP of the coordinated VP), we align no word or empty category to the *pro*.

5.1.4.8 Omission with predicate nominalization

Another common technique of English translations avoiding explicitly disambiguating Chi- nese dropped subjects is by nominalizing the verb predicate. For example:

没有, 过去 [*pro*] 全 没有 能够 去去去 的

no past all no can go No, it was totally inaccessible in the past

*pro* is the subject of 去去去(go), but in the translation 没有(no) 能够(can) 去(go) is nominalized to inaccessible.

Heuristic 8 (nom): if a Chinese predicate has a *pro* subject and the aligned English word is a noun, we align no word or empty category to the *pro*.

To prevent over-application of the heuristic caused by erroneous word alignment, we use

WordNet [31] to examine only nouns that have a morphologically related verb (ex: inaccessible is related to the verb access). And unlike the previous 7 heuristics, this one does not rely on semantic mappings. 67

5.1.5 Experiment

5.1.5.1 Setup

We used the triple-gold BC corpus to evaluate the heuristics against bilingual human anno- tators. The broadcast conversation portion was chosen over the Xinhua News portion because of the prevalence of *pro*s: There are 1859 *pro*s in 2944 total sentences (1335 sentences contain one or more *pro*s). The Xinhua News portion, by comparison, only has about 34% the density of *pro*s per words in the corpus.

We performed the semantic mapping using the triple-gold annotation. We also experimented with using automatic word alignment output in two ways:

(1) semantic mapping: We used GIZA++ [62] and the Berkeley Aligner [22], trained with

the 400K parallel sentence corpus, as input to the semantic mapping system.

(2) establishing baseline: We trained the Berkeley Aligner on the 19K Chinese-English

parallel sentence pairs from OntoNotes 4.0, with *pro*s and *PRO*s in place. The output

would then contain both ordinary word alignments as well as *pro* alignments). This is a

much smaller word alignment training corpus as only parallel Treebank-ed sources can be

used.

Two bilingual linguists were tasked with aligning the *pro*s to English empty categories or words. The annotators were instructed to not align multiple *pro*s to the same empty category or word, and vice versa (except when aligning *pro* to a multi-word noun phrase). The annotators were, however, allowed to align *pro* to English empty categories beside *PRO*. For *pro*s without an English alignment, the annotators were asked to annotate whether it was a personal pronoun or one of the 4 abstract types (existential, unspecified, event, and pleonastic) as detailed by Baran et al. [4]. Each annotator aligned about half the corpus, with a small portion double annotated to compute inter-annotator agreement (IAA) score. 68

5.1.5.2 Results

There are 1859 *pro*s in the broadcast conversation section of the corpus (we’ll reference as “all *pro*”). 1469 *pro*s appears as subject to a verb predicate in Ontonotes 4.0 PropBank annotation (we’ll reference as “subject *pro*”). Of those, the Chinese SRL predicate of 1007 *pro* subjects has a mapping to a corresponding English SRL predicate using the semantic mapping system (we’ll reference as “mapped *pro*”). The 8 heuristics, with the exception of “predicate nominalization”, can only operate on these mapped *pro*s, where there is a subject *pro*’ and there are matched predicates.

5.1.6 Heuristics distribution

Figure 5.1 details the distribution of the 8 heuristics. Of the 1007 mapped (out of 1859, or

54.2%) *pro*s, 208 *pro*s do not have any applicable alignment heuristics. While some of those are undoubtedly errors in the application of the heuristic, for others, the sentence structure of the translation is far removed from the patterns in the heuristics to match.

400

300

200

100

0 PRO PRP EX DET WP pass coord nom N/A

Figure 5.1: Distribution of *pro* heuristics applied to broadcast conversation. “N/A” denotes the number of mapped *pro*s not applicable by any of the heuristics.

We also performed *pro* alignment on the newswire (Xinhua News) portion of the corpus and compared it against the broadcast conversation portion (Figure 5.2). Because of the much more literal translation (and possibly very ambiguous dropped subjects), the translation tends to avoid 69 inserting word tokens for the dropped subject. This is evident with the less frequent application of the personal pronoun heuristic and much more frequent application of the passive construction heuristic.

bc 400 nw

300

200

100

0 PRO PRP EX DET WP pass coord nom

Figure 5.2: Comparison of frequency of heuristic application between the broadcast conversation (bc) and Xinhua News (nw) corpora

5.1.7 Heuristics performance

Table 5.1 details the *pro* alignment heuristics as well as the baseline automatic word alignment output comparisons against human annotation on the entire broadcast conversation corpus portion. As the alignment heuristics can only operate on mapped *pro*s, precision remains the same while recall decreases going from the smaller subset of *pro*s to all *pro*s. However, the drop-off is not as severe as one might expect (given mapped *pro*s only account for 54.2% of all *pro*s, or 1007 of 1859): 71.1% of mapped *pro*s have an empty category or word alignment while the ratio for the rest of the *pro*s drops to 31.3%.

The heuristics, even when applied with automatic word alignment, consistently outperform the baseline system in precision and overall F-score, improving 10-12 precision points and 3-8 F points for mapped *pro*s, over 20 precision points and 1-6 F points for all *pro*s. Even though the word alignment training corpus for the heuristic system was much larger than the baseline system (400K vs 19K sentences), the word alignment accuracy was only slightly better: 66.0 vs

65.8 F-score. Therefore, word alignment performance differences likely did not account for very 70

*pro* type w/ EC system precision recall F-score baseline 61.3 66.3 63.7 yes heuristic auto 71.3 65.1 68.0 heuristic gold 75.9 79.2 77.5 mapped baseline 56.3 59.4 57.8 no heuristic auto 68.7 62.8 65.6 heuristic gold 75.7 81.6 78.5 baseline 49.8 64.5 56.2 yes heuristic auto 68.8 59.2 63.6 heuristic gold 75.9 71.0 73.4 subject baseline 42.3 57.7 48.8 no heuristic auto 65.6 58.6 61.9 heuristic gold 75.7 75.0 75.3 baseline 48.9 64.2 55.5 yes heuristic auto 68.8 48.1 56.6 heuristic gold 75.9 57.7 65.5 all baseline 41.7 57.9 48.5 no heuristic auto 65.6 46.5 54.4 heuristic gold 75.7 59.5 66.6

Table 5.1: *pro* alignment (including & excluding alignment to English empty category (EC)) results on broadcast conversation. The baseline system is the Berkeley Aligner output trained with *pro* and *PRO* in place. The heuristic auto output is the heuristics applied using automatic word alignment (Berkeley Aligner). The heuristic gold output is the heuristics applied using gold standard word alignment annotation. 71 type with EC IAA System yes 86.5 85.0 mapped no 88.2 85.8 yes 84.8 82.4 subject no 87.3 83.1 yes 83.1 73.9 all no 86.2 73.9

Table 5.2: Inter-annotator agreement F-score vs system F-score (averaged between 2 annotators) on *pro* alignment

much of the performance increase. This lack of word alignment performance using a larger corpora also underscores the importance of good semantic annotation for *pro* alignment. Curiously, training the Berkeley Aligner on the 19K sentences with *pro*s and *PRO*s also produced slightly more accurate word token alignment than training on the sentences without the empty categories

(65.8 vs 65.5 F-score). The difference, however, is not statistically significant overall.

The baseline system performed worse on *pro* alignment to word token(s) than to empty categories, likely due to the more diverse alignment of word tokens (compared to just English

*PRO*s and *pro*s) that must be learned with a small training corpus. The baseline system did achieve higher alignment recall on all *pro*s, since it’s not limited by being able to find a semantic mapping with a *pro* in the subject position.

Table 5.2 compares the inter-annotator agreement F-score on *pro* alignment with the F- score of the heuristic output against double annotated (but not adjudicated) sections of broadcast conversation. The sections were chosen arbitrarily, but they were possibly easier to annotate as the heuristics performed much better on these than on the entire set. While the heuristics output cannot compare to human annotators on all *pro*s, for the subset of mapped *pro*s, its performance is only about 2 F-score points lower than inter-annotator agreement.

There is a surprisingly large difference in *pro* alignment performance between using GIZA++ and the Berkeley aligner (Table 5.3). While GIZA++ did produce worse word alignment (both precision and recall around 6 points lower) than the Berkeley aligner, the semantic mapping output was quite a bit worse (61.8 vs 85.8 F-score). The resulting *pro* alignment performance drop was 72 type with EC GIZA++ Berkeley Gold yes 52.4 68.0 77.4 mapped no 50.1 65.6 78.5 yes 49.5 63.6 73.2 subject no 47.3 61.9 75.3 yes 43.5 56.6 65.4 all no 40.9 54.4 66.6

Table 5.3: *pro* alignment heuristics F-score using GIZA++, Berkeley, and gold word alignment compared against human annotation on the entire broadcast conversation portion

larger than the difference going from using gold standard word alignment to using Berkeley aligner

(in fact, the drop in semantic mapping F-score correlated very closely with the drop in F-score on *pro* alignment). Because of the inherent redundancy in considering word alignment of both the predicate and its arguments, the semantic mapping system can typically deal with a certain amount of word alignment error quite well: the Berkeley aligner achieved a 66.0 F-score yet the semantic mapping system using its output achieved a 85.8 F-score. However, with the 60.2 F-score

GIZA++ as input, the the semantic mapping system only achieved a 61.8 F-score, much closer to the word alignment performance.

5.1.8 Analysis

The 8 heuristics discussed (including detecting when the dropped subject has no alignment), when applied with automatic word alignment input, consistently outperformed the baseline auto- matic word alignment output, especially on *pro* alignment to English word tokens (where the

F-score is 5.9-7.8 points higher). When they are applied using gold standard PropBank and word alignment, the results approximate human annotator performance given the semantic mapping sys- tem is able to map the Chinese predicate-argument structure (with *pro* subject) to an English equivalent (85.8 F-score vs 88.2 F-score for IAA). More over, these heuristics apply to 80% of the mapped *pro*s, regardless of whether they have an English alignment or not, pointing to their po- tential value as features for improving Chinese dropped subject recovery in the presence of parallel text. 73

On the other hand, mapped *pro*s only account for 54.2% of all *pro*s in the corpus

(although they account for 72.8% of *pro*s that have an English alignment). And this figure may be lower when automatic SRL and word alignment outputs are used for semantic mapping (to be fair, the Ontonotes 4.0 PropBank has some unannotated verb predicates that automatic SRL may pick up). So while the heuristics based off of semantic mapping is very useful for aligning Chinese dropped subjects to English, it also leaves a sizable portion unexplained. A combined approach using both the baseline word alignment approach and the heuristics may yield a better performing

*pro* alignment system overall, especially if the training of the word aligner is bootstrapped with some of the *pro* alignments from the heuristics system.

5.2 Chinese empty category recovery with parallel corpora

We now turn our attention to building a Chinese EC recovery system using the *pro* align- ment heuristics we’ve studied.

5.2.1 Monolingual implementation

The basic implementation of the Chinese dropped subject recovery system we describe here is very similar to Dienes and Dubey [23] for English empty element recovery and Yang and Xue [94] for Chinese empty element recovery: empty element recovery is posed as a sequence classification problem requiring a decision on whether there is an empty element immediately in front of each word token in a sentence (although with the rare dropped object, a Chinese EC instance can occur at the end of an sentence as well).

We start off with a simple set of lexical features for the classifier; these include:

Current word lemma of the word in focus

Current POS part-of-speech tag of the current word

Previous word lemma of the previous word

Previous POS part-of-speech tag of the previous word 74

Next word lemma of the next word

Next POS part-of-speech tag of the next word

Word 2 lemma of the second word to the right

POS 2 part-of-speech tag of the second word to the right

We also adopt a number of syntactic features that Yang and Xue [94] have found to suggest the presence and/or the type of an EC instance; these include

first-IP-child whether the position is the start of the lowest IP constituent

1st-word-in-subjectless-IP whether the word in the current position starts a subjectless IP

constituent verb-in-NP/VP whether the word in the current position is a verb in an NP or VP has-no-object whether the word at the previous position is an intransitive verb

For semantic features, we search for the closest verb to the current position and used its core argument label set (as well as whether the current position is adjacent to or inside the phrase boundary any of the arguments). We also looked up the frame file for the expected set of core arguments for the verb.

Since we decided to use a linear classifier, we created a number of bigrams and of the feature sets. Like our SRL system, we used LIBLINEAR [30] as the classifier. Also like our SRL system, we implemented a 2 stage classification system where the EC predictions of the first stage classifier are used as additional features for the second stage classifier. These additional features are:

(1) predicted label at the current position

(2) predicted label at the previous position

(3) predicted label at the next position

(4) all predicted label types in the sentence 75

(5) all predicted label types left of the current position

(6) all predicted label types right of the current position

5.2.2 Parallel feature enhancement

Using the 8 *pro* alignment heuristics described in section 5.1.4, we propose 8 more features

(each corresponding to an activated *pro* alignment heuristic):

Relative clause the English predicate is in a relative clause

Personal pronoun the English SRL has an unaligned pronoun subject

Existential word the English predicate has an existential word as the subject

Demonstrative determiner the English SRL has an unaligned demonstrative determiner

Relative pronoun the English SRL has an unaligned relative pronoun

Passive voice the English predicate has passive voice while the Chinese predicate has active voice

Coordination the English predicate is in a VP coordination

Non-verbal predicate the English predicate is an adjective or noun

5.2.3 Experiment

5.2.3.1 Setup

We used a similar setup as section 5.1.5.1: we used both the triple-gold BC and triple- gold Xinhua News corpora for evaluation. The parallel corpora used for our systems include 19K

Chinese-English parallel sentence pairs in OntoNotes 4.0 (excluding the evaluation data), otherwise they are trained on all of OntoNotes 4.0 (excluding the evaluation data). Our training parses and

SRL are both generated through 10-fold cross-validation. We used the Berkeley aligner trained with the 400K parallel sentence corpus, as input to the semantic mapping system. We trained 3 EC types: *pro*, *PRO*, and *T* since these are the only types likely to be affected by our heuristics based feature enhancements. 76 monoling./parallel bilingual monoling./full corpus type unlabeled labeled unlabeled labeled unlabeled labeled *pro* 55.7 49.9 55.7 49.9 59.9 55.2 bc *PRO* 52.6 37.6 52.4 36.2 65.5 52.0 *T* 46.2 29.1 45.7 28.7 57.3 43.6 *pro* 40.9 24.0 42.0 24.9 54.4 35.4 Xinhua *PRO* 48.3 33.0 50.6 33.3 67.2 44.1 *T* 49.6 40.1 50.7 41.7 58.0 48.3

Table 5.4: EC results of bilingual and monolingual systems (trained on the parallel sentences and all of OntoNotes 4.0)

5.2.3.2 Results

We report the performance of of our Chinese EC recovery systems in table 5.4. Comparing the monolingual and bilingual systems trained on the same 19K sentence pair parallel data, the bilingual system does show some improvement on the Xinhua News evaluation data: the labeled performance of *pro* improved by 0.9 F point and *T* by 1.6 F points. However, because there are a lot more annotated Chinese trees, the monolingual system significantly outperformed the bilingual system (6-16 F points) when the monolingual system is trained with all available annotations. With such a large gap, it’s unlikely any improvement to either the semantic mapping or dropped subject mapping system would bridge the performance and give the bilingual system a practical advantage.

5.3 Dependency-based empty category recovery

Xue and Yang [93] proposed a dependency-based empty category recovery (off of phrase struc- ture parse trees) approach. By identifying the head of the EC (which would also help inserting the recovered EC into the phrase structure tree) and modeling its relationship with its EC dependent, they were able to achieve state-of-art results on Chinese EC recovery. One other advantage of this approach is it presents a more natural way of incorporating SRL features into the EC recovery system as most EC instances, apart from the *OP* instances, are dependents of a verb predicate.

The found argument set, along with the expected argument set of the verb predicates can indicate which ones are missing. However, Xue and Yang (table 4 in [93]) did not find using SRL features 77 improved the performance of their EC system.

We implemented a system based on their approach to see whether SRL features, especially from the output of a higher performing SRL system, can improve Chinese EC recovery with a dependency-based EC recovery system.

5.3.1 Implementation

We implemented most of the features detailed in [93], these include the linear relationship between the EC position and its head (dubbed horizontal features), the tree path relationship between the EC position and its head (dubbed vertical features), as well as syntactic features identifying certain grammatical constructions (not too different from our previous system). We implemented the SRL features similar to our previous system (matching the found core arguments with the expected core arguments from the frame file), however, we also indicate which of these found arguments are local (inside the immediate IP constituent headed by the predicate): a core argument outside the local IP but is the head of a CP suggests the predicate has a trace dependent, whereas an Arg0 shared with the parent suggests the predicate has a *PRO* subject dependent.

With a dependency-based EC model, without any filtering, the number of candidates we need to consider for a sentence of n words would be Θ(n2), since an EC instance can appear in any of the n + 1 positions and there are n headwords. Even after taking into account that the

EC instance must be inside the constituent spanned by the headword, the number of candidates under consideration is still much larger than a linear word sequence based EC model. From the training corpus, we’ve discovered that typically, only words that head an IP, CP, or QP constituent will have any EC dependents. With automatic parse input, this may not be true, especially if the part-of-speech is labeled incorrectly. However, the EC model is unlikely to overcome these types of parse errors. Therefore, we only retain candidates that head an IP, CP, or QP constituent.

The rest of the system implementation follows our linear word sequence based approach. 78

5.3.2 Experiment

5.3.2.1 Setup

We used the same Chinese Treebank section splits for empty category recovery: sections

81-885 for training, sections 41-80 for development, and sections 1-40, 900-931 for testing. We generated the training parses (with 10 fold cross-validation) and the test parses using the Berkeley parser1 (5 split-merge cycles). For some reason, the parser performance (as measured by ParseE- val [6]) we obtained on the test sections is 82.73 F1 score, lower than the 83.63 F1 score as reported by Xue and Yang [93]. We have experimented training with additional (broadcast news) parses to close the parser performance gap and found the EC detection performance can improve by 1-2 F1 points.

We also used our SRL systems as described in chapter 4, which is trained on the same Chinese

Treebank sections (with the addition of PropBank annotation on these parses). The training SRL output is again produced with 10 fold cross-validation (although we did not generate separate topic models for each cross-validation training sets).

5.3.2.2 Results

Our results, as compared to Xue and Yang’s [93] in table 5.5, take into account all 1838

EC instances in the test set, even though some are in the same linear position and attach to the same head (mostly caused by an omitted 的/de in a relative construction) This is the cause of the differences in recall and F score for Xue and Yang [93]’s system. Between this phenomena and considering only heads of an IP, CP, or QP constituent (in the automatic parse tree), we are left with 85.9% of the EC instances we can potentially identify correctly. As with Xue and Yang [93], we consider an EC to be correctly labeled if we identify the correct head attachment and linear position in the sentence (although we also report our results evaluated on linear position only).

With our SRL input and 2 stage classification system, we were able to best Xue and Yang’s system

1 code.google.com/p/berkeleyparser/ 79 Xue&Yang (paper) Xue&Yang (all EC) our system evaluated on position type p r f1 p r f1 p r f1 p r f1 *pro* 39.7 15.4 22.2 39.7 14.6 21.4 56.56 21.90 31.58 59.02 22.86 32.95 *PRO* 60.2 53.1 56.4 60.2 53.1 56.4 65.54 51.15 57.46 68.91 53.77 60.41 *OP* 72.4 65.3 68.7 72.4 59.3 65.2 74.62 59.31 66.09 75.92 60.34 67.24 *T* 67.3 56.7 61.5 67.3 56.6 61.5 76.49 52.82 62.49 77.48 53.50 63.39 * 0 0 0 0 0/0 0/0 62.5 26.32 37.04 62.50 26.32 37.04 *RNR* 71.4 62.5 66.7 71.4 58.8 64.5 76.00 55.88 64.41 76.00 55.88 64.41 all 65.3 51.2 57.4 65.3 49.1 56.1 71.70 49.08 58.27 73.37 50.22 59.63

Table 5.5: EC results compared to Xue and Yang (paper results and results considering all 1838 EC instances in the test corpus)

by 2.1 F points overall, with a large portion of the differences made up by the 10+ F point *pro* improvement.

To understand the impact of SRL on EC detection, we trained a system without SRL input, a system with our baseline Chinese SRL system (about 1 F point lower than the full system), as well as a system with gold standard SRL. We obtained the gold SRL results by mapping the PropBank annotation onto the automatic parse. Arguments that cannot be mapped to a single constituent in the automatic parse tree (5-8%) are discarded. The results (table 5.6) show that SRL can definitely improve EC recovery, although it’s far from being a panacea: with gold SRL annotation, the overall

F1 score improved from 54.76 to 65.10, and in particular, the *pro* performance improved by almost 30 F points (although 40.26 is still low). The overall improvement using our full-featured

SRL system is 3.51 F points, with a 12 F point improvement on *pro*. While SRL also improved

*PRO* and *T* performance by a smaller margin, *OP* showed little improvement even with gold

SRL. The differences between the 2 automatic SRL systems are very modest. The full-featured

SRL system did improve *pro* by 1.6 F points, although with the small *pro* sample size (315 instances in the test corpus), the improvement is not statistically significant.

The selectional preference feature on its own made an even smaller difference (the *pro* performance increased by 0.67 F point, but the overall improvement was a negligible 0.06 F point).

The latter is not too surprising as our topic model based SP enhancement mostly improved argument labeling performance of non-verbal predicates, which are rarely the head of an EC instance. Still, 80 no SRL baseline SRL sup+stage2 SRL full-featured SRL gold SRL type p r f1 p r f1 p r f1 p r f1 p r f1 *pro* 58.73 11.75 19.58 52.03 20.32 29.22 54.40 21.59 30.91 56.56 21.90 31.58 64.79 29.21 40.26 *PRO* 64.38 49.18 55.76 65.69 51.48 57.72 66.24 51.48 57.93 65.54 51.15 57.46 65.99 64.26 65.12 *OP* 74.17 57.93 65.05 74.57 59.66 66.28 74.57 59.66 66.28 74.62 59.31 66.09 75.00 62.59 68.23 *T* 73.22 47.69 57.76 76.28 53.33 62.78 75.85 53.16 62.51 76.49 52.82 62.49 89.60 61.88 73.21 * 0 0 0 62.50 26.32 37.04 50.00 15.79 24.00 62.50 26.32 37.04 77.78 36.84 50.00 *RNR* 66.67 52.94 59.01 76.00 55.88 64.41 73.08 55.88 63.33 76.00 55.88 64.41 87.50 61.76 72.41 all 70.87 44.61 54.76 71.21 49.13 58.15 71.29 49.18 58.21 71.70 49.08 58.27 76.54 56.64 65.10

Table 5.6: EC results comparing using no SRL features and using different SRL system outputs

we’ve found examples where SP clearly influenced the output: in figure 5.3, the verb 参加/participate in a relative clause construction is missing both a local (within the IP) subject and a local object, making it syntactically impossible to disambiguate 世乒赛/world pingpong match between Arg0 and Arg1. Because 世乒赛/world pingpong match is a shorthand for 世届/world 乒乓/pingpong

赛/match, it’s a low frequency word that did not appear in the training corpus. But in Gigaword, it may have appeared as the syntactic object of 参加/participate and been learned by the topic model (table 4.1 on page 50 shows some of the other words in the same topic). The correct labeling of Arg1 as the head of the relative clause led to the correct *T* insertion as the object of 参

加/participate, which also led to the correct insertion of *pro* as the subject of 参加/participate.

The above example already gives us a clue as to why *pro* performance may be so low. To get an even better idea, we looked at the label predictions of our EC system using the full featured

SRL input (table 5.7). *pro* performance seems to suffer on 2 fronts: with a *pro*, it’s both more difficult to detect whether there is an EC instance as well as to distinguish it from *PRO* or *T* types. For the other EC types, the difficulty is mostly only with detection.

5.4 Discussion

We presented a set of 8 heuristics for mapping Chinese dropped pronouns to English text.

Using our semantic mapping system, the heuristic based system handily outperformed the baseline system using only word alignment (4-8 F points on the subset of *pro* instances to which the 81

Figure 5.3: Effects of SRL: correct Arg1 labeling led the system to insert *T* as the object of participate. This in turn disambiguated the missing EC subject type of participate as *pro*.

type *pro* *PRO* *OP* *T* * *RNR* !EC *pro* 69 34 0 16 0 0 196 *PRO* 7 156 0 15 0 0 127 *OP* 1 0 344 0 0 0 235 *T* 4 7 0 309 0 0 265 * 0 1 0 0 5 0 13 *RNR* 0 0 0 0 0 19 15

Table 5.7: Empty category confusion matrix for full-featured SRL system. Each row represents the gold EC type and the prediction counts of the automatic system. 82 heuristics can be applied and 1-6 F points over all *pro* instances). On the other hand, using these mapping heuristics, our system could not take advantage of English parallel corpora to significantly improve Chinese EC recovery performance. As both annotated and unannotated monolingual data will likely be much more readily available, a parallel approach would need to perform much better to make it worthwhile over a monolingual approach.

For monolingual Chinese EC recovery, we presented a system that uses SRL output as fea- tures. When paired with our enhanced Chinese SRL system, we achieved a new state-of-the-art

58.27 labeled F score. While PropBank semantic knowledge (that can be enhanced/disambiguated using parallel corpora) may be essential for Chinese EC recovery, our experiments also showed that it’s not sufficient by itself, as even with gold standard SRL input, our Chinese EC system only achieved a 65.10 F score. To significantly improve Chinese EC recovery would likely require the following approaches: 1) Joint inference/learning between syntactic parsing and semantic role labeling so that semantic knowledge can be fully utilized. Certain parse errors (such as the wrong

POS tag for verbs), once introduced, are very hard to correct by the SRL system or the EC recovery system. 2) Improving EC classification, possibly by leveraging unannotated data, as we have done with topic model based selectional preference for SRL. We would like to explore these areas in the future. Chapter 6

Summary

This thesis introduces a word alignment based approach for mapping PropBank predicate- argument structures across languages in order to leverage cross-lingual semantic similarities as well as annotations from resource rich languages for different NLP tasks. We’ve successfully demon- strated that using our system, one can produce quality mapping of PropBank annotations (with gold or automatic system outputs) between Chinese and English, despite the presence of errors in automatic word alignment (87.31 and 83.87 predicate mapping F scores on Xinhua News, using gold and automatic SRL respectively). We’ve also demonstrated that the mapping performance can be further enhanced by modeling the predicate-to-predicate and argument-to-argument align- ment probabilities and optimizing the results with an EM based approach, resulting in > 1 F point improvement over an already high baseline. This technique was even able to improve mapping performance with gold standard SRL input, using probability model generated from automatic

SRL output. More over, using the mapping system, we’ve demonstrated the viability of inducing

Chinese verb classes from English lexical resources.

We’ve also made contributions in areas complimentary and secondary to our main thesis focus, namely building enhanced Chinese and English semantic role labeling systems, a Chinese dropped pronoun to English text mapping system, and a Chinese empty category recovery system.

Achieving good performance using our semantic mapping approach depends on having good

SRL input. Toward that end, we described techniques like support verb identification, 2 stage classification, and topic model based selectional preference features, which together, improved upon 84 the Chinese SRL performance of argument labeling of verb predicates by 0.94 F point, and argument labeling of nominal predicate by 3.2 F points, achieving a new state-of-art performance of 76.40

F score on arguments of verb predicates and a 72.74 F score overall. When the same techniques are applied to English SRL, our system may have achieved one of the highest constituent based

SRL F-score (80.16) based on a single input parse tree. Notably, we established topic model based selectional preference as a viable technique for improving PropBank SRL performance by achieving gains that are comparable to distributional similarity based selectional preference enhancements on verb predicates (that required the ensemble of many similarity models, some requiring lexical resources that are only available for English) and a more impressive 2.34 F point gain for arguments of Chinese nominal predicates.

To demonstrate how we may use semantic mapping for NLP tasks, we studied Chinese empty category recovery. By applying heuristics for mapping Chinese dropped pronouns to English text using our semantic mapping system, we improved on the word alignment based approach by 4-8 F points on the subset of *pro* instances to which the heuristics can be applied and 1-6 F points over all *pro* instances. While our approach for using parallel English to improve Chinese EC recovery did not improve over a monolingual approach due to the availability of larger monolingual anno- tations, we did demonstrate that semantic knowledge can help Chinese EC recovery by producing a new state-of-the-art (58.27 labeled F score) system using our enhanced Chinese SRL system as input. Specifically, *pro* recovery performance increased from 21.4 to 31.58 F points over the previous state-of-the-art system, an over 10 F point gain. At the same time, we’ve demostrated the limitation of Chinese EC recovery as a post-parsing task: when using gold standard SRL, we were only able to improve the results to 40.26 F score on *pro*s and 65.10 F score overall.

By demonstrating the effectiveness of our systems and making all parts freely available1 , we hope this will encourage extensions and other applications of leveraging semantic similarity of parallel corpora. In particular, we’ve started looking at applying our semantic mapping systems to other language pairs like Arabic to English and Italian to English.

1 code.google.com/p/clearsrl 85

Because the systems we presented in the thesis touch on a number of NLP areas, there is a lot of room for improvement as well. For our probability based enhancement to semantic mapping, we can take advantage of existing verb class resources in one language (like VerbNet for

English) and mapping them to the other language as well as inducing corpus based classes for both languages, so that the probability model can be predicated on verb classes/clusters, alleviating the issue of modeling predicate-to-predicate probability from sparse frequency counts (currently, we resort to smoothing with a series of back-off probability models). As our semantic mapping model does not currently suggest SRL argument label alternatives, even if the likelihood of a pair of aligned argument labels is low, argument label errors from automatic SRL systems are propagated through the mapping system. Zhuang and Zong [100] have already demonstrated that it’s possible to improve both Chinese and English SRL performance on parallel text. Although this may be of limited use by itself, we can potentially extract a better predicate-argument alignment probability model between 2 languages and use that to generate a parallel corpora constrained monolingual selectional preference model, thereby improving monolingual SRL performances. We wish to study

Zhuang and Zong’s approach, as well as exploring a joint-inference/joint-learning framework that includes semantic mapping, SRL, and word alignment so that we can both produce a more accurate semantic mapping probability model as well as improve the respective SRL outputs in a parallel environment.

For monolingual SRL, we would like to explore alternative approaches to the current imple- mentation of distributional similarity based selectional preference, by integrating the selectional preference features directly in the SRL system (instead of only using the SP model’s predicted label output). For topic model based selectional preferences, we would like to explore alternative topic models like LinkLDA, ROOTH-LDA, LEX-LDA, as well as supervised LDA topic model ap- proaches. While an initial attempt at bilingual SP was not able to outperform monolingual SP, with our probability model based semantic mapping improvements (and other potential future en- hancements), we believe such approach still holds great potential and would like to explore this in greater depth. 86

For Chinese EC recovery, we would like to explore joint inference/learning between syntactic parsing and semantic role labeling so that semantic knowledge can be fully utilized for EC recovery, as currently, certain wrong decisions made in syntactic parsing are very difficult for the EC system to overcome. For actual EC classification, we would like to explore the technique of leveraging unannotated data, perhaps even jointly with English, as we have attempted w/ topic model based selectional preference for SRL. Lastly, we would like to study the impact of EC recovery for semantic mapping, as well as downstream applications such as coreference resolution, machine translation, etc. Bibliography

[1] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The berkeley project. In Proceedings of the 36th Annual Meeting of the Association for Computational and the 17th International Conference on Computational Linguistics, 1998.

[2] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

[3] Elizabeth Baran and Nianwen Xue. Singular or plural? exploiting parallel corpora for chinese number prediction. In In Proceedings of Machine Translation Summit XIII, 2011.

[4] Elizabeth Baran, Yaqin Yang, and Nianwen Xue. Annotating dropped pronouns in chinese newswire text. In In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), 2012.

[5] Daniel Bauer and Owen Rambow. Increasing coverage of syntactic subcategorization patterns in framenet using . In Proceedings of the 2011 IEEE Fifth International Conference on Semantic Computing, ICSC ’11, pages 181–184, 2011.

[6] E. Black, S. Abney, S. Flickenger, C. Gdaniec, C. Grishman, P. Harrison, D. Hindle, R. In- gria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strza- lkowski. Procedure for quantitatively comparing the syntactic coverage of english grammars. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, pages 306–311, Stroudsburg, PA, USA, 1991. Association for Computational Linguistics.

[7] Susan Windisch Brown, Dmitriy Dligach, and Martha Palmer. Verbnet class assignment as a wsd task. In Proceedings of the Ninth International Conference on Computational Semantics, IWCS ’11, pages 85–94, 2011.

[8] Alexander Budanitsky and Graeme Hirst. Evaluating -based measures of semantic distance. Computational Linguistics, 32(1):13–47, 2006.

[9] David Burkett and Dan Klein. Two languages are better than one (for syntactic parsing). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 877–886, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. 88

[10] Shu Cai, David Chiang, and Yoav Goldberg. Language-independent parsing with empty elements. In In Proceedings of ACL-HLT, 2011.

[11] Richard Campbell. Using linguistic principles to recover empty categories. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 645, Morristown, NJ, USA, 2004. Association for Computational Linguistics.

[12] Marine Carpuat, Yuval Marton, and Nizar Habash. Improving arabic-to-english statistical machine translation by reordering post-verbal subjects for alignment. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 178–183, 2010.

[13] Marine Carpuat and Dekai Wu. Improving statistical machine translation using word sense disambiguation. In The 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 61–72, 2007.

[14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [15] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for lan- guage modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA, 1996. Association for Compu- tational Linguistics.

[16] Jinho D. Choi. Optimization of Natural Language Processing Components for Robustness and Scalability. PhD thesis, University of Colorado Boulder, 2012.

[17] Jinho D. Choi, Martha Palmer, and Nianwen Xue. Using parallel propbanks to enhance word- alignments. In Proceedings of ACL-IJCNLP workshop on Linguistic Annotation (LAW‘09), pages 121–124, 2009.

[18] Tagyoung Chung and Daniel Gildea. Effects of empty categories on machine translation. In In Proceedings of Empirical Methods in Natural Language Processing (EMNLP’10), 2010.

[19] Ronan Collobert, Jason Weston, L´eonBottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, November 2011.

[20] Dipanjan Das and Noah A. Smith. Semi-supervised frame-semantic parsing for unknown predicates. In Proceedings of the Joint Conference of the 49th Annual Meeting of the ACL, 2011.

[21] Steve DeNeefe and Kevin Knight. Synchronous tree adjoining machine translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09), volume 2, pages 727–736, 2009.

[22] John Denero. Tailoring word alignments to syntactic machine translation. In In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-2007), pages 17–24, 2007. 89

[23] P´eterDienes and Amit Dubey. Antecedent recovery: experiments with a trace tagger. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 33–40, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

[24] P´eterDienes and Amit Dubey. Deep syntactic processing by combining shallow methods. In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 431–438, Morristown, NJ, USA, 2003. Association for Computational Lin- guistics.

[25] Zhendong Dong and Qiang Dong. Hownet-a hybrid language and knowledge resource. In Proceedings of Int’l Conf. Natural Language Processing and Knowledge Engineering, pages 820–824, 2003.

[26] Zhendong Dong, Qiang Dong, and Changling Hao. Hownet and its computation of mean- ing. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, COLING ’10, pages 53–56, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[27] Bonnie J. Dorr, Gina anne Levow, and Dekang Lin. Large-scale construction of a chinese- english semantic hierarchy, 2000.

[28] Katrin Erk. A simple, similarity-based model for selectional preferences. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 216–223. Association for Computational Linguistics, 2007.

[29] Katrin Erk, Sebastian Pad´o,and Ulrike Pad´o.A flexible, corpus-driven model of regular and inverse selectional preferences. Comput. Linguist., 36(4):723–763, December 2010.

[30] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[31] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.

[32] C. Fillmore, C. Johnson, and M. Petruck. Background to framenet. International Journal of Lexicography, pages 235–250, 2003.

[33] Pascale Fung, Zhaojun Wu, Yongsheng Yang, and Dekai Wu. Learning bilingual seman- tic frames: Shallow semantic parsing vs. semantic role projection. In 11th Conference on Theoretical and Methodological Issues in Machine Translation, pages 75–84, 2007.

[34] Pascale Fung, Zhaojun Wu, Yongsheng Yang, and Dekai Ww. Automatic learning of chinese- english semantic structure mapping. In Proceedings of IEEE/ACL 2006 Workshop on Spoken Language Technology (SLT 2006), 2006.

[35] Ryan Gabbard, Mitchell Marcus, and Seth Kulick. Fully parsing the penn treebank. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Co mputational Linguistics, pages 184–191, Morris- town, NJ, USA, 2006. Association for Computational Linguistics.

[36] William A. Gale. Good-turing smoothing without tears. Journal of Quantitative Linguistics, 2, 1995. 90

[37] Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245–288, 2002. [38] Ana-Maria Giuglea and Alessandro Moschitti. Semantic role labeling via framenet, verb- net and . In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 2006. [39] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. Ontonotes: The 90% solution. In Proceedings of HLT-NAACL 2006, pages 57–60, 2006. [40] Sean Paul Igo. Identifying reduced passive voice constructions in environ- ments. Master’s thesis, University of Utah, 2007. [41] Richard Johansson and Pierre Nugues. Dependency-based syntactic-semantic analysis with propbank and nombank. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 183–187, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. [42] Mark Johnson. A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 136–143, Morristown, NJ, USA, 2002. Association for Com- putational Linguistics. [43] Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. Extending verbnet with novel verb classes. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), 2006. [44] Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. A large-scale classification of english verbs. Journal of Language Resources and Evaluation, pages 21–40, 2008. [45] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for sta- tistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’07), demonstration session, pages 177–180, 2007. [46] P. Koomen, V. Punyakanok, D. Roth, and W. Yih. Generalized inference with multiple semantic role labeling systems shared task paper. In Ido Dagan and Dan Gildea, editors, CoNLL, pages 181–184, 2005. [47] Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. [48] Simon Lacoste-Julien, Ben Taskar, Dan Klein, and Michael I. Jordan. Word alignment via quadratic assignment. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 112–119, Stroudsburg, PA, USA, 2006. Association for Computa- tional Linguistics. [49] Beth Levin. English verb classes and alternations : a preliminary investigation. University Of Chicago Press, 1993. 91

[50] Roger Levy and Christopher D. Manning. Deep dependencies from context-free statistical parsers: correcting the surface dependency approximation. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 327, Morristown, NJ, USA, 2004. Association for Computational Linguistics.

[51] Peng Li, Maosong Sun, and Ping Xue. Fast-champollion: a fast and robust sentence align- ment algorithm. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, pages 710–718, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[52] Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of ACL 1998, ACL ’98, pages 768–774, Stroudsburg, PA, USA, 1998. Association for Computational Linguistics.

[53] Feifan Liu, Fei Liu, and Yang Liu. Learning from chinese-english parallel data for chinese tense prediction. In In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP’11), 2011.

[54] Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2(3):26:1–26:18, May 2011.

[55] Chi-kiu Lo and Dekai Wu. Meant: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames. In In Proceedings of 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT 2011), 2011.

[56] Nitin Madnani, Philip Resnik, Bonnie Dorr, and Richard Schwartz. Applying automatically generated semantic knowledge: A case study in machine translation. In NSF Symposium on Semantic Knowledge Discovery, Organization and Use, 2008.

[57] Nitin Madnani, Philip Resnik, Bonnie Dorr, and Richard Schwartz. Are multiple reference translations necessary? investigating the value of paraphrased reference translations in pa- rameter optimization. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas (AMTA’08), 2008.

[58] David Mareˇcek.Improving word alignment using alignment of deep structures. In Proceedings of the 12th International Conference on Text, Speech and Dialogue, pages 56–63, 2009.

[59] David Mareˇcek.Using tectogrammatical alignment in phrase-based machine translation. In Proceedings of WDS 2009 Contributed Papers, pages 22—27, 2009.

[60] Paola Merlo and Lonneke Van Der Plas. Abstraction and generalisation in semantic role labels: Propbank, verbnet or both? In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, ACL ’09, 2009.

[61] Alessandro Moschitti, Daniele Pighin, and Roberto Basili. Tree kernels for semantic role labeling. Comput. Linguist., 34(2):193–224, June 2008.

[62] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. 92

[63] Sebastian Pad´oand Mirella Lapata. Cross-linguistic projection of role-semantic information. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 859–866, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.

[64] Sebastian Pad´oand Mirella Lapata. Optimal constituent alignment with edge covers for semantic projection. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 1161–1168, Stroudsburg, PA, USA, 2006. Association for Computational Lin- guistics.

[65] Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, pages 71–106, 2005.

[66] Martha Palmer and Zhibiao Wu. Verb semantics for english-chinese translation. In Machine Translation, pages 59–92, 1995.

[67] Patrick Pantel and Dekang Lin. An unsupervised approach to prepositional phrase attachment using contextually similar words. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pages 101–108, Stroudsburg, PA, USA, 2000. Asso- ciation for Computational Linguistics.

[68] Slav Petrov and Dan Klein. Improved inference for unlexicalized parsing. In In HLT-NAACL ’07, 2007.

[69] Sameer S. Pradhan, Wayne Ward, and James H. Martin. Towards robust semantic role labeling. Comput. Linguist., 34(2):289–310, June 2008.

[70] Philip Resnik. Selectional preference and sense disambiguation. In Proceedings of ACL SIGLEX Workshop on Tagging Text with Lexical Semantics, pages 52–57, Washington, D.C., 1997. ACL.

[71] Philip Resnik. Exploiting hidden meanings: Using bilingual text for monolingual annotation. In Alexander Gelbukh, editor, Lecture Notes in Computer Science 2945: Computational Linguistics and Intelligent , pages 283–299. Springer, 2004.

[72] Alan Ritter and Oren Etzioni. A latent dirichlet allocation method for selectional prefer- ences. In In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 2010.

[73] Helmut Schmid. Trace prediction and recovery with unlexicalized pcfgs and slash features. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Associat ion for Computational Linguistics, pages 177–184, Morristown, NJ, USA, 2006. Association for Computational Linguistics.

[74] Karin Kipper Schuler. Verbnet: a broad-coverage, comprehensive verb lexicon. PhD thesis, University of Pennsylvania, Philadelphia, PA, USA, 2005. AAI3179808.

[75] Diarmuid S´eaghdhaand Anna Korhonen. Probabilistic with latent variable models. Computational Linguistics, 40(3):587–631, September 2014. 93

[76] Lei Shi and Rada Mihalcea. Putting pieces together: Combining framenet, verbnet and wordnet for robust semantic parsing. In CICLing, pages 100–111, 2005.

[77] Elena Erosheva Stephen, Stephen Fienberg, and John Lafferty. Mixed membership models of scientific publications. In Proceedings of the National Academy of Sciences, 2004.

[78] Weiwei Sun. Improving chinese semantic role labeling with rich syntactic features. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10, pages 168–172, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[79] Mihai Surdeanu, Llu´ısM`arquez,Xavier Carreras, and Pere R. Comas. Combination strategies for semantic role labeling. J. Artif. Int. Res., 29(1):105–151, 2007.

[80] Mihai Surdeanu and Jordi Turmo. Semantic role labeling using complete syntactic analysis. In Proceedings of CoNLL-2005 shared task, pages 221–224, 2005.

[81] Mary Swift. Towards automatic verb acquisition from verbnet for spoken dialog processing. In Proceedings of Interdisciplinary Workshop on the Identification and Representation of Verb Features and Verb Classes, pages 115–120, 2005.

[82] Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. A global joint model for semantic role labeling. Comput. Linguist., 34(2):161–191, June 2008.

[83] Dekai Wu and Pascale Fung. Can semantic role labeling improve smt? In Proceedings of the 13th Annual Conference of the EAMT, pages 218–225, Barcelona, Spain, 2009.

[84] Dekai Wu and Pascale Fung. Semantic roles for smt: A hybrid two-pass model. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT’09), pages 13–16, 2009.

[85] Shumin Wu, Jinho D. Choi, and Martha Palmer. Detecting cross-lingual semantic similarity using parallel propbanks. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas, 2010.

[86] Shumin Wu and Martha Palmer. Semantic mapping using automatic word alignment and semantic role labeling. In Proceedings of ACL-HLT workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-5), 2011.

[87] Shumin Wu and Martha Palmer. Improving chinese-english propbank alignment. In Proceedings of NAACL-HLT workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-9), 2015.

[88] Zhaojun Wu, Yongsheng Yang, and Pascale Fung. C-assert: Chinese shallow semantic parser. http://hlt030.cse.ust.hk/research/c-assert/, 2006.

[89] Nianwen Xue. Labeling chinese predicates with semantic roles. Computational Linguistics, 34(2):225–255, 2008.

[90] Nianwen Xue and Martha Palmer. Calibrating features for semantic role labeling. In Proceedings of EMNLP 2004, pages 88–94, 2004. 94

[91] Nianwen Xue and Martha Palmer. Automatic semantic role labeling for chinese verbs. In Proceedings of the 19th international joint conference on Artificial intelligence, IJCAI’05, pages 1160–1165, San Francisco, CA, USA, 2005. Morgan Kaufmann Publishers Inc.

[92] Nianwen Xue and Martha Palmer. Adding semantic roles to the chinese treebank. Nat. Lang. Eng., 15(1):143–172, 2009.

[93] Nianwen Xue and Yaqin Yang. Dependency-based empty category detection via phrase struc- ture trees. In Proceedings of HLT-NAACL 2013, pages 1051–1060. The Association for Com- putational Linguistics, 2013.

[94] Yaqin Yang and Nianwen Xue. Chasing the ghost: recovering empty categories in the chinese treebank. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, pages 1382–1390, Stroudsburg, PA, USA, 2010. Association for Com- putational Linguistics.

[95] Alexander Yeh. More accurate tests for the statistical significance of result differences. In Proceedings of the 18th Conference on Computational Linguistics - Volume 2, COLING ’00, pages 947–953, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.

[96] Annie Zaenen, Daniel Bobrow, and Cleo Condoravdi. The encoding of lexical implications in verbnet predicates of change of locations. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008.

[97] Be˜natZapirain, Eneko Agirre, Llu´ısM`arquez,and Mihai Surdeanu. Improving semantic role classification with selectional preferences. In HLT-NAACL 2010, HLT ’10, pages 373–376, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[98] Be˜natZapirain, Eneko Agirre, Llu´ısM`arquez, and Mihai Surdeanu. Selectional preferences for semantic role classification. In Computational Linguistics, pages 631–663, 2013.

[99] Min Zhang, Hongfei Jiang, Ai Ti Aw, Jun Sun, Sheng Li, and Chew Lim Tan. A tree-to-tree alignment-based model for statistical machine translation. In Machine Translation Summit XI, 2007.

[100] Tao Zhuang and Chengqing Zong. Joint inference for bilingual semantic role labeling. In Proceedings of EMNLP 2010, pages 304–314, Cambridge, MA, October 2010. Association for Computational Linguistics. Appendix A

SRL Comparison

Automatic Representation Description Annotation Coverage Techniques/Implementation Performance System mapping PropBank to aimed at using PropBank SRL as a platform to produce Verb- VerbNet, FrameNet, based on ClearSRL SemLink 1M words (WSJ) Net roles or FrameNet Frame Elements WordNet, and OntoNotes and ClearNLP http://verbs.colorado.edu/semlink/ sense grouping improving coverage of nominalizations and light same aim with improved SRL coverage - See ClearSRL and verb constructions for ClearNLP based on ClearSRL SemLink+ SRL,expanding VerbNet http://www1.icsi.berkeley.edu/~miriamp/ and ClearNLP coverage, adding fillmore-tribute/slides/Martha_Palmer.pdf distributional information 46.49 F1 (exact frames that are schematic argument match) Syntactic parse input. representations of a 50.24 F1 (partial Separate log-linear model for frame identification and argu- situation; ∼1200 frames/11,500 argument match) Semafor/ ment identification. each frame has specific lexical units; full text [Collin Bakers’ reasons FrameNet CMU, Noah Beam search or Alternating Directions Dual Decomposition core & non-core frame and lexical sample for these scores: Smith on argument set to ensure output adheres to structural con- elements apply to nouns, annotation Framenet differs from straints; trained only on full text annotation verbs, adjectives, some syntax; sparse data; https://framenet.icsi.berkeley.edu/fndrupal/ prepositions and not exploiting frame hierarchy yet] 96 F1) predicates) (verb+noun 80.14 F1 WSJ 82.48 F1 WSJ 79.90 F1 WSJ (CoNLL 2005) (CoNLL 2005) (CoNLL 2009) 75.4 F1 WSJ PropBank 1.0 78.63 F1 WSJ 72.74 F1 Chinese (CoNLL 2005) (CoNLL 2009) (CoNLL 2005) (CoNLL 2005, SP 81.48 F1 OntoNotes feature improved 0.4 (2nd in CoNLL 2005) 77.97 F1 WSJ+Brown 76.06 F1 WSJ+Brown 77.30 F1 WSJ+Brown Multiple constituent parsesperformance) and NP chunkKernel input. SVM (one-vs-all (improves multi-class) modeltification/labeling for (BIO argument style) iden- Iteratively adds correct versionsamples of to mis-classified random trainingdata seed ex- size training sethttp://cemantix.org/software/assert.html to reduceLinear total SVM training (pairwisetification/labeling multi-class) model forIn argument sequence iden- processing ofordinate predicates clauses from root2-stage clause argument labeling to model sub- LDA for topic structural model learning basedhttps://code.google.com/p/clearsrl/ selectional preference Optional phrase chunk, syntacticFast parse & input. compact neuralcation/labeling network (BIO model style) forSemi-supervised argument learning identifi- usingwikipedia language and Reuters model news http://ml.nec-labs.com/senna/ collected on Multiple constituent parses/chunks input. Linear modelstion/labeling (viaStructural Adaboost) inference w/ constraintscoring for satisfaction and argument candidate http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_ identifica- a_00145 dependency parse input Linear SVM model forargument pruning argument heuristics identification/labeling to improvehttp://www.clearnlp.com/ decoding performance dependency parseprojective transformation) inputLIBLINEAR logistic (projective regression pipelineglobal parser SRL parse & model SRL w/ rerankingaggressive (top algorithm 16 trees) pseudo- w/http://www.aclweb.org/anthology/D08-1008 online passive- Lund (Choi) SwiRL) Nugues) ASSERT Colorado Colorado Colorado (based on ClearSRL University Senna (NEC) (Johansson & ClearNLP SRL Zapirain et. al. and PropBank) 1.5M words, 10K 900K words, 25K GALE - BOLT is 300K words, 4.7K Same as PropBank predicates (Arabic) (OntoNotes 5.0 from predicates (English) predicates (Chinese) adding more Treebank words theme arguments verb (and nominal, represent PropBank their core & adjunct agent, Arg1 patient or arguments by the head to the phrase structure w/ some generalization across predicates: Arg0 syntactic representation arguments typically tied rather than constituents, adjective) predicates and typically representing the core numbered arguments different parse structures/ PropBank dependency PropBank w/ 97 features) trees) by 0.57 F1) parse trees) coreference) TempEval-2 (Heideltime) 46.42 F1 86 F1 score in 79.44 F1 WSJ 68.39 F1 WSJ (CoNLL 2005; (joint inference 86.307 accuracy 78.6 accuracy (using only word 65.1 F1 (charniak Preposition Role 77.6 F1 (gold parse (52.47 F1 with gold Punyakanok et. al) F1, Preposition Role temporal connectives 92 F1 on five types of improved SRL by 0.85 Multiple constituent parses/chunks input. Linear models for argumentConstrained identification/labeling Conditional Modelence for (using global integerWSD. Structured linear infer- programming).http://cogcomp.cs.illinois.edu/papers/SrikumarRo11. Augmented with pdf Linear models for predicate argumentclassification. and preposition relation Constrained Conditional Model forRole joint SRL inference and (that Preposition is,work an for integer optimizing linear joint(Could programming output) produce frame- conflicts withhttp://cogcomp.cs.illinois.edu/page/resource_view/12 FrameNet) Statistically-driven Sentencelearner (syntax-based) Transformationhttp://cogcomp.cs.illinois.edu/demo/tempdemo/?id=29 Rule (STR) Based on Heideltime systemTime (Strotgen normalization and and Gertz, comparisonhttp://cogcomp.cs.illinois.edu/page/resource_view/29 2010) of intervals 2 local classifiers (oneJoint for inference E-T to and enforceence global one constraints for + E-E Eventhttp://cogcomp.cs.illinois.edu/page/resource_view/23 links) Corefer- Soft-margin SVM with L2-lossContextual (using and Learning-Based statistical Java) (e.g.http://cogcomp.cs.illinois.edu/page/resource_view/26 PMI) features SRL Do et al. Zhao et al. Tu & Roth UIUC Verb UIUC Prep and Nom SRL Srikumar et al. examples possessives 2-4, 23) also 1,123 negative tagged entries) 5,940 E-E links punctuation and 4K prepositional 324 E-T links comma relations) 1,039 positive and 1K sentences from Same as PropBank 232 Time intervals 324 Event mentions 486 sentences (27,592 WSJ section 00 (1646 phrases (WSJ sections intervals events and times Event-Event and 183 news articles events, times and Preposition roles: object’ and humans information, adding Standard PropBank annotated 4% of the cleaned-up sentences manually annotating along with generated Event-Time task) also possessives the ACE2005 dataset, Comma relations: five different comma types patterns ‘verb + noun Temporal Relations: temporal links between 22 preposition relations (based on SemEval-2007 augment PropBank with annotated with temporal 20 newswire articles from that express the relations Light Verbs: starting with verbs, BNC was mined for the six most frequent light natural language sentences E-SRL 98 data data WSJ WSJ Entail- 94.8 F1 Relation) (syntax MT) 85.68 labeled 84.05 labeled 84.02 labeled AMR parser) 71.17 F1 93.22 unlabeled (phrase MT) 75.8 F1 full match 81.8 F1 full match 58 F1 Smatch Score (macro-average over attachment score on attachment score on attachment score on attachment score on (Jeff Flanigan’s ACL Ontonotes & medical Ontonotes & medical paper for JAMR - the ment/Contradiction/No sc609/pubs/acl13jacob.pdf ~ Soft-margin SVM with L2-lossWord, (using chunk and LIBLINEAR) parserhttp://cogcomp.cs.illinois.edu/page/software_view/ features Quantifier Quantity segmentation using bank-of-classifiersand (Punyakanok Roth, 2001) +http://cogcomp.cs.illinois.edu/page/software_view/ SRL + Coreference Quantifier Linear SVM model transition-based parsing algorithm bootstraping parse information forhttp://www.clearnlp.com/ the parsing algorithm Linear, kernel, and ensembletransition-based mode parsing algorithm http://nlp.stanford.edu/software/ stanford-dependencies.shtml PCFG constituent parser converted to dependency parse http://nlp.stanford.edu/software/ stanford-dependencies.shtml graph-based non-projective dependency parser sibling/ancestry/valency constraints modeled with integerear lin- programming alternating directions dual decompositioneling algorithm third-order for dependency mod- features http://www.ark.cs.cmu.edu/TurboParser/ Prague Workshop preliminary parsinghttp://amr.isi.edu/ results Statistical machine translation basedlogical (align form) words toGIZA++ parts of word alignment,and Moses syntax-based (SCFG) MT MT) systemhttp://www.cl.cam.ac.uk/ (phrase-based Parser Stanford Colorado ClearNLP Tu & Roth Dependency Roy & Roth Turbo Parser Parser (Choi) MALT Parser Andreas et al. RTE2-4 examples) (AMR 1.0) (GeoQuery) 880 sentences newswire 100 paragraphs 1.5M English words 600 examples from 384 examples from and also available in 1,348 sentences (65% many other languages 13K English sentences > positive, 35% negative etc) mentions annotated head. crowdsourcing PropBank style database entries further collapses out true light verb Web search results becomes the direct mentions, manually containing monetary BNC was mined and representation of the Quantities: RTE and mapping sentences to this form may also be relationships (subject, dependent of the head annotated with phrase dependent words to its Phrasal Verbs: starting arguments, coreference, Stanford representation object, modifier, etc) of with the previous set of verbs, explicitly filtering annotation was done via newswire text sentences, boundaries and quantity graph structure semantic calculus). Facts stored in logical forms (ex: lambda named entity recognition, whole sentence (including constructions (see above), etc, so the ”content” word syntactic representation of conjunctions, prepositions, parse Parsing Meaning Abstract Semantic Dependency Representation Other Approaches 99 accuracy (PCDET) 35.7 WebQ 39.9 WebQ 76.70 F1 WSJ 72.1 (FREE917) 83.7 F1 full match 62.0 (FREE917) 68.5 (FREE917) (direction following) 64.1 task completion 88.08 F1 WSJ (DM) 90.93 F1 WSJ (PAS) ) ) lsz/papers/zc-acl09. ~ lsz/papers/ ~ pliang/papers/ pliang/papers/ ~ ~ https://bitbucket.org/yoavartzi/ https://bitbucket.org/yoavartzi/spf ) https://bitbucket.org/yoavartzi/spf pdf http://homes.cs.washington.edu/ Combinatory Categorical Grammar (CCG)context parsing ear dependent( model; analysishttps://homes.cs.washington.edu/ partial using re-implementationCCG in semantic weighted parsinglearn for UW from navigational weakavailable lin- direction supervision in SPF following; in UWspf situated SPF environment; ( http://yoavartzi.com/pub/az-tacl.2013.pdf code trained on Freebase Q/Aalign pairs text (no phrase logical tobridging form knowledge annotation) base neighboring predicates predicatespredicates to generatehttp://cs.stanford.edu/ additional KB freebase-emnlp2013-talk.pdf CCG semantic parsing to underspecifiedontology logical matching form process ; learned UW to SPF map ( onto Freebase; built with kcaz-emnlp13.pdf generate candidaterealization logic formuse and paraphrase model their toassociation natural choose and the vector language best spacehttp://cs.stanford.edu/ realization based paraphrase model paraphrasing-acl2014.pdf extension to thedependency (non-projective) parsing (allows TurboParser co-parents) alternating for directions semantic dual decompositioneling algorithm second-order for semantic mod- dependencyhttp://www.ark.cs.cmu.edu/TurboParser/ features al. Turbo Parser Liang) Collins Artzi & Semantic (Berant & Zettlemoyer Berant et al. ParaSempre Table A.1: SRL Comparisons Kwiatkoski et Zettlemoyer & corpus) DEC94) 500 sets of 595M edges) 595M edges) 917 Q&A pairs; sentences (SAIL 75K WSJ words instructions; 2700 5810 Q&A pairs 5810 Q&A pairs 18 M question pairs graph (41M entities, graph (41M entities, FreeBase Knowledge FreeBase Knowledge 4600 sentences (ATIS from wikianswers.com (PCEDT) parse (PAS) DeepBank (DM) Prague Czech-English Dependency Treebank PropBank on dependency Broad- Parsing Semantic Coverage Dependency