A Study on Bilingually Informed Coreference Resolution

S. Krajciˇ (ed.): ITAT 2018 Proceedings, pp. 130–137 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Michal Novák A Study on Bilingually Informed Coreference Resolution Michal Novák Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské námestíˇ 25, CZ-11800 Prague 1 [email protected] Abstract: Coreference is a basic means to retain coher- big parallel data might be leveraged in a weakly supervised ence of a text that likely exists in every language. How- manner to boost the models trained in a monolingual way. ever, languages may differ in how a coreference relation The present work is concerned with corpus-based bilin- is manifested on the surface. A possible way how to mea- gually informed CR on Czech-English texts. Specifically, sure the extent and nature of such differences is to build it focuses on resolution of pronouns and zeros, as these are a coreference resolution system that operates on a parallel the coreferential expressions whose grammatical and func- corpus and extracts information from both language sides tional properties differ considerably across the languages. of the corpus. In this work, we build such a bilingually For instance, whereas in English most of non-living ob- informed coreference resolution system and apply it on jects are referred to with pronouns in neuter gender (e.g. Czech-English data. We compare its performance with “it”, “its”), genders are distributed more evenly in Czech. the system that learns only from a single language. Our Information on Czech genders thus may be useful to fil- results show that the cross-lingual approach outperforms ter out English candidates that are highly improbable to be the monolingual one. They also suggest that a system for coreferential with the pronoun. By comparison of its per- Czech can exploit the additional English information more formance with a monolingual approach and by thorough effectively than the other way round. The work concludes analysis of the results, our work aims at discovering the with a detailed analysis that tries to reveal the reasons be- extent and nature of such differences. hind these results. The paper is structured as follows. After mentioning related work (Section 2), we introduce a coreference resolver (Section 3), both its monolingual and cross-lingual 1 Introduction variants. Section 4 describes the dataset used in experiments in Section 5. Before we conclude, the results of Cross-lingual techniques are becoming still more and experiments are thoroughly analyzed (Section 6). more popular. Even though they do not circumvent the task of Coreference Resolution (CR), the research is mostly limited to cross-lingual projection. Other cross- 2 Related Work lingual techniques remain a largely unexplored area for this task. Building a bilingually informed CR system requires a par- One of the yet neglected cross-lingual techniques is allel corpus with at least the target-language side annotated called bilingually informed resolution. It is an approach, with coreference. Even these days very few such corpora in which decisions in a particular task are made based on exist, e.g. Prague Czech-English Dependency Treebank the information from bilingual parallel data. Parallel texts 2.0 Coref [14], ParCor 1.0 [9] and parts of OntoNotes 5.0 must be available when a method is trained, but also at [19]. test time, that is when a trained model is applied to new It is thus surprising that the peak of popularity for such data. In real-world scenarios, the availability of parallel approach was reached around ten years before these cor- data at test time requires the technique to apply a machine pora had been published. Harabagiu and Maiorano [10] translation service to acquire them (MT-based bilingually present an heuristics-based approach to CR. The set of informed resolution). heuristics is expanded by exploiting the transitivity prop- Nevertheless, for limited purposes it may pay off to use erty of coreferential chains in a bootstrapping fashion. human-translated parallel data instead (corpus-based bilin- Moreover, they expand the heuristics even more, follow- gually informed resolution). If it outperforms the mono- ing mention counterparts in translations of source English lingual approach, it may be used in building automatically texts to Romanian with coreference annotation. Mitkov annotated parallel corpora. Such corpora with more reli- and Barbu [13] adjust a rule-based pronoun coreference able annotation could be useful for corpus-driven theoret- resolution system to work on a parallel corpus. After pro- ical research.1 Furthermore, it can be also used for au- viding a linguistic comparison of English and French pro- tomatic processing. For instance, improved resolution on nouns and their behavior in discourse, the authors distill their findings into a set of cross-lingual rules to be inte- grated into the CR system. In evaluation, they observe im- 1In case a cross-lingual origin of the annotation does not matter. A study on bilingually informed coreference resolution 131 provements in resolution accuracy of up to 5 percentage The tectogrammatical layer is also the place, where points compared to the monolingual approach. coreference relations should be annotated. It is technically As for more recent works, the authors of [5] address represented as a link between two coreferential nodes:3 the the task of overt pronoun resolution in Chinese. Among anaphor (the referring expression) and the antecedent (the the others they propose an MT-based bilingually informed referred expression). approach. A model is built on Chinese coreference, ex- Each input text must be first automatically pre- ploiting Chinese features. These are augmented with En- processed up to this level of linguistic annotation. The CR glish features, extracted from the Chinese texts machine- system based on supervised machine learning then takes translated to English. It allows for taking advantage of advantage of the information available in the annotation. English nouns’ gender and number lists, which according to authors correspond to the distribution of genders and Pre-processing. The input text must undergo an analysis numbers over Chinese nouns. producing a tectogrammatical representation of its sen- Experiments of Novák and Žabokrtský [17], the first tences before coreference resolution is carried out. We ones using bilingually informed CR on Czech-English use pipelines for analysis of Czech and English available data, are most relevant to the present work. With the focus in the Treex framework [18]. The analysis starts with a on English personal pronouns only, their best cross-lingual rule-based tokenization, morphological analysis and part- configuration managed to outperform the monolingual CR of-speech tagging (e.g. [21] for Czech), dependency pars- by one F-score point. Taking advantage of a more devel- ing to surface trees (e.g. MST parser [12] for English) oped version of their CR system, we extend their work in and named entity recognition [22]. In addition, the NADA several directions. First, we explore the potential of such tool [3] is applied to help distinguish referential and non- approach for a wider range of English coreferential expres- referential occurrences of the English pronoun “it”. sions. Next, we perform experiments in the opposite direc- Tectogrammatical trees are created by a transformation tion, i.e. Czech CR informed by English. And finally, we from the surface trees. All function words are made hid- provide a very detailed analysis of the results unveiling the den, morpho-syntactic information is transferred and se- nature of the cross-lingual aid. mantic roles are assigned to tectogrammatical nodes [4]. On the tectogrammatical layer, certain types of ellipsis can be restored. The automatic pre-processing focuses only on 3 Coreference Resolution System restoring nodes that might be anaphoric. Such nodes are added by heuristics based on syntactic structures. The re- For coreference resolution we adopt a more developed ver- stored nodes include Czech zero subjects and both Czech sion of the resolver utilized in [17]. This new version and English zeros in non-finite clauses, e.g. zero relative builds on the monolingual Treex CR system [15], and aug- pronouns, unexpressed arguments in infinitives, past and ments it with the cross-lingual extension presented in [17]. present participles. The difference between the current system and the system in [17] lies mostly in that it can target a wider range Model design. Treex CR models coreference in a way to of expressions, it exploits a richer feature set and the pre- be easily optimized by supervised learning. Particularly, processing stage analyzing the text to the tectogrammati- we use logistic regression with stochastic gradient descend cal representation is of higher quality. Instead of listing all optimization implemented in the Vowpal Wabbit toolkit.4 the changes, we briefly introduce the monolingual (Sec- Design of the model employs multiple concepts that have tion 3.1) and the cross-lingual component (Section 3.2) of proved to be useful and simple at the same time. Treex CR from the scratch.2 Given an anaphor and a set of antecedent candidates, mention-ranking models [6] are trained to score all the candidates at once. On the one hand a mention-ranking 3.1 Monolingual Resolution model is able to capture competition between the candi- Treex CR operates on the tectogrammatical layer. It is dates, but on the other hand features describe solely the a layer of deep syntax based on the theory of Functional actual mentions, not the whole clusters built up to the mo- Generative Description [20]. The tectogrammatical repre- ment. Antecedent candidates for an anaphor (both positive sentation of a sentence is a dependency tree with rich lin- and negative) are selected from the context window of a guistic features consisting of the content words only. Fur- predefined size.

Load more