Pronoun Prediction with Linguistic Features and Example Weighing

Pronoun Prediction with Linguistic Features and Example Weighing Michal Novak´ Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranske´ nam´ estˇ ´ı 25, CZ-11800 Prague 1 [email protected] Abstract We take a machine learning approach to the problem and apply a linear classifier. Our ap- We present a system submitted to the proach combines features coming from the target WMT16 shared task in cross-lingual language model with features extracted from the pronoun prediction, in particular, to linguistically analyzed source and target texts. We the English-to-German and German-to- also introduce training example weighing, which English sub-tasks. The system is based aims at improving the prediction accuracy of less on a linear classifier making use of fea- populated target pronouns. All the source codes tures both from the target language model used to build the system are publicly available.3 and from linguistically analyzed source According to the WMT16 pronoun translation and target texts. Furthermore, we apply shared task results (Guillou et al., 2016), our best example weighing in classifier learning, German-to-English system ranks in the middle of which proved to be beneficial for recall the pack while our English-to-German systems in less frequent pronoun classes. Com- seem to be the poorest. However, after the shared pared to other shared task participants, our task submission deadline, we discovered an error best English-to-German system is able to in post-processing of the classifier predictions on rank just below the top performing sub- the evaluation set for the English-to-German di- missions. rection. After correcting this error, our system reaches the 2nd best result for this language di- 1 Introduction rection. Previous works concerning translation of pro- The paper is structured as follows. After intro- nouns1 have shown that unlike other words, producing the related work in Section 7, we describe nouns require a special treatment. Context and tar- three preprocessing components of our system that get language grammar influence pronoun transla- enrich the input data with additional information tion much more profoundly than the translation of in Section 2. Section 3 then presents features ex- parts-of-speech carrying lexical information. tracted from the data whereas Section 4 gives more This paper presents a system for the WMT16 details about the method used to train the model. shared task of cross-lingual pronoun prediction In Section 5, all our system configurations submit- (Guillou et al., 2016),2 the task that looks at the ted to the shared tasks are evaluated. Finally, we problem of pronoun translation in a more sim- examine the effect of individual features and ex- plified way. Here, the objective is to predict ample weighing in Section 6 before we conclude a target language pronoun from a set of pos- in Section 8. sible candidates, given source text, lemmatized and part-of-speech-tagged target text, and auto- 2 Preprocessing components matic word alignment. We address specifically the The preprocessing stage combines three compo- sub-tasks of English-to-German and German-to- nents, each of them enriching the input data with English pronoun prediction. additional information: a target language model, 1Summarized by Hardmeier (2014). an automatic linguistic analysis of the source sen- 2http://www.statmt.org/wmt16/pronoun-task. html 3https://github.com/ufal/wmt16-pronouns 602 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 602–608, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics tences, and a basic automatic analysis of the target of being anaphoric by the NADA tool (Bergsma sentences. and Yarowsky, 2011). 7 2.1 Target language model German. We utilized the MATE tools (Bjorkelund¨ et al., 2010) to perform part- For language modeling, we employed the KenLM of-speech tagging, morphological analysis Language Model Toolkit (Heafield et al., 2013), an (necessary to obtain grammatical categories efficient implementation of large language models such as gender or number), and transition-based with modified Kneser-Ney smoothing (Kneser and dependency parsing (Bohnet and Nivre, 2012; Ney, 1995). Seeker and Kuhn, 2012). Lemmatized 5-gram models for English and German have been supplied as a baseline system 2.3 Target language analysis by the organizers of the shared task. An integral In the data supplied by the task organizers, the for- part of the baseline system is a wrapper script4 mat of the target language sentences differs from performing necessary preprocessing before the ac- the source language format. Not only are the target tual probability estimation. For instance, it selects words to be predicted replaced by a placeholder, words which may possibly belong to the OTHER but all other tokens are also substituted with cor- class5 and it enables setting a penalty for pre- responding lemmas and coarse-grained part-of- ferring an empty word.6 We only adjusted the speech tags. wrapper script so that it fits into our processing For this reason, we needed to simplify the anal- pipeline, making no modifications to the estima- ysis of target texts. The parsers used for source tion machinery. texts do not accept the tagset used by the organiz- 2.2 Source language analysis ers. There are two possible solutions to fix this disagreement: either running a part-of-speech tag- In the input data supplied by the task organiz- ger producing tags that agree with the tagset re- ers, source text is represented as plain tokenized quired by the parser, or obtaining suitable part- sentences. We have processed the source texts of-speech tags by a transformation of the origi- with tools obtaining additional linguistic analy- nal tagset. However, both options are prone to er- sis. However, due to different availability of rors. In the former option, the tags produced in these tools for English and German, the depth of this way would definitely be of low quality as only the analysis differs. We describe both analysis a lemmatized text is available. This would cause pipelines separately in the following: problems especially for German. The latter option brings another problem. The original tagsets (12 English. English source texts have been analyzed up to the level of deep syntax using the Treex tags in both English and German) are more coarse- framework (Popel and Zabokrtskˇ y,´ 2010) incor- grained than the tagsets required by the parsers (44 porating several external tools. The processing and 53 tags in English and German, respectively), pipeline consists of part-of-speech tagging with which makes the transformation in this direction the Morceˇ tool (Spoustova´ et al., 2007) depen- difficult. dency parsing conducted by the MST parser (Mc- Due to these obstacles, we decided to abandon Donald et al., 2005), semantic role labeling (Bo- any additional linguistic processing except for the jar et al., 2016), and coreference resolution ob- identification of noun genders. We consider gen- tained as a combination of Treex coreference mod- der and number information one of the most valu- ules and the Bart 2 toolkit (Versley et al., 2008; able inputs for correct pronoun translation. While Uryupina et al., 2012). Prior to the last step, all in- the number information is hard to reconstruct from stances of the pronoun it are assigned a probability a lemmatized text with part-of-speech tags having no indication of grammatical number, gender can 4https://bitbucket.org/yannick/discomt_ be reconstructed from a noun lemma itself quite baseline/src satisfactorily. In each of the languages, we ap- 5The OTHER class comprise words, not necessarily pronouns, that appear often enough in the context typical for pro- proached the task of obtaining gender for a given nouns to be resolved but not enough to form their own class. noun in a different way. Furthermore, it can be an empty word if the source pronoun has no target language counterpart. 7https://code.google.com/archive/p/ 6In all experiments, we used zero penalty. mate-tools/ 603 English. The gender information was obtained dependency tree, and combinations of the previous using the data collected by Bergsma and Lin features. As the analysis of English goes deeper (2006).8 They used paths in dependency trees to than the surface syntax, we include the semantic learn the likelihood of coreference between a pro- function of the source counterpart. If the counter- noun and a noun candidate and then applied them part is an instance of the pronoun it, we add the in a bootstrapping fashion on larger data to obtain anaphoricity probability estimated by the NADA a noun gender and number distribution in different detector, quantized in the same way as the proba- contexts. bilities coming from the KenLM model. For the sake of simplicity, we filtered their list only to single-word items. If we encounter a token Target language features. The lemma of a par- with a noun tag assigned in the target sentence, its ent verb of the target pronoun placeholder might lemma is looked up in the list and assigned the also be a valuable feature. Even though we have most probable gender, if any is found. Otherwise, not performed a syntactic analysis on the target the neuter gender is assumed. text (see Section 2.3), we are still able to approx- imate it in several ways. The easiest option is German. We run the MATE morphological to list all verb lemmas that appear in a relatively analysis separately for every lemma labeled as a small context surrounding the placeholder (1, 3, or noun. If no gender information is obtained, the 5 words). Another approach is to project the par- noun is assigned the neuter gender.

Load more