Cognate Production Using Character-Based Machine Translation

Cognate Production using Character-based Machine Translation Lisa Beinborn, Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for International Educational Research www.ukp.tu-darmstadt.de Abstract A strict definition only considers two words as Cognates are words in different languages cognates, if they have the same etymological ori- that are associated with each other by lan- gin, i.e. they are genetic cognates (Crystal, 2011). guage learners. Thus, cognates are im- Language learners usually lack the linguistic back- portant indicators for the prediction of the ground to make this distinction and will use all perceived difficulty of a text. We in- similar words to facilitate comprehension regard- troduce a method for automatic cognate less of the linguistic derivation. For example, the production using character-based machine English word strange has the Italian correspondent translation. We show that our approach strano. The two words have different roots and are is able to learn production patterns from therefore genetically unrelated. However, for lan- noisy training data and that it works for guage learners the similarity is more evident than a wide range of language pairs. It even for example the English-Italian genetic cognate works across different alphabets, e.g. we father-padre. Therefore, we aim at identifying all obtain good results on the tested language words that are sufficiently similar to be associated pairs English-Russian, English-Greek, and by a language learner no matter whether they are English-Farsi. Our method performs sig- genetic cognates. As words which are borrowed nificantly better than similarity measures from another language without any modification used in previous work on cognates. (such as cappuccino) can be easily identified by direct string comparison, we focus on word pairs 1 Introduction that do not have identical spelling. In order to improve comprehension of a text in a If the two associated words have the same or foreign language, learners use all possible infor- a closely related meaning, they are true cognates, mation to make sense of an unknown word. This while they are called false cognates or false friends includes context and domain knowledge, but also in case they have a different meaning. On the one knowledge from the mother tongue or any other hand, true cognates are instrumental in construct- previously acquired language. Thus, a student is ing easily understandable foreign language exam- more likely to understand a word if there is a sim- ples, especially in early stages of language learn- ilar word in a language she already knows (Ring- ing. On the other hand, false friends are known bom, 1992). For example, consider the following to be a source of errors and severe confusion for German sentence: learners (Carroll, 1992) and need to be practiced Die internationale Konferenz zu kritischen more frequently. For these reasons, both types Infrastrukturen im Februar ist eine Top- need to be considered when constructing teaching Adresse fur¨ Journalisten. materials. However, existing lists of cognates are Everybody who knows English might grasp the usually limited in size and only available for very gist of the sentence with the help of associ- few language pairs. In order to improve language ated words like Konferenz-conference or Februar- learning support, we aim at automatically creating February. Such pairs of associated words are lists of related words between two languages, con- called cognates. taining both, true and false cognates. 883 International Joint Conference on Natural Language Processing, pages 883–891, Nagoya, Japan, 14-18 October 2013. In order to construct such cognate lists, we need to decide whether a word in a source language has a cognate in a target language. If we already have candidate pairs, string similarity measures can be used to distinguish cognates and unrelated Figure 1: Character-based machine translation pairs (Montalvo et al., 2012; Sepulveda´ Torres and Aluisio, 2011; Inkpen et al., 2005; Kondrak rithm selects the best combination of sequences. and Dorr, 2004). However, these measures do not The transformation is thus not performed on iso- take the regular production processes into account lated characters, it also considers the surrounding that can be found for most cognates, e.g. the En- sequences and can account for context-dependent glish suffix ~tion becomes ~cion´ in Spanish like in phenomena. The goal of the approach is to directly nation-nacion´ or addition-adicion´ . Thus, an alter- produce a cognate in the target language from an native approach is to manually extract or learn pro- input word in another language. Consequently, in duction rules that reflect the regularities (Gomes the remainder of the paper, we refer to our method and Pereira Lopes, 2011; Schulz et al., 2004). as COP (COgnate Production). All these methods are based on string align- Exploiting the orthographic similarity of cog- ment and thus cannot be directly applied to lan- nates to improve the alignment of words has al- guage pairs with different alphabets. A possible ready been analyzed as a useful preparation for workaround would be to first transliterate foreign MT (Tiedemann, 2009; Koehn and Knight, 2002; alphabets into Latin, but unambiguous translitera- Ribeiro et al., 2001). As explained above, we ap- tion is only possible for some languages. Methods proach the phenomenon from the opposite direc- that rely on the phonetic similarity of words (Kon- tion and use statistical MT for cognate production. drak, 2000) require a phonetic transcription that is Previous experiments with character-based MT not always available. Thus, we propose a novel have been performed for different purposes. Pen- production approach using statistical character- nell and Liu (2011) expand text message abbre- based machine translation in order to directly pro- viations into proper English. In Stymne (2011), duce cognates. We argue that this has the follow- character-based MT is used for the identification ing advantages: (i) it captures complex patterns in of common spelling errors. Several other ap- the same way machine translation captures com- proaches also apply MT algorithms for translit- plex rephrasing of sentences, (ii) it performs bet- eration of named entities to increase the vocabu- ter than similarity measures from previous work lary coverage (Rama and Gali, 2009; Finch and on cognates, and (iii) it also works for language Sumita, 2008). For transliteration, characters from pairs with different alphabets. one alphabet are mapped onto corresponding let- ters in another alphabet. Cognates follow more 2 Character-Based Machine Translation complex production patterns. Nakov and Tiede- mann (2012) aim at improving MT quality using Our approach relies on statistical phrase-based cognates detected by character-based alignment. machine translation (MT). As we are not inter- They focus on the language pair Macedonian- ested in the translation of phrases, but in the trans- Bulgarian and use English as a bridge language. formation of character sequences from one lan- As they use cognate identification only as an in- guage into the other, we use words instead of sen- termediary step and do not provide evaluation re- tences and characters instead of words, as shown sults, we cannot directly compare with their work. in Figure 1. In the example, the English charac- To the best of our knowledge, we are the first to ter sequence cc is mapped to a single c in Spanish use statistical character-based MT for the goal of and the final e becomes ar. It is important to note directly producing cognates. that these mappings only apply in certain contexts. For example, accident becomes accidente with a 3 Experimental Setup double c in Spanish and not every word-final e is changed into ar. In statistical MT, the training pro- Figure 2 gives an overview of the COP architec- cess generates a phrase table with transformation ture. We use the existing statistical MT engine probabilities. This information is combined with Moses (Koehn et al., 2007). The main difference language model probabilities and a search algo- of character-based MT to standard MT is the lim- 884 Figure 2: Architecture of our Cognate Production (COP) approach ited lexicon. Our tokens are character n-grams in- types of cognates as foreign words also trigger stead of words, therefore we need much less train- wrong associations in learners (see Section 5.4). ing data. Additionally, distortion effects can be neglected as reordering of ngrams is not a regu- Evaluation Metrics In order to estimate the lar morphological process for cognates.1 Thus, we cognate production quality without having to rely deal with less variation than standard MT. on repeated human judgment, we evaluate COP against a list of known cognates. Existing cog- Training As training data, we use existing lists nate lists only contain pairs of true cognates, but of cognates or lists of closely related words and a word might have several true cognates. For ex- perform some preprocessing steps. All duplicates, ample, the Spanish word musica´ has at least three multiwords, conjugated forms and all word pairs English cognates: music, musical, and musician. that are identical in source and target are removed. Therefore, not even a perfect cognate production We lowercase the remaining words and introduce process will be able to always rank the right true # as start symbol and $ as end symbol of a word. cognate on the top position. In order to account for Then all characters are divided by blanks. Moses the issue, we evaluate the coverage using a relaxed additionally requires a language model. We build metric that counts a positive match if the gold stan- an SRILM language model (Stolcke, 2002) from a dard cognate is found in the n-best list of cognate list of words in the target language converted into productions.

Load more