Machine Translation from an Intercomprehension Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Machine Translation from an Intercomprehension Perspective Yu Chen Tania Avgustinova Department of Language Science and Technology Saarland University, Saarbrücken, Germany {yuchen, tania}@coli.uni-sb.de Abstract ent that translating between them is never an easy task. For example, all Slavic languages have rich Within the first shared task on machine trans- lation between similar languages, we present morphology, but inflections systematically differ our first attempts on Czech to Polish machine from one language to another, which makes it im- translation from an intercomprehension per- possible to have a uniform solution for translating spective. We propose methods based on the between them or to a third language. mutual intelligibility of the two languages, We chose to work on the language pair Czech- taking advantage of their orthographic and Polish from the West Slavic subgroup. In an phonological similarity, in the hope to improve intercomprehension scenario, when participants over our baselines. The translation results are evaluated using BLEU. On this metric, none of in a multilingual communication speak their na- our proposals could outperform the baselines tive languages, Czechs and Poles are able to on the final test set. The current setups are understand each other to a considerable extent, rather preliminary, and there are several poten- mainly due to objectively recognisable and subjec- tial improvements we can try in the future. tively perceived linguistic similarities. As Czech- 1 Introduction English and Polish-English translation pairs are challenging enough, this naturally motivates the A special type of semi-communication can be ex- search for direct translation solutions instead of a perienced by speakers of similar languages, where pivot setup. all participant use their own native languages and We first briefly introduce the phenomenon of in- still successfully understand each other. On the tercomprehension between Slavic languages and other hand, in countries with more than one offi- our idea how to take advantage of it for machine cial languages, even if these languages are mutu- translation purposes. The next section spreads out ally intelligible. While it is a common practice to our plans on Czech-Polish translation by exploring use English as a pivot language for building ma- the similarities and differences between the two chine translation systems for under-resourced lan- languages. Then, we explain how we organized guage pairs. If English turns out to be typolog- the experiments that lead to our submissions to the ically quite distant from both the source and the shared task. We conclude with a discussion of the target languages, this circumstance easily results translation results and an outlook. in accumulation of errors. Hence, the interesting research question is how to put the similarity be- 2 Slavic Intercomprehension tween languages into use for translation purposes in order to alleviate the problem caused by the lack Intercomprehension is a special form of multi- of data or limited bilingual resources. lingual communication involving receptive skills Slavic languages are well-known for their close when reconstructing the meaning in inter-lingual relatedness, which may be traced to common an- contexts under concrete communicative situation cestral forms both in the oral tradition and in writ- It is common practice for millions of speakers, es- ten text communication. Sharing many common pecially those of related languages. In order to in- features, including an inventory of cognate sound- terpret the message encoded in a foreign but re- meaning pairings, they are to various degrees mu- lated language, they rely on linguistic and non- tually intelligible, being at the same time so differ- linguistic elements existing for similar situations 192 Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 3: Shared Task Papers (Day 2) pages 192–196 Florence, Italy, August 1-2, 2019. c 2019 Association for Computational Linguistics in their own linguistic repertoire. a bilingual linguist and cannot be easily applied Languages from the same family exhibit sys- to large amount of data. An alternative to the tematic degrees of mutual intelligibility which manual rewriting is to utilize the correspondence may be in many cases asymmetric. Czech and rules using Minimum Description Length (MDL) Polish belong to the West Slavic subgroup and re- principle (Grünwald, 2007). Most of the around lated both genetically and typologically. It could 3000 rules generated from a parallel cognate list be shown, for example, that the Poles understood of around 1000 words are not deterministic. We written Czech better than the Czechs understood use only the rules converting Czech letters that do written Polish, while the Czechs understood spo- not exist in Polish, as listed in Table1, to avoid ken Polish better than the Poles understood spo- over-transformation. ken Czech (Golubovic´, 2016). How can this be useful for machine translation? In order to tackle CZ PL the Czech-to-Polish machine translation problem Áá Aa ˇ from an intercomprehension point of view, we Ccˇ Cz cz ˇ currently focus on orthographic and phonological Dd’ D´zd´z ˇ similarities between the two languages that could Eeˇ Je je provide us with relevant correspondences in order Éé Ee to establish inter-lingual transparency and reveal Íí Ii cognate units and structures. Nˇ nˇ Nn Rˇ rˇ Rz rz 3 Approach Šš Sz sz Tt’ˇ C´ c´ 3.1 Orthographic correspondences U˚u˚ Óó Both orthographic systems are based on the Latin Vv Ww alphabet with diacritics, but the diacritical signs in Xx Ks ks the two languages are rather different. Czech has Ýý Yy a larger set of letters with diacritics, while Polish Žž Z´z´ uses digraphs more often. There are two basic di- acritical signs in Czech: the acute accent (´) used Table 1: Orthographic correspondence list to mark a long vowel and the hacekˇ (˘) in the con- sonants which becomes the acute accent for d’ and t’. The diacritics used in the Polish alphabet are 3.2 Phoneme correspondences the kreska (graphically similar to the acute accent) Czech and Polish are both primarily phonemic in the letters c,´ n,´ ó, s,´ ´z;the kreska ukosna´ (stroke) with regard to their writing system, which is re- in the letter ł. ; the kropka (overdot) in the letter z;˙ flected in the alphabets. That is, graphemes con- and the ogonek ("little tail") in the letters ˛a,˛e.The sistently correspond to phonemes of the language, Czech letters á, c,ˇ d’, é, e,ˇ ch, í, n,ˇ r,ˇ š, t’, ú, ˚u,ý, but the relation between spelling and pronuncia- ž as well as q, v, and x do not exist in Polish, and tion is more complex than a one-to-one correspon- the Polish letters ˛a, c,´ ˛e,ł, n,´ s,´ w, z˙ and ´zare not dence. In addition, Polish uses more digraphs, part of the Czech alphabet. such as ch, cz, dz, d´z,dz,˙ rz, and sz. In both lan- In a reading intercomprehension scenario, it guages, some graphemes have been merged due is natural for people to simply ignore unknown to historical reasons and at the same time some elements around graphemes that they are famil- changes in phonology have not been reflected in iar with. That is, when facing unknown alpha- spelling. bet with “foreign” diacritical signs, the reader is It is well-known, that people often try to pro- most likely to drop them and treat the respec- nounce the foreign texts in a way closer to their tive letters as the corresponding plain Latin ones. own language. Moreover, hearing the correct Experiments showed that efficiency of intercom- pronunciation sometimes helps them to infer the prehension is significantly improved if the text meaning more easily, in particular, loanwords / in- is manually transformed to mimic the spelling ternationalisms and the pan-Slavic vocabulary. in the reader’s language (Jágrová, 2016). How- To be able to make use of phonological infor- ever, such rewriting requires a huge effort from mation within a machine translation system, we 193 run fast_align (Dyer et al., 2013) on the paral- IPAcz IPApl lel corpora to obtain word alignments in both di- rections. Then, phrase pairs with less than 6 to- kens are extracted to construct a translation model Text Text cz pl based on the alignments. Weights for the fea- Figure 1: Phoneme-infused translation setup tures in the translation model are determined with the Minimal Error Rate Training (MERT) (Och, 2003). propose a multi-source multi-target structure as A byte pair encoding (BPE) (Sennrich et al., shown in Figure1, which considers the IPA tran- 2015) is applied to the training data to reduce the scription of the text as a text in a "new" closely vocabulary to 36,000 units for the NMT systems. related language. More specifically, for the trans- The first NMT system utilized only the parallel lation from Czech to Polish, the source languages data. It is a single sequence-to-sequence model of the multilingual system include ordinary Czech with single-layer RNNs in both the encoder and and IPA-transcribed Czech. So, three different the decoder. The embedding size is 512 and the translation paths are competing with each other to RNN state size is 1024. produce the final translations. The architecture of our second NMT baseline follows the architecture described in (Vaswani 4 Experiments et al., 2017). We first train a shallow model from 4.1 Data and baselines Polish to Czech with only the parallel corpora in order to translate the complete monolingual Pol- We used the provided parallel corpora Europarl, ish corpus into Czech for a synthesized parallel Wiki Titles, JRC-Acquis and the monolingual cor- corpus, which is concatenated with the original pus News Crawl 2018 for Polish. We extract ran- data to produce new training data (Sennrich et al., domly two disjoint subsets from the development 2016).