<<

Statistical Machine Translation between (Burmese) and (Tavoyan) Thazin Myint Oo Ye Kyaw Thu UCSY, Myanmar NECTEC, Thailand [email protected] [email protected]

Khin Mar Soe Thepchai Supnithi UCSY, Myanmar NECTEC, Thailand [email protected] [email protected]

Abstract mar. The results show that the hierarchi- cal phrase-based SMT (HPBSMT) (Chiang, This paper contributes the first evalu- 2007) approach gave the highest translation ation of the quality of statistical ma- chine translation (SMT) between Myan- quality in terms of both the BLEU (Papineni mar (Burmese) and Dawei (Tavoyan). We et al., 2002) and RIBES scores (Isozaki et al., also developed a Myanmar-Dawei parallel 2010). Win Pa Pa et al (2016) (Pa et al., corpus (around 9K sentences) based on the 2016) presented the first comparative study Myanmar of ASEAN MT corpus. of five major machine translation approaches The 10 folds cross-validation experiments applied to low-resource . Phrase- were carried out using three different sta- based statistical machine translation (PB- tistical machine translation approaches: SMT), HPBSMT, tree-to-string (T2S), string- phrase-based, hierarchical phrase-based, and the operation sequence model (OSM). to-tree (S2T) and operation sequence model In addition, two types of segmentation (OSM) translation methods to the transla- were studied: word and segmen- tion of limited quantities of travel domain tation. The results show that all three data between English and Thai, Laos, Myan- statistical machine translation approaches mar in both directions. The experimental re- give comparable BLEU and RIBES scores sults indicate that in terms of adequacy (as for both Myanmar to Dawei and Dawei measured by BLEU score), the PBSMT ap- to Myanmar machine translations. OSM approach achieved the highest BLEU and proach produced the highest quality transla- RIBES scores among three SMT ap- tions. Here, the annotated tree is used only proaches for both word and syllable seg- for for S2T and T2S ex- mentation. periments. This is because there is no pub- licly available tree parser for Lao, Myanmar 1 Introduction and Thai languages. According to our knowl- Our main motivation for this research is edge, there is no publicly available tree parser to investigate SMT performance for Myan- for both Dawei and Myanmar languages and mar (Burmese) and Dawei (Tavoyan) language thus we cannot apply S2T and T2S approaches pair. The Dawei (Tavoyan) language is closely for Myanmar-Dawei language pair. From related to Myanmar (Burmese) language and their RIBES scores, we noticed that OSM ap- it is often considered as dialect of Myanmar proach achieved best machine translation per- language. The state-of-the-art techniques of formance for Myanmar to English translation. statistical machine translation (SMT) (Koehn Moreover, we learned that OSM approach et al., 2003). This demonstrate good perfor- gave highest translation performance trans- mance on translation of languages with rela- lation between Khmer (the tively similar word orders (Koehn, 2005). To of Cambodia) and twenty other languages, date, there have been some studies on the in both directions (Thu et al., 2015). Re- SMT of Myanmar language. (Thu et al., lating to Myanmar langauge dialects, Thazin 2016) presented the first large-scale study of Myint Oo et al. (2018) (Oo et al., 2018) the translation of the Myanmar language. contributed the first PBSMT, HPBSMT and A total of 40 language pairs were used in OSM machine translation evaluations between the study that included languages both sim- Myanmar and Rakhine. The experiment was ilar and fundamentally different from Myan- used the 18K Myanmar-Rakhine parallel cor- pus that constructed to analyze the behav- Dawei (formerly Tavoy) in (for- ior of a dialectal Myanmar-Rakhine machine merly Tenasserim) by about 400,000 people; translation. The results showed that higher its sterotyped characteristic is the mesial /I/, BLEU (57.88 for Myanmar-Rakhine and 60.86 found in earlist inscriptions but by for Rakhine-Myanmar) and RIBES (0.9085 for merger there nearly 800 years ago; for further Myanmar-Rakhine and 0.9239 for Rakhine- information see (1933) and Myanmar) scores can be achieved for Rakhine- Okell (1995)(OKELL, 1995). Dawei is a city Myanmar language pair even with the lim- of south-eastern Myanmar and is the capital ited data. Based on the experimental re- of , formerly known as sults of previous works, in this paper, the ma- the Tenasserim is bounded by to chine translation experiments between Myan- the north, Thailand to the east and south, mar and Dawei were carried out using PB- and the Andaman sea to the west. Tavoyan SMT, HPBSMT and OSM. retains /-l-/ medial that has since merged into the /-j-/ medial in standard Burmese and can 2 Related Work form the following consonant clusters: /ɡl-/, Karima Meftouh et al. built PADIC (Parallel /kl-/, /kʰl-/, /bl-/, /pl-/, /pʰl-/, /ml-/, /m̥l-/. Exam- Dialect Corpus) corpus from scratch, ples include “ေမလ” (/mlè/ → Standard Burmese then conducted experiments on cross dialect /mjè/) for “ground” and “ေကလာင်း” (kláʊɴ/ → Arabic machine translation (Meftouh et al., Standard Burmese tʃáʊɴ/) for “school”. [4] 2015) PADIC is composed of dialects from Also, voicing only with unaspirated consonants, both the Maghreb and the Middle-East. Some whereas in standard Burmese, voicing can occur interesting results were achieved even with with both aspirated and unaspirated consonants. the limited corpora of 6,400 parallel sentences. Also, there are many loan words from Malay and Using SMT for dialectal varieties usually suf- Thai not found in Standard Burmese. An example fers from data sparsity, but combining word- is the word for goat, which is hseit “ဆိတ်” in level and character-level models can yield good Standard Burmese but be “ဘဲ” in Tavoyan. In results even with small training data by ex- the Tavoyan dialect, terms of endearment, as ploiting the relative proximity between the two well as family terms, are considerably different varieties (Neubarth et al., 2016). Friedrich from Standard Burmese. For instance, the terms Neubarth et al. described a specific problem for “son” and “daughter” are “ဖစု” (/pʰa̰ òu/) and its solution, arising with the translation and “မိစု”(/mḭ òu/) respectively. Moreover, the between standard Austrian German and Vi- “ေနာင်” (Naung) is used in lieu of ennese dialect. They used hybrid approach “ေမာင်” (Maung) for young males. Another evi- of rule-based preprocessing and PBSMT for dence of “Dawei” is “Dhommarazaka” pogoda getting better performance. Pierre-Edouard inscription of Bagan period. It was inscription Honnet et al. proposed solutions for the ma- of Bagan period. It was inscribed in AD 1196 chine translation of a family of dialects, Swiss during the region of Bagan King German, for which parallel corpora are scarce (AD 1174-1201) . In this inscription line 6 to (Honnet et al., 2018). They presented three 19, when the demarcation of Bagan is mentioned strategies for normalizing Swiss German input “Taung-Kar-Htawei” (up to Htawei to the south) in order to address the regional and spelling di- and “Taninthaye” (Tanintharyi) are including. versity. The results show that character-based Therefore, the name of “Dawei” appeared par- neural MT was the most promising one for text ticulary since Bagan period, at the time of the normalization and that in combination with first Myanmar Empire. (Dawei was established PBSMT achieved 36 % BLEU score. at Myanmar year 1116) is actually meant that 3 Dawei Language the present name Dawei appears as the name of the settlers later and the original name of the The Tavoyan or Dawei dialect of Burmese city is Tharyarwady, which was established at is spoken in Dawei (Tavoy), in the coastal Myanmar year 1116 according to the saying. As Tanintharyi Region of southern Myanmar “Dawei” nationality deserves as one nationalist (Burma). The large and quite distinct Dawei in our country. Actually, Dawei region is a place or Tavoyan variety is spoken in and around where local people lived since very ancient Stone Age. After that, Stone Age, Bronze Age and ။ Iron Age culture developed. Moreover, as there my: ေကာင်ေလး ေကျာင်း မှန်မှန် တက် တယ် ။ has sound evidence of Thargara ancient city, (“The boy goes to school regularly” in English) comtemporary to Phu Period, the Dawei people, can be assumed that they are one nationality of high culture in Myanmar. Dawei(Tavoyan) 4 Methodology usage and vocabularies is divided into three In this section, we describe the methodology used main groups. The first one is using Myanmar in the machine translation experiments for this pa- vocabularies with Dawei speech, the second is the per. vocabularies same with Myanmar vocabularies and using isolated Dawei words and vocabularies. 4.1 Phrase-Based Statistical Machine In Myanmar word (“ထို, ဟို”), (“here, there”) is Translation used (“here”) and (“there”) “သယ်” “ေဟာက်” A PBSMT translation model is based on phrasal in Dawei language. For example Dawei word units (Koehn et al., 2003). Here, a phrase is is same as in Myanmar language “သယ်မျ ိုး” “ဒီလို” simply a contiguous sequence of words and gen- and “ေဟာက်မျ ိုး” means “ဟိုလို” in Myanmar erally, not a linguistically motivated phrase. A language. The question words “နည်း (သနည်း), လဲ phrase-based translation model typically gives (သလဲ)” are used in Myanmar language, similarly better translation performance than word-based “ေလာ,ေလာ်” is used instead of “လား (သလား)” models. We can describe a simple phrase-based translation model consisting of phrase-pair prob- in Dawei language. Moreover, “ဘာလဲ”(what) abilities extracted from corpus and a basic re- and “ဘာြဖစ်တာလဲ” (“what happened”) is same ordering model, and an algorithm to extract the with and in Dawei usage. “ြဖာနူး” “ြဖာြဖစ်နူး” phrases to build a phrase-table (Specia, 2011). In negative sense of Myanmar word “ဘူး” is The phrase translation model is based on noisy not usually used in Dawei word. The negative channel model. To find best translation eˆ that Dawei words are “ဟှ (ရ)” or “ဟန်း” (“No” in maximizes the translation probability P(f) given English). Myanmar word “သိပ်, အလွန်, the source sentences; mathematically. Here, the အလွန့်အလွန်”(very, extremely) is used as “ရရာ, source language is French and the target language ရမိရရာ, ြဗင်း”. Some more example of Dawei is an English. The translation of a French sentence into an English sentence is modeled as equation 1. vocabularies are “ဝန်းရှင်း” (“ကိုယ်၀န်ေဆာင်” in Myanmar language, “pregnant” in English), eˆ = argmax P(e|f) (1) “ေကာန်သား” (“ေကာင်ေလး” in Myanmar language, e “boy” in English), “ဝယ်သား” (“ေကာင်မေလး” in Applying the Bayes’ rule, we can factorized the Myanmar language, “girl” in English), “ကပ်” into three parts. (“ပိုက်ဆံ” in Myanmar language, “money” in P(e) English), “ေချာ့-က်တိုအိုးသီး” (“ကေကာသီး”ဲ in P (e|f) = P(f|e) (2) Myanmar language, “pomelo” in English) and P(f) “သစ်ခတ်ကလား” (“ကျားသစ်” in Myanmar language, “leopard” in English). The followings are some The final mathematical formulation of phrase- example parallel sentences of Myanmar (my) and based model is as follows: Dawei(dw): argmaxeP(e|f) = argmaxeP(f|e)P(e) (3) dw: သယ်၀ယ်သား က လှ ြဗင်း ဟှယ် ။ We note that denominator P(f) can be dropped my: ဒီေကာင်မေလး က လှ လွန်း တယ် ။ because for all translations the probability of the (“The girl is so beautiful” in English) source sentence remains the same . The P(e|f) variable can be viewed as the bilingual dictionary dw: လတ်ဖတ်ရယ် က ရိ ြဗင်း ဟှယ် ။ with probabilities attached to each entry to the my: လက်ဖက်ရည် က ချ ို လွန်း တယ် ။ dictionary (phrase table). The P(e) variable gov- (“The tea is so sweet” in English) erns the grammaticality of the translation and we model it using n-gram language model under the dw: ေကာန်သား ေကလာန်း မှန်းမှန် သွား ဟှယ် PBMT paradigm. Figure 1: Some examples of hierarchical phrase-based grammar between Dawei and Myanmar phrases

4.2 Hierarchical Phrase-Based Operation 1: Generate (Please, ေကျးဇူးပြပီး) Statistical Machine Translation Operation 2: Insert Gap The hierarchical phrase-based SMT approach is a model based on synchronous context-free gram- Operation 3: Generate (here, ေကျးဇူးပြပီး ဒီမှာ) mar (Specia, 2011). The model is able to be learned from a corpus of unannotated parallel Operation 4: Jump Back (1) text. The advantage this technique offers over Operation 5: Generate (sit, ်) the phrase-based approach is that the hierarchical ေကျးဇူးပြပီး ဒီမှာ ထိုင structure is able to represent the word re-ordering 5 Experiment process. The re-ordering is represented explicitly rather than encoded into a lexicalized re-ordering 5.1 Corpus Statistics model (commonly used in purely phrase-based ap- We used 9,000 Myanmar sentences (without name proaches). This makes the approach particularly entity tags) of the ASEAN-MT Parallel Corpus applicable to language pairs that require long- (Prachya and Thepchai, 2013), which is a parallel distance re-ordering during the translation process corpus in the travel domain. It contains six main (Braune et al., 2012). Some examples of hierar- categories and they are people (greeting, intro- chical phrase based grammar between Dawei and duction and communication), survival (transporta- Myanmar phrases are shown in Figure 1. tion, accommodation and finance), food (food, beverage and restaurant), fun (recreation, travel- 4.3 Operation Sequence Model ing, shopping and nightlife), resource (number, The operation sequence model that can com- time and accuracy), special needs (emergency and bines the benefits of two state-of-the-art SMT health). Manual Translation into Rakhine Lan- frameworks named n-gram-based SMT and guage was done by native Rakhine students from phrase-based SMT. This model simultaneously two Myanmar universities and the translated cor- generate source and target units and does not pus was checked by the editor of Rakhine news- have spurious ambiguity that is based on minimal paper. Word segmentation for Rakhine was done translation units (Durrani et al., 2011)(Durrani manually. We held 10-fold cross-validation ex- et al., 2015). It is a bilingual language model periments and used 6,883 to 6,893 sentences for that also integrates reordering information. OSM training, 1,212 to 1,217 sentences for develop- motivates better reordering mechanism that ment and 890 to 922 sentences for evaluation re- uniformly handles local and non-local reordering spectively. and strong coupling of lexical generation and 5.2 Word Segmentation reordering. It means that OSM can handle both short and long distance reordering. The operation In both Myanmar and Dawei text, spaces are types are such as generate, insert gap, jump used for separating phrases for easier reading. back and jump forward which perform the actual It is not strictly necessary, and these spaces are reordering. The following shows an example rarely used in short sentences. There are no clear translation process of English sentence “Please rules for using spaces, and thus spaces may (or sit here” into Myanmar language with the OSM. may not) be inserted between words, phrases, and even between a root words and their affixes. Source: Please sit here Although Myanmar sentences of ASEAN-MT Target: ေကျးဇူးပြပီး ဒီမှာ ထိုင် corpus is already segmented, we have to consider some rules for manual word segmentation of Figure 2: Visualizaiton of syllable breaking with regular expression for Myanmar language

Dawei sentences. We defined Dawei “word” and a root word “ဝယ်” and the suffix “လာရဟှယ်” to be meaningful units and affix, root word and are also segmented as two words “ဝယ် လာရဟှယ်” suffixe(s) are separated such as “စား ဟှယ်”, (“bought” in English) “စားပီးဟှယ်”, “စား ဖို့ဟှယ်”. Here, “စား” (“eat”in English) is a root word and the others are suffixes 5.3 Syllable Segmentation for past and future tenses. Similar to Myanmar Generally, Myanmar words are composed of mul- language, Dawei plural are identified by tiple , and most of the syllables are com- following particle. We also put a space between posed of more than one character. Syllables are and the following particle, for example a composed of Myanmar words. If we only focus on Dawei word “ဇွန်သားေဒ” (shrimp) is segmented consonant-based syllables, the structure of the syl- as two words “ဇွန်သား” and the particle “ေဒ”. lable can be described with Backus normal form In Dawei grammar, particles describe the type (BNF) as follows: of noun, and used after number or text number. Syllable := CMW[CK][D] For example, a Dawei word “ရှီးခိုသီးတစ်လုံ း” (“papaya” in English) is segmented as “ရှီးခိုသီး Here, C stands for consonants, M for medials, V for , K for vowel killer character, and D for တစ် လုံ း”. In our manual word segmentation rules, nouns are considered as one diacritic characters. Myanmar syllable segmenta- tion can be done with a rule-based approach, finite word and thus, a Dawei compound word “ကပ်” state automation (FSA) or regular expressions + “အိတ်” (“money” + “bag” in English) is (RE) (https://github.com/ye-kyawthu/sylbreak). written as one word “ကပ်အိတ်” (“wallet” in The visualization of the syllable breaking based English). Dawei adverb words such as “ရရာ, on the RE for Myanmar language is as shown in ရမိရရာ” (“very” in English), “ြဗင်း” (“extremely” Figure 2. In our experiments, we used RE based in English) are also considered as one word. The Myanmar syllable segmentation tool named following is an example of word segmentation for “sylbreak”. The following is an example of a Dawei sentence in our corpus and the meaning syllable segmentation for a Dawei sentence in is “Shrimps are very rare and bought fishes.” our corpus and the meaning is “You are cute.”

Unsegmented Dawei sentence: Unsegmented Dawei sentence: dw: ် ် ် ် ဇွနသားေဒရရာရှားဟှယ၊ငါးေဗာငးသားဘ့ဲ ဝယလာရ dw: နန်ရှစ်ဇရာကွန်းဇမား။ ဟှယ်။ Syllable segmented Dawei sentence: dw: နန် ရှစ် ဇ ရာ ကွန်း ဇ မား ။ Word Segmented Dawei sentence: dw: ဇွန်သား ေဒ ရရာ ရှား ဟှယ် ၊ ငါးေဗာင်းသား ဘ့ ဲ 5.4 Moses SMT System ဝယ် လာရဟှယ် ။ We used the PBSMT, HPBSMT and OSM system provided by the Moses toolkit (Koehn et al., 2007) In this example, “ဇွန်သားေဒ” (shrimps) is for training the PBSMT, HPBSMT and OSM sta- segmented as two words “ဇွန်သား” and the par- tistical machine translation systems. The word ticle “ေဒ”. Dawei adverb words such as “ရရာ” segmented source language was aligned with the (“rare” in English) is also considered as one word word segmented target language using GIZA++ src-tgt PBSMT HPBSMT OSM dw-my 29.143 (0.82286) 29.09 (0.82203) 29.563 (0.82369) my-dw 21.575(0.62624) 21.697 (0.78651) 21.701 (0.78667)

Table 1: Average BLEU and RIBES scores for PBSMT, HPBSMT and OSM using word segmentation

src-tgt PBSMT HPBSMT OSM dw-my 60.788 (0.94613) 60.472 (0.94476) 63.221 (0.94825) my-dw 44.8 (0.91601) 45.441 (0.91496) 45.584 (0.91550)

Table 2: Average BLEU and RIBES scores for PBSMT, HPBSMT and OSM using Syllable Segmentation

(Och and Ney, 2000). The alignment was sym- brackets. Here, “my” stands for Myanmar, “dw” metrize by grow-diag-final and heuristic [1]. The stands for Dawei, “src” stands for source lan- lexicalized reordering model was trained with the guage and “tgt” stands for target language re- msd-bidirectional-fe option (Tillmann, 2004). We spectively. The BLEU and RIBES score results use KenLM (Heafield, 2011) for training the 5- for machine translation experiments with PBSMT, gram language model with modified Kneser-Ney HPBSMT and OSM using word level segmenta- discounting (Chen and Goodman, 1996). Mini- tion between Myanmar and Dawei languages are mum error rate training (MERT) (Och, 2003) was shown in Table 1. From the results, OSM method used to tune the decoder parameters and the de- achieved the highest BLEU and RIBES score for coding was done using the Moses decoder (ver- both Myanmar-Dawei and Dawei-Myanmar ma- sion 2.1.1). We used default settings of Moses for chine translations. Interestingly, the BLEU and all experiments. RIBES score of all three methods are compara- ble performance. Our results with current paral- 6 Evaluation lel corpus indicate that Dawei to Myanmar ma- chine translation is better performance (around We used two automatic criteria for the evaluation 8 BLEU and 0.03 RIBES scores higher) than of the machine translation output. One was the de Myanmar to Dawei translation direction. The facto standard automatic evaluation metric Bilin- results of BLEU and RIBES scores of syllable gual Evaluation Understudy (BLEU) (Papineni segmentaion between Myanmar and Dawei lan- et al., 2002) and the other was the Rank-based guages are shown in Table 2. Our results with Intuitive Bilingual Evaluation Measure (RIBES) syllable segmentation also indicate that Dawei (Isozaki et al., 2010). The BLEU score measures to Myanmar machine translation is better perfor- the precision of n-gram (over all n ≤ 4 in our mance (around 17 BLEU and 0.03 RIBES score case) with respect to a reference translation with higher) than Myanmar to Dawei translation direc- a penalty for short translations (Papineni et al., tion. As we expected, generally, machine transla- 2002). Intuitively, the BLEU score measures the tion performance of all three SMT approaches be- adequacy of the translation and large BLEU scores tween Myanmar and Dawei languages with lim- are better. RIBES is an automatic evaluation met- ited parallel corpus achieved suitable scores for ric based on rank correlation coefficients modified both BLEU and RIBES. The reason is that as we with precision and special care is paid to word or- mentioned in Section 3, the two languages, Myan- der of the translation results. The RIBES score is mar and Dawei are close languages. We assume suitable for distance language pairs such as Myan- that long distance reordering is relatively rare and mar and English. Large RIBES scores are better. only local reordering is enough for the Myanmar- 7 Results and Discussion Dawei language pair. We can expect that we can increase these scores higher than current results The BLEU and RIBES score results for machine by increasing the corpus size in the near future. translation experiments with PBSMT, HPBSMT 8 Error Analysis and OSM are shown in Table 1. Bold numbers indicate the highest scores among three SMT ap- We also used the SCLITE (score speech recogni- proaches. The RIBES scores are inside the round tion system output) program from the NIST scor- Freq Reference ==> Hypothesis ### Paraphrasing Error ### 16 သူမ ==> သူ SOURCE:ငှား ဟှားဟိ အီ ေလ ။ 14 ခင်ဗျား ==> မင်း Scores: (#C #S #D #I) 4 1 0 0 9 ပါတယ် ==> တယ် REF: ငှားရမ်း ထားတ့ဲ အိမ် ေတွ ။ 8 ပါဘူး ==> ဘူး HYP: ငှား ထားတ့ဲ အိမ် ေတွ ။ 7 သလဲ ==> တယ် Eval: S 5 ဘာေတွ ==> ဘာ 5 မင်းကို ==> ကို SOURCE: လူတိုင်း သတတိ ရှိ ေက့ဟှယ် ။ 5 မလား ==> မှာလား Scores: (#C #S #D #I) 4 1 0 0 5 လား ==> သလား REF: လူတိုင်း သတတိ ရှိ ြကပါတယ် ။ 5 အ့ဲ ဒါကို ==> ကို HYP: လူတိုင်း သတတိ ရှိ ြကတယ် ။ 4 ခ့ဲ ဘူး ==> ဘူး Eval: S 4 ဘူးလား ==> ရှိလား SOURCE: ကန်ေတာ် အိ ရှင်ေနဟှယ် ။ 4 မင်းရဲ ့ ==> မင်း Scores: (#C #S #D #I) 3 1 0 2 4 လ ==> သလ ဲ ဲ REF: ကန်ေတာ် အိပ် **** ****** ချင်ေနတယ် ။ 4 သူ့ ==> သူမ HYP: ကန်ေတာ် အိပ် ဖို့ ဆနဒရှိ တယ် ။ Eval: I I S Table 3: The top 15 confusion pairs of OSM model for Dawei-Myanmar machine translation with word segmentation SOURCE: သူဟှ ရတိုင်း လှ မား ။ Scores: (#C #S #D #I) 3 2 0 0 REF: သူက အရမ်း လှ တာပဲ ။ ing toolkit SCTK version 2.4.10 for making dy- HYP: သူက သိပ် လှ ေရာ ။ namic programming based alignments between Eval: S S reference and hypothesis strings for detail analysis on translation errors. From our studies, the top 15 ### Word Segmentation Error ### confusion matrix for Dawei-Myanmar OSM ma- chine translation (with word segmentation) can be SOURCE: အဲဝယ်ဟှား ကားမွန်း ဟိ ့မဝလား ။ seen in Table 3. We also made manual error anal- Scores: (#C #S #D #I) 4 1 1 0 ysis on translated outputs of the best OSM model, REF: သူမ ကား ေမာင်း မှာ မဟုတ်ဘူးလား ။ and we found that dominant errors are different in HYP: သူမ ********* ကားေမာင်း မှာ sentence level. We will introduce four frequent မဟုတ်ဘူးလား ။ error patterns and they are “Male-Female Vocab- Eval: D S ulary Error”, “Paraphrasing Error”, “Word Seg- SOURCE: အယ်မိုဇာ ပိုဆိုး လာဟှယ် ။ mentation Error” and “Negative Error”. The fol- Scores: (#C #S #D #I) 3 1 1 0 lowings are some example translation mistakes for REF: အဲဒါ ပို ဆိုး လာတယ် ။ each category: HYP: အဲဒါ ********* ပိုဆိုး လာတယ် ။ ### Male-Female Vocabulary Error ### Eval: D S SOURCE: သူ နန့် ဟိ ြဗင် လား ။ Scores: (#C #S #D #I) 3 2 0 1 ### Negative Error ### REF: ****** သူမ မင်းကို ြမင် သလား ။ SOURCE: ေြဖ ေပး ဟိ ့ ရှစ် ေနလား ။ HYP: သူ မင်း ကို ြမင် သလား ။ Scores: (#C #S #D #I) 5 1 0 1 Eval: I S S REF: အေြဖ *** ေပး ဖို့ ရှက် ေနသလား ။ HYP: အေြဖ မ ေပး ဖို့ ရှက် ေနတာလား ။ SOURCE: သူ ့ကိုယ်သူ သိ ဟှယ် ။ Eval: I S Scores: (#C #S #D #I) 3 1 0 0 REF: သူမကိုယ်သူမ သိ ပါတယ် ။ SOURCE: ဝယ်ရား နတ်ဆက် သွား ဟှ ။ HYP: သူ့ကိုယ်သူ သိ ပါတယ် ။ Scores: (#C #S #D #I) 5 0 1 0 Eval: S REF: သူမ နတ်ဆက် မ သွား ဘူး ။ HYP: သူမ နတ်ဆက် *** သွား ဘူး ။ Eval: D Acknowledgment Where “SOURCE” is the test sentence of Dawei language, “Scores” are operation scores We would like to thank U Aung Myo (Lead- of the Edit Distance (Miller et al., 2009), “C” is ing Charge, Dawei Ethnic Organizing Commit- the number of correct words, “S” is the number tee, DEOC) for his advice especially on writing of substitutions, “D” is the number of deletions, system of Dawei language with Myanmar char- “I” is the number of insertions, “REF” for acters. We are very greatful to Daw Thiri Hlaing reference (i.e. Myanmar sentence), “HYP” for (Lecturer, University of Computer Studies Dawei) hypothesis and “Eval” is the ordered sequence of for her leading the Myanmar-Dawei Translation edit operations. Team. We would like to thank all students of We found that translation error of male to Myanmar-Dawei translation team namely, Aung female vocabulary and vice versa happen be- Myat Shein, Aung Paing, Aye Thiri Htun, Aye tween Dawei-Myanmar translation such as Thiri Mon, Htet Soe San, Ming Maung Hein, “သူမ” (“she” in English) to “သူ” (“he” in Nay Lin Htet, Thuzar Win Htet, Win Theingi Kyaw, Zin Bo Hein and Zin Wai for translation English), “သူမကိုယ်သူမ” (“herself” in English) between Myanmar and Dawei sentences. Last but to “သူ့ကိုယ်သူ” (“himself” in English). The not least, we would like to thank Daw Khin Aye second category, paraphrasing errors are really Than (Prorector, University of Computer Studies interesting and it is also proved that two language Dawei) for all the help and support during our stay are similar. In our paraphrasing error examples, at University of Computer Studies Dawei. the meanings of all reference and hypothesis pairs are the same. Some errors are just the difference between the formal (polite form) and informal References written form such as “ြကပါတယ်” (polite form of ending phrase in Myanmar conversa- Fabienne Braune, Anita Ramm, and Alexander Fraser. “ြကတယ်” 2012. Long-distance reordering during search for tion) and “ြကတယ်”. One of the possible reasons hierarchical phrase-based smt. In Proceedings of for the word segmentation errors is inconsistent the Annual Conference of the European Association word segmentation of translators such as for Machine Translation (EAMT), pages 177--184. “ကားေမာင်း” and “ကား ေမာင်း” (“drive a car” in Stanley F. Chen and Joshua Goodman. 1996. An em- English). We also found that one more frequent pirical study of smoothing techniques for language translation errors between Dawei-Myanmar and modeling. In Proceedings of the 34th Annual Meet- Myanmar-Dawei machine translation is changing ing on Association for Computational Linguistics, into negative form (e.g. “အေြဖေပး” (“to answer” ACL '96, pages 310--318, Stroudsburg, PA, USA. Association for Computational Linguistics. in English) and “အေြဖမေပး” (“no answer” in English). David Chiang. 2007. Hierarchical phrase-based trans- lation. Computational Linguistics, 33(2):201--228.

9 Conclusion Nadir Durrani, Helmut Schmid, and Alexander Fraser. 2011. A joint sequence translation model with in- tegrated reordering. In Proceedings of the 49th An- This paper contributes the first PBSMT, HPBSMT nual Meeting of the Association for Computational and OSM machine translation evaluations from Linguistics: Human Language Technologies, pages Myanmar to Dawei and Dawei to Myanmar. We 1045--1054, Portland, Oregon, USA. Association used the 9K Myanmar-Dawei parallel corpus that for Computational Linguistics. we constructed to analyze the behavior of a dialec- tal Myanmar-Dawei machine translation. We also Nadir Durrani, Helmut Schmid, Alexander Fraser, Philipp Koehn, and Hinrich Schütze. 2015. The op- investigated two types of segmentation schemes eration sequence Model---Combining n-gram-based (word segmentation and syllable segmentation). and phrase-based statistical machine translation. We showed that well-grounded BLEU and RIBES Computational Linguistics, 41(2):157--186. scores can be achieved for Dawei-Myanmar lan- guage pair even with the limited data. In the Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the near future we plan to test PBSMT, HPBSMT Sixth Workshop on Statistical Machine Translation, and OSM models with other Myanmar dialect lan- pages 187--197, Edinburgh, Scotland. Association guages such as Myeik (Beik). for Computational Linguistics. Pierre-Edouard Honnet, Andrei Popescu-Belis, Franz Josef Och and Hermann Ney. 2000. Improved Claudiu Musat, and Michael Baeriswyl. 2018. Ma- statistical alignment models. In Proceedings of chine translation of low-resource spoken dialects: the 38th Annual Meeting on Association for Com- Strategies for normalizing swiss German. In Pro- putational Linguistics, ACL '00, pages 440--447, ceedings of the Eleventh International Conference Stroudsburg, PA, USA. Association for Computa- on Language Resources and Evaluation (LREC tional Linguistics. 2018), Miyazaki, Japan. European Language Resources Association (ELRA). John OKELL. 1995. Three burmese dialects. Pa- pers in Southeast Asian Linguistics No.13, Studies Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito in Burmese Languages, 13:1--138. Sudoh, and Hajime Tsukada. 2010. Automatic eval- uation of translation quality for distant language Thazin Myint Oo, Ye Kyaw Thu, and Khin Mar pairs. In Proceedings of the 2010 Conference on Soe. 2018. Statistical machine translation between Empirical Methods in Natural Language Process- myanmar (burmese) and rakhine (arakanese). In ing, pages 944--952, Cambridge, MA. Association Proceedings of ICCA2018, pages 304--311. for Computational Linguistics. Win Pa Pa, Ye Kyaw Thu, Andrew M. Finch, and Ei- Philipp Koehn. 2005. Europarl: A Parallel Corpus ichiro Sumita. 2016. A study of statistical machine for Statistical Machine Translation. In The tenth translation methods for under resourced languages. Machine Translation Summit, pages 79--86, Phuket, In SLTU-2016, 5th Workshop on Spoken Language Thailand. AAMT, AAMT. Technologies for Under-resourced languages, 9-12 May 2016, Yogyakarta, Indonesia, pages 250--257. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Kishore Papineni, Salim Roukos, Todd Ward, and Brooke Cowan, Wade Shen, Christine Moran, Wei-Jing Zhu. 2002. Bleu: a method for auto- Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra matic evaluation of machine translation. In Pro- Constantin, and Evan Herbst. 2007. Moses: Open ceedings of the 40th Annual Meeting of the Asso- source toolkit for statistical machine translation. In ciation for Computational Linguistics, pages 311-- Proceedings of the 45th Annual Meeting of the ACL 318, Philadelphia, Pennsylvania, USA. Association on Interactive Poster and Demonstration Sessions, for Computational Linguistics. ACL '07, pages 177--180, Stroudsburg, PA, USA. Association for Computational Linguistics. Boonkwan Prachya and Supnithi Thepchai. 2013. Technical Report for The Network-based ASEAN Philipp Koehn, Franz Josef Och, and Daniel Marcu. Language Translation Public Service Project. On- 2003. Statistical phrase-based translation. In Pro- line Materials of Network-based ASEAN Lan- ceedings of the 2003 Conference of the North Amer- guages Translation Public Service for Members, ican Chapter of the Association for Computational NECTEC. Linguistics on Human Language Technology - Vol- ume 1, NAACL '03, pages 48--54, Stroudsburg, PA, Lucia Specia. 2011. Tutorial, Fundamental and New USA. Association for Computational Linguistics. Approaches to Statistical Machine Translation. In- ternational Conference Recent Advances in Natural Karima Meftouh, Salima Harrat, Salma Jamoussi, Language Processing. Mourad Abbas, and Kamel Smaili. 2015. Machine translation experiments on PADIC: A parallel Ara- Ye Kyaw Thu, Vichet Chea, Andrew M. Finch, Masao bic DIalect corpus. In Proceedings of the 29th Pa- Utiyama, and Eiichiro Sumita. 2015. A large-scale cific Asia Conference on Language, Information and study of statistical machine translation methods for Computation, pages 26--34, Shanghai, China. . In Proceedings of the 29th Pa- cific Asia Conference on Language, Information and Frederic P. Miller, Agnes F. Vandome, and John Computation, PACLIC 29, Shanghai, China, Octo- McBrewster. 2009. Levenshtein Distance: Informa- ber 30 - November 1, 2015. tion Theory, Computer Science, String (Computer Science), String Metric, Damerau Levenshtein Dis- Ye Kyaw Thu, Andrew Finch, Win Pa Pa, and Eiichiro tance, Spell Checker, Hamming Distance. Alpha Sumita. 2016. A large scale study of statistical ma- Press. chine translation methods for myanmar language. In Proceedings of SNLP2016. Friedrich Neubarth, Barry Haddow, Adolfo Huerta, and Harald Trost. 2016. A hybrid approach to sta- Christoph Tillmann. 2004. A unigram orientation tistical machine translation between standard and di- model for statistical machine translation. In Pro- alectal varieties. volume 9561, pages 341--353. ceedings of HLT-NAACL 2004: Short Papers, pages 101--104, Boston, Massachusetts, USA. As- Franz Josef Och. 2003. Minimum error rate training sociation for Computational Linguistics. in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Com- putational Linguistics - Volume 1, ACL '03, pages 160--167, Stroudsburg, PA, USA. Association for Computational Linguistics.