Statistical Machine Translation Between Myanmar (Burmese)
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Machine Translation between Myanmar (Burmese) and Dawei (Tavoyan) Thazin Myint Oo Ye Kyaw Thu UCSY, Myanmar NECTEC, Thailand [email protected] [email protected] Khin Mar Soe Thepchai Supnithi UCSY, Myanmar NECTEC, Thailand [email protected] [email protected] Abstract mar. The results show that the hierarchi- cal phrase-based SMT (HPBSMT) (Chiang, This paper contributes the first evalu- 2007) approach gave the highest translation ation of the quality of statistical ma- chine translation (SMT) between Myan- quality in terms of both the BLEU (Papineni mar (Burmese) and Dawei (Tavoyan). We et al., 2002) and RIBES scores (Isozaki et al., also developed a Myanmar-Dawei parallel 2010). Win Pa Pa et al (2016) (Pa et al., corpus (around 9K sentences) based on the 2016) presented the first comparative study Myanmar language of ASEAN MT corpus. of five major machine translation approaches The 10 folds cross-validation experiments applied to low-resource languages. Phrase- were carried out using three different sta- based statistical machine translation (PB- tistical machine translation approaches: SMT), HPBSMT, tree-to-string (T2S), string- phrase-based, hierarchical phrase-based, and the operation sequence model (OSM). to-tree (S2T) and operation sequence model In addition, two types of segmentation (OSM) translation methods to the transla- were studied: word and syllable segmen- tion of limited quantities of travel domain tation. The results show that all three data between English and Thai, Laos, Myan- statistical machine translation approaches mar in both directions. The experimental re- give comparable BLEU and RIBES scores sults indicate that in terms of adequacy (as for both Myanmar to Dawei and Dawei measured by BLEU score), the PBSMT ap- to Myanmar machine translations. OSM approach achieved the highest BLEU and proach produced the highest quality transla- RIBES scores among three SMT ap- tions. Here, the annotated tree is used only proaches for both word and syllable seg- for English language for S2T and T2S ex- mentation. periments. This is because there is no pub- licly available tree parser for Lao, Myanmar 1 Introduction and Thai languages. According to our knowl- Our main motivation for this research is edge, there is no publicly available tree parser to investigate SMT performance for Myan- for both Dawei and Myanmar languages and mar (Burmese) and Dawei (Tavoyan) language thus we cannot apply S2T and T2S approaches pair. The Dawei (Tavoyan) language is closely for Myanmar-Dawei language pair. From related to Myanmar (Burmese) language and their RIBES scores, we noticed that OSM ap- it is often considered as dialect of Myanmar proach achieved best machine translation per- language. The state-of-the-art techniques of formance for Myanmar to English translation. statistical machine translation (SMT) (Koehn Moreover, we learned that OSM approach et al., 2003). This demonstrate good perfor- gave highest translation performance trans- mance on translation of languages with rela- lation between Khmer (the official language tively similar word orders (Koehn, 2005). To of Cambodia) and twenty other languages, date, there have been some studies on the in both directions (Thu et al., 2015). Re- SMT of Myanmar language. (Thu et al., lating to Myanmar langauge dialects, Thazin 2016) presented the first large-scale study of Myint Oo et al. (2018) (Oo et al., 2018) the translation of the Myanmar language. contributed the first PBSMT, HPBSMT and A total of 40 language pairs were used in OSM machine translation evaluations between the study that included languages both sim- Myanmar and Rakhine. The experiment was ilar and fundamentally different from Myan- used the 18K Myanmar-Rakhine parallel cor- pus that constructed to analyze the behav- Dawei (formerly Tavoy) in Tanintharyi (for- ior of a dialectal Myanmar-Rakhine machine merly Tenasserim) by about 400,000 people; translation. The results showed that higher its sterotyped characteristic is the mesial /I/, BLEU (57.88 for Myanmar-Rakhine and 60.86 found in earlist Bagan inscriptions but by for Rakhine-Myanmar) and RIBES (0.9085 for merger there nearly 800 years ago; for further Myanmar-Rakhine and 0.9239 for Rakhine- information see Pe Maung Tin (1933) and Myanmar) scores can be achieved for Rakhine- Okell (1995)(OKELL, 1995). Dawei is a city Myanmar language pair even with the lim- of south-eastern Myanmar and is the capital ited data. Based on the experimental re- of Tanintharyi Region, formerly known as sults of previous works, in this paper, the ma- the Tenasserim is bounded by Mon state to chine translation experiments between Myan- the north, Thailand to the east and south, mar and Dawei were carried out using PB- and the Andaman sea to the west. Tavoyan SMT, HPBSMT and OSM. retains /-l-/ medial that has since merged into the /-j-/ medial in standard Burmese and can 2 Related Work form the following consonant clusters: /ɡl-/, Karima Meftouh et al. built PADIC (Parallel /kl-/, /kʰl-/, /bl-/, /pl-/, /pʰl-/, /ml-/, /m̥l-/. Exam- Arabic Dialect Corpus) corpus from scratch, ples include “ေမလ” (/mlè/ → Standard Burmese then conducted experiments on cross dialect /mjè/) for “ground” and “ေကလာင်း” (kláʊɴ/ → Arabic machine translation (Meftouh et al., Standard Burmese tʃáʊɴ/) for “school”. [4] 2015) PADIC is composed of dialects from Also, voicing only with unaspirated consonants, both the Maghreb and the Middle-East. Some whereas in standard Burmese, voicing can occur interesting results were achieved even with with both aspirated and unaspirated consonants. the limited corpora of 6,400 parallel sentences. Also, there are many loan words from Malay and Using SMT for dialectal varieties usually suf- Thai not found in Standard Burmese. An example fers from data sparsity, but combining word- is the word for goat, which is hseit “ဆိတ်” in level and character-level models can yield good Standard Burmese but be “ဘဲ” in Tavoyan. In results even with small training data by ex- the Tavoyan dialect, terms of endearment, as ploiting the relative proximity between the two well as family terms, are considerably different varieties (Neubarth et al., 2016). Friedrich from Standard Burmese. For instance, the terms Neubarth et al. described a specific problem for “son” and “daughter” are “ဖစု” (/pʰa̰ òu/) and its solution, arising with the translation and “မိစု”(/mḭ òu/) respectively. Moreover, the between standard Austrian German and Vi- honorific “ေနာင်” (Naung) is used in lieu of ennese dialect. They used hybrid approach “ေမာင်” (Maung) for young males. Another evi- of rule-based preprocessing and PBSMT for dence of “Dawei” is “Dhommarazaka” pogoda getting better performance. Pierre-Edouard inscription of Bagan period. It was inscription Honnet et al. proposed solutions for the ma- of Bagan period. It was inscribed in AD 1196 chine translation of a family of dialects, Swiss during the region of Bagan King Narapatisithu German, for which parallel corpora are scarce (AD 1174-1201) . In this inscription line 6 to (Honnet et al., 2018). They presented three 19, when the demarcation of Bagan is mentioned strategies for normalizing Swiss German input “Taung-Kar-Htawei” (up to Htawei to the south) in order to address the regional and spelling di- and “Taninthaye” (Tanintharyi) are including. versity. The results show that character-based Therefore, the name of “Dawei” appeared par- neural MT was the most promising one for text ticulary since Bagan period, at the time of the normalization and that in combination with first Myanmar Empire. (Dawei was established PBSMT achieved 36 % BLEU score. at Myanmar year 1116) is actually meant that 3 Dawei Language the present name Dawei appears as the name of the settlers later and the original name of the The Tavoyan or Dawei dialect of Burmese city is Tharyarwady, which was established at is spoken in Dawei (Tavoy), in the coastal Myanmar year 1116 according to the saying. As Tanintharyi Region of southern Myanmar “Dawei” nationality deserves as one nationalist (Burma). The large and quite distinct Dawei in our country. Actually, Dawei region is a place or Tavoyan variety is spoken in and around where local people lived since very ancient Stone Age. After that, Stone Age, Bronze Age and ။ Iron Age culture developed. Moreover, as there my: ေကာင်ေလး ေကျာင်း မှန်မှန် တက် တယ် ။ has sound evidence of Thargara ancient city, (“The boy goes to school regularly” in English) comtemporary to Phu Period, the Dawei people, can be assumed that they are one nationality of high culture in Myanmar. Dawei(Tavoyan) 4 Methodology usage and vocabularies is divided into three In this section, we describe the methodology used main groups. The first one is using Myanmar in the machine translation experiments for this pa- vocabularies with Dawei speech, the second is the per. vocabularies same with Myanmar vocabularies and using isolated Dawei words and vocabularies. 4.1 Phrase-Based Statistical Machine In Myanmar word (“ထို, ဟို”), (“here, there”) is Translation used (“here”) and (“there”) “သယ်” “ေဟာက်” A PBSMT translation model is based on phrasal in Dawei language. For example Dawei word units (Koehn et al., 2003). Here, a phrase is is same as in Myanmar language “သယ်မျ ိုး” “ဒီလို” simply a contiguous sequence of words and gen- and “ေဟာက်မျ ိုး” means “ဟိုလို” in Myanmar erally, not a linguistically motivated phrase. A language. The question words “နည်း (သနည်း), လဲ phrase-based translation model typically gives (သလဲ)” are used in Myanmar language, similarly better translation performance than word-based “ေလာ,ေလာ်” is used instead of “လား (သလား)” models. We can describe a simple phrase-based translation model consisting of phrase-pair prob- in Dawei language. Moreover, “ဘာလဲ”(what) abilities extracted from corpus and a basic re- and “ဘာြဖစ်တာလဲ” (“what happened”) is same ordering model, and an algorithm to extract the with and in Dawei usage.