Statistical Romanization for Abugida Scripts: Data and Experiment on Khmer and Burmese

言語処理学会第23回年次大会発表論文集 (2017年3月) Statistical Romanization for Abugida Scripts: Data and Experiment on Khmer and Burmese Chenchen Ding1 Vichet Chea2 Win Pa Pa3 Masao Utiyama1 Eiichiro Sumita1 1ASTREC, NICT, Japan 2NIPTICT, Cambodia 3UCSY, Myanmar 1fchenchen.ding, mutiyama, [email protected] [email protected] [email protected] 1 Introduction level rather than on word (or phrase) level with no (or few) reordering operations. Hence, general SMT Linguistically, romanization is the task of transform- techniques can be facilitated once training data are ing a non-Latin writing system into Latin script. It prepared. The phrase-based SMT plays a role of is a language-specific task in natural language pro- baseline in recent workshops [1, 2], whereas neural cessing (NLP). Despite standard romanization sys- network techniques provide further gains in perfor- tems, conventional transcription methods are ap- mance [4, 5]. Although neural network-based, pure plied prevalently in many languages. The spellings string-to-string approaches prove powerful on dif- are thus inconsistent and complicated in some cases. ferent transliteration tasks, there is still room for Consequently, statistical approaches are required for improvement, especially on tasks between different an efficient solution on this task. writing systems. For example, the Thai-to-English We focus on the name romanization for languages task, which is similar to our task, has relatively poor using abugida scripts. Different from an alpha- performance in NEWS 2015. The problem around betic system, in an abugida system, consonant letters diacritics in abugida is also stated in Kunchukuttan stand for a syllable with an implicit inherent vowel, and Bhattacharyya [6]. General techniques may offer and diacritics modify consonant letters to change or acceptable solutions overall, but specialized investi- depress the inherent vowel. We designed annotation gation and processing are required for further im- schemes to establish a precise, consistent, and mono- provement on tasks for specific languages. In the re- tonic correspondence between the two different writ- mainder of the paper, the Khmer and Burmese data ing systems on grapheme level, through which vari- with manual annotation are described in Secs. 3 and ous machine learning approaches are facilitated. 4, respectively. Experiments are reported in Sec. 5, Specifically, we collect and manually align 7; 658 and Sec. 6 contains the conclusion and future work. and 2; 335 person name romanization instances for Khmer and Burmese (Myanmar), respectively. Both languages are with limited resources and NLP stud- 3 Khmer Data ies. Experimental results demonstrate that standard approaches of conditional random fields (CRF) and We collect and annotate 7; 658 real Khmer name ro- support vector machine (SVM) supervised by the manization instances. An example of a Khmer name manually annotated data achieve high precision in aligned with its romanization is illustrated in Fig. 1. romanization. The annotated data have been re- A special feature of Khmer script is that there are two series of consonant letters, which have different leased under a CC-BY-NC-SA license.1 inherent vowels. Furthermore, diacritics represent different vowels when added to corresponding conso- 2 Related Work nant series, which leads to a very complicated vowel system. Another feature of Khmer script is that the In engineering practice, the romanization task is a vertically stacked consonant letters are very com- string-to-string transformation, which can be cast as mon. One reason is the abundant consonant clus- a simplified translation task working on grapheme ters in phonology; another reason is the etymologi- cal spelling of Sanskrit (Pali) derived words. Those 1The Khmer data are available at http://niptict. stacked letters are not strictly based on the syllable edu.kh/khmer-name-romanization-with-alignment-on- structure. Additionally (and more problematically), grapheme-level/. The Burmese data are available at http: the virama, i.e., the diacritic used to suppress the in- //www.nlpresearch-ucsy.edu.mm/NLP_UCSY/name-db.html. Please contact the first author to obtain the data if the herent vowel, is absent in Khmer script, which makes servers are unstable. the identification of onset and coda difficult. ― 234 ― Copyright(C) 2017 The Association for Natural Language Processing. All Rights Reserved. a character-by-character alignment can be estab- 3 1 3 10 6 11 lished. According to our overall principles, Khmer consonant letters are aligned to Latin consonant KOY CHANTRA កូយ ចន្ទ្រា graphemes; diacritics, including inserted inherent 2 5 8 vowels, are aligned to Latin vowel graphemes; and any silent part is aligned to a placeholder. ក ូូ យ . ច . ន ូ ទ ូ រ ូ 4 Burmese Data 1 2 3 4 5 6 7 8 9 10 11 We collect and annotate 2; 335 real 1 6 K O Y . CH A N . T . R A 1 6 Burmese name romanization in- 44 stances, including students and fac- လင� ulty names in a university, public Figure 1: A Khmer name composed of two words. figure names and names from mi- 3 22 55 The upper part is the raw string pair and the lower nority nationalities in Myanmar. part is the grapheme-aligned pair. In the raw Khmer Syllable are clear and integrated units in Burmese string, the bold lines show the boundary of the un- script, which can be identified by rules, i.e., all dia- breakable unit in writing and the thin lines distin- critics and consonant letters with a virama must be guish the components. The order of Khmer letters attached to the letter they modify [3]. As an exam- is marked by numbers. 4 is the space; 7 and 9 ple, a Burmese syllable composed of six characters are stacking operators. The \." stands for inserted is illustrated in the beginning of the section, where inherent vowels on the Khmer side and for a silent the numbers indicate the order of the characters in placeholder on the Latin side. the composition. 1 and 4 are consonant letters, and 2 , 3 , 5 , and 6 are diacritics. Specifically, As phonemic syllables and writing units do not 2 and 3 form consonant clusters with 1 , 5 is a match well in Khmer script. Consistent alignment tone mark, and 6 is the virama to depress the in- principles are thus difficult to establish if the sylla- herent vowel of 4 to form the coda of the syllable. bles or writing units are taken as atoms in processing. In order to provide an efficient interface for the Furthermore, there are numerous types of syllables Romanization task, we step into the syllable struc- and writing units, which are too complex for statis- ture and further extract sub-syllabic segments, , i.e., tical model learning because of sparseness. Based onset, rhyme, and tone, to achieve a precise and con- on these facts, we prefer a pure character-level align- sistent correspondence between Burmese and Latin ment, where standalone consonant letters, diacritics, scripts. Fig. 2 is a diagram of the sub-syllabic seg- and the invisible staking operator are separated and mentation scheme. It is relatively intuitive to divide treated equally as grapheme on the Khmer side. a Burmese syllable into two parts, onset and rhyme, A consequent problem is the inherent vowels, once despite the difficulties introduced by medial conso- we completely ignore the syllable structure. We can- nants. The rhyme is not further dividable except in not judge whether a standalone consonant stands to stripe tones annotated by explicit marks. for only one consonant or contains a further inher- Specifically, four diacritics are used to represent ent vowel, due to ambiguity in the writing system. the medial consonants, i.e., yapin, yayit, wahswe, We thus apply a scheme to insert a mark to repre- and hahto to represent /-j-/, /-j-/,4/-w-/, and /-h-/,5 sent the inherent vowel for all the \bare" consonant respectively. As shown in Fig. 2, yapin, yayit, and letters (i.e., consonant letters without any diacrit- hahto are placed into the onset part, but wahswe ics or stacking operators) to establish a consistent into the rhyme part. This scheme is adopted for alignment. This insertion is thus decisive based on two reasons. (1) The phonotactical constraints are the surface spelling, and the ambiguity on the pres- strict on yapin, yayit, and hahto, but loose on wah- ence or absence of the inherent vowel is converted to swe. Hahto is only used on sonorants (nasals and whether the inserted mark is silent. approximants) and yapin / yayit only on velars and In the example of Fig. 1. the consonant letters bilabials.6 In contrast, wahswe can be freely com- are 1 , 3 , 5 , 6 , 8 , and 10 , where 3 and 5 bined with all consonant phonemes with the triv- are bare consonant letters,2 after which the inher- 3 ent vowel is inserted (noted by a dot here and in- As the task is to transform Khmer script to Latin script, the graphemes are not guaranteed to be single letters on the dicated by a wedge without original index). Then romanization side, e.g., the 5 corresponds to CH here. 4Yayit originally represents /-r-/ while in the modern stan- 2 2 and 11 are diacritics for 1 and 10 , respectively, dard Burmese the phoneme /r/ has been merged into /j/. and 6 and 8 are followed by staking operators. So these 5actually a voiceless sign, e.g., changing /n-/ to /n-/. four consonant letters are not bare. 6Yapin can also be combined with /l-/. ˚ ― 235 ― Copyright(C) 2017 The Association for Natural Language Processing. All Rights Reserved. syllable wahswe-hahto aukmyit-virama swapping swapping onset rhyme လင� = လ + ွ + ွ + င + ွ + ွ� initial medial nucleus coda* tone hlwint la w h nga t -j-, -r-*, -h-, -w- လ ွ ွ င ွ� ွ Figure 2: Burmese sub-syllabic segmentation.

Statistical Romanization for Abugida Scripts: Data and Experiment on Khmer and Burmese

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support