Simplified Abugidas
Total Page:16
File Type:pdf, Size:1020Kb
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan fchenchen.ding, mutiyama, eiichiro.sumitag@nict.go.jp phonogram segmental abjad Abstract writing alphabet system An abugida is a writing system where the logogram syllabic abugida consonant letters represent syllables with Figure 1: Hierarchy of writing systems. a default vowel and other vowels are de- noted by diacritics. We investigate the fea- ណូ ន ណណន នួន ននន …ជិតណណន… sibility of recovering the original text writ- /noon/ /naen/ /nuən/ /nein/ vowel machine ten in an abugida after omitting subordi- diacritic omission learning ណន នន methods nate diacritics and merging consonant let- consonant character ters with similar phonetic values. This is merging N N … J T N N … crucial for developing more efficient in- (a) ABUGIDA SIMPLIFICATION (b) RECOVERY put methods by reducing the complexity in abugidas. Four abugidas in the south- Figure 2: Overview of the approach in this study. ern Brahmic family, i.e., Thai, Burmese, ters equally. In contrast, abjads (e.g., the Arabic Khmer, and Lao, were studied using a and Hebrew scripts) do not write most vowels ex- newswire 20; 000-sentence dataset. We plicitly. The third type, abugidas, also called al- compared the recovery performance of a phasyllabary, includes features from both segmen- support vector machine and an LSTM- tal and syllabic systems. In abugidas, consonant based recurrent neural network, finding letters represent syllables with a default vowel, that the abugida graphemes could be re- and other vowels are denoted by diacritics. Abugi- covered with 94% – 97% accuracy at the das thus denote vowels less explicitly than alpha- top-1 level and 98% – 99% at the top-4 bets but more explicitly than abjads, while being level, even after omitting most diacritics less phonetic than alphabets, but more phonetic (10 – 30 types) and merging the remain- than syllabaries. Since abugidas combine segmen- ing 30 – 50 characters into 21 graphemes. tal and syllabic systems, they typically have more symbols than conventional alphabets. 1 Introduction In this study, we investigate how to simplify and Writing systems are used to record utterances in recover abugidas, with the aim of developing a a wide range of languages and can be organized more efficient method of encoding abugidas for in- into the hierarchy shown in Fig.1. The sym- put. Alphabets generally do not have a large set of bols in a writing system generally represent either symbols, making them easy to map to a traditional speech sounds (phonograms) or semantic units keyboard, and logogram and syllabic systems need (logograms). Phonograms can be either segmen- specially designed input methods because of their tal or syllabic, with segmental systems being more large variety of symbols. Traditional input meth- phonetic because they use separate symbols (i.e., ods for abugidas are similar to those for alphabets, letters) to represent consonants and vowels. Seg- mapping two or three different symbols onto each mental systems can be further subdivided depend- key and requiring users to type each character and ing on their representation of vowels. Alphabets diacritic exactly. In contrast, we are able to sub- (e.g., the Latin, Cyrillic, and Greek scripts) are the stantially simplify inputting abugidas by encoding most common and treat vowel and consonant let- them in a lossy (or “fuzzy”) way. 491 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 491–495 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics GUTTURAL PALATE DENTAL LABIAL PRE-V. PLOSIVE PLOSIVE PLOSIVE PLOSIVE S-LIKE DE-V. R-LIKE H-LIKE APP. APP. APP. NAS. NAS. NAS. NAS. ZERO-C. LONG-A MERGED I II I II I II I II MN K G U C J I Y T D N L P B M W R S H Q A E TH กขฃ คฅฆ ง จฉ ชซฌ ญ ย ฎฏฐดตถ ฑฒทธ ณน ลฦฬ บปผฝ พฟภ ม ว รฤ ศษส หฬ อ ๅ เ แ โ ใ ไ MY ကခ ဂဃ င စဆ ဇဈ ဉ ည ယ ိ ဋဌတထ ဍဎဒဓ ဏန လဠ ပဖ ဗဘ မ ဝ ိ ရ ြိ သဿ ဟ ိ အ ိ ိ ိ ိ KM កខ គឃ ង ចឆ ជឈ ញ យ ដឋតថ ឌឍទធ ណន លឡ បផ ពភ ម វ រ ឝឞស ហ អ ិ ្ LO ກຂ ຄ ງ ຈ ຊ ຍ ຢ ດຕຖ ທ ນ ລ ບປຜຝ ພຟ ມ ວ ຣ ສ ຫຮ ອ ເ ແ ໂ ໃ ໄ ะ ั ั ั ั ั ั ั ั ិ ិ ិ ិ ិ ិ ិ ើិ ើិ ើិ ើិ ែិ ៃិ ະ ិ ិ ិ ិ ិ ិ ិ ិ OMITTED TH MY ိ ိ ိ ိ ေိ ိ ိ ိ ိ KM LO ั ั ั ั ั ั ั ั ั ើិ ើិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ຽ ិ ិ ិ ិ ិ ិ Figure 3: Merging and omission for Thai (TH), Burmese (MY), Khmer (KM), and Lao (LO) scripts. The MN row lists the mnemonics assigned to graphemes in our experiment. In this study, the mnemonics can be assigned arbitrarily, and we selected Latin letters related to the real pronunciation wherever possible. Fig.2 gives an overview of this study, show- random fields (CRFs), and machine translation ing examples in Khmer. We simplify abugidas by techniques (Wang et al., 2006; Jiang et al., 2007; omitting vowel diacritics and merging consonant Li et al., 2009; Yang et al., 2012). Similar meth- letters with identical or similar phonetic values, ods have also been developed for character con- as shown in (a). This simplification is intuitive, version in Japanese (Tokunaga et al., 2011). This both orthographically and phonetically. To resolve study takes a similar approach to the research on the ambiguities introduced by the simplification, Chinese and Japanese, transforming a less infor- we use data-driven methods to recover the origi- mative encoding into strings in a natural and re- nal texts, as shown in (b). We conducted experi- dundant writing system. Furthermore, our study ments on four southern Brahmic scripts, i.e., Thai, can be considered as a specific lossy compression Burmese, Khmer, and Lao scripts, with a unified scheme on abugida textual data. Unlike images or framework, using data from the Asian Language audio, the lossy text compression has received lit- Treebank (ALT) (Riza et al., 2016). The exper- tle attention as it may cause difficulties with read- iments show that the abugidas can be recovered ing (Witten et al., 1994). However, we handle this satisfactorily by a recurrent neural network (RNN) issue within an input method framework, where using long short-term memory (LSTM) units, even the simplified encoding is not read directly. when nearly all of the diacritics (10 – 30 types) have been omitted and the remaining 30 – 50 char- 3 Simplified Abugidas acters have been merged into 21 graphemes. Thai We designed simplification schemes for several gave the best performance, with 97% top-1 ac- different scripts within a unified framework based curacy for graphemes and over 99% top-4 accu- on phonetics and conventional usages, without racy. Lao, which gave the worst performance, considering many language specific features. Our still achieved the top-1 and top-4 accuracies of primary aim was to investigate the feasibility of re- around 94% and 98%, respectively. The Burmese ducing the complexity of abugidas and to establish and Khmer results, which lay in-between the other methods of recovering the texts. We will consider two, were also investigated by manual evaluation. language-specific optimization in a future work, via both data- and user-driven studies. 2 Related Work The simplification scheme is shown in Fig.3. 1 Generally, the merges are based on the common Some optimized keyboard layout have been pro- distribution of consonant phonemes in most natu- posed for specific abugidas (Ouk et al., 2008). ral languages, as well as the etymology of the char- Most studies on input methods have focused on acters in each abugida. Specifically, three or four Chinese and Japanese characters, where thousands of symbols need to be encoded and recovered. For 1Each script also includes native punctuation marks, digit notes, and standalone vowel characters that are not repre- Chinese characters, Chen and Lee(2000) made sented by diacritics. These characters were kept in the experi- an early attempt to apply statistical methods to mental texts but not evaluated in the final results, as the usage sentence-level processing, using a hidden Markov of these characters is trivial. In addition, white spaces, Latin letters, Arabic digits, and non-native punctuation marks were model. Others have examined max-entropy mod- normalized into placeholders in the experiments, and were els, support vector machines (SVMs), conditional also excluded from evaluation. 492 graphemes are preserved for the different articu- output ជិ ត ណែ ន lation locations (i.e., guttural, palate, dental, and softmax labial), that two for plosives, one for nasal (NAS.), linear and one for approximant (APP.) if present. Addi- 512-dim. tional consonants such as trills (R-LIKE), frica- (256×2) LSTM-RNN tives (S-/H-LIKE), and empty (ZERO-C.) are also assigned their own graphemes. Although the 128-dim. simplification omits most diacritics, three types input J T N N are retained, i.e., one basic mark common to Figure 4: Structure of the RNN used in this study. nearly all Brahmic abugidas (LONG-A), the pre- posed vowels in Thai and Lao (PRE-V.), and resulting in a final representation consisting of the vowel-depressors (and/or consonant-stackers) a 512-dimensional vector that concatenates two in Burmese and Khmer (DE-V.).