Brahmic Schwa-Deletion with Neural Classifiers: Experiments with Bengali

The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages 29-31 August 2018, Gurugram, India Brahmic Schwa-Deletion with Neural Classifiers: Experiments with Bengali Cibu Johny, Martin Jansche Google AI, London, United Kingdom {cibu, mjansche}@google.com Abstract Table 1: The words car and bus as loanwords in various languages, shown in native orthography and transliteration The Brahmic family of writing systems is an alpha-syllabary, in which a consonant letter without an explicit vowel marker can Group Languages Orth. Trans. Orth. Trans. be ambiguous: it can either represent a consonant phoneme or Bengali কার kār@ বাস bās@ a CV syllable with an inherent vowel (“schwa”). The schwa- Hindi, Marathi, Nepali कार kār@ बस b@s@ deletion ambiguity must be resolved when converting from text 1 Gujarati કાર kār@ બસ b@s@ to an accurate phonemic representation, particularly for text-to- Sinhala කා kār බ b@s Malayalam കാർ kār ബസ് b@s speech synthesis. We situate the problem of Bengali schwa- Tamil கா kār ப p@s deletion in the larger context of grapheme-to-phoneme conver- 2 sion for Brahmic scripts and solve it using neural network classifiers with graphemic features that are independent of the script The phenomenon of schwa deletion arises in several lan- and the language. Classifier training is implemented using Ten- guages written in Brahmic scripts, including in Hindi [3, 4], sorFlow and related tools. We analyze the impact of both train- Punjabi [5] in Brahmic Gurmukhi (as opposed to Perso-Arabic ing data size and trained model size, as these represent real-life Shahmukhi) script, Marathi [6, 7], and others. Choudhury data collection and system deployment constraints. Our method et al. [8] approach the problem diachronically for several lan- achieves high accuracy for Bengali and is applicable to other guages, including Bengali. languages written with Brahmic scripts. Index Terms: Schwa-deletion, grapheme-to-phoneme, Bengali, Schwa deletion does not affect the entire Brahmic fam- Brahmic scripts, phonology, TensorFlow, neural networks ily. To show the absence of an inherent vowel, most Brah- mic scripts can make use of ligatures and of an explicit vowel suppression mark, generically called virama following Sanskrit/ 1. Introduction Unicode conventions. One subset of languages – including San- The English word table exists as a loanword in Bengali, where it skrit, Sinhala, and the major Dravidian languages – requires ex- is pronounced /ʈebil/1. In Bengali script – also known as Eastern plicit signaling of schwa deletion: a consonant without an inher- Nagari, a prominent member of the Brahmic family [1] – this is ent vowel must appear in ligated form or carry a visible virama; written as টিবল. It is divided into three parts, one of which is otherwise the inherent vowel is always realized. ট followed by িব and ল. The first, ট, represents the syllable The cross-linguistic situation is illustrated in Table 1. For /ʈe/ and consists of the consonant letter ট (indicating /ʈ/) and easy comparison a uniform alphabetic transliteration (based on the modifier ◌ appearing on its left and indicating /e/. This [9] and purely graphemic) is shown alongside the native abugida illustrates the workings of an alpha-syllabary [2] or abugida: in orthographies. Bengali and other languages in Group 1 do not this type of writing system, consonant letters like ট and ব can explicitly suppress the final schwa: phonetically both car and support vowel modifiers, which attach to them. A typical visual bus end in a consonant, but this fact is not indicated in the or- cluster represents a consonant-vowel (CV) syllable: ট is read thography and must be inferred. Hindi kār@ 2, and b@s@ ⟨ ⟩ ⟨ ⟩ /ʈe/ and িব is read /bi/. contrast with Sinhala kār and b@s . Sinhala and the Dra- This raises two related questions: How should a consonant vidian languages in Group⟨ ⟩ 2 consistently⟨ ⟩ use a visible virama sound be written when it is not followed by a vowel? And how (sometimes ligated in Malayalam) on the last letter of each should a consonant letter be read if no modifier is attached? word. Group 2 has a straightforward answer to our first ques- In an abugida an unmodified consonant is thought to carry tions: How should a consonant sound be written when it is not an inherent vowel and is read as a CV syllable. The precise pho- followed by a vowel? The consonant letter must be explicitly netic quality of that vowel varies across languages, and some- marked as lacking a vowel. times within the same language. Following established conven- Such simple conventions are absent from Group 1 lan- tions, we refer to the inherent vowel as schwa. We use that term guages, not only those listed in Table 1 but also Punjabi and in a purely graphemic sense, without any phonetic implications. Kashmiri in Brahmic scripts, Bhojpuri, Maithili, Assamese, Syl- The English word ball is pronounced /bɔl/ as a loanword in heti, and others. Consequently Group 1 has no straightforward Bengali. Its written representation বল consists of two unmodi- answer to our second question: How should a consonant letter fied consonant letters already familiar from above. The default be read if no modifier is attached? As the words बस b@s@ reading of the inherent vowel in Bengali is /ɔ/, and so plain is ⟨ ⟩ ব /bəs/ in Hindi and বল b@l@ /bɔl/ in Bengali illustrate, the read /bɔ/. We say that schwa has been phonetically realized. inherent vowel can either⟨ be realized⟩ or deleted. This paper de- In both বল /bɔl/ and the earlier example টিবল /ʈebil/ the scribes a general method for resolving this ambiguity. last letter ল has no modifiers attached and represents the final consonant sound. The inherent vowel of ল is not realized phonetically. We say that schwa has been deleted. 2Brahmic graphemes are transliterated to the Latin script inside . ⟨⟩ The inherent vowel is represented by @ unless it has been explicitly 1Phonetic transcriptions are written with IPA symbols inside slashes. suppressed, in which case no corresponding⟨ ⟩ vowel symbol is written. 264 10.21437/SLTU.2018-55 Table 2: Illustration of g2p stages and schwa outcomes Table 3: Comparison of schwa deletion in Hindi and Bengali Representation stage Example Language Orthography Transliteration Phonemes processing step Hindi राम rām@ ram Bengali রাম rām@ ram 1 Orthography ক ম ল ন গ র transliterate Hindi आठ सौ āṭ@ sau aʈsɔ ↓ Bengali আটশ āṭ@ś@ aʈʃo 2 Abstract graphemes, with schwa k@ m@ l@ n@ g@ r@ resolve schwa Hindi कम k@rm@ kərm ↓ Bengali k@rm@ kɔrmo 3 Abstract graphemes, no schwa k a m a l n a g a r কম convert to phonemes Hindi अपयात ap@ryāpt@ əpərjapt ↓ 4 Phonemes k ɔ m o l n ɔ g o r Bengali অপযা ap@ryāpt@ ɔpɔrdʒapto larger g2p and phonological regularities in Bengali.3 For exam- 2. Background ple the raising of /ɔ/ to /o/ generally affects all occurrences of /ɔ/, The task of relating a written representation of a word to a regardless of whether they arose from an inherent vowel or from the independent vowel letter অ a . It also parallels the raising phonemic representation is known as grapheme-to-phoneme ⟨ ⟩ conversion, or g2p for short. It is a common building block in of /ɛ/ to /e/ in Bengali. In the context of Bengali TTS [11] our several larger tasks, including transliteration [7], speech recog- preferred approach is to train sequence-to-sequence models on nition, and text-to-speech synthesis (TTS) [3, 4], where the need annotated data. For other languages, hand-written rewrite rules for highly accurate pronunciations is especially acute. go a long way. Details are beyond the scope of the current paper. 2.1. Grapheme-to-phoneme conversion for Brahmic 2.2. Schwa deletion in Bengali compared with Hindi Schwa deletion is a well-studied phenomenon of Hindi [3, 4]. The wide variety of Brahmic scripts used for writing South Bengali shares some similarities, but differs in important ways. Asian languages creates both complexities and opportunities. Table 3 compares cognates between Hindi and Bengali. The sheer diversity of scripts is an obstacle, but their similar Words like the name Ram that end in simple final consonants workings provide an opportunity for applying a single generic are written and pronounced essentially the same in Hindi and solution to many similar problems. Our larger aim – beyond the Bengali: the final schwa deletes. On the other hand, in the com- scope of this paper – is towards generic grapheme-to-phoneme pound word āṭ@ś@ (eight hundred) the middle schwa for Brahmic across similar languages, with minimal language- আটশ deletes and the final⟨ schwa is⟩ pronounced. specific customization. A generic approach must arguably be Hindi has a very strong prohibition against final schwa. able to deal with the schwa deletion phenomenon, as it affects a While this is a general preference in Bengali as well, a simi- large group of languages to various degrees. larly strong restriction does not exist there. In fact, in several We conceive of schwa resolution as one module in a mul- situations a final schwa must be pronounced in Bengali, when tistage process outlined in Table 2 which converts script char- it would be deleted in the corresponding Hindi word. One large acters (Stage 1) to phonemes (Stage 4). Table 2 illustrates and systematic class of exceptions are those that end in conso- the processing stages and steps using the Bengali place name nant clusters in Hindi and in the corresponding cluster followed (Kamalnagar), pronounced /kɔmolnɔgor/. কমলনগর by /o/ in Bengali. Table 3 lists several examples.

Brahmic Schwa-Deletion with Neural Classifiers: Experiments with Bengali

Talking Stone: Cherokee Syllabary Inscriptions in Dark Zone Caves

Cross-Language Framework for Word Recognition and Spotting of Indic Scripts

SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation

The Festvox Indic Frontend for Grapheme-To-Phoneme Conversion

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Writing Systems Reading and Spelling

Myanmar Script Learning Guide Burmese 1,2,3 Tone System Is Developed by Naing Tin-Nyunt-Pu Copyright © 2013-2017, Naing Tinnyuntpu & Asia Pearl Travels

Proposal to Encode 0D5F MALAYALAM LETTER ARCHAIC II

An Introduction to Old Persian Prods Oktor Skjærvø

Writing Systems: Their Properties and Implications for Reading

Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution

Proposal for Ethiopic Script Root Zone LGR