Improving the accuracy of pronunciation lexicon using Naive Bayes classifier with character n-gram as feature: for language classified pronunciation lexicon generation

Aswathy P V Arun Gopi Sajini T Bhadran V K Language Technolo- Language Technolo- Language Technolo- Language Technolo- gy Section gy Section gy Section gy Section CDAC-T CDAC-T CDAC-T CDAC-T India India India aswa- arungo- [email protected] - [email protected] [email protected] [email protected]

tive language . In most cases the pronuncia- Abstract tion model for MLE is different and depends on explicit knowledge of the language. Hence, it This paper looks at improving the accu- must be identified by the system in order to en- racy of pronunciation lexicon for able the correct model. Language identification by improving the quality of front end is often based on only written text, which creates processing. Pronunciation lexicon is an in an interesting problem. User intervention is al- evitable component in speech research and ways a possibility, but a completely automatic speech applications like TTS and ASR. This paper details the work done to improve the system would make this phase transparent and accuracy of automatic pronunciation lexicon increase the usability of the system (William. B). generator (APLG) with Naive Bayes clas- In this paper we brief about the language sifier using character n-gram as feature. n- identification from text, which is typically a gram is used to classify Malayalam native symbolic processing task. Language identifica- words (MLN) and Malayalam English tion is done to classify MLN and MLE and apply words (MLE). Phonotactics which is unique LTS to generate Indian English pronunciation. for Malayalam is used as the feature for We used Naive Bayes classifier with character n- classification of MLE and MLN words. Na- grams as feature, to identify whether the given tive and nonnative Malayalam words are word belongs to native or non-native Malayalam. used for generating models for the same. Testing is done on different text input col- lected from news domain, where MLE fre- 2 Structure of Malayalam quency is high. Malayalam is an offshoot of the Proto-Tamil- Malayalam branch of the Proto Dravidian 1 Introduction Language. Malayalam belongs to the southern Automatic pronunciation generator is one of group of Dravidian Language. There are the main modules in speech application which approximately 37 million Malayalam speakers determines the quality of the output. In speech worldwide, with 33,066,392 speakers in India, as applications, pronunciation is generated online of the 2001 census of India. Basically Malayalam from the given input using dictionary or rules. words are derived from and Tamil. For a language like Malayalam which has agglu- Malayalam script contains 51 letters including 15 tination, rule based approach is more suitable and 36 , which forms 576 than look up. Since English is the official lan- syllabic characters, and contains two additional guage, the influence is such that the usage of characters named anuswara and . English words is very common in Indian lan- In the of syllabic , all guage scripts and texts. So APLG must be able to consonants have an inherent . , handle the pronunciation of these English words. which can appear above, below, before or after a The inputs given to TTS are normally bilin- , are used to change the inherent gual with English words in and na- vowel. When they appear at the beginning of a 113 D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 113–118, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) syllable, vowels are written as independent 3 Malayalam corpus collection letters. When certain consonants occur together, special conjunct symbols are used which The main input for speech research is a huge combine the essential parts of each letter corpus, both text and speech properly annotated. (omniglot). Consonants with vowel omission are For the major world languages, large corpora are used to represent the 5 chillu, which occurs in publicly available. But for most other languages, around 25% of words. they are not. But with the advent of the Web it can be highly automated and thereby fast and Malayalam has a canonical word order of inexpensive (Adam and G V Reddy). SOV (subject–object–verb) as do other . Phonetically a syllable in Malayalam corpus was created from online Malayalam consists of an obligatory nucleus sources like newspapers, blogs and other sites, which is always characterized by a vowel or which is then used to extract data. Manual verifi- diphthongal articulation; preceded and followed cation of the sites content is done to ensure that it optionally by a ‘releasing’ and an ‘arresting contains good-enough-quality textual content. ‘consonant articulation respectively For domain coverage contents is manually pre- (Prabodachandran, 1967 pp 39-40). pared. 25GB of raw corpus is collected for ex- Among the vowel in Malayalam, tracting optimal text which covers the phonetic two fronts, two are back and one is a low central variations in the language. The collected corpus vowel. Front vowels are unrounded and back as such cannot be used for optimal text selection. vowels are rounded ones (Prabodachandran, The raw data contains junk characters, foreign 1967 pp 39-40). All vowels both short and long words (frequent occurring of English words in except || occurs initially medially and finally. Malayalam script) etc. Short |o| does not occur word finally medially short || and short |o| occur only in the first The optimal text selected for speech applica- syllable. Malayalam has also borrowed the tion like ASR and TTS, must be able to well Sanskrit of /äu/ (represented in represent the language. Optimal text (OT) must Malayalam as ഔ, ) and // (represented in cover all phonemic variations occurring in the Malayalam as ഐ, ai), although these mostly language. OT is the minimum text which covers occur only in Sanskrit loanwords. the all/maximum possible units (variations) in the language. In a syllable there will be at least one consonant and a vowel. By the combination of a Frequent occurrences of English words give consonant and a vowel, Malayalam syllable higher rank for sentences containing more num- structure may be expressed by the simple ber of English words. This reduces the text with formula (c) v (c). native content. If we have to select sentence with The syllable structure occurring in Malayalam Malayalam words, these English words must not corpus and its frequency is shown in figure 1: add to the unit count. Manually marking of these C*VC* patterns occur in English words. words is practically not possible. Post verifica- tion and cleaning of English words from OT is Harmonic sequence of vowels is one of the tedious and sometimes requires additional text main characteristics of the Dravidian family of selection. languages (Prabodhachandran).That is one vowel can influence sound of other. Automatic language identification can be used in the text selection to improve the quality of OT. 4 Malayalam pronunciation and Letter to sound(LTS) rules Malayalam is a phonetic language having a direct correspondence between symbols and sounds (Prabhodhachandran, 1997). The pro- Figure 1. Syllable pattern and frequency nunciation lexicon generation is easy for phonet- ic languages when compared with English. But there exist few exceptions, in case of Malayalam. 114 Accurate pronunciation can be generated only by properly handling these exceptions. Pronuncia- • Foreign word pronunciation ( frequent tion is one of the factors that determine the quali- usage of English words ) ty of TTS. For generating proper pronunciation foreign A detailed analysis on speech and text corpus words must be identified and separate rules or is carried to identify the LTS rules for Malaya- pronunciation lexicon must be applied for gene- lam. The identified rules are then verified and rating proper pronunciation. The influence or validated by language experts. dominance of English has reached to an extent that the contents from few sources even contains The Malayalam LTS rules can be catego- more than 25% of foreign words. rized as below. For handling English pronunciation 3 addi- • Phonetic – direct correspondence be- tional phones are required. tween text and sounds e.g.: “a” sound as in “bank” (ബാ്) തിരുവnപുരം തി രു വ n പു രം “ph” sound as in “fan” tiruvananthapuram ti ru na ntha pu ram (ഫാന്) “s” sound as in “zoo” (സൂ) • Pronunciation different from orthogra- phy In TTS application a standard pronunciation is used at synthesis time. Users can add user spe- e.g.: നനയുക ന (ന) യു ക cific pronunciation dictionary to generate pro- “nanayuka n a n@ a y k a” (n nunciation in their own preference. In ASR, mul- represent dental and n@ represent the tiple pronunciation lexicons will be stored and alveolar) pronunciation variations can be handled.

Dental and alveolar sounds for NA and For generating accurate pronunciation, lan- its guage identification is done and specific rules must be applied to handle the phone variations. Influence of y in gemination of kk 5 Pronunciation lexicon generation for Retroflex plosive and its Malayalam

Pronunciation lexicon for Malayalam is gen- Lexicons having multiple pronunciations erated using LTS rules and exception list for – homonym/homophones handling the above listed cases.

“ennaal” e nn aa l Exception patterns for dental and alveolar NA “nn” can be alveolar gemination or den- gemination and allophonic variation of KK is tal gemination. The occurrence of such extracted from corpus, and added to exception cases is very less in Malayalam. list. Pronunciation is generated based on these

exception patterns. Other exceptions like pro- • Pronunciation by usage ( Speaker spe- nunciation of nouns etc. are also included in this cific pronunciation) exception list.

utവം u t വം 5.1 Handling of English words pronuncia- “utsavam” “u t s a v a m” tion (English script) -Actual pronunciation u സവം u സ വം For handling English word in latin script, en- “utsavam” “u l s a v a m” coding is checked to identify the language. Pro- - by usage nunciation is generated using a standard dictio- ദയ ദ യ nary look up. CMU dictionary with 100K words “daya” “d a y a” is taken as the reference for English pronuncia- – Actual pronunciation tion.

ദയ െദ യ “daya” “d e y a” – by usage 115 A pattern based replacement is done to modi- list. Only 70-75% of words are covered in this fy the pronunciation as Indian English (of native approach. Malayalam speaker) pronunciation. 5.2 Handling of English word pronuncia- 5.4 Naive Bayes classifier using character tion (Native script) n-gram approach The input text contains frequent occurrence of Naive Bayes classifier begins by calculating English words. Encoding based language identi- the prior probability of each label, which is de- fication is not applicable since both are in the termined by checking the frequency of each label same script. in the training set. The contribution from each All languages have a finite set of phonemic feature is then combined with this prior probabil- units on which the words are building and there ity, to arrive at a likelihood estimate for each - are constraints on the way in which these pho- bel. The label whose likelihood estimate is the nemes can be arranged to form syllables. These highest is then assigned to the input value. constraints are called as phonotactics or sequence constraint. Phonotactics is language Naive Bayes classifier with character n-gram dependent. as feature is used for categorizing MLN and MLE. In this approach two categories of texts are In the process of recognition with respect to collected, one only with native c words (MLN) Spoken Languages it is observed that human are and the other with nonnative (MLE). From these the best identifier of a language. However during texts n-gram model profile for MLN and MLE is automatic recognition we have to consider sever- generated. These profiles hold n-gram models up al factors like with respect to language identifica- to order 3 and their individual frequencies. Gen- tion one language differs from another language eration of n-gram is shown in figure 2. in one or more of the following: (M.A.Zissman, Navratil J, Mutusamy, Schultz, Mak). N-gram Malaya- N-gram frequency lam native • Phonology: Phone sets would be differ- Generate profile text ent for each of the languages (MLN)

• Morphology: The word roots and the N-gram Malayalam N-gram frequency lexicons may be different for different non-native Generate profile category of languages text (MLE)

• Syntax: The sentence patterns with re- spect to grammar are different Figure 2. Generation of n-gram frequency profile for each word in the source • Prosody: Duration, pitch, and stress pat- terns vary from language to language N-gram for Malayalam word മലയാളം 1-gram: മ ല യ ◌ാ ള ◌ം In this paper an attempt has been made to _ma la y aa lxa m_ identify language using the phonology since 2-gram: _മ മല ലയ യാ ◌ാള ളം ◌ം_ we require word level identification. We use _ma mala lay yaa aalxa lxam m_ language specific phonotactics information to 3-gram: _മല മലയ ലയാ യാള ◌ാളം ളം_ identify words that belong to native language. _ma malay layaa yaalx aalxam lxam_ 5.3 Pattern based approach As n increases expressiveness of the model Pattern based approach is a string matching increases but ability to estimate accurate parame- method to classify words. A detailed analysis of ters from sparse data decreases, which is impor- corpus is done to extract 1200 patterns that do tant in categorizing nonnative Malayalam words, not occur in Malayalam. Pattern search is done so n-gram up to 3 servers the job. on the input text and words with matching pat- terns were classified as English. Patterns that are N-gram decomposes every string into sub- common / noun patterns are not considered in the strings, hence any errors that are present is li- 116 mited to few n-grams (n-gram based text catego- 5. Return where rization, William B. Cavnar and John M. Tren- cNB = argmaxP(c j ) ∏ P(xi | c j ) kle). cj∈C ∈positions

5.4.1 Formulation of Naive Bayes Word MAP MLE or Given a word and the classes (CMLN CMLE) to Classifier which it may belong. Naïve Bayes classifier eva- luates the posterior probability for which the word belongs to particular class. It then assigns the word to class with highest probability val- Profile ues(Jing Bai and Jian-Yun Nie. Using Language Profile MLE MLN Model for Text Classification). Figure 3: Categorization using Naïve Bayes | MAP is Maximum a posteriori probability 6 Experiment and results

| argmax We selected 200 sentences with 2000 unique words was taken as the input for verification of pattern based and Naive Bayes based classifica- argmax pw|cpc tion. First we performed the words categorization

based on the pattern list (for English words).391 argmax px ,x ,..,x |cpc words were identified and 97 were mis- x represents feature classified. We then used the Naive Bayes clas- By Naive Bayes conditional independence assumption sifier to classify the words. The result is given in Table 1.0 px,x,..,x|c px|c px|c . . px|c Language identi- #Identified #Miss ∏ c argmaxpc px|c fication English word classified Pattern based 391 97 5.4.2 Applying Naive with n-gram frequen- Naïve Bayes 782 47 cy as feature Table 1.0 Category identified and misclassi- 1. Generate character based n-gram profiles fied for each category as discussed above

2. From training corpus extract vocabulary We also tested with random input collected of model from online sources. Naive Bayes with n-gram showed better word identification, than pattern. An average of 90% of word coverage was given 3. Calculate terms a. For each in C do by n-gram based language identification. i. := count of all words that belong to class Cj 7 Conclusion | | In this paper we brief about the effort for im- proving the accuracy of pronunciation lexicon for Malayalam. This method can be used to im- 4. Calculate | prove the quality of corpus, and speech applica- b. For each tions like TTS and ASR. In text corpus selection, | APLG using Naive Bayes classifier is used to | | identify foreign words in native script. This re- (Add 1 smoothing where 1) duces the manual effort required for manual veri- fication and cleaning of selected text corpus. For 117 TTS the pronunciation of words can be made Reference accurate using Naive Bayes classifier. Depending on the domain this will increase the quality of Adam, G V Reddy Lexical Computing Ltd. IIIT Hy- synthetic speech. derabad, Masaryk University, IIIT Hyderabad United In future we plan to improve the accuracy of Kingdom, India, Czech Republic, India Naive Bayes classifier by including the morphol- ogy rules to identify the root words. Using dif- M.A. Zissman, 1996 Comparison of Four Approaches ferent smoothing technique can also improve to Automatic Language Identification of Telephone performance(Sami Virpioja, Tommi Vatanen, speech, IEEE Transactions on Speech and Audio Jaakko J. Vayrynen. Language Identification of Processing. Short Text Segments with N-gram Mod- els).Naive Bayes classifier has wide range of ap- Milla Nizar, 2010, Malayalam: Dative Subject Con- structions in South-Dravidian Languages. plications which includes text categorization, development of multi-lingual speech recognizer Navratil. J, Sept. 2001 Spoken Language Recognition etc. - A Step toward Multilinguality in Speech Processing, IEEE Transactions on Speech and Audio Processing. Input text T. Sajini, K. G. Sulochana, R. Ravindra Kumar, Optimized Multi Unit Speech Database for High Tokenize Quality FESTIVAL TTS

Alan W. Black, Carolyn P. Rosé, Kishore Prahallad, Y Pronunciation User choice Rohit Kumar, Rashmi Gangadharaiah, Sharath Rao, Building a Better Indian English using More

N Data

Prabodachandran, 1967, Malayalam structure (Aber- CMU Dict Ngram Models crombie 1 pp 39-40)

Sanghamitra Mohanty, April 2011 Phonotactic Model LTS Rules for Spoken Language Identification in Indian Lan- guage Perspective, International Journal of Computer Applications (0975 8887) Volume 19 No.9 N < W, C > Schultz.T, et al, 1999, Language Independent and

Y Language Adaptive Large Vocabulary Speech Recog- nition, Proc. Euro Speech, Hungary. Phonetic variations Schultz. T and Kirchhoff. K, 2006, Multilingual Speech Processing, Academic Press Pronunciation Lexicon Mak. B, et al, 2002, Multilingual Speech Recognition with Language Identification, Proc. ICSLP.

Figure 4: Implementation of APLG William B. Cavnar and John M. Trenkle N-Gram-

Based Text Categorization.

Jing Bai and Jian-Yun Nie, Using Language Model

for Text Classification

Jaakko J. Vayrynen, Sami Virpioja, Tommi Vatanen,

Language Identification of Short Text Segments with

N-gram Models

118