Unicode Sinhala and Phonetic English Bi-Directional Conversion for Sinhala Speech Recognizer

IEEE International Conference on Industrial and Information Systems 2015 \ ,\S» Unicode Sinhala and Phonetic English Bi-directional Conversion for Sinhala Speech Recognizer

M. Punchimudiyanse R. G. N. Meegama Department of Mathematics and Computer Science Department of Computer Science The Open University of Sri Lanka University of Sri Jayawardenapura Nawala, Sri Lanka Gangodawila, Sri Lanka [email protected] [email protected]

Abstract—An automated speech recognizer (ASR) having a build the phonetic text transcription from the original Unicode large vocabulary is yet to be developed for the Sinhala language text, a proper phonetic tag per character is necessary. because of the time consuming nature of gathering the training data to build a language model. The dictionary and building the Sinhala Unicode text to phonetic English tagging language model require non-English text, in our case, Sinhala (syllabification) is not a straightforward task because Sinhala Unicode, to be transcribed in phonetic English text Unlike text to characters have several modifiers that are typed after the speech conversions which only require transcribing the non- actual base character but appears before, after, top and the English text to phonetic English text an ASR needs correct bottom of the actual base character. In addition, a non reproduction of the original language text when the phonetic printable zero-width joiner (Unicode U+200D) is inserted English text is produced as the output of the speech recognizer. between the characters to convert certain character In the present research, newspaper articles are used to gather a combinations to one of the modifiers named yansaya, large set of sentences to build a language model having thousands rakaranshaya or repaya which are not present in Sinhala of words for the Sphinx ASR. We present a decoder algorithm Unicode. that produces phonetic English text from Sinhala Unicode text and an encoder algorithm that produces the correct reproduction For an ASR application to work, both Sinhala Unicode to of Unicode Sinhala text from phonetic English. For a near phonetic text conversion and proper reproduction of Sinhala phonetic tag set for Sinhala alphabet, results indicate 100% Unicode from phonetic English output of the speech accuracy for the decoder algorithm while for numberless text, recognizer is necessary. Phonetic tags for Sinhala alphabet has accuracy of the encoder algorithm stands at 98.61% for distinct to be searched and replaced in a specific order to prevent phonetic English words. invalid construction of words from phonetic English. For example, if phonetic tag for the characters $ is i and 8 is ii Keywords—Sinhala to Phonetic English, Phonetic English to with search order i, ii is used to encode a phonetic English Sinhala, Sinhala ASR, Sinhala Phonetic Tag set, Sphinx word giithaya (meaning song), the word is constructed as (of + $ + $ + s> + ca) @a which is misspelled. Correct spelling of I. Introduction die word Sta a will be constructed if the search order ii,i is Commercial and open source products to automatically followed. recognize spoken English is available in the market today. In The ASR applications use N-gram based techniques to terms of performance and the accuracy, ASR systems that are construct word combinations to build a language model in based on acoustic and language models utilize hidden markov phonetic English. Therefore, it is necessary for the researchers model (HMM) or Gaussian mixture models (GMM) for to identify suitable phonetic tag set to develop a Sinhala to recognition. Most popular such open source ASR systems phonetic text decoder algorithm and phonetic English to include Sphinx, HTK and Julius [1-3]. HMM based ASR Sinhala Unicode encoding algorithm. systems typically have two phases, namely, the training phase and the recognition phase [16]. Onetime training phase is used in which a single or multiple user's voice is utilized to train a II. B a c k g r o u n d a n d R e l a t e d W o r k proper speech model called an acoustic model. In the Representation of Sinhala characters in digital form recognition phase, the speech model built in the previous commenced in early 1980s to display Sinhala subtitles in phase is recurrently used to recognize spoken words of the television. The Wijesekara keyboard used in the type writers speaker. are extended to Sinhala fonts and keyboard drivers used in the computers. The Sinhala alphabet is first standardized by the The process of training a speech model requires a Sri Lanka standards (SLS) institute as Sinhala Character Code collection of acoustic data, a recorded voice sample of a for Information Interchange (SLS 1134:1996) and person, and the correct phonetic text transcription of that voice subsequently revised in 2004 and 2011 (SLS 1134:2004, SLS sample. Adopting this technique for systems built for non- 1134:2011) [5]. The Sinhala alphabet representation in English languages require a production of non-English words Unicode is first added to the Unicode standard version 3 based in phonetic English text for the speech trainer. In order to on the ISO/IEC 10646 [4]. IEEE International Conference on Industrial and Information Systems 2015

The Sinhala language has 18 vowels, 2 half vowels and 40 The pronunciation of the derived consonant varies with the consonants in modem Sinhala alphabet of 60 characters [6]. associated modifiers. Most of the modifiers are inheriting the The consonant letter eq is present to make it 61 characters in sounds of a vowel. The al modifier (<3) does not have a vowel the Unicode standard. In addition, there are several modifiers associated to its sound. A word in Sinhala language is placed after a consonant which corresponds to vowel sounds constructed from a set of phonetic sounds related to the [7]. The first well documented syllabification effort claims an alphabet. Phonologically a base consonant (termed as pure accuracy of 99.95% and employs a rule based algorithm to consonant) is denoted as a consonant with a al (

2Sf 2 3 2553 T S ii 2S>i 2 0 z S 2^5 2 ^ @255 @ 2sf @2553 @255;J ZS>£ Modifier^ yansaya zsf + ZWJ + d

2553 2S5aa 25) @ 3 © 0 © ffif 25)i @ 1 2 ^ 2 § © 25)3 © 25)^ «s M odifier repaya e + z w j + <6

Conjunct sj + zw j + a Fig. 1. Derived consonents from the consonent a . IEEE International Conference on Industrial and Information Systems 2015

B. Improved Phonetic Tag Set TABLE III. Phonetic Tags fo r M odifiers and M atching Vow els Based on the initial tag set present in [13,14], several tags Matching Phonetic Matching Phonetic are changed to eliminate ambiguity in characters with similar Vowel modifier Tag Vowel modifier Tag phonetic sounds but having different Unicode character codes. - <5 w tfo 0 3 axa The near phonetic tag set used in this research is given in

Table II. 4 i xcae 4 i aeae Several new tags are introduced to prevent actual vowels <8 o i 8 o ixi being misinterpreted as modifiers. The phonetic tag used for a C S u Co 9 uxu particular modifier for a corresponding vowel is given in 8 Table III. The character form derived by adding al modifier <5 ®o e <3 9 0 eze (<3) to a consonant is considered as the base form (pure a a O o 0 a © O i 0 X 0 consonant) for breaking down a Sinhala word to its phonetic pronunciation. a a Oa zru a a a Oaa zruu The Sinhala to phonetic English decoder algorithm is @6 aao zaik 3>t © O a zau planned to generate die phonetic form automatically from an - Oo xon - Os zkf input character sequence so that it will fulfill the requirement of the Sphinx ASR trainer tools. The encoder algorithm used to reconstruct Sinhala Unicode word from phonetic English output of the speech recognizer also utilizes the same phonetic C. Sinhala Unicode to Phonetic English Decoder Algorithm pronunciation format. The numbers in digit form (0-9) and HTML tags present in the Sinhala Unicode sentences that are copied from die online newspaper articles are removed first. The cleaned up text is TABLE II. SlNAHALA ALPHEBHET WITH IDENTIFIED PHONETIC TAGS processed using the following decoder algorithm. Numbers Phonetic Phonetic Phonetic written in textual form such as dm (one) in text are allowed. Character Tag Character Tag Character Tag DECODER ALGORITHM Vowels //This algorithm converts Sinhala Unicode text to phonetic 4 a axa 4i xcae //English text in order to use with sphinx based Sinhala ASR // abbreviations: zero width joiner - ZWJ, carriage return - CR 4t aeae <8 i 8 ixi tr = input_string e u C» uxu aa zri //sinhala character and phonetic tag placed in three categories Baa zrii o zilu O0 ziluu vowels = "q-a,ep-axa,..,o»-xon" joiners = "03-axa,o-i,o-ixi^-u,..,oaa-zruu" <5 e 6 eze @6 ai conson = "d-zch,eo-zng,..,»-f' 0 a 0X0 $« xau //processing section split vowel to two lists vowel & phoneticvowel Consonants and Zero Width Joiner split conson to two lists consonent & phoneticconsonent dh e zp 25) k split joiner to two lists modifier and phoneticmodifier remove non printable keeping ZWJ and CR © g 0 t a d repeat until end of vowellist n O p 3 b tr = tr.replace(vowel,"@"+ phoneticvowel +"#") repeat until end of consonentlist jhcn d xjhx e zndx tr = tr.replace(consonent,"$" & phoneticconsonent &"!") 4 qndh 6 zch CO zng repeat until end of modifierlist tr = tr.replace(modifier,"-" & phoneticmodifier & "~") e zon 23 txh aO zjh //zero width joiner replacement d zth a zdh ® xmb tr = tr.replace(ZWJ,"$qx#") //introduce "a" sound for non al consonents & zsh c5 zdx zn tr = tr.replace("!$", "!-a~") € zl a zt cn tr = tr.replace("!@", "!-a~@") tr = tr.replace("!", "!-a~") 6 jh C3 sh e ch tr = tr.replace("!" & CR, "!-a~" & CR) a zg 3 zk CO zb tr = tr.string.replace(CR, " " & CR & "") //remove all special characters except /, < , > © m C3 y 6 r tr = tr.replace("[\\! @#$' %A&* ."",0_+=~| {}?;:\[V\]]","") c 1 e V a s add to the start of tr remove last from the tr C9 f CO h zwj qx return tr IEEE International Conference on Industrial and Information Systems 2015 Decoder algorithm is capable of handling multiple else sentences separated by linefeed. The tags ~~and~~ is a if current character is not in ("Oo", "o%” ) then requirement to build sentence transcription of sphinx language if next 3 characters are "wqx" Then model building tool. GoTo rakaranshasection end If D. Phonetic English to Sinhala Unicode Encoder Algorithm kk = current char & "d" A longer algorithm required to encode phonetic English to else Sinhala Unicode because the features mentioned in 3A has to kk = current char be considered. end if rakaranshasection: //This algorithm converts phonetic English to Sinhala Unicode //rakaranshaya handling section //Phonetic English text is without numbers, but number written If next 4 characters are "wqxd" Then //in text is possible, eg. 1 - is not allowed, one - is allowed kk = current char + "dd" If fifth character from current char = "q" Then ra = input_phonetic_english_text Increment character count by 5 else // otherwise check the additional modi //data preparation section check 5th char from current position for vowel vowelO = vowel list in sinhala Unicode if found fifth character from current = modifier modifierO = modifier list sinhala Unicode kk += modifier consonentO = consonant list sinhala Unicode Increment character count by 3 end if csingle = Unicode character list with single letter phonetic tag Increment character count by 1 cdual = Unicode character list with two letter phonetic tag end If ctri = Unicode character list with three letter phonetic tag //end of rakaransaya handling section clarge = Unicode character list with large phonetic tag repeat rakaransha section changing 6 to ca, a for singlephone = single letter phonetic tag list yansaya and conjuncts of e,Q respectively dualphone = two letter phonetic tag list triphone = three letter phonetic tag list //repaya handling largephone = phonetic tag list with more than 3 letters for tag If next 4 characters are "dwqx" Then kk = "5” + cta(cc+4) //replace tag with Unicode character in the order of tag size If fifth character from current char = "q" Then Increment character count by 5 repeat until eof largephonelist else // otherwise check the additional modi ra = ra.replace(largephone,clarge) check 5th char from current position for vowel repeat until eof triphonelist if vowel found ra = ra.replace(triphone, ctriple) Fifth character from current char = modifier repeat until eof dualphonelist kk += modifier ra = ra.replace(dualphone,,cdual) Increment character count by 4 repeat until eof singlephonelist end if ra = ra.replace(singlephone, cingle) Increment character count by 1 end If //make a character array to do linear search to convert end if //vowels that immediately follow a consonant to modifiers sout +=kk cta() = ra.ToCharArrayO else sout = sout + current char + kkr //ch - current char, cc - counter, kk, kkr - tempcharacter else while not end of input_phonetic_english_text if current character is not in modifier then if current char = space then sout = sout + current char sout += space end if end if Increment current character count by 1 if ch is a consonant then end while check next character cta(cc+l) for in vowelO return sout //sout has the Sinhala Unicode output if vowel found next character = modifier kkr = modifier IV. Results and Discussion else Sinhala to phonetic English algorithm (decoder) and check next character for modifier without vowel phonetic English to Sinhala Unicode (encoder) algorithm if found then given in sections 3C and 3D are implemented and tested with kk = current char & next char different test cases. Increment character count by 1 IEEE International Conference on Industrial and Information Systems 2015 A. Sinhala Unicode to Phonetic English Decoder Algorithm B. Phonetic English to Sinhala Unicode Encoder Algorithm The functionality of the Sinhala Unicode to phonetic The phonetic English to Sinhala encoder algorithm is English decoder algorithm is tested with several phases. In the tested in several ways. The first test involves carrying out first phase, it is checked for accurate generation of phonetic proper reproduction of 61 alphabetical letters from their English text from a given sentence. Test Results are described respective phonetic tags each separated by a space character. without ~~and~~ tags which is a format requirement of The results indicate an accuracy of 100% in this regard. the Sphinx trainer. The next test is carried out for a sentence containing words A sentence ®8cf oo&(S a 8 a i ©®af 2S8® qjntgdiqscazoca without special modifiers. The sentence "sixonhalenw (meaning: It is risky to walk in the rail track) is processed katahazndxa haqndhunaxa gxcaeniximeze using the decoder algorithm produces the output as rezelw mzrudhukaxaxongaya saqndhahaxa mema xcaelwgoritxhama paxareze payinw gamanw kirixima anatxhurudhaxayakaya. It dheka upakaxarixi veze" (meaning: these two algorithms correctly produces the phonetic text by ordering the phonetic helpful in Sinhala ASR software) is encoded using phonetic tags of the test sentence in the following character sequence: English to Sinhala Unicode algorithm. It produces the accurate 8d(^w deptSd defcdyniw ® ©g20 c e j 25538 ©O". The word ©$ef (first word of the above sentence) is A sentence with special modifiers is tested using the same constructed as "M<^w" by decoding it to phonetic sounds of technique. The sentence "puxujhwqxya pakwzshaya the base character and its modifier. &(S becomes IS + d and ef chakwqxralezezkaya sahitxha dhuxmburu varwqxznayenw becomes cf + w in the word ©&cf. The tag w implies that the yutxhu kavaraya vzaikdhwqxyavarayaxata zbaxara consonant does not have a hidden vowel ep (a) present. dhunwnezeya" (meaning: The brown colored cover containing Therefore, the word @8cf becomes (Sddw and subsequently, the circular handed over to the doctor by the clergy) is rezelw and the rest of the words are processed with the same properly encoded by the encoding algorithm which produces notation. the output as "gt£» estates ©@®c?Scs esfltss gg<5i G-eSacasf gzg 255©dc8 ©©©gxGdc&t© 853(5 gsf@!jfc8". The second phase of testing the decoder algorithm involves generating phonetic text in the presence of special Researchers have observed that the phonetic English text modifiers (repaya, rakaranshaya and yansaya) and conjuncts generated by decoder algorithm using die Sinhala text typed (combined letters). Table IV depicts die decoding process of from real time Unicode converter application! 15] and die die words ©29a (curve), qOasod (truthful), ©«8 (colors) and Google input tools provided a 100% accurate output in the qssd (letters) tested with online Sinhala typing application encoder algorithm for the three special modifiers yansaya, "real time Unicode converter" by UCSC [15]. This algorithm rakaranshaya and repaya. The encoding algorithm accurately converts the words to proper phonetic English format. supports one conjunct that is the combined letter s» which is ad and ® added together. Testing with the text converted from The initial results indicate that the decoder algorithm other online sources produced errors which will be discussed generates phonetic English from Sinhala Unicode in the in the next section. expected format of the Sphinx ASR trainer. A test set of 500 sentences is gathered from the Divaina, Silumina and The output of Sinhala Unicode text from phonetic English Lankadeepa online editions of Sinhala newspapers which are source sentence is a continuous effort of experimental published in the Sinhala Unicode font to test bidirectional comparisons in identifying a proper near phonetic tag set for conversion between phonetic English and Sinhala Unicode the Sinhala alphabet. It is also observed that the search order text. The numbers present in the text in digits (0-9) are of the phonetic tags has to be from the largest to the smallest.. manually replaced by the word form of the numbers. C. Testing the Bi-directional Conversion A test set of 500 Sinhala Unicode sentences is processed using the decoder algorithm to produce the phonetic English At the final stage, the bidirectional capability of the two transcription within 3 seconds in a personal computer having a algorithms is tested using the following three steps. processing speed of 2.7 GHz. Step 1: A test set of Sinhala Unicode text is processed by the decoder algorithm to obtain phonetic English text. Step 2:

TABLE IV. Phonetic English conversion o f special m odifiers and The phonetic text is fed as the input to the encoder algorithm C onjuncts to obtain Sinhala Unicode text. Step 3: The sentences in the original test set are compared with the regenerated Sinhala Derived Phonetic Phonetic tag Unicode text. Modifier consonants Form separated Online editions of the Silumina, Divaina and Lankadeepa Sinhala newspapers possess the least number of typographical rakaranshaya vakwqxraya vakwqxraya errors. As such, Sinhala Unicode text of the articles published tftacJ yansaya avwqxyaxajha avwqxyaxajha in such newspapers are selected for building the test set containing 500 sentences to be used in the bidirectional e«s repaya varwqxzna v a r w qx zn a conversion. Note that the test set of 500 sentences are stripped qa»<5 conjunct akwqxzshara a k w qx zsh a r a of all the HTML tags. IEEE International Conference on Industrial and Information Systems 2015

The regenerated Sinhala Unicode output of the encoder References had 46 errors from a total of 7324 words present in the original test set and the accuracy stands at 99.37%. On the [1] K.F. Lee, H.W. Hon and R. Reddy, “An overview of the SPHINX other hand, if the accuracy is taken as the percentage of errors speech recognition system”, IEEE Transactions on Acoustics, Speech from the number of distinct words, 38 errors out of 2743 and Signal Processing, vol. 38, no. 1, pp. 35-45, January 1990. distinct words measure up to the encoding accuracy of [2] S. Young, “The HTK hidden markov model toolkit: design and 98.61%. philosophy,” Cambridge University Engineering Department, UK, Tech. Rep. CUED/F-INFENG/TR152, September 1994. It is observed that the encoding errors occur due to the [3] A. Lee, T. Kawahara and K. Shikano,"Julius - An open source real-time placement of zero-width joiners in the wrong sequence or large vocabulary recognition engine", In Proceedings of the 7th placed twice in the places with special modifiers rakaransha, European Conference on Speech Communication and Technology yansa and repaya in the original Sinhala text, which is used to (EUROSPEECH2001), Aalborg, Denmark, pp. 1691-1694, September 2001. generate phonetic English text. This is due to excessive use of zero-width joiners when typing the Sinhala text by the user or [4] V.K. Samaranayake, S.T. Nandasara, J.B. Dissanayake, A.R. Weerasinghe and H. Wijayawardhana. (2003) "An introduction to the way different keyboard / font driver interprets rakaransha, UNICODE for Sinhala characters". University of Colombo School of yansa and repaya by the Sinhala typing application. Computing, Department of Sinhala University of Colombo, UCSC technical report 03/01, Sri Lanka. Having identified the cause of errors, an accuracy of 100% [5] Sri Lanka Standards Institute, "Sri Lanka Standards SLS 1134 : 2011 - can be reached after running the set of raw sentence through Sinhala character code for information interchange - revision 3", SLSL, the decoder and the encoder and replacing the words 2011. containing invalid zero-width joiners with correctly typed [6] "Sinhala Lekhana Reethiya", National Institute of Education, Sinhala words. Maharagama, Sri Lanka, 1989. [7] The Unicode Consortium, "The Unicode standard version 7.0, Sinhala The Sphinx ASR trainer is used to build the Sinhala Range: 0D8O-ODFF", 2014. [Online], Available: language model using the phonetic English transcription http://unicode.org/diarts/PDF/U0D80.pdf/, [Accessed: 30-Nov-2014], generated by decoder algorithm taking Sinhala sentences as an [8] G.V. Dias,"Challenges of enabling IT in the Sinhala language",27th Input. Then, the language model, along with the acoustic Internationalization and Unicode Conference, Berlin, Germany, 2005. model, is used with the Sphinx ASR to produce continuous [9] R. Weerasinghe, A. Wasala and K. Gamage, "A rule based Sinhala voice recognition output in phonetic English. syllabification algtoithm for sinhala", In Proceedings of the 2nd Subsequently, the phonetic English output is fed into the International Joint Conference on Natural Language Processing (UCNLP-05), Jeju Island, Korea, pp. 438-449,2005. encoder algorithm to generate Sinhala Unicode text via a [10] J.B. Disanayaka,"^>di ho 8 g (Letters and Strokes)",Colombo, S. wrapper application. Godage & Bros., 2000. [11] J.B. Disanayaka,"National Languages of Sri lanka - I, Sinhala", V. Conclusions Colombo, Department of Cultural Affairs, 1976. [12] B. Hettige and A.S. Karunananda,"Transliteration system for English to Building a language model in phonetic English from Sinhala machine translation", In proceedings of International Conference Sinhala Unicode text is a tedious task in Sphinx based Sinhala on Industrial and Information Systems (ICIIS 2007), pp. 209-214,2007. ASR. In this regard, online newspapers are a valuable source [13] A. Wasala and K. Gamage. (2007) "Research report on phonetics and of obtaining a large set of Sinhala sentences that can be used phonology of Sinhala ”. University of Colombo School of to generate a language model. Computing,Working Papers 2004-2007, Sri Lanka [Online]. Available: http://www.columbiaedu/~kf2119/SPLTE1014/Day%203%20slides%2 The main contributions of the proposed technique are the 0and%20readings/SinhalaPhoneticsandPhonology.pdf, [Accessed: 17- decoder algorithm having an accuracy of 100% for numberless Jul-2014], text and the encoder algorithm with an accuracy of 98.61% for [14] "Sinhala syllabification tool", Language Technology Research distinct words in making successful bidirectional conversion Laboratory, University of Colombo School of Computing (UCSC), Sri Lanka [Online], Available: http://ucsc.cmb.ac. lk/ltrl/?page=downloads between Sinhala Unicode text and phonetic English &lang=en&style=default, [Accessed: 14-May-2014], Two future improvements are suggested. First [15] "grfSaax/O C o i & s ! oSOSsnocs (Unicode real time font conversion improvement is enhancement of the decoder algorithm to utility)", Language Technology Research Laboratory, University of support inline number to text conversion with appropriate Colombo School of Computing (UCSC), Sri Lanka2006. [Online]. Available: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/tl .html sensing of suffixes used in specifying days, positions and numbers. Second improvement is adding a touching letter (two [16] K. Seymore et al.,"The 1997 CMU sphinx3 English broadcast news transcription system", In Proceedings of DARPA Broadcast News consonants touched together without al modifier in the first Transcription and Understanding Workshop,1998. consonant) conversion support as an option which are [17] A.R. Weerasinghe, D.L. Herath and K. Gamage,"The Sinhala collation commonly used in classical Buddhist and Pali literature. sequence and its representation in UNICODE",International Journal of Localization (Localization Focus), vol 5, no.l, pp 13-19, March 2006. Acknowledgment This research is funded by the National Science Foundation of Sri Lanka research grant number TG/2014/Tech-D/02.