Building a Pronunciation Dictionary for Indonesian Speech Recognition System

Building a Pronunciation Dictionary for Indonesian Speech Recognition System Amalia Zahra Sadar Baskoro Mirna Adriani Faculty of Computer Science Faculty of Computer Science Faculty of Computer Science University of Indonesia University of Indonesia University of Indonesia Depok 16424, Indonesia Depok 16424, Indonesia Depok 16424, Indonesia [email protected] [email protected] [email protected] recorded using a high-quality microphone. The Abstract environment where the speakers speak also determines the performance of ASR. The development of ASR systems has This paper reports the development of a advanced tremendously in recent years where pronunciation dictionary generator for Bahasa systems can now recognize speeches in English Indonesia. The pronunciation dictionary is an with accuracy of over 90%. Much research has important component in developing an been conducted for other languages too. They automatic speech recognition system (ASR). also achieved similar accuracy. For example, We compare two kinds of pronunciation dictionary: one that is generated automatically Arabic (Satori, et al., 2007) achieved around and another that is manually prepared. Both 85% accuracy, Estonian (Alumae, 2004) dictionaries are used in an Indonesian ASR achieved around 92% accuracy, and Indian system. The result shows that both (Kishore, et al., 2005) achieved around 83% dictionaries have similar levels of accuracy. accuracy. An ASR system has been developed for Bahasa Indonesia (Adriani and Baskoro, 1. Introduction 2008) that produces around 80% accuracy. Nevertheless, the research on speech recognition In recent years the availability of multimedia for Bahasa Indonesia is still very limited. data such as text, image, video, and speech has In order to get good performance, ASR increased tremendously through the internet, systems need a huge speech corpus. However, radio, television, and telephone. It motivates the availability of such collections is very rare, many studies in organizing such huge data. One especially for unpopular languages such as of the applications that are needed in many areas Bahasa Indonesia and Javanese. is one that converts speech data into text. Such a An ASR system has three main components, system is called an Automatic Speech namely an acoustic model, a pronunciation Recognition (ASR) system. dictionary, and a language model. A large The performance of ASR is influenced by corpus is needed to create the acoustic model many factors, namely: the pronunciation, the and the language model in order to get input signal, and the transducer. There are two acceptable ASR results. The pronunciation ways of pronouncing words: isolated and dictionary also plays an important role in continuous. Isolated word recognizers give recognizing spoken data. All words need to be better results than continuous speech phonetically transcribed correctly. Incorrect recognizers because continuous speech phonetic transcription may affect the acoustic recognizers use more complex methods for model adversely by polluting the samples of one recognizing the speech. The other factor is the phone with those of other phones. input signal which depends on the way the In order to produce a correct pronunciation speakers speak, ranging from low frequency dictionary, one needs to check all the phonetic with slow pronunciation to high frequency with transcriptions manually. It is, obviously, very fast pronunciation. In addition, the transducer expensive because it requires a lot of time, also affects the recognition result. Speech resources, and energy to check the recoded through a telephone has different pronunciation of every word in the dictionary. recognition accuracy compared to speech Therefore, there has been a research effort to generate the pronunciation dictionary consists of 33 phone symbols, which contain 23 automatically. single symbols and 10 symbol pairs, as shown in A pronunciation dictionary and lexicon Table 1. generator for English, named Bob1 (Wan, et al., 2008) has been developed. The pronunciation Table 1. Indonesian Phone Symbols dictionary has 89% word accuracy and 98% Phone Symbols #Symbols phone accuracy. The ASR system using that a, b, c, d, e, f, g, h, dictionary has 39.5% WER or 4.2% worse than Single I, j, k, l, m, n, o, p, 23 that of using a manually checked pronunciation r, s, t, u, w, y, z ax, ng, ny, kh, kk, dictionary which has 35.3% WER. Pair 6 Basically, Indonesian pronunciation (Chaer, sy 2003) is simpler than that of other languages, for Diphthong aw, ay, ey, oy 4 example English and Chinese. Indonesian words Total 33 are pronounced exactly the same as their letters. For example, letters “ea” are always pronounced All words in the dictionary are taken from an as /e y a/ (e.g. words “beasiswa”, “dea”) while online Indonesian newspaper. There are 199,648 in English, letters “ea” can be pronounced sentences. Those sentences are classified into differently, for example in the word “read” and two sets, the first set contains only Indonesian “break”. In addition, Indonesian pronunciation words (156,718 sentences) and the second set has no tone like in Chinese, where a tone has contains some foreign words (42,930 sentences), meaning. Because of the pronunciation such as English, Arabic, etc. In this study we simplicity of Indonesian, we conduct a research only use the first set which consists of 50,988 to develop an Indonesian pronunciation distinct words. We then generate a dictionary generator and the dictionary produced pronunciation dictionary automatically by will be used in our Indonesian speech running a letter-to-phone program (Table 2). recognition system. Table 2. The Pronunciation Dictionary 2. Pronunciation Dictionary Word Pronunciation indonesia /i n d o n e s i a/ The pronunciation dictionary is one of the indoensia /i n d o e n s i a/ ASR components. It consists of words along with their pronunciations, which are formed by However we realize that not all the produced chain of alphabets of phone symbols. There are pronunciations are correct, so manual work is three standard phone symbols mainly used in still needed to improve the dictionary. Some of linguistics and ASR systems. They are IPA2 the errors are misspelled words that appear in (International Phonetic Alphabet), SAMPA3 the newspaper and the pronunciation. An (Speech Assessment Methods Phonetic example of the misspelled word is “indoensia” Alphabet), and ARPABET4. IPA is the mother instead of “indonesia”. Because the of phone symbols. It means that IPA is used as a pronunciations are produced by a simple letter- guideline to identify a sound in various to-phone program, many of them are incorrect. languages in the world. SAMPA and ARPABET Additional errors occur because there are are ASCII representations of symbols in IPA so different pronunciations for a word, i.e. the word that their symbols can be recognized by “indonesia”, so we revise them manually (Table computer. 3). Our final version of the pronunciation In this study we use phone symbols that are dictionary consists of 51,622 words along with produced from our previous research in their pronunciations. Indonesian phonetic dictionary development (Zahra, 2008). Basically, they are based on Table 3. The Revised Pronunciation Dictionary ARPABET with a little modification on some Word Pronunciation phones for simplification. The phone set indonesia /i n d o n e s i y a/ indonesia2 /i n d o n e sy a/ 1 Available for download at www.webasr.com indonesia3 /e n d o n e sy a/ 2 http://en.wikipedia.org/wiki/International_Phonetic_Alphabet 3 http://www.phon.ucl.ac.uk/home/sampa/ 4 http://en.wikipedia.org/wiki/Arpabet The manual checking requires a lot of time, Indonesian energy, and resources. Therefore, we study the Word List errors that occur in the pronunciation dictionary every word and create rules (Table 4) based on Indonesian Split phonetics (Chaer, 2003) to produce correct pronunciations automatically. Array of letters (chain of vowel(s) or chain of consonant(s)) Table 4. Rules of Indonesian Pronunciation Every chain of one or two or three Every chain of five or No Rules Example or four consecutive letters more consecutive letters subroutines to create phonetic transcription of subroutine to create phonetic transcription 1 Letter “e” in prefix “me- • “menari” one to four consecutive letters of five or more consecutive letters ”, “pe-”, “ke-”, “ter-”, /m ax n a r i/ phonetically “ber-”, and “se-” is Knowledge transcribe the word as abbreviation pronounced as "ax". 2 Statistical approach on • “berurai” pronunciation rules of /b ax r u r ay/ Mapping of List of letter Indonesian with its diphthong and non- • “bermain” Phones Set pronunciation diphthong vowels (“ai”, /b ax r m a y i n/ (33 symbols) as abbreviation “au”, “ei”, “oi”). 3 Insertion of phone “y” • “biasa” into chain of vowels “ae”, /b i y a s a/ Indonesian Pronunciation Dictionary “ia”, “iu”, “ie”, “io”, “ea”, “eo”. Figure 1. The Algorithm of Pronunciation 4 Transform the sound of • “abad” Dictionary Generator letter “b”, “d”, and “g” at /a b a t/ the end of word into “p”, At this moment, we have two kinds of “t”, and “k” pronunciation dictionaries. They are the 5 The remaining letter(s) is • “batang” dictionary developed manually (“manually replaced by its phone by /b a t a ng/ revised dictionary”) and generated automatically looking at the phones set. by our pronunciation dictionary generator (“automatically generated dictionary”). Both The generator starts by splitting the word into dictionaries are used in our ASR experiments in chains of consecutive vowel(s) and those of order to find out how they affect the ASR consecutive consonant(s). Every chain is performance. processed further based on the number of letters in the chain. If a chain consists of one to four 3. The ASR Experiments letters, then the generator implements rules shown in Table 4 to produce its phonetic There are three components of an ASR transcription.

Building a Pronunciation Dictionary for Indonesian Speech Recognition System

Rule-Based Standard Arabic Phonetization at Phoneme, Allophone, and Syllable Level

Lecture # 07 (Phonetics)

UC Berkeley Dissertations, Department of Linguistics

Multilingual Speech Recognition for Selected West-European Languages

Phonemic Similarity Metrics to Compare Pronunciation Methods

Phonetic Properties of Oral Stops in Three Languages with No Voicing Distinction

Techniques and Challenges in Speech Synthesis Final Report for ELEC4840B

Automatic Phonetization-Based Statistical Linguistic Study of Standard Arabic

A Tutorial on Acoustic Phonetic Feature Extraction for Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) Applications in African Languages

Synchronizing Keyframe Facial Animation to Multiple Text-To- Speech Engines and Natural Voice with Fast Response Time

UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations

Learning Allophones: What Input Is Necessary?