<<

Building a Pronunciation Dictionary for Indonesian Speech Recognition System

Amalia Zahra Sadar Baskoro Mirna Adriani Faculty of Computer Science Faculty of Computer Science Faculty of Computer Science University of Indonesia University of Indonesia University of Indonesia Depok 16424, Indonesia Depok 16424, Indonesia Depok 16424, Indonesia [email protected] [email protected] [email protected]

recorded using a high-quality microphone. The Abstract environment where the speakers speak also determines the performance of ASR. The development of ASR systems has This paper reports the development of a advanced tremendously in recent years where pronunciation dictionary generator for Bahasa systems can now recognize speeches in English Indonesia. The pronunciation dictionary is an with accuracy of over 90%. Much research has important component in developing an been conducted for other languages too. They automatic speech recognition system (ASR). also achieved similar accuracy. For example, We compare two kinds of pronunciation dictionary: one that is generated automatically Arabic (Satori, et al., 2007) achieved around and another that is manually prepared. Both 85% accuracy, Estonian (Alumae, 2004) dictionaries are used in an Indonesian ASR achieved around 92% accuracy, and Indian system. The result shows that both (Kishore, et al., 2005) achieved around 83% dictionaries have similar levels of accuracy. accuracy. An ASR system has been developed for Bahasa Indonesia (Adriani and Baskoro, 1. Introduction 2008) that produces around 80% accuracy. Nevertheless, the research on speech recognition In recent years the availability of multimedia for Bahasa Indonesia is still very limited. data such as text, image, video, and speech has In order to get good performance, ASR increased tremendously through the internet, systems need a huge speech corpus. However, radio, television, and telephone. It motivates the availability of such collections is very rare, many studies in organizing such huge data. One especially for unpopular languages such as of the applications that are needed in many areas Bahasa Indonesia and Javanese. is one that converts speech data into text. Such a An ASR system has three main components, system is called an Automatic Speech namely an acoustic model, a pronunciation Recognition (ASR) system. dictionary, and a language model. A large The performance of ASR is influenced by corpus is needed to create the acoustic model many factors, namely: the pronunciation, the and the language model in order to get input signal, and the transducer. There are two acceptable ASR results. The pronunciation ways of pronouncing words: isolated and dictionary also plays an important role in continuous. Isolated word recognizers give recognizing spoken data. All words need to be better results than continuous speech phonetically transcribed correctly. Incorrect recognizers because continuous speech may affect the acoustic recognizers use more complex methods for model adversely by polluting the samples of one recognizing the speech. The other factor is the phone with those of other phones. input signal which depends on the way the In order to produce a correct pronunciation speakers speak, ranging from low frequency dictionary, one needs to check all the phonetic with slow pronunciation to high frequency with transcriptions manually. It is, obviously, very fast pronunciation. In addition, the transducer expensive because it requires a lot of time, also affects the recognition result. Speech resources, and energy to check the recoded through a telephone has different pronunciation of every word in the dictionary. recognition accuracy compared to speech Therefore, there has been a research effort to generate the pronunciation dictionary consists of 33 phone symbols, which contain 23 automatically. single symbols and 10 symbol pairs, as shown in A pronunciation dictionary and lexicon Table 1. generator for English, named Bob1 (Wan, et al., 2008) has been developed. The pronunciation Table 1. Indonesian Phone Symbols dictionary has 89% word accuracy and 98% Phone Symbols #Symbols phone accuracy. The ASR system using that a, b, c, d, e, f, g, h, dictionary has 39.5% WER or 4.2% worse than Single I, j, k, l, m, n, o, p, 23 that of using a manually checked pronunciation r, s, t, u, w, y, z ax, ng, ny, kh, kk, dictionary which has 35.3% WER. Pair 6 Basically, Indonesian pronunciation (Chaer, sy 2003) is simpler than that of other languages, for Diphthong aw, ay, ey, oy 4 example English and Chinese. Indonesian words Total 33 are pronounced exactly the same as their letters. For example, letters “ea” are always pronounced All words in the dictionary are taken from an as /e y a/ (e.g. words “beasiswa”, “dea”) while online Indonesian newspaper. There are 199,648 in English, letters “ea” can be pronounced sentences. Those sentences are classified into differently, for example in the word “read” and two sets, the first set contains only Indonesian “break”. In addition, Indonesian pronunciation words (156,718 sentences) and the second set has no like in Chinese, where a tone has contains some foreign words (42,930 sentences), meaning. Because of the pronunciation such as English, Arabic, etc. In this study we simplicity of Indonesian, we conduct a research only use the first set which consists of 50,988 to develop an Indonesian pronunciation distinct words. We then generate a dictionary generator and the dictionary produced pronunciation dictionary automatically by will be used in our Indonesian speech running a letter-to-phone program (Table 2). recognition system. Table 2. The Pronunciation Dictionary 2. Pronunciation Dictionary Word Pronunciation indonesia /i n d o n e s i a/ The pronunciation dictionary is one of the indoensia /i n d o e n s i a/ ASR components. It consists of words along with their pronunciations, which are formed by However we realize that not all the produced chain of alphabets of phone symbols. There are pronunciations are correct, so manual work is three standard phone symbols mainly used in still needed to improve the dictionary. Some of linguistics and ASR systems. They are IPA2 the errors are misspelled words that appear in (International Phonetic Alphabet), SAMPA3 the newspaper and the pronunciation. An (Speech Assessment Methods Phonetic example of the misspelled word is “indoensia” Alphabet), and ARPABET4. IPA is the mother instead of “indonesia”. Because the of phone symbols. It means that IPA is used as a pronunciations are produced by a simple letter- guideline to identify a sound in various to-phone program, many of them are incorrect. languages in the world. SAMPA and ARPABET Additional errors occur because there are are ASCII representations of symbols in IPA so different pronunciations for a word, i.e. the word that their symbols can be recognized by “indonesia”, so we revise them manually (Table computer. 3). Our final version of the pronunciation In this study we use phone symbols that are dictionary consists of 51,622 words along with produced from our previous research in their pronunciations. Indonesian phonetic dictionary development (Zahra, 2008). Basically, they are based on Table 3. The Revised Pronunciation Dictionary ARPABET with a little modification on some Word Pronunciation phones for simplification. The phone set indonesia /i n d o n e s i y a/ indonesia2 /i n d o n e sy a/ 1 Available for download at www.webasr.com indonesia3 /e n d o n e sy a/ 2 http://en.wikipedia.org/wiki/International_Phonetic_Alphabet 3 http://www.phon.ucl.ac.uk/home/sampa/ 4 http://en.wikipedia.org/wiki/Arpabet The manual checking requires a lot of time, Indonesian energy, and resources. Therefore, we study the Word List errors that occur in the pronunciation dictionary every word and create rules (Table 4) based on Indonesian Split phonetics (Chaer, 2003) to produce correct pronunciations automatically. Array of letters (chain of vowel(s) or chain of consonant(s)) Table 4. Rules of Indonesian Pronunciation Every chain of one or two or three Every chain of five or No Rules Example or four consecutive letters more consecutive letters subroutines to create phonetic transcription of subroutine to create phonetic transcription 1 Letter “e” in prefix “me- • “menari” one to four consecutive letters of five or more consecutive letters ”, “pe-”, “ke-”, “ter-”, /m ax n a r i/ phonetically “ber-”, and “se-” is Knowledge transcribe the word as abbreviation pronounced as "ax". 2 Statistical approach on • “berurai” pronunciation rules of /b ax r u r ay/ Mapping of List of letter Indonesian with its diphthong and non- • “bermain” Phones Set pronunciation diphthong vowels (“ai”, /b ax r m a y i n/ (33 symbols) as abbreviation “au”, “ei”, “oi”). 3 Insertion of phone “y” • “biasa” into chain of vowels “ae”, /b i y a s a/ Indonesian Pronunciation Dictionary “ia”, “iu”, “ie”, “io”, “ea”, “eo”. Figure 1. The Algorithm of Pronunciation 4 Transform the sound of • “abad” Dictionary Generator letter “b”, “d”, and “g” at /a b a t/ the end of word into “p”, At this moment, we have two kinds of “t”, and “k” pronunciation dictionaries. They are the 5 The remaining letter(s) is • “batang” dictionary developed manually (“manually replaced by its phone by /b a t a ng/ revised dictionary”) and generated automatically looking at the phones set. by our pronunciation dictionary generator (“automatically generated dictionary”). Both The generator starts by splitting the word into dictionaries are used in our ASR experiments in chains of consecutive vowel(s) and those of order to find out how they affect the ASR consecutive consonant(s). Every chain is performance. processed further based on the number of letters in the chain. If a chain consists of one to four 3. The ASR Experiments letters, then the generator implements rules shown in Table 4 to produce its phonetic There are three components of an ASR transcription. Unless a chain consists of one to system, namely, the acoustic model, the four letters, or it consists of more than four language model, and the pronunciation letters, the original word is phonetically dictionary. For the acoustic model, we use transcribed as an abbreviation. This procedure speech corpora from various sources (Table 6). iteratively proceeds until all chains, and then all The first speech corpus is taken from the daily words, are processed. The algorithm does not news from an Indonesian radio station. The cover the variations in pronunciation that a word second one is taken from the daily news from an probably has. An example of some words and Indonesian television station. The third one is a their pronunciations produced by the generator speech corpus containing speech recorded using is shown in Table 5. a mobile telephone. The speech consists of short sentences for daily conversation. The last one Table 5. The Result of the Pronunciation contains speech recorded using audio recorder. Generator The speech contains sentences taken from an Word Pronunciation online Indonesian newspaper. bahagia /b a h a g i y a/ Each speech corpus is divided into 75% for cerita /c ax r i t a/ training and 25% for testing purposes. We also indonesia /i n d o n e s i y a/ conduct a study by combining all of the training menguraikan /m ax ng u r ay k a n/ sets and testing sets. puasa /p u w a s a/ In order to measure the performance of our Table 6. The Duration and Number of Audio pronunciation dictionary generator, we calculate Files Used in the Experiments the word and phone level accuracy of the Training Testing automatically generated dictionary. The word Sources of Files level accuracy is simply determined by looking #Files #Hours #Files #Hours at the whole pronunciation of a particular word. Audio recorder 1,181 7.56 394 2.5 If a word is transcribed into its phones correctly, Radio 302 0.2 101 0.11 then it contributes to the percentage of word Television 1,383 1.36 479 0.52 level accuracy. If an error occurs in one phone, Telephone 2,626 2.38 876 0.84 it contributes nothing to the percentage. The Total (Mix) 5,492 11.5 1,850 3.97 phone level accuracy calculates the number of correct phone in every word. We use

The speech corpora contain 5,492 audio files Levenshtein distance (Heeringa, 2004) to (11.5 hours) which are used for building the measure it.

acoustic model (AM) and 1,850 audio files (3.97 Table 8. The Pronunciation Dictionary hours) for testing purposes. We use a text corpus Generator Performance containing newspaper articles and manual transcriptions of the audio files for building the Word Level Accuracy *Phone Level Accuracy language model (LM). The text corpus contains 83.8% 96.8% 156,718 sentences taken from a daily newspaper * The phone level accuracy is calculated using Levenshtein and 7,342 sentences taken from the distance. transcriptions of the speech corpora. Thus, we have 164,060 sentences with an average length By looking at the phone level accuracy, of about 15 words each. Those sentences are which is 96.8%, we can calculate the phone used to produce a trigram language model. error rate, which is 3.2% for this case. The phone error rate measures error in terms of three Table 7. Experiment Design components: insertion, deletion, and substitution #Audio #Audio #Sentc. Pronunc. (see Table 9). The substitution contributes 2.1% #Exp. files files for for LM Dict. to the phone error rate, while the deletion for AM testing contributes 0.4% and the insertion contributes manual 1 revised 0.7% to the phone error rate. 5,492 1,850 dictionary (11.5 164,060 (3.97 automatic Table 9. The Phone Error Rate hours) hours) 2 generated Percentage dictionary Substitution 2.1% Deletion 0.4% We develop the Indonesian ASR system Insertion 0.7% using Sphinx-4 (Walker, et al., 2004). The Phone Error Rate 3.2% acoustic model is built using SphinxTrain (Walker, et al., 2004) and the language model is built using CMU Language Model Toolkit (Clarkson and Rosenfeld, 1997). To evaluate the performance of the ASR system, we use Speech Recognition Scoring Toolkit (SCTK)5 to measure Word Error Rate (WER). In this study we conduct two experiments using two different pronunciation dictionaries (Table 7).

4. The Indonesian ASR System Evaluation

5 Figure 2. The Top 20 Pairs of Phone Available for download at Substitutions http://www.itl.nist.gov/iad/mig/tools/ The substitutions (Figure 2) mostly happen occurrences (0.17%), /a/ has 630 occurrences for phones /ax/ and /e/, which have 5,812 (0.16%), and /y/ has 527 occurrences (0.13%) of occurrences (1.45%). Both phones are sounds of phone insertion. Most of the cases happen in letter “e”. Based on knowledge mentioned in diphthong and non-diphthong for vowels. For Table 4, letter “e” appearing as a prefix, such as example, the pronunciation of word “me-”, “pe-”, “ke-”, “ter-”, “ber-”, and “se-” “bagaimana” is generated automatically as /b a g must be pronounced as /ax/. The pronunciation a y i m a n a/ whereas it is supposed to be of letter “e” that is not categorized as prefix is pronounced as /b a g ay m a n a/. In this case, determined as phone /e/ or /ax/ based on phone /a/ and /y/ should be the inserted phones. statistical calculation for such phones in the manually revised dictionary. For example, letter Table 10. The ASR Performance “e” that follows letter “c” is frequently Pronunciation Dictionary WER OOV pronounced as /ax/ rather than /e/ in the manually revised 20.7% manually revised dictionary. So, the dictionary 2.38% pronunciation dictionary generator should automatically generated 22.4% assign phone /ax/ to letter “e” that follows letter dictionary “c”. The performance of the ASR system using both dictionaries is shown in Table 10. The automatically generated dictionary produces 20.7% Word Error Rate (WER) while the manually revised dictionary produces 22.4% WER. The manually revised dictionary is 1.7% better than the automatically generated dictionary because it maps every phone in every word correctly. In addition, the manually revised dictionary is also enriched by variety of pronunciation so that it covers all possible pronunciations of every word. However, the Figure 3. The Statistic of Phone Deletions percentage difference is not very high so that we can still use the automatically generated The deletions (Figure 3) mostly happen for dictionary without manual revision to get similar phone /e/, which has 726 occurrences (0.18%). results. We can reduce the effort in terms of Most of the cases happen in the pronunciation of time and resources required to revise the abbreviations. For example, according to dictionary and still obtain acceptable results. algorithm illustrated in Figure 1, word “bap” is pronounced as /b a p/. Nevertheless, it is 5. Conclusion pronounced as /b e a p e/, thus, phone /e/ must be deleted. Our study evaluates the ASR system for Bahasa Indonesia using two different kinds of pronunciation dictionaries. The first dictionary is checked manually and the second one is generated automatically. The performance of the ASR using the manually revised dictionary is only slightly better than that of using the automatically generated dictionary. So, the pronunciation dictionary that is generated automatically is an alternative for the manually revised one since the performance of ASR system is comparable to that using the manually

Figure 4. The Statistic of Phone Insertions revised dictionary. In the future we intend to include informal The number of insertions is shown in Figure Indonesian words, which are often used in daily 4. It shows that the phone /e/ has 670 conversation. Such additional words will enrich our pronunciation dictionary so that our ASR system can be expected to be able to recognize CMU-Cambridge Toolkit. Proceedings of both formal and informal Indonesian speech. ESCA Eurospeech.

S. P. Kishore, et al. 2005. Development of Indian References Language Speech Databases for Large

Vocabulary Speech Recognition Systems. Abdul Chaer. 2003. Linguistik Umum. PT. Rineka Proceedings of International Conference on Cipta. Jakarta, Indonesia. Speech and Computer (SPECOM), Patras, Greece. Amalia Zahra. 2008. Penyusunan Kamus Fonetik Tanel Alumae. 2004. Estonian Speech dalam Pengembangan Sistem Pengenalan Recognition Experiments Using The Suara Otomatis untuk Bahasa Indonesia SpeechDAT-like Database. Fonetiikan Päivät, (Developing Indonesian Phonetic Dictionary The Phonetics Symposium 2004: 65-68. for Automatic Speech Recognition). Bachelor Thesis. Depok, Indonesia. Vincent Wan, John Dines, Asmaa El Hannani, and Thomas Hain. 2008. Bob: A lexicon and H. Satori, M. Harti, and N. Chenfour. 2007. pronunciation dictionary generator. Chenfour. Introduction to Arabic Speech Proceedings of Workshop on Spoken Language Recognition Using CMUSphinx System. Technology (SLT). Goa, India. http://dblp.unitrier.de/rec/bibtex/journals/corr/abs -0704-2083. Wilbert Jan Heeringa. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Mirna Adriani and Sadar Baskoro. 2008. Distance. PhD thesis. Rijksuniversiteit, Developing an Indonesian Speech Groningen. Recognition System. Proceedings of Second Malindo Workshop, Cyberjaya. Willie Walker, et al. 2004. Sphinx-4: A Flexible Open Source Framework for Speech Philip Clarkson and Ronald Rosenfeld. 1997. Recognition. SMLI TR 2004-0811. Statistical Language Modeling Using the