<<

PronouncUR: An Pronunciation Lexicon Generator Haris Bin Zia1, Agha Ali Raza1, Awais Athar2 1Information Technology University, 6th Floor, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan 2EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK {haris.zia, agha.ali.raza}@itu.edu.pk [email protected]

Abstract State-of-the-art speech recognition systems rely heavily on three basic components: an acoustic model, a pronunciation lexicon and a model. To build these components, a researcher needs linguistic as well as technical expertise, which is a barrier in low- resource domains. Techniques to construct these three components without having expert domain knowledge are in great demand. Urdu, despite having millions of speakers all over the world, is a low-resource language in terms of standard publically available linguistic resources. In this paper, we present a -to-phoneme conversion tool for Urdu that generates a pronunciation lexicon in a form suitable for use with speech recognition systems from a list of Urdu words. The tool predicts the pronunciation of words using a LSTM-based model trained on a handcrafted expert lexicon of around 39,000 words and shows an accuracy of 64% upon internal evaluation. For external evaluation on a speech recognition task, we obtain a word error rate comparable to one achieved using a fully handcrafted expert lexicon.

Keywords: Pronunciation Lexicon, Pronunciation Modeling, Lexicon Learning, Speech Recognition, Urdu

To our best knowledge, our Urdu pronunciation lexicon 1. Introduction generation tool is the first tool of its kind that makes it Automatic Speech Recognition (ASR) for resource easier for researchers to work on Urdu speech recognition systems without prior linguistic knowledge. scarce has been an active research area in the past few years (Sherwani, 2009; Qiao, 2010; Chan, 2012). Modern speech recognition systems usually require three The remainder of the paper is structured as follows. Section 2 reviews similar kind of work for different world resources: transcribed speech for acoustic modeling, a languages. We then present Urdu and Urdu large text data for language modeling and a pronunciation lexicon that maps words to sub-word units known as phonetic inventory in Section 3. Section 4 briefly discusses challenges in Urdu pronunciation modeling. We phonemes. Pronunciation lexicon acts as a link connecting present our tool in Section 5 and conclude in Section 6. language model with the acoustic model. While it is comparatively easy to gather transcribed 2. Literature Review speech waveforms and large text datasets, developing a There exists a range of research focusing on lexical pronunciation dictionary is quite expensive and requires resources or tools available for different world languages tremendous amount of manual effort and linguistic for pronunciation modeling in speech recognition tasks. expertise. Therefore, development of a pronunciation lexicon is the bottleneck when building ASR systems for  CMUdict2 (Carnegie Mellon pronunciation low-resource languages. Techniques to reduce the need of dictionary) is an open-source pronunciation expert knowledge in design and development of dictionary for North American English that contains pronunciation lexicons are in great demand. over 134,000 words and their pronunciations (Weide, 1998). There is also a lexicon generation tool3 We are interested in developing a pronunciation lexicon available that uses CMUdict. generation tool for Urdu which is an Indo-Aryan language 1 spoken widely with over 100 million speakers . Urdu is  Tan et al. (2009) proposed a rule based grapheme-to- official language of Pakistan. Its system is phoneme tool generating a pronunciation dictionary Segmental and more specifically Abjad i.e. only for . Their trained ASR on read are marked while () are speech corpus, using tool generated pronunciation optional. Urdu follows written from right to dictionary achieved a word error rate (WER) of left. A sentence written in Urdu along with its English 16.5%. translation is given below: 4  A Bengali pronunciation dictionary was developed اردو پاکستان کی قومی زبان ہے ۔ under Google Internationalization Project5 (Gutkin et Urdu is the national language of Pakistan. al., 2016). The dictionary contains around 65,000

Automatic Speech Recognition (ASR) research for Urdu words that were manually transcribed into their phonemic representation by a team of five linguists. exhibits number of challenges which are discussed in detail in subsequent sections. Despite being spoken by millions of speakers all over the world, Urdu is low- resource in terms of standard publically available linguistic resources. 2 https://github.com/cmusphinx/cmudict 3 http://www.speech.cs.cmu.edu/tools/lextool.html 4 https://github.com/googlei18n/language- resources/blob/master/bn/data/lexicon.tsv 1 https://www.ethnologue.com/language/urd 5 https://developers.google.com/international/  Pronunciation lexicons were developed for Amharic, Arabic character set. The character set includes 37 basic Swahili and Wolof languages under LFFA Project6 and 4 secondary letters, 7 diacritics, marks and were made available publically7 (Gauthier et al., and special symbols (Hussain & Afzal, 2001; Afzal & 2016). Hussain, 2001; Hussain, 2004) (see Appendix A).

 Mandarin Chinese Phonetic Segmentation and Tone 3.2 Phonetics is a publically8 available corpus of 7,849 Mandarin Urdu has a very rich phonetic inventory13, combination of Chinese utterances and their phonetic segmentation. Urdu letters and diacritics realizes 44 consonants (28 non- The corpus can be used for pronunciation modeling aspirated & 16 aspirated), 7 long vowels, 7 nasalized long of Mandarin Chinese. vowels, 3 half long vowels, 3 short vowels and 3 nasalized short vowels (Saleem et al., 2002; Hussain,  Arabic Speech Recognition Pronunciation Dictionary 2007; Hussain, 2004). Since speech recognition systems is a publically9 available pronunciation dictionary for require the representation of sounds using some phonemic (MSA) that contains notation such as IPA14 or SAMPA15 etc., we have used 526,000 words and two million pronunciations. CISAMPA (Case Insensitive Speech Assessment Methods Phonetic ) proposed by Raza et al. (2010) to  Masmoudi et al. (2014) presented Tunisian Arabic represent Urdu phonemes (see Appendix B). Phonetic Dictionary based on a set of phonetic rules and manually tagged lexicon of exceptions (for words 4. Challenges in Urdu Pronunciation that do not follow phonetic rules). Modeling

10 Pronunciation modeling for Urdu exhibits a number of  Egyptian Colloquial Arabic Lexicon is a publically challenges: available pronunciation dictionary of Egyptian Colloquial Arabic (ECA), it contains 51,202 words Dialects: Due to large user base and variety of speakers, and their pronunciation. there are variations in dialect leading to large variations in pronunciation and phonetics.  The Georgetown dictionary of Iraqi-Arabic is a 11 modern, up-to-date, publically available dialectal Script: In Urdu, diacritics serve to inform reader of the Arabic language resource that can be used for short vowels accompanying each written , but pronunciation modeling of Iraqi-Arabic. It contains commonly used Urdu script generally does not contain 17,500 Iraqi-Arabic entries along with their IPA diacritics. Speakers can distinguish the words through pronunciations. context and experience but some constructions may still can mean either اس  Bonaventura et al. (1998) presented a -to-phone be ambiguous, for instance, the word their respective IPA ,(اُ س) ’or ‘that ( ِاس) ’conversion system for Spanish that can be used to ‘this supply phonetic transcriptions to a speech recognizer. representation being /ɪs/ or /ʊs/ respectively.

 Mendonça et al. (2014) proposed a hybrid approach Morphology: Urdu is a morphologically rich language, based on manual transcription rules and machine combinations of affixes and stems results into large learning algorithms to build a machine readable vocabulary of words. pronunciation dictionary for Brazilian Portuguese. The dictionary as well as algorithms used to build Dual Behavior: Three Urdu characters show dual pronunciation dictionary were made publically12 behavior i.e. both consonantal and vocalic, based on their available. position of occurrence (Hussain, 2004). Pronunciation dictionaries developed under GlobalPhone 5. PronouncUR Project (Schultz, 2014) are also available for research and commercial purposes in 20 different languages - German, We have developed PronouncUR, an Urdu grapheme-to- French, Russian, Korean, Turkish, Chinese and Thai to phoneme tool based on a model (c.f. Section 5.2) that can name a few. generate a pronunciation lexicon in a form suitable for use with speech recognition systems from a list of Urdu 16 3. Urdu Language words. PronouncUR is freely available online . 3.1 Orthography 5.1 Lexicon Urdu is written in in a cursive format To train our model we have developed a lexicon of ( style) from right to left using an extended approximately 46K words. Lexicon has been tagged by trained transcription experts, carefully considering the letter-to-sound rules for Urdu proposed by Hussain 6 http://alffa.imag.fr/ (2004). 7 https://github.com/besacier/ALFFA_PUBLIC 8 https://catalog.ldc.upenn.edu/LDC2015S05 9 https://catalog.ldc.upenn.edu/LDC2017L01 13http://www.cle.org.pk/Downloads/ling_resources/phonet 10 https://catalog.ldc.upenn.edu/LDC99L22 icinventory/UrduPhoneticInventory.pdf 11http://press.georgetown.edu/book/languages/georgetown 14 https://www.internationalphoneticassociation.org/ -dictionary-iraqi-arabic 15 http://www.phon.ucl.ac.uk/home/sampa/ 12 https://github.com/gustavoauma/aeiouado_g2p 16 http://lextool.csalt.itu.edu.pk The format of the training lexicon is very straight forward. 5.2 G2P Model Each line consists of one word form and its pronunciation. The grapheme-to-phoneme (G2P) is the task of translating Word forms and their pronunciations are separated by tab. input sequence of (letters) to output sequence A small portion of the training lexicon is given in Table 1. of phonemes.

F O L A_A D_D فوالد ن َ ب Graphemes َ A L A_A M A_A T_D عالمات Phonemes B A N D_Z A_A I_I D_D A_A D_D جائیداد

L A R_R K I J O_O_N َلڑ ِکیوں D_D A R V A_Y S_H I_I Table 3: An example of grapheme-to-phoneme translation درویشی U L D_Z_H A_A O_O الجھاؤ R U K V A_A Given the success of sequence-to-sequence learning رکوا (Sutskever et al., 2014) and power of LSTM for sequence I_I R A_A N ایران modeling (Hochreiter et al., 1997), we choose LSTM for X A R I_I D_D I_I خریدی grapheme-to-phoneme conversion as proposed by Yao et A_A F A_A T_D al. (2015). We used open-source G2P toolkit17 to train our آفات F A R J A_A D_D G2P model with 2 LSTM layers and 512 hidden units in فریاد .I R A_A Q I_I each layer عراقی

Table 1: Training Lexicon

Out of 67 phonemes available in Urdu phonetic inventory (see Appendix B), our training lexicon currently caters for 64 phonemes, while the work is in progress to include 3 nasalized short vowels. Phonemes M_H and J_H occur very rarely in Urdu and thus have only one entry each in Figure 1: An encoder-decoder LSTM with two layers. the training lexicon, for the rest of the phonemes the frequency of occurrence is given in Table 2. Figure 1 shows a sample of the model where the encoder LSTM is on the left of dotted line while decoder on the ن

17 Table 2: Frequency Distribution of Phonemes in Training https://github.com/cmusphinx/g2p-seq2seq 18 Lexicon http://csalt.itu.edu.pk/PRUSCorpus/index.html 19 https://cmusphinx.github.io/ gram language model using the training data transcripts Conference, 2001. IEEE INMIC 2001. Technology for was applied during decoding. By using lexicon generated the 21st Century. Proceedings. IEEE International (pp. through lexicon tool, we obtained a word error rate 223-228). IEEE. (~19%) that approaches the rate achieved using a fully Hussain, S. (2004, August). Letter-to-sound conversion handcrafted expert lexicon. We used the same train/test for Urdu text-to-speech system. In Proceedings of the split as used by Raza et al. (2010) and thus results are workshop on computational approaches to Arabic directly comparable. script-based languages (pp. 74-79). Association for Computational Linguistics. 6. Conclusion and Future Work Hussain, S. (2007). Phonetic correlates of lexical stress in Urdu (Doctoral dissertation, UMI Ann Arbor). We presented an online pronunciation lexicon generation Masmoudi, A., Khmekhem, M. E., Esteve, Y., Belguith, tool for Urdu that can be used to generate pronunciation lexicon to be used with speech recognition systems. L. H., & Habash, N. (2014, May). A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Experimental results showed that pronunciation lexicon Recognition. In LREC (pp. 306-310). generated through lexicon tool behaves as good as handcrafted expert lexicon in speech recognition tasks. Mendonça, G., & Aluisio, S. (2014). Using a hybrid approach to build a pronunciation dictionary for As a future direction, we will look into the ways to Brazilian Portuguese. In Fifteenth Annual Conference decrease the WER of lexicon tool e.g. increase of the International Speech Communication coverage in training lexicon, increase size of training Association. lexicon, add support for nasalized short vowels and Qiao, F., Sherwani, J., & Rosenfeld, R. (2010, December). increase the coverage of rarely occurring phonemes. Small-vocabulary speech recognition for resource- scarce languages. In Proceedings of the First ACM 7. Acknowledgements Symposium on Computing for Development (p. 3). ACM. We would like to thank Atique-ur-Rehman for providing Raza, A. A., Hussain, S., Sarfraz, H., Ullah, I., & Sarfraz, us with cloud hosting and Murtaza Azam Khan for his Z. (2009, August). Design and development of help with frontend. phonetically rich Urdu speech corpus. In Speech Database and Assessments, 2009 Oriental COCOSDA 8. Bibliographical References International Conference on (pp. 38-43). IEEE. Afzal, M., & Hussain, S. (2001). Urdu computing Raza, A. A., Hussain, S., Sarfraz, H., Ullah, I., & Sarfraz, standards: development of Urdu Zabta Takhti (UZT) Z. (2010). An ASR system for spontaneous Urdu 1.01. In Multi Topic Conference, 2001. IEEE INMIC speech. The Proc. of Oriental COCOSDA, 24-25. 2001. Technology for the 21st Century. Proceedings. Saleem, A. M., Kabir, H. A. S. A. N., Riaz, M. K., IEEE International (pp. 216-222). IEEE. Rafique, M. M., Khalid, N. A. U. M. A. N., & Shahid, Aminzadeh, A. R., & Shen, W. (2008, December). Low- S. R. (2002). Urdu consonantal and vocalic resource speech translation of Urdu to English using sounds. CRULP Annual Student Report. semi-supervised part-of-speech tagging and Sherwani, J. (2009). Speech interfaces for information transliteration. In Spoken Language Technology access by low literate users (Doctoral dissertation, Workshop, 2008. SLT 2008. IEEE (pp. 265-268). IEEE. Carnegie Mellon University). Bonaventura, P., Giuliani, F., Garrido, J. M., & Ortin, I. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence (1998, August). Grapheme-to-phoneme transcription to sequence learning with neural networks. In Advances rules for Spanish, with application to automatic speech in neural information processing systems (pp. 3104- recognition and synthesis. In Proceedings of the 3112). Workshop on Partially Automated Techniques for Schultz, T., & Schlippe, T. (2014, May). GlobalPhone: Transcribing Naturally Occurring Continuous Pronunciation Dictionaries in 20 Languages. Speech (pp. 33-39). Association for Computational In LREC (pp. 337-341). Linguistics. Tan, T. P., & Ranaivo-Malançon, B. (2009). Malay Chan, H. Y., & Rosenfeld, R. (2012, March). grapheme to phoneme tool for automatic speech Discriminative pronunciation learning for speech recognition. In Proc. Workshop of Malaysia and recognition for resource scarce languages. Indonesia Language Engineering (MALINDO) 2009. In Proceedings of the 2nd ACM Symposium on Weide, R. L. (1998). The CMU pronouncing Computing for Development (p. 12). ACM. dictionary. URL: Gutkin, A., Ha, L., Jansche, M., Pipatsrisawat, K., & http://www.speech.cs.cmu.edu/cgibin/cmudict. Sproat, R. (2016, May). TTS for Low Resource Yao, K., & Zweig, G. (2015). Sequence-to-sequence Languages: A Bangla Synthesizer. In LREC. neural net models for grapheme-to-phoneme Gauthier, E., Besacier, L., Voisin, S., Melese, M., & conversion. arXiv preprint arXiv:1506.00196. Elingui, U. P. (2016, May). Collecting resources in sub- saharan african languages for automatic speech 9. Language Resource References recognition: a case study of wolof. In 10th Language Ali, Ahmed. Arabic Speech Recognition Pronunciation Resources and Evaluation Conference (LREC 2016). Dictionary LDC2017L01. Web Download. Hochreiter, S., & Schmidhuber, J. (1997). Long short- Philadelphia: Linguistic Data Consortium, 2017. term memory. Neural computation, 9(8), 1735-1780. Kilany, Hanaa, et al. Egyptian Colloquial Arabic Lexicon Hussain, S., & Afzal, M. (2001). Urdu computing LDC99L22. Web Download. Philadelphia: Linguistic standards: Urdu zabta takhti (uzt) 1.01. In Multi Topic Data Consortium, 1997. ɽ R_R ڑ Yuan, Jiahong, Neville Ryant, and Mark Liberman. 37 ɽʰ R_R_H ڑھ Mandarin Chinese Phonetic Segmentation and Tone 38 j J ی LDC2015S05. Web Download. Philadelphia: Linguistic 39 jʰ J_H یھ Data Consortium, 2015. 40 tʃ T_S چ 41 tʃʰ T_S_H چھ Appendix A 42 dʒ D_Z ج 43 ا ب پ ت ٹ ث ج چ dʒʰ D_Z_H جھ 44 ح خ د ڈ ذ ر ڑ ز Vowels ژ س ش ص ض ط ظ ع uː U_U َ و 45 غ ف ق ک گ ل م ن oː O_O و 46 و ہ ء ی ے ɔː O ََ و 47 aː A_A آ،ا Table A1: Basic Urdu Letters 48 iː I_I ی 49 eː A_Y ے 50 آ ں ۃ ھ æː A_E ََ ے 51 ũː U_U_N َ وں Table A2: Secondary Urdu Letters 52 õː O_O_N وں 53 ɔː̃ O_N ََ وں 54 ََ َِ َ َ َ َ َ ãː A_A_N آں،اں 55 ĩː I_I_N َِ یں 56 ẽː A_Y_N یں Table A3: Urdu Diacritics 57 æ̃ː A_E_N ََ یں 58 eˑ A_Y_H َِ ہ Appendix B 59 æˑ A_E_H ََ ہ Sr. No. Urdu Letter IPA CISAMPA 60 oˑ O_O_H َ ہ Consonants 61 p P 62 َِ ɪ I پ 1 pʰ P_H 63 َ ʊ U پھ 2 ə A ََ،ء b B 64 ب 3 ɪ ̃ I_N َِ ں bʰ B_H 65 بھ 4 ʊ̃ U_N َ ں m M 66 م 5 ə ̃ A_N ََ ں mʰ M_H 67 مھ 6 t̪ T_D ت،ط 7 t̪ʰ T_D_H Table B1: Urdu Letters with IPA and CISAMPA تھ 8

d̪ D_D د 9 d̪ ʰ D_D_H دھ 10 t T ٹ 11 tʰ T_H ٹھ 12 d D ڈ 13 dʰ D_H ڈھ 14 n N ن 15 nʰ N_H نھ 16 k K ک 17 kʰ K_H کھ 18 ɡ G گ 19 ɡʰ G_H گھ 20 ŋ N_G نک،نکھ،نگ،نگھ in ن 21 q Q ق 22 ʔ Y ع 23 f F ف 24 v V و 25 s S س 26 z Z ذ،ز،ض،ظ 27 ʃ S_H ش 28 ʒ Z_Z ژ 29 x X خ 30 ɣ G_G غ 31 h H ح،ہ 32 l L ل 33 lʰ L_H لھ 34 r R ر 35 rʰ R_H رھ 36