Pashto Spoken Digits Database for the Automatic Speech Recognition Research
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 18th International Conference on Automation & Computing, Loughborough University, Leicestershire, UK, 8 September 2012 Pashto Spoken Digits Database for the Automatic Speech Recognition Research 1Arbab Waseem Abbas, 1Nasir Ahmad, 2Hazrat Ali 1 Department of Computer Systems Engineering 2 Department of Electrical & Electronics Engineering University of Engineering and Technology Peshawar, Pakistan [email protected], [email protected] Abstract--This paper presents the development of a to recognize an isolated whole-word speech of the ten Pashto Spoken Digits database for the automatic speech digits, zero through nine in Arabic. The Hidden Markov recognition research. This is, to the best of author’s Model Toolkit (HTK) was used to implement the isolated knowledge, the first Pashto isolated digits database. The word recognizer with phoneme based HMM models. In database consists of Pashto digits from zero (sefer) to the training and testing phase of this system, isolated hundred (sul) uttered by sixty speakers, 30 male and 30 digits data sets are taken from the telephony Arabic female. The speakers included are having ages, ranging speech database, SAAVB. This standard database was from 18 to 60 years. The recordings are performed in a developed by KACST and it is classified as a noisy noise-free environment using Sony PCM-M 10 Linear speech database. A hidden Markov model based speech Recorder. The audio is stored in .wav format and recognition system was designed and tested with transferred to the laptop via an usb cable. After editing, the audio is split into individual digits using Adobe Audition automatic Arabic digits recognition. In Urdu language, ver. 1.0. The isolated digit recognition experiments are then corpus development was discussed by Huda Sarfraz [7] performed on a subset of the database containing first 11 while the development of isolated word database is digits and 18 speakers. Mel Frequency Cepstral Coefficients presented in [8]. In [7], development of an Urdu speech (MFCC) were used as the feature vector while Linear database for speaker independent spontaneous speech Discriminent Analysis (LDA) based classifier was used for recognition has been presented containing data from 82 the classification. speakers (42 male and 40 female), recorded over a microphone and a telephone line. The speech was Keywords-Pashto speech recognition; Pashto acoustic collected from speakers ranging from 20 to 55 years of database and Pashto islotated word database. age. Recording sessions were conducted in office and home environments. In Indian languages, speech I. INTRODUCTION databases in Tamil, Telugu and Marathi were elaborated by Anumanchipalli [9]. In this work speech data was The world is moving towards Computerization and collected from about 560 speakers in these three there is an increased interaction of humans with the languages. Preliminary speech recognition results using machines in their daily lives, so there is a strong need to the acoustic models created on Sphinx 2 speech make this human-computer interaction more natural and recognition toolkit, has also been presented. Speech pervasive [1]. To achieve this goal, the development of translation system for American English and Pashto was listening [2], speaking [3] and viewing [4], [5] machine developed by Andreas Kathol [10]. In this paper, issues has became very hot areas of research. The area of encountered by Pashto in the areas of written research, conducting work on listening machine is representation, corpus creation, speech recognition, popularly known as Automatic Speech Recognition speech synthesis, and grammar development for (ASR). Obtaining such capabilities in local languages is a translation have been discussed. next stage of this technological advancement so that everyone can communicate with the computers in a General rules for corpus development are discussed by natural way, and a number of research works on the ASR Sue Atkins [11]. This paper, explains the principal aspects of local languages has been reported in the recent years. of corpus creation and the major decisions to be made Hejazi, in [2] has presented an isolated digits recognition when creating an electronic text corpus, basic standards system in Persian languages. He has used Hidden Markov and features for corpus design in the initial stages of Model (HMM) to decompose a word into small parts for corpus building. solving the problem arising from the pronunciation For the research on natural language processing of any similarity of some Persian digits that are composed of language, the first step is the development of a standard very similar phonetic and spectral components. For language resource, both for training and a basic platform recognition a Support Vector Machine SVM is used to for performance evaluation. In the ASR research, isolated segment the input word and to find an entry with the word and digit recognition tasks are generally the first maximum number of similar segments. In Arabic, an stages posing a relatively simple and well defined tasks automatic recognition of spoken digits was described by and a baseline for onward research [2], [6], [8]. This Yousef Ajami [6]. In this paper, the system was designed paper presents the development of an isolated digits 17 ˺̀ βϟ ϩ΅ϭ Owa- Las database in Pashto language. Initial ASR experiments 18 ˺́ βϟ ϪΗ Ata- Las have been carried out on the database using Mel- 19 ˺̂ βϟ Ϯϧ Noo- Las Frequency Cepstral Coefficients (MFCC) based features 20 ˻˹ Ϟη Shul 21 ˻˺ ΖθϯϮϳ Yaw-Vesht and Linear Discriminant Analysis (LDA) classifier. The 22 ˻˻ Ζθϯϭ ϩ΅Ω Dwa—Vesht έΩ Dray-Veshtـ contents of the database, recording setup and the results 23 ˻˼ Ζθϯϭ of initial experiments on the database are discussed in 24 ˻˽ Ζθϯϭ ήϯϠո Celour-Vesht details in the following sub-sections. 25 ˻˾ Ζθϯϭ ϪթϨ̡ Penza-Vesht Ζθϯϭ̢֛η Shpag-Vesht ˿˻ 26 27 ˻̀ Ζθϯϭ ϩ΅ϭ Owa-Vesht II. PASHTO DIGITS DATABASE 28 ˻́ Ζθϯϭ ϪΗ Ata-Vesht Pashto, the national language of Afghanistan and one 29 ˻̂ Ζθϯϭ ϪϬϧ Naha-Vesht 30 ˼˹ εήϯΩ Dairsh of the major languages of Pakistan, has an estimated 50- 40 ˽˹ Ζ֣ϯϮϠո Celwaikht 60 million speakers all over the world [12]. Pashto is 50 ˾˹ αϮթϨ̡ Panzoos divided into three dialects, Northern Pashto, Central 60 ˿˹ ϪΘϯ̢η Shpeta Pashto and Southern Pashto [13]. Northern Pashto is 70 ̀˹ Ϫϯϭ Away spoken in most of Khyber Pakhtoonkhwa and its 80 ́˹ ϪϯΗ Atya adjoining areas in Afghanistan, the Provinces of Kunar 90 ̂˹ ϯϮϧ Nawi and Nangarhar. It is also most common dialect among the 100 ˺˹˹ Ϟγ Sul Pashtun Diaspora mostly in Canada, India, UAE, UK and The pattern of Pashto digits from zero to hundred is the USA. Its alternate names are Pakhto, Pukhto and shown in Table 1. Yusufzai Pukhto. Pashto central is spoken in Wazirstan, Bannu and Karak region of the Khyber Pakhtoonkhwa In the number system of Pashto, like most other province of Pakistan. Its alternate names are Waciri languages such as Persian [2], Arabic [6] and other (Waziri) and Bannuchi (Bannochi, Bannu). Southern languages [16], each digit from zero (sefer) to ten (las) Pashto is spoken in Quetta area of Balochistan province has different name. The numbers following las, yawo- and also in Afghanistan, Iran, Tajikistan, United Arab las(11) to noo-las(19) contains common post-fix las with Emirates, and United Kingdom. In this paper Northern each digit from yaw(one) to naha(nine) with slight Pashto (Yusufzai dialect) has been used. modification, while from yaw-vesht(21) to naha- vesht(29) the common post-fix is vesht with digits from yaw(1) to naha(9) as a pre-fix. Same pattern is followed A. Database content for all digits till sul(100). There are some variations of The first step in database development is the selection this pattern among different regions/dialects. The of content. A database for the general purpose research numbers included here are those of the Yousafzai dialect. should contain all the sounds in the language called phonemes. III. DATABASE DEVELOPMENT A phoneme is the smallest meaningful unit of sound For the digits database the content is naturally fixed. utterance [14]. Although no standard set of Pashto As Pashto speakers also use other number, such as Urdu phonemes has yet been identified, in [15], 41 phonemes in and English in their daily lives, to make it easy for their Pashto have been reported including 35 consonants and 6 recording, numbers from sefer (zero) to sul (100) were vowels phonemes. For a database for ASR, all phonemes written on a sheet using Liwal Pashto/Dari support of the language need to be included along with software [17]. maintaining a statistical balance between their occurrences. However, for the digit recognition A. Speaker selection application the set of words is fixed i.e. digit which may or may not contain all the phonemes of the language. For the purpose of recording, the speaker who could fluently & accurately pronounce the written digits in TABLE I. PASHTO DIGITS PATTERN Pashto was selected. A page with Pashto digits written on Digits Pashto (in figure) Pashto (words) Pronunciation it was given to each speaker for reading and was asked to 0 ˹ ήϔλ Sefer read. The pronunciation of the speaker was checked 1 ˺Ϯϳ Yaw whether his/her the reading was correct and that the 2 ˻ϩ΅Ω Dwa pronunciation was in the Yousafzai dialect. έΩ Drayـ˼ 3 4 ˽ έϮϠո Celour 5 ˾ ϪթϨ̡ Penza B. Recording environment η Shpeg The recording environment is an important aspect of̢֛ ˿ 6 7 ̀ ϩ΅ϭ Owa the database such as office environment and in car etc. In 8 ́ ϪΗ Ata 9 ̂ ϪϬϧ Naha this database the recording was performed in an office 10 ˺˹ βϟ Las environment either in faculty room or in the library of the 11 ˺˺ βϟ Ϯϳ Yawo- Las University. Further, the signal to noise ratio was 12 ˺˻ βϟ ϩ΅Ω do- Las controlled by adjusting the recorder sensitivity.