Automatic Phoneme Recognition for Bangla Spoken Language
Total Page:16
File Type:pdf, Size:1020Kb
1 B.Sc. in Computer Science and Engineering Thesis Automatic Phoneme Recognition for Bangla Spoken Language Submitted by Israt Jahan Mouri 201014040 Tahsin Sultana 201014023 Md. Wahiduzzaman Khan 201014063 Supervised by Dr. Mohammad Nurul Huda Professor & MSCSE Coordinator Department of Computer Science and Engineering United International University (UIU), Dhaka, Bangladesh. Department of Computer Science and Engineering Military Institute of Science and Technology CERTIFICATION This thesis paper titled “Automatic Phoneme Recognition for Bangla Spoken Language ”, submitted by the group as mentioned below has been accepted as satisfactory in partial fulfillment of the requirements for the degree B.Sc. in Computer Science and Engineering on December 2013. Group Members: Israt Jahan Mouri Tahsin Sultana Md. Wahiduzzaman Khan Supervisor: ———————————- Dr. Mohammad Nurul Huda Professor & MSCSE Coordinator United International University (UIU), Dhaka, Bangladesh. ii CANDIDATES’ DECLARATION This is to certify that the work presented in this thesis paper is the outcome of the inves- tigation and research carried out by the following students under the supervision of Dr. Mohammad Nurul Huda, Professor & MSCSE Coordinator ,United International University (UIU), Dhaka, Bangladesh. It is also declared that neither this thesis paper nor any part thereof has been submitted any- where else for the award of any degree, diploma or other qualifications. ———————————- Israt Jahan Mouri 201014040 ———————————- Tahsin Sultana 201014023 ———————————- Md. Wahiduzzaman Khan 201014063 iii ACKNOWLEDGEMENT We are thankful to Almighty Allah for his blessings for the successful completion of our thesis. Our heartiest gratitude, profound indebtedness and deep respect go to our supervisor Dr. Mohammad Nurul Huda,Professor & MSCSE Coordinator,United International Uni- versity (UIU), Dhaka, Bangladesh, for his constant supervision, affectionate guidance and great encouragement and motivation. His keen interest on the topic and valuable advices throughout the study was of great help in completing thesis. We are especially grateful to the Department of Computer Science and Engineering (CSE) of Military Institute of Science and Technology (MIST) for providing their all out support during the thesis work. Finally, we would like to thank our families and our course mates for their appreciable assistance, patience and suggestions during the course of our thesis. Dhaka Israt Jahan Mouri December 2013 Tahsin Sultana . Md. Wahiduzzaman Khan iv ABSTRACT Speech recognition (also known as automatic speech recognition or computer speech recog- nition) converts spoken words to text. The term “voice recognition” is sometimes used to refer to recognition systems that must be trained to a particular speaker as is the case for most desktop recognition software. Recognizing the speaker can simplify the task of translating speech. For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hard- ware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper, we prepare a Bangla Phoneme recognition system of Bangla Automatic Speech Recognition (ASR). Most of the Bangla ASR system uses a small number of speakers, but 30 speakers selected from a wide area of Bangladesh, where Bangla is used as a na- tive language, are involved here. In the experiments, mel-frequency cepstral coefficients (MFCCs) are inputted to the hidden Markov model (HMM) based classifiers for obtaining phoneme recognition performance. It is shown from the experimental results that MFCC- based method of 39 dimensions provide higher phoneme correct rate and accuracy. More- over, it requires fewer mixture components in the HMMs . Moreover, this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Bangla Language. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines. v TABLE OF CONTENT CERTIFICATION ii CANDIDATES’ DECLARATION iii ACKNOWLEDGEMENT iv ABSTRACT v List of Figures x List of Tables xi List of Abbreviation xii 1 Introduction 1 1.1 Introduction . 1 1.2 Background of research . 2 1.3 Goal of the thesis . 4 1.4 Significance of the Study . 4 2 Speech Recognition 6 2.1 Introduction . 6 2.2 Introduction to ASR Systems . 7 2.3 Applications of ASR Systems . 8 2.4 Speech Recognition Basics . 9 2.5 Overview of the Full System . 11 vi 2.6 Types of Speech Recognition . 12 2.7 Conclusion . 13 3 Hidden Markov Models (HMMs) 14 3.1 Introduction . 14 3.2 Hidden Markov Models . 15 3.3 HMMs for Word Modeling . 17 3.4 Emission Probability Density of a Vector Sequence . 18 3.5 The Viterbi Algorithm . 20 3.6 Gaussian Probability Density Function . 21 3.6.1 Continuous Mixture Densities . 21 3.7 Conclusion . 22 4 Bangla IPA Table 24 4.1 Introduction . 24 4.2 Bangla Script . 25 4.3 Bangla IPA Table . 26 4.3.1 Vowels & Consonants . 26 4.4 Bangla Phoneme Schemes . 28 4.4.1 Bangla Phoneme . 28 4.5 Conclusion . 30 5 Our contribution 31 5.1 Introduction . 31 5.2 Why is ASR hard to accomplish? . 32 5.3 Preparation of Bangla Speech Corpus . 33 vii 5.4 Feature Extraction and Phonemes . 35 5.5 Bangla Phonemes with Phonetic Symbols . 36 5.6 Bangla Words . 39 5.7 HTK Tool . 39 5.7.1 HCopy . 39 5.7.2 HList . 40 5.7.3 HCompV . 41 5.7.4 HRest . 42 5.7.5 HERest . 43 5.7.6 HHEd . 45 5.7.7 HParse . 46 5.7.8 HVite . 46 5.7.9 HResults . 48 5.8 Steps of the Experiment(Data preparation) . 49 5.9 Steps of the Experiment(Training) . 52 5.9.1 Mixture 1............................... 55 5.9.2 Mixture 2............................... 56 5.9.3 Mixture 4............................... 56 5.9.4 Mixture 8............................... 57 5.9.5 Mixture 16.............................. 57 5.9.6 Mixture 32.............................. 58 5.10 Steps of the Experiment(Testing) . 58 5.11 Conclusion . 59 6 Conclusions and Future works 61 viii 6.1 Conclusion . 61 6.2 Future Works . 62 References A Batch HList.m ix LIST OF FIGURES 2.1 General Solution. 9 2.2 Overview of the Project . 11 3.1 An hmm example state diagram. Three states with corresponding output distributions are interconnected by transitions with transitional probabilities architecture. 17 3.2 Typical lefttoright HMM for word recognition . 18 3.3 Possible state sequences through the HMM . 19 3.4 scatterplot of vectors emitted by a gaussian mixture density process . 22 4.1 A chart of the full International Phonetic Alphabet . 27 4.2 Bangla phonetic scheme in IPA (Vowels) . 28 4.3 Bangla phonetic scheme in IPA (Consonants) . 29 5.1 HCopy Command . 51 5.2 HERest Command . 55 5.3 HTK Processing Stages . 60 x LIST OF TABLES 2.1 Typical parameters used to characterize the capability of speech recognition systems. 7 5.1 Bangla Vowels . 37 5.2 Bangla Consonants . 37 5.3 Some Bangla words with their orthographic transcriptions and IPA . 39 xi LIST OF ABBREVIATION AF : Articulatory Features ASR : Acoustic Model ATR : Automatic Speech Recognition BPF : Band Pass Filter CMN : Cepstral Mean Normalization DARPA: Defense Advanced Research Projects Agency DCT : Discrete Cosine Transform DPF : Distinctive Phonetic Featur DCR : DPF Correct Rate DSR : Distributed Speech Recognition ETSI : European Telecommunications Standards Institute EM : Expected Maximization FFT : Fast Fourier Transform GS : Gram-Schmidt GMM : Gaussian Mixture Model GI : Gender-Independent HNN : Hybrid Neural Network HMM : Hidden Markov Model HL : Hidden Layer In/En : Inhibition/Enhancement BNAS : Bangla Newspaper Article Sentences JEIDA : Japan Electronic Industries Development Association KLT : Kahunen-Loeve Transform LF : Local Feature LM : Language Model LR : Linear Regression MFCC : Mel-Frequency Cepstral Coefficient MLN : Multilayer Neural network xii OL : Output Layer OOV : Out-of-vocabulary PCR : Phoneme Correct Rate PRA : Phoneme Recognition Accuracy RNN : Recurrent Neural Network SPINE : Speech recognition In Noisy Environments SNR : Signal-to-noise Ratio xiii CHAPTER 1 INTRODUCTION 1.1 Introduction Speech is one of the oldest and most natural means of information exchange between hu- man beings. We as humans speak and listen to each other in human-human interface. For centuries people have tried to develop machines that can understand and produce speech as humans do so naturally [1] [2] . Obviously such an interface would yield great benefits [3]. Attempts have been made to develop vocally interactive computers to realise voice/speech recognition. In this case a computer can recognize text and give out a speech output [3]. Voice/speech recognition is a field of computer science that deals with designing computer systems that recognize spoken words. It is a technology that allows a computer to identify the words that a person speaks into a microphone or telephone. Speech recognition can be defined as the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words [4] [5]. Automatic speech recogni- tion (ASR) is one of the fastest developing fields in the framework of speech science and engineering. As the new generation of computing technology, it comes as the next major in- novation in man-machine interaction, after functionality of text-to-speech (TTS), supporting interactive voice response (IVR) systems. The reason for the evolution of ASR, hence improved is that it has a lot of applications in many aspects of our daily life, for example, telephone applications, applications for the physically handicapped and illiterates and many others in the area of computer science.