A Spbech Rpcocnrrron Sysrbvr Usrxc a Npunnl Nbtworx Modbl Fon Vocal Shaprnc
Total Page:16
File Type:pdf, Size:1020Kb
A SpBECH RpcocNrrroN SysrBvr Usrxc A Npunnl NBTwoRx MoDBL Fon Vocal SHaprNc by Christopher D. Love A Thesis presented to the University of Manitoba in paniat fulfillment of the requirements for the degree of Master of Science in the Department of Electrical and Compuær Engineering V/innþg, Manitoba March 1991 A SPEECH RECOGNITION SYSTEM USING A NEURAL NETWORK MODEL FOR VOCA-L SHAPING BY CHRISTOPHER D. LOVE A thesis subnrined to thc Faculty of Craduate Studies of the University of Manitoba in partial fulfìllment of the requirenterìrs of the degree of MASTER OF SCIENCE @ 1991 Permissio¡r has been grarrted to the LIBRARY OF THE UNIVER- S¡TY OF MANITOBA to lend or sell copies of this thesis. to the NATIONAL LIBRARY OF CANADA to rnicrolilm this thesis and to lend or sell copies oí the film, and UNIVERSITY MICROF¡LMS to publisir an abstract of rhis thesis. The author rescryes other publication rights, and neither thc thesis nor extensive extracts frorn it may be pnnteC or other- wise reproduced without the author's written permission. ABSTRACT Computerized vocal shaping systems a¡e of considerable value and interest. This study is motivated by the need to improve: (i) recognition of previous vocal shaping systenìs in order to reduce disagreement between compuær and human evaluator, (ü) multi- level classification using five levels that correspond to human assessment, (iii) real-time recognition through the use of neural networks, and (iv) the human interface to achieve efficient interaction through the Macintosh graphical interface. A system has been developed to recognize a limited set of isolated utrerances on a finite scale in a quiet environment' A speech signal is first sampled and analyzed to obtain linear predictive coding (LPc) filter coefficiens which a¡e then presented as æmplates to a bacþropagation neural network' The LPC coefficients provide a sufficient spectral feature for high accuracy isolated speech recognition. A standard bacþropagation neu¡al network is used to improve recognition accuracy, to reduce classification time, and to improve multi-level grading results from the past vocal shaping system. The bacþropagation neural nerwork uses an adaptive learning rule, which is based on the total internal network error measure and has been shown to reduce the network raining time in obtaining a global solution. The experimental system includes a20Mfr12,68030 based Macintosh IIsi computer with a Mc68882 floating point coprocessor. The development software is c language (an object- oriented version). The new human interface developed for vocal shaping is an improvement from the previous model due to the configuration flexibiliry and ease of operation' The multi-level grading scheme and classification accuracy were evaluated using a vowel and consonant set which a¡e based on the relative locations of the fi¡st two formants' The vowel set provided good recognition classification results (6l.gvo)while the consonant data sets resulted in slightly lower classification results (4gvo). The muhi- level classification scheme provided good separation between utterance versions for the vowel set. ACKNOWLEDGEMENTS The author would like to acknowledge the rnany people who who conributed to this research. In particular, credit is given to my advisor Dr. W. Kinsner who helped guide me through the thesis process and helped me to understand the true meaning of research. Also,I would like to tÌ¡ank my fellow colleagues, Ken Ferens, Adi Ind¡ayanro, and Amlein I-angi, who, without thei¡ suppon and encouragement, would have made this task so much more difficult. I also wish to thank Mr. J. Schnabl of Manitoba Telephone System for providing the noise cancelling microphone. Pa¡tial support for this work came f¡om the University of Manitoba and NSERC is also acknowledged üi TABLE OF CONTENTS Page ABSTRACT ü ACKNOWLEDGEMENTS üi LIST OF FIGURES vü I-IST OF TABLES x LIST OFABBREVIATIONS AND SYMBOIJ xi INTRODUCTION Purpose I Motivation 2 Thesis Organization 3 II BACKGROUND ON SPEECH PROCESSING Introduction . 5 Human Speech hoduction 6 Speech Processing Techniques t4 Featu¡e Technique Seleaion 19 The Fast Fourier Transform r9 Linea¡ hedictive Coding 23 Recognition Method Selection 26 Templaæ Maæhing )1 Neural Networks 29 Summary 31 IU REVIEW OF BACKPROPAGATION Introduction 33 Errsr Bacþropagation 33 Background 33 Training 36 Batch Update 40 Error Measures 43 Recognition 44 Recognition a¡d Multi-level Grading 44 Summary 46 IV SYSTEM DESCRIPTION Introduction 47 The Speech Recognition System 47 lv Page Resident System Tools 49 System Ha¡dwa¡e 49 System Firmwa¡e 50 New Tools 52 The Human Interface 52 Data Acquisition and Featr¡¡e Exffaction 53 Training and Recognition 54 Summary 55 V SOFTWARE IMPLEMEI{TATION Introduction . 56 Softwa¡e Background 56 Human Interface 66 Initialization 66 Recording 67 Training 68 Testing 72 Recognition 74 Data Acquisition and heprocessing 76 Interfacing Software 80 The BP Neura] Network 8r Utility Routines 84 Summary 87 VI SPEECH RECOGNITION EXPERIMENTS Introduction . 88 System Verification 88 Software Verification 88 Ha¡dwa¡e Verifîcation 89 Descriptions and Results of Experiments 93 Data Acquisition and Training 93 Nenvork Verifi cation Test 96 Recognition Tests 96 Vowel Tests 98 Consonant Tests r00 Summary r03 Page VII CONCLUSIONS AND RECOMMEIYDATIONS Conclusions 105 Recommendations t07 REFERENCES . 108 APPENDICES A: Speech Recognition System Softwa¡e Dalog Boxes lll B: Speech Recognition System Ha¡dwa¡e 118 c: source code Listing of Macintosh speech Recognition sysæm sofrware 122 vl LIST OF FIGURES Figure Page 2.t Human speech apparatus ,r) 6 Vocal tract natr:¡al frequencies 8 2.3 speech wavefonr¡ "This is a test of the new sound on the Macintosh IIsi" 10 2.4 Frequency spectrum of the sentence, "This is a test of the new sound on the Macinosh IIsi" . r3 2.5 Speech encoding æchniques 15 2.6 Analog and associated digital representation of: (a) PCM, (b) DpCM, and (c) ADPCI{ 16 2.7 Parametric coding analysis for LPC l8 2.8 Reprcsentation of a trapezoidal waveform in the (a) Time domain and (b) Frequency domain 20 2.9 The radix 2 bunerfly 2l 2.t0 Flow diagram describing the radix 2 FFT for N=4 22 2.tt DTV/æmplate matching . 28 2.12 Dependence of the oulputs of a nenvork with recurrent links at earlier time frames 30 3.1 Ravine in three dimensional energy space 35 3.2 Energy surface topology for global solution in 3 dimensions 36 3.3 Three layer bacþropagation a¡chitecture 37 3.4 Representation of an a¡tificial neuron and its logistic activation function 37 3.5 Classification scheme used in recognition 45 4.t Speech recognition sysrem block diagram . 48 4.2 Speech recogni tion system h ardwa¡e confi guration 50 5.t Functional overview of the speech recognition system menus 57 5.2 Think C class hierarchy . 58 5.3 Flow of control in object-uienæd programming environment 59 5.4 Structure of an object sending a message 60 5.5 Global softwa¡e structure 61 5.6 Application object 62 5.7 DoCommandmethod 63 5.8 OPElrl command & vu Figure Page 5.9 SAVE command g 5.10 Updating the menus & 5.11 RWERT TO SAVED command 65 5.12 DOSAVEAS command 65 5.r3 Build window method 65 5.14 NEW command 67 5.15 ACQUIRE command . 69 5.16 Calculation of the neural networrk otal internal enor 70 5.17 NETWORK SETUP command 7l 5.18 Updating of the neural network modal dialog box during training . 72 5.19 TRAIN command 73 5.20 TEST NETWORK command 74 5.2t Unerance playback . 75 5.22 Recognition results display 76 5.23 RECOGNIZE command 77 5.24 Acquisition of the speech uttenrnces by the IBM computer 77 5.25 (a) Approach to rcsonance of the vowel i (first LPC frame) (b) Sample of tt¡e resonant period of vowel i (second LPC frame) (c) Fall from resonance of the vowel t (third LPC frame) 79 5.26 Management of the LFC daø 81 5.27 Data record stn¡cture 83 5.28 Neural network weighs . 83 5.29 Check boxes 84 s.30 Setting the values in the SEIUP NETWORK dialog box 84 5.31 Radio button management 85 5.32 Loading of the utterance librrary . 85 5.33 Display of tl¡e utterance library 85 5.34 Button outline display 86 5.35 Daædiqplay 86 6.1 Ha¡dwa¡everification of the NEC DSP A/D and D/A conversion process and filter bandwidth 90 6.2 Frequency response of the NEC DSP using the spectrum analyzer 90 6.3 Ha¡dwa¡e verification test of microphone distortion in the signal path 91 6.4 Frequency response of the NEC DSP afær passing through the microphone using tlre spectrum analyza 92 6.5 Frequency response plot of the SMl0A microphone . 94 vtu Figure Page 6.6 Polar panern of cardioid (unidirectional) microphone response 94 6.7 The vowel riangle . 97 6.8 Loci of vowels for a wide range of speakers 97 6.9 Pattern playback diagrams showing effect of F2 ransitions perceived on consonant qæe l0l D( LIST OF TABLES Table Page 2.1 Phonetic Alphabet l1 2.2 Formant Frequencies of Ten Vowels t2 6.1 c-onfi,¡sion Matrix representing Recognition Results from l1 speakers Evaluated on Six Vowel Sounds . gg 6.2 c-onfusion Marix representing Recognition Resuls from l t speaken Evaluated on Four Consonant Combination using the Vowel /a/ l0Z 6.3 conñ¡sion Matrix representing Recognition Results from l l speaken Evaluated on Four Consonant C-ombinæion using the Vowel /e/ 103 LIST OF ABBREVIATIONS AND SYMBOLS AID Analog o digital converter ADPG{ AdaFtive differential pulse code modulation c Reprcsentation of the momentum ARll4A Außo fegressive moving average BAM Bidi¡ectional associative memory BP Bacþropagation CPU Central processing unit ô Representation of the network error signal DPCM Dfferential pulse code modulæion DFT Discrete Fourier transform DSP Digital signal processor FFT Fast Fourier tra¡rsfonn LMS Least mean square LPC Linea¡predictive coding q Representation of the learning raæ PARCOR Panial correlation PCM Pulse code modulation RAM Random access memory SPB Sound speech blocks xt .