CALIFORNIA STATE UNIVERSITY, NORTHRIDGE SPEECH SYNTHESIS and VOICE RECOGNITION USING SMALL CONFUTERS a Thesis Submitted in Parti
Total Page:16
File Type:pdf, Size:1020Kb
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE SPEECH SYNTHESIS AND VOICE RECOGNITION USING SMALL CONFUTERS A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Electronic Engineering by Nopakorn Hiranrat January, 1984 The thesis of Nopakorn Hiranrat is approved: Laurence s. Caretta g, Chairman California State University, Northridge ii ACKNOHLEDGEHENT I .wish to thank Dr. Robert Hong, Dr. Yuh sun and Dr. Laurence Caretto for their critical readings of the drafts and their helpful suggestions. A special thank7to my colleaque, Jan Rampacek for his supports during the writing of this thesis. iii TABLE OF CONTENTS Page ACKNmvLEDG EMENTS iii ABSTRACT • • • • • • • • • • • 0 • • • • • • • • • • • • v Chapter 1 THE SPEECH PROCESS 1 1.1 SPEECH GENERATION . 1 1 • 2 HISTORICAL REVImv . 6 Chapter 2 SPEECH SYNTHESIS . 10 2 .1 DELTA MODULATION 10 2.2 PHONEME CODING 14 2.3 LINEAR PREDICTIVE CODING 22 2.4 EFFICIENCY COl1PARISON 32 2.5 VOTRAX SC-01, SPEECH CHIP . 35 Chapter 3 SPEECH RECOGNITION 40 3.1 RECOGNITION PROBLEHS 40 3.2 FEATURE EXTRACTION 42 3.3 FEATURE EXTRACTION BY FILTER BANKS 42 3.4 FEATURE EXTRACTION BY ZERO CROSSING DETECTORS 48 3.5 ADVANCED FEATURE EXTRACTION 51 3.6 INTERSTATE VRC008, VOICE RECOGNITION SYSTEM 55 Chapter 4 DESIGNING VOICE RECOGNITION SYSTEM FOR VIC-20 COMPUTER •••••••••••• 59 4.1 DESIGN OBJECTIVE 59 4. 2 HARDV•TARE . DESIGN 60 4.3 SOFTWARE DESIGN 66 4.4 THE RESULTS 73 Chapter 5 CONCLUSION . 77 REFERENCES . 81 APPENDIX A . 84 APPENDIX B . 98 iv ABSTRACT SPEECH SYNTHESIS AND VOICE RECOGNITION USING SHALL COHPUTERS by Nopakorn Hiranrat Haster of Science in Engineering Voice recognition involves the design of electronic systems capable of accepting spoken words and derive meanlngs from these words. The study of speech synthesis involves not one, but a multitude of disciplines. Sometimes called Speech Science, the field includes acoustics, linguistics, engineering, physiology, phonetic, statistics, communication theory, prosaics, forensics and semantics. Recently, a great deal of interests has been generated in the understanding of human voice as well as speech synthesis by the use of computers in the development of artificial intelligence. In this thesis, advances and techniques of speech synthesis and voice recognition are described. Relative performance, advantages and disadvantages of incorporating these techniques in the design of speech generation and recognition systems are analyzed. A voice recognition system was designed based on the Vic-20 microcomputer. Hardware and software design are described in detailed and experimental results are presented. v These results indicated that although considerable advances have been made in the area of voice recognition, much research remains to be done so that a system can be built to recognize a large vocabulary spoken by many individuals at relatively high recognition accuracy. vi Chapter 1 THE SPEECH PROCESS 1.1 SPEECH GENERATION Speaking occurs as a by-product of respiration. When the air flow from the lungs to the mouth is unobstructed, no sound is made and only breathing is produced. By introducing obstructions into that path, one produces vibrations of acoustic sound waves. Different sounds are produced by different ways the obstruction are introduced. Speech is produced by exciting the vocal tract with either glottal pulses or noise while shape of the vocal tract is varied by controlling the position of the tongue and jaws. Noise excitation produce unrecognizable words. Glottal excitation produces voiced sounds. The movement of the articulators (tongue and jaws) changes the resonant frequencies of the vocal tract. These groups of frequencies are called formants. The first three formants, designated F1 (20 to 800Hz), F2 (800 to 2300Hz), F3 (2300 to 3000Hz), carry most if not all meaning in speech. In the field of phonetic, sounds are dissected into basic units called phonemes. English has approximately 40 1 2 vowel and consonant phonemes which are used to construct the entire vocabulary. Table 1-1 illustrates how the English phonemes are classified into five groups (1). Phonemes are not pronounced literally, but rather serve as symbols for their respective sounds. The phonemic codes are created to represent neuromuscular signals for controlling the physical speech. "Voiced" phonemes are the vowel-like sounds which are produced in vocal cords as periodic bursts of air from 125 to 200 times a second. "Unvoiced" phoneme or "fricative" are noise-like sounds made by forcing air through constrictions in the mouth, such as between tongue and teeth. "Stops" and "voice stops" resulted from momentary blockage of air flow. The vocal tract, shown in Figure 1-1, acts as a time-varying filter that changes the resonant characteristics of the air's path (2). The controls over the size and shape of the vocal tracks determine the types of voice produced. Figure 1-2 shows a block diagram representing the speech producing system. Figure 1-3 represents the vocal track as a time-varying filter. Based on this model, Texas Instruments (Speak & Spell maker) designed and produce several integrated circuit chips for voice generation (3). In this model, a time description and a spectral description are used to define speech signal varied over time at each frequency. Each of the spectral peaks represents the fundamental frequency which contains more energy than its neighbors in the spectrum. 3 1.VOICED EXAMPLE 2.FRICATIVES EXAMPLE AE bad FX fan AH father TH bath AI bite sx sip Avl bought SH ship AY bay CH church EH bet HX hand EE feet ER bird 3.COMBINATION OF VOICED IX fit AND FRICATIVE ou boast vx van ux book DH than UH but zx zip uu boot ZH measure wx win YX yes 4.VOICED STOP RX rip BX ban LX lit GX gap MX man NX not S.STOP NG ring JH jump TX tan KX can TABLE 1-1: FIVE GROUPS OF SPEECH PHONEMES. 4 LIPS VOCAL CORDS- Fi;_:Jure 1-1. A cross section of the human ...rocal tract. [ =s... :o·,l~-1~-~-l.E-, ~----·--] iii VOCAL I l R~SONAT?~~L. I r;~~;~~u~·~·~~RS; l' ·~ vf'--, .._.nOR')'"' ,_..,N .. SAL C.~V.ilf.~~~1r:..!J.r.,; ..:J.F.;.;, 1ll..~U..l-.i\.:h-T''-'~''C:. ....- & Di'"P'H1. .,...ti.. ... .~-! Ii'·~ '.t...::l '! .'': I ·.:.E'l,,··:.- ,'• I • I lPJ'I.GM. _j I I \ l L--------- ,_.-----'· .._ .. .,.J Figure 1-2. A block dia.gram of tJ:·.. e speech system. 5 . 1 AIR ! , 'OCAi VOICE SOURCE : f. VOCAL TRACT 1 o·uT"'UT ~ LUNGS : ==>! " - ;::::======::::::>1 ~ ,.. ! . STREAM ~.....· _C_O_R_D_S___; ! iRESONHOR~ SOUND I •! n n n ' ' I \ I \ I~ ;i n it ME I \ i \ ! I .I I (\ rr i .;Jj 41 DESCRiPTiON (1 i. I TIME·+ I 1,1 I .,) TIME-+ •I ... VOICE SOURCE 1 VOCAL TRACT" SPEECH I SPECTRUM i FREQUENCY I l SPE:CTRUM i ~ I RESPONSE 1 It f1 I / ~, ! I \ !I I' /; J ~~u 111111'\i' (\ 1\'n ,,, ' I \\ 1\ 1\ !\ I A r.._.A_..~ 1 I I 'I'·,'\IIi! '/ I i I /\!:fJI SPECTRAL !/,i\/\1\,\ll .________ ! 1 '/\!I: I \,;I\! 1I I L...ll....:.._.liJ.LlL.J DESCRIPTION FREQUENCY-+ FREQUENCY-+ FRE(.I!JENCY-+ Figure 1-3. The vocal tract as time-varying filter. vOCAL TRACT t ---------- - --------- -, SOURCE 1 OUTPuT ;---r··--r~~'-- ...T--rv-..,.......,___;_ --T--~ . :..L j_ : i l : Tl ' (~ ~I ...., l :I I I ---:-'- • 1 L-- -------·-----·---- --' Figure 1-4. The electrical equivalent of the vocal tract. AIRFLOW--+ (2) (3) (4) Figure 1-5. Bell Labs' .two-tubes 1 resonance model. 6 Synthesizing a male volce speaking a normal English vowel would require three peaks, the first at 200 to 800 Hz, the second at 900 to 2300 Hz, and the third at 2400 to 3000 Hz. Since vocal tract acts as a time-varying filter, it makes sense to model the vocal tract by electrical filter with the proper filter coefficients. Figure 1-4 shows the human vocal tract represented by an electrical filter. 1.2 HISTORICAL REVIEW In 1949 Dr. Dunn of Bell Laboratory introduced a model representing vowel resonances. In this model, the vocal tract was modeled as two resonant cavities with a series of cylindrical sections placed end to end. This is shown in Figure 1-5. The piston represents the source of air at the vocal cords. The four cylinders represent (1) the throat cavity, (2) the mouth cavity, (3) a constriction due to the lips and (4) a baffle that act as the speaker's face. This model worked accurately enough that Dr. Dunn and his colleaques were able to build ~n electric vocal tract and demonstrated at Massachusette Institute of Technology Speech Communication Conference in 1950. Other early synthesizers designed to simulate speech process with varying results. In 1791, Wolfgang Von Kempelen built an apparatus that generated connected utterances. As shown in Figure 1-6, a bellow was connected to a reed-exited and 7 'SW WHISTLE LEVER BELLOWS NOSTRILS lEATHER \ \._ REED ·-----:::--::.:::-....~=\/ IYOCAL CORDI SPEECH~-. · /} -----= 1•1 ' : SOUHD::::::-?-0 - __:;::;.;_;;::- ~ COMPRt:SS£D Alll CH>\M8fR Figure 1-6. Von Kempelen's talking bellows. RANO~HOiSE I GEN£R6,TQR 10 1 RESONATOR 1\r.!PlltlfR f I Fll TERS RELAXATION :VOICED 'L--'TT1rTI"'I'n"IT'"---r~-~ OSCILLATOR I STOPS' LOUD SPEAKER ~"QUIEr !lEY I u I [§T-sAPJ YODER KEYBOARD "ENERGY. ~ FOOT PEDAl SWITCH PITCH CONTROL '------'----.- Figure 1-7. The Voder. 8 hand-controlled resonator. The operator controlled four separate contricted passages with one hand and at the same time controlled the resonator with the other hand. The limit of the device lies in the ability of the operator to coax subtle speech sound from it. In this sense, the device acted as musical instrument rather than a true voice generator. Thus, there's a fairly long training time before the operator could manage anything resembling speech.