<<

B.Gold A History of Research at Lincoln Laboratory

During the 1950s, Lincoln Laboratory conducted a study of the detection of pitch in . This work led to the design ofvoice coders ()-devices that reduce the bandwidth needed to convey speech. The bandwidth reduction has two benefits: it lowers the cost oftransmission and reception ofspeech, and it increases the potential privacy. This article reviews the past 30 years ofvocoder research with an emphasis on the work done at Lincoln Laboratory.

ifI could detennine what there is in the very rapidly basically a transmitter of waveforms. In fact, changingcomplexspeechwavethatcorrespondstothe telephone technology to date has been almost simple motions ojthe Zips and tongue, I could then totally oriented toward transmission while the analyze speechjorthesequantities, I wouldhaveaset ojspeech defining signals that could be handled as subject of speech modeling has remained of low-jrequency telegraph currents with resulting ad­ peripheral practical interest, in spite ofyears of vantages ojsecrecy, and more telephone channels in research by engineers and scientists from Lin­ the sameJrequency space as well as a basic under­ coln Laboratory, Bell Laboratories, and other standingoJthecarriernatureojspeechbywhichthe lip research centers. reader interprets speechjrom simple motions. -, 1935 Modern methods ofspeech processing really beganinthe UnitedStateswiththe development Historical Background oftwo devices: the voice operated demonstrator () [4] and channel voice coder (vocoder) [51. More than 200 years ago, the Hungartan Both devices were pioneered by Homer Dudley, inventorW. von Kempelen was the first to build whose philosophy ofspeech processing is quot­ a mechanical talking contrivance that could ed at the beginning of this article. mimic, albeit in a rudimentary fashion, the The appearance of the voder at the 1939 vartoils sounds of human speech [1, 2]. The World's FairinSanFranciscoandNewYorkCity device, constructed around 1780, proved that elicited intense curiosity. The device was con­ human speech production could be replicated. trolled by a human operator who operated a The next major development in the field was console similar to a piano keyboard (Fig. 1). The undoubtedly the most important: Alexander keys on the keyboard switched the appropriate Graham Bell's invention of the telephone. Al­ bandpass fIlters into the system; by depressing though there is no need to chronicle the well­ two or three of the keys and setting the wrist known events that led to the invention, nor any barto thebuzz(Le., voicing) condition, anopera­ need to discuss its subsequent impact on tor could produce and nasal sounds. If humancommunications, itisinterestingto note the Wlist bar was set to the hiss (Le., voice­ that Bell's primary profession was that of a less) condition, sounds such as the voiceless speech scientist, Le., someone with knowledge fricatives could be generated. Special keys ofthe intricacies ofthe humanvocal apparatus. were used to produce the plosive and affrica­ Reference 3, which contains a description of tive sounds (the chin cheese and) injaw), and Bell's Harp telephone, reveals the inventor's a pitch pedalwas used to control the intonation keen understanding of the rudiments of the of the sound. spectral envelope ofspeech. While the voder was the first electronic syn­ But the telephone that Bell invented was thesizer, the channel vocoder was the first

The Lincoln Laboratory Journal. Volume 3, Number 2 (1990) 163 Gold - A History oJVocoder Research at Lincoln Laboratory

dard). Thus the channel vocoder was qUickly recognized as a means ofreducing the speech bit rate to some number that could be handled through the average telephone channel. Even­ tually the military set the standard rate at 2.4 kb/sec. To understand Dudley's conceptofmore tele­ phone channels in the same frequency space, it is important to realize that human speech pro­ duction depends on relatively slow changes in the vocal-tract articulators such as the tongue and the lips. Thus, ifwe could develop accurate models of the articulators and estimate the parameters of their motion, we could create an analysis-synthesis system that has a low data rate. Dudley managed to bypass the difficult task of modeling the individual articulators by realizing that he could lump all articulator motion into one time-varying spectral envelope. Furthermore, Dudley understood that, to a first approximation, the excitation produced by vo­ cal-cord vibrations (for voiced sounds) and re­ gions ofvocal-tractconstriction (for ) could be separated from the actual speech Fig. 1-0riginal sketch of voder by S. W. Watkins (8 Sep­ tember 1937). sounds that are produced by the motion of the tongue, the lips, and the participation of the nasal passage. Both the spectrum envelope and analysis-synthesis system: that is, the channel excitation parameters must vary slowly (be­ vocoder derived certain parameters from a cause the articulators themselves vary slowly) speech wave, and the parameters were then and thus transmission ofthese derived parame­ used to control a that reproduced ters requires only several hundred hertz of the speech. bandwidth rather than the usual 3 to 10 kHz To paraphraseDudley, vocoders could lead to needed to transmitspeechvia normal telephony advantages of more secure communications, methods. and a greater number of telephone channels in In an interesting aside, an informative and the same frequency space. Both of Dudley's entertaining paper by W.R. Bennett (6) gives a predictions were correct, but the exact ways in historical survey of the X-System of secret te­ which they came to pass (or are coming to pass) lephony that high-level Allied personnel used probably differed from what he imagined. extensively dUring World War II. For example, In 1929, digital communications was un­ during the invasion of Normandy in 1944, the known. When digitization did become feasible, system was used for communications between researchers realized that the process could London and Washington. Details about X-Sys­ provide a means of making communications tem have only recentlybecome declassified, and less vulnerable to eavesdropping. Digitization, from thatinformationX-Systemappears to have however, required wider transmission band­ been a sophisticated version of Dudley's chan­ widths. For example, the current 3-kHz band­ nel vocoder. X-System included such features width ofa local telephone line cannottransmita as pulse-coded transmission, loga­ pulse-coded modulation speech signal that is rithmic encoding of the channel signal, and, of coded to 64 kb/sec (the present telephone stan- course, enciphered speech.

164 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory

Advances in Telephony tain that the growth ofcellular phone traffic will require vocoding techniques. In an article that appeared in National Geo­ graphic magazine [7], F.B. Colton gives an en­ grossing account of the history of telephone Work at Lincoln Laboratory transmission. Figure 2, which is from that ar­ Lincoln Laboratory's history in vocoding ticle, shows the telephone wires on lower Broad­ commenced in the 1950s with extensive re­ way in New York City in 1887. It is clear that search in pattern recognition, which led to progress in telephony could easily have been work on detecting the pitch of speech. Some broughtto a haltifnotfor suchimprovementsas of the Laboratory's contributions in vocoding underground cables and tech­ include the niques. At present, fiber optical transmission 1. development ofpitch-detection algorithms and lasers are allowing another great leap for­ and hardware that have served as stan­ ward in capacity as telephone traffic continues dards for almost 30 years, to increase. 2. development ofthe first vocoder hardware Such a technological advance brings to mind for satellite and aircraft narrowband the following question: Will optical fibers with speech communications, their enormous bandwidth make vocoding 3. creation of the first real-time technology obsolete? The author believes the programs that used linear predictive cod­ answer is no because of the great potential ing (LPC) and homomorphic vocoding, growth of radio telephones, e.g., cellular 4. use ofvocoders for packet speech commu­ phones. The permissible bandwidths of such nications, devices are limited by nature; thus it seems cer- 5. use of high-frequency (HF) channels for vocoding, 6. invention of a novel analysis-synthesis system (the Sine Transform Coder), and 7. introductionofmethods thatallowvocoder research and development to be carried out in real time on general-purpose com­ puters. This article discusses the first, second, third, and seventh contributions, and gives references to the remaining items.

The Vocoder

Figure 3 is a schematic diagram ofa vocoder. In the left ofthefigure, thevocal-source analyzer finds the correct time-Varying excitation para­ meters of the input speech (discussed in the following section, "Pitch Detection"). A parame­ ter is modeled as either a quasi-periodic pulse train or a source. The vocal-tract analyzer finds the time-varying shape of the spectral envelope of the input speech by means of an appropriate algorithm (discussed in the subse­ quent section, ""). The excita­ Fig. 2-Telephone wires on lower Broadway in New York tion parameters and spectral envelope are then City in 1887. (Courtesy ofNational Geographic.) transmitted and the received signals are used to

The Lincoln Laboratory Joumal. Volume 3. Number 2 (1990) 165 Gold - A History oJVocoder Research at Lincoln Laboratory control a speech-synthesis system, as shown in (Details ofthe research will be discussed later.) the right portion of Fig. 3. This section begins with an overview of pitch detection, discusses some of the difficulties, Pitch Detection summarizes the connections between human pitch perception and automatic pitch detection, In the late 1950s, vocoder researchers real­ and reviews the ideas that have been applied to ized that the lack of an effective pitch detector the latter problem. was a major stumblingblockto adequatevocod­ ing. Pitch detection is defined in the following Overview ofPitch Detection way. A speech signal can be modeled as the output of a time-varying linear filter that is During the 1950s, Lincoln Laboratory was excited by one or more source functions. The researching pattern-recognition algorithms for human vocal tractacts as the time-varying filter automatic Morse code detection and the auto­ and the excitation signals derive from either the matic recognition of handwritten alphanumer­ quasi-periodic vibrations ofthe , the ics. One ofthe interesting concepts in this area turbulence created by points ofstricture, or the was the use ofparallel processing to increase the pressurebuildup and suddendischarge atsome probability ofmaking the right decision. During pointofclosure in the vocal tract. Thus analysis the same time period, the computer group at ofthe speech signal involves finding parameters Lincoln Laboratory developed the TX-2 com­ related to both the vocal-cord and vocal-tract puter, a powerful and highly interactive ma­ models. The vocal-cord vibrations control the chine that greatly facilitated advanced research time-varying fundamental frequency of the on difficult pattern-recognition problems. This speech signal. The analysis ofthis parameter is fortuitous symbiosis between the parallel-pro­ referred to as pitch detection. cessingconceptand theTX-2 computerled to an Early research at Lincoln Laboratory used a innovative contribution to the art ofpitch detec­ parallel-processing pattern-recognition algo­ tion and created the environment for Lincoln rithm to model the pitch-detection process. Laboratory's entry into the vocoder field.

Vocal-Tract Analyzer Vocal-Tract Synthesizer CD Short-Time Spectral Analysis CD Short-Time Spectral Input Synthesis Output Speech ...... ® Speech ..,..... (LPC) - ... ® LPC Synthesis ...... ,..... ® Homomorphic Analysis ® Homomorphic Synthesis @ Analysis o Formant Synthesis ~ Coded ~ Transmission ~ of Parameters '- Vocal-Source Analyzer

CD Fundamental Frequency ~ Extraction ..... Excitation - Generator ® Voiced-Unvoiced Decision

Fig. 3-Schematic ofa vocoder.

166 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory

In the world, pitch is the fundamental bring into evidence some obvious event that frequency of a note, a definition that is not am­ signifies the start ofa period. This strategyleads biguous as long as the note consists of evenly to approaches thatarebestcategorized as signal spaced . But what if the signal con­ processing. tains aharmonic frequencies? In addition, the Mter signal processing, it is necessary to perception of pitch by people (see the box, make explicit voiced-unvoiced (or buzz-hiss) "Human Pitch Perception: The Debateover Place decisions and to assign specific pitch numbers versus Periodicity") is a function of various to the voicing periods. (The pitch numbers parameters such as intensity and duration as denote either the periods or frequencies of the well as fundamental frequency. For these rea­ pitches.) This partofthe pitch-detectionprocess sons, audiologists preferto define pitch as a per­ can rightly be called pattern recognition. ceptual entity. For the purposes of this article, Lastly, we can use a priori knowledge to fix however, the terms pitch and voice fundamental mistakes. Suchknowledge includesinformation frequency will be used interchangeably. Thus about the range of pitch values to be detected pitch can be interpreted as the response of a and the observation that the duration ofvoicing device called a pitch detector to a speech signal. rarely lasts less than 50 msec. And correctpitch corresponds to speech synthe­ Thus pitch detection canbe divided into three sis dUring which no error is detected by the cascaded operations: signal processing, pattern human auditory system. recognition, and editing. All speech sounds require the positioning of the human vocal cords. During voiced sounds Difficulties ofEstimating Pitch and Making (e.g., and nasals), the vocal cords vibrate the Voiced-Unvoiced Decision quasi-periodically; during thevoiceless sounds, the vocal cords are open. The vibrations of the Abasic task ofvocodersystems is the separa­ vocal cords create quasi-periodic pressure tion of speech into a source and a vocal-tract waves that can be modeled as the input to a filter. The separation implies that the pitch and time-varying linear filter (Le., the vocal tract). spectrum canbe individually manipulated prior When a speaker wishes to stress a word or a to synthesis. Exciting the synthesizer with a syllable, the speaker's pitch will usually rise; pure hiss source can result in a sound like that is, the frequency ofthe vocal-cord vibration whispered speech; exciting with constant­ will increase. Speech analysis-synthesis sys­ period pulses produces a strong monotone ef­ tems depend greatly on the pitch detector's fect. Even a single error, such as a missing or ability to distinguish between voiced and un­ added pulse, canbe perceived ifitoccurs dUring voiced sounds, and its ability to determine the a steady-statevowel. Ifthe error occurs near the time-varying fundamental frequency of the beginning or end of voicing, however, the error voiced sounds. may not be perceived [8, 9). Excessive smooth­ ing of the measured fundamental frequency Signal Processing, Pattern creates a monotone effect. Recognition, and Editing The fundamental frequency ofan adult voice canvaryfrom 60 to 600 Hz. The large range often Insight is gained by studying the mechanism results in the pitch estimate being off by a fac­ of human speech production. During voicing, tor of two to three. Such errors are annoying to thevocal cords open and close at the fundamen­ listeners, but even more annoying is a hiss tal frequency. When we look at a speech oscillo­ decision dUring voicing. The inverse of that er­ gram we see this periodicity, but the precise ror (a buzz decision dUring hissing) appears, instantsatwhich openingand closingtake place on the average, to be less annoying. are generally masked by the filtering effects of In developing an algorithm for voiced-un­ thevocal tract. Itis thus appealingto investigate voiced decision making, one should note that ways of processing the speech signal so as to most English sounds are pure buzz (quasi-

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 167 Gold - A History oJVocoder Research at Lincoln Laboratory

Human Pitch Perception: The Debate over Place versus Periodicity

Although the automatic ex­ frequencies. As the vibrations point of view, this model corres­ traction of the fundamental fre­ penetrate more deeply into the ponds to a filter bank that covers quency of speech is a relatively cochlea (Fig. A[ 1)). the response the audiO range (Fig. A(2)). The recent discipline, the study of becomes more sluggish, corre­ pitch of the tone can be deter­ human pitch perceptionhasbeen sponding to filters with lower mined by locating those filters a lively field for manyyears. E. de center frequencies. Thus, for ex­ which contain energy. Boer wrote an excellent review of ample. a pure tone would cause a Although the place theory the subject [IJ. specific place on the basilar described above supplies a cred­ H. von Helmholtz [2J conceived membrane to vibrate most vigor­ ible explanation for pure tones, ofthe auditory system as a bank ously, which would lead to per­ the theory has trouble accommo­ of many overlapping bandpass ception of that tone. Tones of dating the perception ofcomplex filters. ear the entrance to the different frequencies stimulate periodic signals that consist of cochlea, the membraneand asso­ different places on the mem­ many harmonics. An important ciated hair cells respond to high brane. From an engineering fact to note in these cases is that

(1 ) t

Stapes Basilar Membrane

(2) Signal Acoustic Mechanical Tube System I 1 1 ~,

I Outer Middle 1 I Ear Ear I I Inner Ear /~~ ~~- (Linear Filter .---__--L__...L..__L-_--L__...L..-, Bank) Auditory Nerve

To Higher Auditory Centers

Fig. A-The auditory system: (1) schematic ofthe earand (2) engineering model.

168 The Lincoln Laboratory Journal. Volume 3, Number 2 (1990) Gold - A History oJVocoderResearch at Lincoln Laboratory

even when the fundamental fre­ of periodicity. The experiments. tion of Houtsma and Goldstein's quency is missing. the perceived however, did not lead to an an­ results is that the ear and brain pitch is that of the fundamental swer to the question -How does perform a high-resolution spec­ frequency. Fletcher [3) referred to the auditory system work?- trum analysis followed by a pat­ this phenomenon as the missing Further inSight was obtained tern-recognition procedure to jimdamental, whileJ.F. Schouten from the experiments of A.J.M. detect pitch based on this spec­ [4) named it the residue. Results Houtsma and J.L. Goldstein [6]. tral structure. J.L. Goldstein [7] ofthis sort have led to theories of In Houtsma and Goldstein's frrst proposed a model ofthis sort and human pitch perception as being experiment. musically trained H. Duifhuis. L.F. Willems, and influenced by the actual period of subjects were asked to recognize RJ. Sluyter [8] implemented the the signal rather than by a place the interval between two succes­ model to extract pitch from a mechanism. These theories are sively played signals. (An interval speech wave. thus called periodicity theories. is defmed as the difference be­ Forthe perception ofthe miss­ tween the fundamental frequen­ References ing fundamental to be explained cies of two signals.) Each signal 1. E. de Boer. -On the 'Residue' and by a place theory would require contained two successive har­ Auditory Pitch Perception.- Hand­ some nonlinear phenomenon in monies of a given fundamental book oJSensory Physiology 5, ed. the ear to cause the place in the frequency. When the two signals W. Keidel (Springer-Verlag. ew basilar membrane that corre­ were presented to both ears, the York. 1976)pp.479-583. sponds to the fundamental fre­ trained subjects had no trouble 2. H. von Helmholtz. Die Lehre von den Tonempfmdungen als physi­ quency to vibrate despite the identifying the intervals. ologischeGrundtagejUrdieTheOrie physical absence of the funda­ The second experiment was a derMusik(Vieweg. Braunschweig. mental. J.C.R. Licklider [5] repeat of the frrst but with one 1862) (5th edition 1896): On the proved thatthe above didnothap­ notable exception; for each sig­ SensationsofToneas a Physiologi­ cal BasisJor the Theory ojMusic. pen via the following clever ex­ nal. only one was pre­ 1st English ed. (1897). paperback periment: he alternately played a sented to one ear while the other ed. (Dover. ewYork.1954). pure tone (at the fundamental) harmonic was presented to the 3. H. Fletcher. Speech and Hearing and a harmonicseriesofthe same other ear. Again. pitch intervals in Communications (D. Van os­ fundamental frequency but with trand. Princeton. J. 1953). were correctly identified. The re­ 4. J.F. Schouten. !he Residue. a the actual fundamental physi­ sultindicates that the perception ew Component in Subjective cally absent. The listener then of pitch is centrally located; that Sound Analysis.- Proc. Kon. Acad. perceived two sounds of equal is. perception took place after the Wetensch. The Netherlands 43, pitch but different timbre. Then auditorysignalsfrom the two ears p. 356 (1940a). 5. J .C.R. Licklider. -Periodicity Pitch noise with a bandwidth centered had been combined. What actu­ and Place Pitch.- J. AcousL Soc. at the fundamental was added to ally happens physiologically is Am. 26, 945 (A) (1954). the above sequence and the fol­ still a mystery. but these experi­ 6. A.J.M. Houtsma and J.L. Gold­ lowing was discovered: the pure ments must now be incorporated stein. !he Central Origin of the tone was completely masked into any proposed model. Also Pitch ofComplexTones: Evidence from Musical Interval Recogni­ while the harmonic series was very significant is the fact that an tion.- J. AcousL Soc. Am. 51,520 still perceived to have the same alteredversion ofthe place theory (1972). pitch. If perception of the har­ sneaks back in. We can imagine 7. J.L. Goldstein. -An Optimum monic series were dependent on thattheappropriate placesonthe Processor Theory for the Central Formation ofthe Pitch ofComplex the combination tones appearing basilar membrane vibrate for all Tones.- J. AcousL Soc. Am. 54, at the fundamental-frequency existing harmonics of the signal 1496 (1973). place in the basilar membrane, and the central processor, or 8. H. Duifhuis. L.F. Willems. and R.J. then the combination tones brain. judiciously combines this Sluyter. -Measurement ofPitch in would have been masked. too. Speech: An Implementation of knowledge to produce a pitch Goldstein'sTheoryofPitchPercep­ Licklider's experiments cast a value. tion.- J.Acoust. Soc. Am. 71,1568 vote in favor ofSchouten's theory A straightforward interpreta- (1982).

periodic signals) or pure hiss (noiselike signals). which the constriction creates aperiodic excita­ The exceptions are thevoiced fricatives (e.g., the tion while the vibrating vocal cords simultane­ vin van, the zin zeroand azure, and the thin the) ously generate quasi-periodic excitation. Thus and thevoiced affricative (e.g., the dgin edge), in the speech source (or excitation) function is

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 169 Gold - A History oJVocoder Research at Lincoln Laboratory neither strictly pure buzz nor pure hiss. Never­ Similarly, Fig. 4(c) shows a rapid change theless, most vocoder systems work on the in the spectrum. For example, a sudden closure premise that the voiced-unvoiced decision is as in a vowel-to-nasal transition can cause binary. The binary assumption is based on the such a change. Although the fundamental fre­ conviction (perhaps implicit) that the resul­ quency in Fig. 4(c) has not changed drastically, tant vocoded speech quality is not perceived pitch detection based on waveform analysis to degrade. can suffer. On the other hand, a spectrally With good-quality speech in a reasonably based pitch extractor would presumably not quiet environment, the occasional errors made be sensitive to such a perturbation. by sophisticated pitch- and voicing-detection Figure 4(d) shows a transition region from algorithms are generally not a problem. Major aperiodic excitation (hiss) to quasi-periodic difficulties, however, arise when the environ­ excitation (buzz). It would appear that a fast­ ment is less benign. For example, in vocoding acting time-domain pitch extractor would be speech spoken in a high-speed jet aircraft, the best to catch the precise transition instant. background noise is loud enough to causemany Finally, Figs. 4(e) and 4(f) respectively showhow buzz-to-hiss errors. Also, when speechis passed telephone transmission and added acoustic through the carbon-button of an noise can degrade the speech signal. It is appre­ ordinary telephone handset and then through a ciably more difficult to extract the pitch from typical telephone channel, enough noise and such signals. distortion are added to produce substantial excitation errors. Methods ojPitch Estimation The annoyance level of perceived pitch and (Signal Processing) voicing errors depends not only on the pitch estimator and voicing detector, but also on the R.J. Ritsma [10] demonstrated that low fre­ vocoder synthesizer. (This point will be dis­ quencies dominate in the perception of pitch. cussed in the section "Speech Synthesis.") Indeed, it seems established that human pitch Figure 4 illustrates why automatic pitch de­ perception of speech and probably for all peri­ tection is difficult. Figure 4(a) shows two speech odic sounds is most strongly influenced by the waveforms: the top signal has a long period, frequencies below about 1500 Hz. about four times that of the bottom signal. The It is interesting to note that many pitch­ example illustrates the large dynamic range of detection algorithms use a low-pass-filtered the 's fundamental frequency. In version of the speech as their input. Thus fact, the pitch ofsome male voices can be as low vocoder developers, perhaps without being as 60 Hz while the pitch ofchildren's voices can aware of Ritsma's result, found that similar be as high as 800 Hz. Any vocoder that is regions of dominance hold for both pitch detec­ dedicated to vocoding all human voices must be tion (by vocoder analysis) and pitch perception able to process this order-of-magnitude pitch (by people). range. In many practical situations, however, Although low-pass filtering is effective in the range can be greatly reduced. For example, removing extraneous influences, it still requires the pitch ofmost male voices ranges from 100 to a formidable pattern-recognition algorithm to 200 Hz. extract pitch. This requirement has led re­ Figure 4(b) shows how the period ofa speech searchers in the field to develop more complex signal can fluctuate drastically and almost in­ signal processing methods, including spectral stantaneously. Note that the leftmost periods of flattening and correlation [11], inverse filtering the curveare short, and the following periodsare [12], and cepstral analysis [13]. abruptly more than twice as long. The right­ The use of cepstral analysis as the signal most periods then snap back to shorter dura­ processing portion of a pitch estimator arises tions. This kind ofbehavior makes pitch track­ from the algorithm's ability to deconvolute two ing difficult. signals. As discussed earlier, speech can be

170 The Lincoln Laboratory Journal. Volume 3. Number 2 (l990j Gold - A History oJVocoder Research atLincoln Laboratory

Fig. 4-The detection ofpitch in speech is difficultbecause (a) the pitch ofthe human voice can range from 60 Hz for some adultmales to 800 Hz for some children, (b) fluctuations in the pitch ofa person speaking can be dramatic and abrupt, (c) variations in the vocal tract can cause sudden changes in the speech spectrum, (d) although the transition from voiced (hiss) to unvoiced (buzz) excitation might not alter the speech's fundamental frequency, the transition does greatlyalterthe waveform ofthe signal, (e) telephone transmission degrades the speech signal, and (f) other acoustic background noise can also degrade the speech signal. modeled as the excitation function convolved becauseanytechnique thatcan separate outthe with the vocal-tract function, and cepstral excitationfunction from thevocal-tract (or spec­ analysis can in principle separate these two tral) function is, in a sense, performingdeconvo­ functions. lution. The unique property ofcepstral analysis It should be stressed that all vocoder algo­ is that deconvolution is attained almost entirely rithms incorporate the conceptofdeconvolution through linear filtering and a simple memory-

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 171 Gold - A History ofVocoder Research atLincoln Laboratory

Excitation Pattern Recognition for Pitch

(a) (b) (c) (d) 512-Point Log 512-Point Separation Spectral FFT Magnitude FFT in Time Function

Windowed Speech Wave

(a) ---l~;A--+t+-f-++-if-\-+T+-+-H-t-\+lr+lrJ-1"------1~ Time

,. _, / Spectrum "" (b) '"'"

Q) -0 Q)~ (c) 0 c: .....J0l ell ~

--;-.L-__.L-__...... __...L..__.....L.__----lL.-_.x...__....1..t~ Frequency

(d)

--;----...... -+-.....L.-->O'---__.....c.-->O' ~...... --i~ Time

Fig. 5-Cepstralanalysis. less nonlinearoperation (in our case, a logarith­ log spectrum. One method of accomplishing mic function). this separation is through another Fourier The technique follows straightforwardlyfrom transform, as shown in Fig. 5. Thus cepstral the following argument. The Fourier transform analysis brings about the desired deconvolu­ of the convolution of two time functions yields tion. A.V. Oppenheim pioneered early work the product oftwo frequency functions, and the on deconvolution by cepstral methods; his logarithm of this product yields the sum of the work will be discussed in the section "Speech logarithms of the two frequency functions. The Synthesis." excitation function. thus transformed. is still a rapidly changing. quasi-periodic function, Pattern-Recognition Methods whereas the transformed spectral-envelope for Pitch Detection function varies relatively slowly. Therefore. these two transformed functions can be sepa­ In the early 1960s. Lincoln Laboratory de­ rated by the appropriate linear filtering of the signed an algorithm that used parallel process-

172 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoderResearch atLincoln Laboratory ing to estimate the voice fundamental period the peak factor even when the polarity of the

(14). The algorithm (Fig. 6[all consisted of four signal is reversed, m4 has been introduced to major steps: accommodate high negative peak factors. In 1. a filter for the speech signal, many instances in which the peak factors are 2. a processor thatgenerated sixfunctions of not high, we can still perceive large peak-to­ the peaks of the filtered speech signal, peak swings by measuring m and m ' Finally, 2 s 3. six identical elementary pitch-period esti­ strong second harmonics in the signal can cre­ mators (PPE), each working on one of the ate two peaks ofnearly equal size per period; the

six functions, and measurements m 3 and m 6 have been intro­ 4. a global statistically oriented computation duced to deal with this effect. based on the results of step 3. Figure 7 shows the operation of an elemen­ Figure 6(b) depicts the six measurements on tary PPE, a device whose function has much in the filtered speech. Measurement m1 follows the common with the firing pattern of an auditory largest peak in the pitch period so that the pitch nerve fiber. In an elementary PPE, a sufficiently of any signal with a high positive peak factor is large pulse will cause the detector to fire. A accurately tracked. Since we also want to track blanking interval (called a refractory interval in

(a)

Step 1 Step 2 Step 4 Processor Final Pitch- Pitch Speech Period Period Filter of Signal Computa- Peaks tion

(b)

Time

--r ms ms ------

Fig. 6-Estimation ofpitchperiod: (a) schematic ofalgorithm, and(b) diagram showing , ... , ) basic measurements (m m2 m6 made on the filtered speech. " The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 173 Gold - A History oJVocoder Research at Lincoln Laboratory neural nomenclature) follows, dUring which greater detail.) If the score of a winner is below subsequentlarge inputs have no effect. Mter the a certain empirically determined threshold, blanking interval, the device becomes asymp­ then the signal is considered unvoiced. (Refer­ totically more sensitive to external stimuli and ence 15 provides a detailed description of will fire againwhenany inputexceeds the dimin­ the buzz-hiss detection rule.) ishing threshold. Lincoln Laboratory's device Lincoln Laboratory's pitch-detection algo­ incorporates a form of learning in that both rithm has an interesting history. Its earliest blanking intervals and rundown times adapt to implementation was on the Laboratory's TX-2 the measured periods. computer. Some speech scientists initially Sixelementary pitch detectors are employed, thought the algorithm too complex to be of one for each of the measurements of Fig. 6(b). practical interest. But N.L. Daggett [16] built a The three most recent measured periods in hardwareversionwith emerging LSI technology, addition to all possible sums of the three are and by 1980 the entire algorithm was incorpo­ saved to insure that the true period is found. rated on a single chip [17]. With current VLSI Thus each elementary detector presents six technology, a complete vocoder can be imple­ measurements to the final estimator (as mented on a single digital signal processing (DSP) chip on which the pitch detector would Variable Exponential account for a small percentage of the total Variable Blanking Decay processing. Time (Rundown Time) Lincoln Laboratory's algorithm analyzes the .--I-~I temporal propertiesofthe filtered speech signal. Some researchers assert, however, that pitch perception by people is based on a frequency (as opposed to a temporal) analysis of the signal. Firings Thus designers of pitch detectors have also looked atfrequency-based approaches. While at Fig. 7-0peration of the detection circuit. After a firing, a Lincoln Laboratory, S. Seneff designed a pitch variable blanking time occurs during which subsequent detector based on the harmonics obtained from pulses are ignored. The blanking time is followed by a variable exponential decay (rundown time) during which a short-time, high-resolution spectral analysis the device fires wheneveranypulse exceeds the threshold. of the speech wave [18]. Quoting Seneff, shown in Fig. 8[all , and a total of 36 measure­ the algorithm described here uses an iterative ments are available for further processing. technique which begins by considering only However, to minimize delays the final esti­ the two largest peaks. It then adds each peak mator considers only six candidates, namely, in tum, from largest to smallest, and after the addition ofa newpeakdetermines a newlistof the most recent periods measured by each potential pitches as the distance between detector. adjacent peaks under consideration. Such a The final estimator compares each of the technique results in a built-in weighting six candidates (the first row of the matrix in mechanism, whereby the largest peak is in­ Fig. 8[bll with the other 35 measurements, cluded inevery iteration, butthe smallestonly in the last. The final decision algorithm deter­ coincidences and the number of are noted. mines the pitch from a list which includes all A coincidence occurs when the absolute dif­ of the estimates from each iteration. ference between a candidate and a measure­ ment does not exceed certain empirically deter­ Although it did not work as well as the time­ mined thresholds. The candidate with the domain system described previously, Seneffs highest score (Le., the largest number of frequency-domain algorithm was more robust, biased coincidences) is declared the winner. especially when it analyzed speech degraded by (Reference 14 describes this algorithm in telephone-line transmission.

174 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History ojVocoder Research at Lincoln Laboratory

Editing ofthe Pitch Contour (a)

Constraints of the human-speech-produc­ I I P,,6 =p,., + P'.2 + P'.3 tion mechanism dictate that pitch should be a I.....I------:----....,.~ reasonably smooth function of time. This piece : P,.5 = P,,2 + P,.3 b : 1'4 • III" I PPE 1 of knowledge has been applied in various ways : I P,.4 = p,'., + P1.2 to remove ambiguities in the pitchcontour. J.W. ,,,I :.. I .. Tukey [19) proposed a nonlineartype ofsmooth­ P,.3 P,.2 P1.1 ingbest classified as median smoothing, a tech­ J. .t. .t. nique that Seneff describes in Ref. 18. ·L Time

Neural Encoding ofPitch

M. Miller and M. Sachs [20) showed that the : P2 5 = P22 + P23: : I If . .' .~ I I auditory neurons in cats will behave either as PPE 2 : I P2.4 =P2'., + P2.2 b:, time- or frequency-domain pitch detectors, de­ I :.. I .. I I I f pending on the relationship between the funda­ 2 P2.2 2 t·;----'-....,·~P ., mental frequency of the signal and the charac­ J. P ,3 ·t·....--...... I... L teristic frequency of the auditory fiber. Thus it ------Time appears that the next step in designing an •• • accurate and robust pitch estimator should •• • encompass both domains. •• •

Speech Synthesis

G. Fant [21) gives a concise statement that defines a model of human speech: ''The speech wave is the response ofthe vocal tract to one or more excitation signals." In engineering terms, Fant's definition means that human speech production can be modeled as a time-varying filter excited bybuzzes, hisses, ora combination ofthe two. J.L. Flanagan [3) and L. Rabiner and (b) PPE No. R. Schafer [22) supply in-depth discussions of 2 3 4 5 6 the subject. From this starting point, we can describe the P", P2"P3" P4 ,1 P5"P6" spectrum of the speech wave as the product of P,,2 P2,2 P3.2 P4 ,2 P5,2 P6,2 an excitation spectrum and a vocal-tract spec­ P,,3 P ,3 P ,3 P ,3 P ,3 P ,3 trum. In particular, during speech sounds that Estimates 2 3 4 5 6 involvevocal-cordvibration, the excitation spec­ of Period P,.4 P2,4 P3,4 P4 ,4 P5,4 P6,4 trum can be approximated by a harmonic spec­ trum. This approximation requires that the P,,5 P2,5 P3,5 P4 ,5 P5,5 P6,5 speech spectrum itself be resolved into a fine P,,6 P ,6 structure and a spectral envelope as shown in 2 P3,6 P4 ,6 P5,6 P6,6 Fig. 9. Many problems are hidden in Fig. 9. First, we note that the fine structure is time variable as Fig. 8-Final estimation ofpitch period: (a) outputs of the six individual pitch-period estimators, and (b) matrix ofthe the pitch changes. In fact, given a sufficiently outputs. Each of the entries in the first row of the matrix high variation in pitch, it is wrong to assume is a candidate for the final estimate.

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 175 Gold - A History oJVocoder Research at Lincoln Laboratory

10,------,------,-----r----, (a) In Fig. 9, we have drawn imaginary curves o that pass through the peaks of the harmonics / Spectral Envelope but we cannot say what the true spectral en­ -10 velope is. especially at frequencies away from ~ -20 Fine Structure the harmonics. Furthermore. the term spectrum in our case -30 necessarily means short-time spectrum. Thus -40 ourresults are critically dependent on how long an interval of speech is processed to produce a -50 L.J...J.....L...I....L..J....I....Ul....L...I....L..J....L..L..J....u..L..L..WL..l....L...... Il..L.J-l...J...... o 2 3 4 single spectrum. kHz Issues such as the above problemstend to get 10...------,-----,----..------, resolved empirically. The analysis of speech (b) yields valuable insightbuteven more important o is the experience gained by actually listening to -10 a vocoder. Dudley's first vocoder contained 10 channels and its quality was found greatly ~ -20 wanting; a laterversion that employed 30 chan­ -30 nels was far more satisfactory. -40 Speech-Synthesis Models -50 Wlllll.Ll.u..w.l.l.W~w.l.LLLl.U.LLl.U.WllWll.LL.Ull.LlW.llW.llLllU 023 4 The ideas introduced in the previous section kHz lead to a simplified model of speech synthesis. 10,------,----,----.------, According to the model, speech (c) consist of three major blocks: a buzz generator o that generates a quasi-periodic pulse train. a -10 hiss generator that produces white noise. and a linear filter with time-Varying parameters. In ~ -20 this section we will focus on the filter and -30 describe several of its variations. Speech synthesizers can be divided into two -40 broad classes: universal in which the synthe­ -50 Wll.LJ.Ww.l.LLWlLJJ.W..LJJ.JJWJ.L.UJ.LlJ.LU.Ju.LU.Wll.w.wu.w.Wll.LLLLI sizer structure remains the same for all sounds o 2 3 4 (Le., only the parameters vary). and sound spe­ kHz cific in which significant structural as well as Fig. 9-Spectral envelope and fine structure of several parametric changes occur that depend on the vowels: (a) vowel atrelatively high pitch, (b) same vowel as parta but with lowerpitch, and (c) different vowel but with speech itself. Let us first describe the universal the same pitch as part b. structures. Since such structures are linear filters. they can be described in terms of their thatthe spacing between adjacent harmonics is poles and zeros. We distinguish among the constant. Also, the exact harmonic structure following types of synthesizers: (a) fixed poles may be compromised by irregularities in the andvariable zeros. (b) variable poles, (c) variable vocal-cord vibration and by various forms of zeros, and (d) variable poles and zeros. external sources ofspeech degradation such as Figure 10 shows a classic channel-vocoder noise or the speech's transmission through a synthesizer. Each of the fixed filters shown is telephone system. Because ofthese effects, but typically a 4-pole or a 2-pole filter. Zeros are perhaps more so because the pitch of speech created by the parallel addition of all filter out­ variesgreatly (from 60 Hz to 800 Hz), itis difficult puts. The poles of the overall transfer function to define the spectral envelope consistently. are simply the poles of all the individual filters.

176 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History ofVocoder Research at Lincoln Laboratory

Channel Signal 1

Channel Signal 2

Excitation t \ Synthes;zed Signal ~speeCh o .. : L .. (Buzz • or Hiss)

Fig. 10-The classical channel-vocoder synthesizer, which has fixed poles and variable zeros.

x(n) •••

p y(n) = LakY (n - k) + x(n) k=1

Fig. 11-Linear predictive coding (LPG) vocoder synthesizer, which has variable poles (variable a coefficients).

while the zeros vary with the time-varying para­ these parameters can be derived from a smaller meters applied to the modulators. Thus the number of more basic parameters. channel-vocoder synthesizer has fixed poles Figure 13 shows an example ofa synthesizer and variable zeros. in which both poles and zeros are time varying. Figure 11 shows a variable-pole speech syn­ This particular device is called a parallel­ thesizer in which the pole variations are con­ formant synthesizer; the variations are cre­ trolled by the parameters shown. ated by adjusting both the poles of the filters Figure 12 shows a variable-zero filter. Struc­ and the modulation signals. turally, this is a tapped delay line or, in digital Sound-specific models (as distinguished terms, a finite impulse response (FIR) filter. from universal models) are really the results of Sucha synthesizerrequires manymorevariable research on synthesizers with human rather parameters than that of Figs. 10 or 11, but, as than automatic analysis. One example is Fant's we shall see when discussing cepstral methods, OVE II synthesizer (23) (Fig. 14). Note that there

The LincoLn Laboratory Journal. VoLume 3. Number 2 (1990) 177 Gold - A History oJVocoderResearch atLincoln Laboratory are three separate networks: the top network succeeded in designing a completely automatic generates vowels and semivowels, the middle analysis-synthesis system based on OVE II. network generates nasals, and the bottom net­ work generates fricatives and plosives. Also, the Spectral Flattening of excitationfunctions are connected in a varietyof the Excitation Function ways so that. for example, whisperedvowels can begenerated. OVE II can producegood synthetic Figure 9 leads one to believe that our model speechwhen a skilled humananalyzerperforms assumes a clearseparationofthe excitation and the analysis. The usual procedure is to enter the spectrum. Thus for voiced speech the exci­ parametersbased on a spectrogramofthe utter­ tation is postulated to consist ofequally spaced ance to be synthesized. Thus far no one has harmonics. all ofthe same amplitude. The spec-

L..-_-. Synthesized Speech

Fig. 12-Homomorphic vocoder synthesizer, which has variable zeros.

Channel Formant 1 Signal 1 , Variable Bandpass Filter 1 Channel Formant 2 Signal 2

Variable Bandpass Excitation Filter 2 0 • •• ••

Channel Formant k Signal k

Variable Bandpass Filter k

Fig. 13-Parallel-formant vocoder synthesizer, which has variable poles andzeros.

178 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoderResearch atLincoln Laboratory

Pitch

~~§--- Nasal Network

LEGEND

A = Control Amplitude K = Fricative Poles F = Formant ® = Multiplier FH = Higher-Pole Correction N = Nasal Poles (Fixed) • = External Control

Fig. 14-The aVE 1/ synthesizer ofGunnar Fant [23).

trum, or more explicitly the spectral envelope, is would synthesize speech with the spectrum assumed to be a smooth curve. Much evidence Eif)Sif) = Eif)2V(f). This result is clearly wrong exists that the physical situation is far from this ideal. Figure 15 shows a few examples of glottal pulses obtained by various methods. Note that t~~. the pressure wave is often not perfectly periodic nor is the waveshape identical from period to period. Therefore, an excitation signal can have a spectrum that deviates greatly from the ideal, and any spectral analysis on the speech signal 1AA~ needs to incorporate the deviation. Let us denote the excitation spectrum byElf) and the vocal-tract spectrum by V{f). Then the measured speech spectrum is given by S(f) = Eif)V(f). Now, let us imagine that by some magical trick the system were capable ofknow­ ing Elf) and thereby capable of knowing the precise shape of the excitation wave. Then all synthesizers that we have thus far described Fig. 15-Samples ofglottal pulses.

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 179 Gold - A History oJVocoder Research at Lincoln Laboratory

Input Output -----­Signal Signal

Fig. 16-Spectralflattening. unless the spectrum E(f) is constant with fre­ speech signal, the frequencies would encom­ quency, which for voiced speech means equality pass about 10 kHz, or about three times as at the harmonics. much as the present American telephone sys­ The solution to this problem is spectraljlat­ tem allows. tening. in which all harmonics of the excitation Another important parameter is the actual signal are set equal when the voicing block is width ofeach bandpass fIlter; this width in tum used. When the voiceless (or hiss) block is used. specifies the number of filters. Here we are a short-term whitening of the spectrum is per­ dealing with the question of frequency resolu­ formed (Le., the spectrumis madeconstantwith tion. Ifwe desire to resolve individual harmonics respect to frequency). Figure 16 shows one ofall speech signals. we need to design narrow­ technique for approximate spectral flattening. band filters. On the other hand. if we are inter­ The technique was derived from vocoder re­ ested in the spectrum envelope of the speech. search at Lincoln Laboratory [24]. the filters would be wider. Figure 17 shows how the measured spectrum changes as the band­ Filter-Bank Spectral Analysis width of the uniform-fIlter bank changes (com­ pare Fig. 17[c) to Fig. 17[e)). A first design might consist of a set of band­ Finally, the designer has to specify the type of pass filters, each of the same fixed bandwidth filter desired. A wide variety of fIlter designs and with crossovers at the 3-dB points. Each exist: bandpass-filteroutputis applied to eithera full­ a. Besselfilters have a desirable response of wave or half-wave linear rectifier that. in tum, linearphaseversus frequency. Such fIl ters feeds a low-pass filter. All low-pass filters are have fairly shallow cutoff characteristics. also ofthe same bandwidth, and the bandwidth (Bessel filters are so named because the is fixed in advance. Such a system is perhaps denominator of the analog transfer func­ simplest to design and implement. For a digital tion is a Bessel polynomial.) implementation, the system has the advantage b. Elliptic filters are the opposite extreme of that the gains of all fIlters are equal. so that no Bessel filters. Elliptic fIlters have sharp gain adjustment is needed for the different cutoffcharacteristics, but because ofhigh channels. reverberations the filters are generally not Given the imposition ofequal bandwidth. we recommended for speech-spectrumanaly­ still have several parameters to consider. First, sis. there is the overall bandwidth of the spectrum c. Chebyshevfilters have undesirable phase analyzer. This disarmingly simple parameter properties similar to elliptic filters. actually has led to much soul searching on the d. Butterworthfilters appear to be a decent part ofsystem builders. Consider, for example. compromise between sharp cutoffand lin­ that channel capacity is directly proportional to ear-phaseresponse. Here the parameterto bandwidth. Thus. if the telephone system in­ consider is the order of the fIlter: too Wgh sisted on transmitting all frequencies in the an ordercanproduceserious phase distor-

180 The Lincoln Laboratory Journal. Volume 3. Number 2 (J 990) Gold - A History oJVocoder Research at Lincoln Laboratory

tion. Second- or third-order Butterworth filters are good analysis filters. /" Spectrum Envelope e. Lincoln Laboratory discovered Lerner filters by adding sUitablyweighted resona­ tor outputs to obtain both an excellent linear phase and sharp amplitude cutoff. J Frequency-sampling filters are derived by Frequency cascading a comb filter with sUitably weighted resonators in parallel, such that the resonator poles are canceled by the comb-filter zeros. Frequency-sampling fil­ (b)

ters have perfectly linear phase and excel­ Q) ••• •• "0 lent cutoffcharacteristics. .3L...U~u..... U-"~ -. 'c g. Digital filters derived from window func­ 0> ctl tions are called nonrecursive, or FIR, fil­ ~ ters. The phase of such filters can always be made perfectlylinearwhile many differ­ (c) ent amplitude characteristics can be de­ rived. h. The bandwidths of otherfilters can be de­ Frequency rived from auditory system analysis. Both physiological and psychophysical data in­ dicate that the ear performs some sort of spectrum analysis. The physiological data (d) include G. von Bekesy's measurements of ••• •• ~ III If',\'\'',\, basilar membrane motion [25) and N.Y.S. .~ IU.L.L.L-~~j' ..LLJ.'..L.-~~ -. Kiang's tuningcurves for thecat'sauditory c:: 0> nerve [26). ctl ~ Filters a through d are of the minimum-phase variety; Le., there are no zeros in the right-half (e) s plane (for analog filters) or no zeros outside the unit circle (for digital filters). Dispensing with these constraints makes other designs (filters e Frequency through h) interesting. The author and C.M. Rader [27) give detailed descriptions of filters a Fig. 17-Spectra obtained with different filter banks: (a) through J Most of the filter designs were first idealized steady-state speech spectrum, (b) bank of nar­ tested as digital implementations in a vocoder rowband filters, (c) spectrum obtained from narrowband­ filter bank, (d) bank of wideband filters, and (e) spectrum context at Lincoln Laboratory. obtained from wideband-filter bank. As the center frequency increases, the effec­ tive bandwidth of the ear's auditory filters also Discrete Fourier transforms (DFT) of the increases. Figure 18 shows three examples of speech signal can be processed to create a filter proposed filter-bank designs based on various bank. It is well known that there are important psychoacoustic data. mathematical relationships between the filter It is important to remember that the ear's bank and the DFT spectrum analysis. As an frequency resolution at higher frequencies is example, any filter bank consisting ofFIR filters poorer than at lower ones. Taking advantage of can be exactly emulated by a sliding DFT (in this fact is easily accomplished in a channel which a new DFT is taken for every sample) with vocoderbythe use offewer analysis channels for a proper window function. higher frequencies. From an implementational viewpoint, filter-

The Lincoln Laboratory Journal. Volume 3. Number 2 (l990) 181 Gold - A History oJVocoderResearch atLincoln Laboratory

(a) 11111111 II I Critical Bands

(b) 1111111 IIIIII I Equal-Articulation Index

(c) 11111 I II Third-Octave Bands

o 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz)

Fig. 18-Various concepts for defining filter bandwidth versus center frequency: (a) critical bands, (b) equal-articulation index, and (c) third-octave bands.

bankanalysis seemsmore straightforward than a popular technique; in 1975 the U.S. govern­ the use of DFT. If. however, a very high-resolu­ ment recognized the LPC vocoder as the stan­ tion analysis is desired, fast Fourier transform dardvocodersystem. Onereasonfor the device's (FFT) methods can save considerable time. And popularity was its appreciably smaller size if one desires to resolve the harmonics of the compared, for example, to Dudley's traditional speech spectrum, DFT is the appropriate channelvocoder. Recent LPC research has been mechanism. For example. if the bandwidth directed toward efficient coding of the error of interest is 7 kHz and the desired fre­ signalwith the intentofdesigninga high-quality quency resolution is 7 Hz, then no less than a vocoder capable ofoperating in the range of2.4 2048-point spectrum is required. to 4.8 kbjsec. Since the original work ofAtal and Hanauer. Linear Prediction many variants of the original algorithm have been proposed and implemented. In this article, References 28. 29. and 30 provide detailed we briefly describe the autocorrelation method discussions oflinear predictive coding (LPC). In ofJ.D. Markel and A.H. Gray [30]. this article, we summarize the concept of LPC In an LPC vocoder. transmission ofthe error without the inclusion offormulas. signal and the coefficients is sufficient to recon­ LPC commences bypredicting the nth sample structthe exactspeech signalatthe receiver. Let of a signal from a linear, weighted sum of the us assume thatwe can faithfully transmitthese previous p samples. Developing a good method parameters. If our prediction method is any of computing the coefficients of the linear pre­ good, much of the speech information will be dictor is the essence of LPC. Note that the transferred to the coefficients and therefore the predictor is really a linear fIlter and because all error signal should contain substantially less operations are linear the true signal can be information than the originalsignal. This reduc­ recovered from the error signal with inverse tion in information can, in principle, lead to the fIltering. An error can be defined as the differ­ desired bandwidth reduction. ence between the predicted and measured nth We also note that the predictor is an FIR ftl­ sample. ter, which contains only zeros in the complex B.S. Atal and S.L. Hanauer [29] developed z-plane. Thus the must be an all­ LPC as an alternative method of speech analy­ pole fIlter (Fig. 11). sis-synthesis for bandwidth compression. Soon The next step is to introduce a criterion that after, E.M. Hofstetter created the first real-time leads to a reasonable computational procedure implementation ofan LPC vocoder. LPC became for fmding the coefficients; the procedure must

182 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research atLincoln Laboratory also meetcertain standards for accurate predic­ nulls in the spectrum are less accurately tion. One such procedure is the minimum tracked by the LPC approximation. mean-square criterion, formulated as follows. d. Another point of interest concerns audio A time interval is chosen dUring which the preprocessing. For example, let us choose original signal is sufficiently stationary (from 20 to filter the speech with a sharp-cutofflow­ to 40 msec for speech). Next, a set ofcoefficients pass filter at 3 kHz. Then, prior to our LPC is found to minimize the mean-squared value computation, letussample the signal at 12 of the error signal over that interval. Such a kHz. The LPC algorithmwill valiantly try to formulation leads to a set of p linear equa­ createa good spectralmatchnotonlyofthe tions. Standard mathematical methods are speech spectrum but also of the sharp then applied to find the desired coefficients of cutoff filter. The filter thus degrades the the original linear predictor. ability of the analysis to represent the Let us summarize the results and ideas that speech spectrumaccurately. The relatively have been developed up to this point. We will narrow 3-kHz bandwidth will cause the then determine what further steps are needed. resultant autocorrelation curve to stretch First, we have associated a specific structure with respect to the x-axis. Therefore, the with a speech wave such that a new speech initial p + 1 points of the measured corre­ sample can be predicted on the basis ofp previ­ lation will change more slowly, which ous measured samples. (In statistics, this type could lead to numerical errors in the solu­ of prediction model is called autoregressive.) tion of the autocorrelation-matrix equa­ The parameters ofthe model can thus be found tion. by solving a set of linear equations. We now discuss several consequences ofthis Homomorphic Vocoding version ofthe LPC algorithm. Most ofour state­ ments will be made without proof. In the early 1960s, researchers at Bell Labo­ a. When we compute the autocorrelation ratories studied the process of deconvolution. function of the impulse response of the Given a signal thatwas assumed tobe created by LPC synthesis filter, the first p samples of the convolution of two other signals, by what the function are equal to the correspond­ technique could the two other signals be sepa­ ing samples of the autocorrelation func­ rated and examined individually? An important tion of the speech itself. Therefore, LPC practical application of this problem is earth­ analysis can be viewed as an approximate quake analysis if one assumes that observed method of matching the correlation func­ seismic signals are responses of the earth to tion of the output speech to that of the sudden movements of faults [31). A.V. Oppen­ input speech, much as a channel vocoder heim independently explored the mathematics matches output and input spectra. of a specific class of nonlinear transformations b. lt can be shown that minimizing the z­ [32). Interestingly, the cepstral analysis devel­ transform of the error signal leads to the oped by the researchers and the same set of equations as that obtained by homomorphicfiltenng ofOppenheim turned out minimizing the error signal. Thus the to be the same scheme. mathematical formalism can be carried During a two-year stay at Lincoln Laboratory outin the spectral domain and leads to the in the late 1960s, Oppenheim developed the same synthesis structure as before. Fur­ homomorphic vocoder algorithm [33). The algo­ thermore, the autocorrelation function rithm takes the log magnitude ofthe OFT ofthe canbe computed by performingan inverse original speech and thus transforms the decon­ Fouriertransform onthe measured square volution problem into a quasi-linear problem. of the spectrum. This transformation allows for good separation c. The , or , of the vocal of the excitation function from the vocal-tract tract are faithfully preserved, whereas the transfer function. Figure 19 is a block diagram

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 183 Gold - A History oJVocoderResearch at Lincoln Laboratory

Input Speech Samples t Fourier Transform ~ I Log -.-TL I Magnitude Time ---. I I t t I I Inverse Fourier Inverse Multiplier ~ ~ Exponential I-- Fourier Transform r---. Fourier Transform I Transform I I Synthetic I Speech Convolve Cepstral I Pitch I Detector I I Excitation I Excitation Generator I Parameters I Analyzer I Synthesizer .. I ~

Fig. 19-5chematic ofthe homomorphic vocoder.

of the homomorphic vocoder. Vocoder Hardware In the 1970s, P. Blankenship implemented the first real-time simulation of the homomor­ By the early 1960s. many laboratories were phic vocoder on the Lincoln Digital Signal Pro­ actively designing and building channel vocod­ cessor (LDSP), which is described in the section ers. In November 1967. theAir Force Cambridge "Speech-Processing Facilities." Research Laboratories (AFCRL) sponsored a As stated earlier. all vocodersystems perform speech analysis-synthesis survey that invited deconvolution in that they separate the excita­ participants to exhibit their vocoder hardware. tion from the vocal-tract filter. The channel There were nine displays of2400-b / sec channel vocoder uses wide-bandpass filters and energy vocoders. from Collins Radio Company. L.M. detectors that effectively remove most of the Ericsson (Sweden), Lincoln Laboratory, Bell excitation signal's contributions. LPC performs Laboratories, Texas Instruments Inc., the U.S. the same function by rmding parameters of a Army. AFCRL. and Philco, which displayed two vocal-tract (all-pole) model. One interesting systems. Itisinteresting to note thatmany ofthe feature of homomorphic analysis is its use of a vocoders employed spectral flattening. high-resolution OFT. Thus with homomorphic The Lincoln Laboratory channelvocoder (Fig. vocoders. as distinct from eitherchannel or LPC 20) performed spectral analysis with a bank of vocoders, the spectral envelope and fine struc­ 16 analog Bessel bandpass filters. 16 half-wave ture are both available for subsequent spectral rectifiers. and 16 fifth-order low-pass Bessel and pitch-detection processing. filters that had 20-Hz cutoff frequencies. The

184 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory pitch and voicing hardware was based on the 1970s, the concurrent development of micro­ author's algorithm that is described both in the processor technology and DSP algorithms section "Pitch Detection" and in Ref. 15. The radically changed that ratio. By the mid­ digital hardware could produce a new estimate 1970s, the most efficient hardware approach every few milliseconds. A single 8-bit period and was with a programmable microprocessor. two 4-bit increments were transmitted for each Using such microprocessors, Hofstetter, J. 20-msec time frame. The spectral information Tierney, and O. Wheeler [341 implemented real­ was sampled every 20 msec, log-encoded to 5 time LPC hardware. After some study, the chip bits, and transformed via a Hadamard matrix. set chosen was a group of bit-slice-oriented Theinformation ratewas thus reduced to 33bits components. Although the original design called per 20-msec frame. for three separate microprocessors that would During the 1970s, a noteworthy change oc­ each perform a subtask, the final design re­ curred in the philosophy of real-time algo­ qUiredjust a single microprocessor. Hofstetter, rithmic implementations. In the 1950s, a Tierney, andWheelerfelt thatthe lone micropro­ special-purpose implementation of a vocoder cessor could most efficiently satiSfY all signal algorithm could be about 100 times faster processing requirements if it were augmented than a computer simulation of the same algo­ by a hardware multiplier. The overall device, rithm. But dUring the late 1960s and early called the LinearPredictive CodingMicroproces-

Fig. 20-Earlychannel vocoderbuiltatLincoln Laboratory.

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 185 Gold - A History oJVocoder Research at Lincoln Laboratory

ADR

Speech From In Modem Speech To Out Modem

Data Memory 1.5 k x16 ROM 0.5 k x16 RAM 6

12 8 Data ADR 10

CPE MBR MAR

16 11 16 16

LEGEND

AID Analog to Digital MPR Multiplier CPE Central Processing Element PIS Parallel to Serial DIA Digital to Analog RAM Random-Access Memory LP Lower Product ROM Read-Only Memory MAR Memory Address Register SIP Serial to Parallel MBR Memory Buffer Register UP Upper Product MCD Multiplicand JlIR Micro-Instruction Register MOR Memory Output Register

Fig. 21-Schematic ofLinear Predictive Coding Microprocessor (LPCM). sor (LPCM), was then the most compact real­ qUired to test the hardware gets larger. When time version of the LPC algorithm. Figure 21 is Dudley built his vocoder, he did so without the a block diagram of the LPCM and Fig. 22 is a aid of computer simulation. In contrast, the photograph ofthe completed hardware. Indeed, LPCM algorithm was first programmed by LPCM was a leap forward in compactness when Hofstetter on the Lincoln Fast Digital Processor compared to the channel-vocoder hardware of (FDP) (the first real-time LPC program) [35) and the 1960s. then programmed again on the Lincoln Digital The box of Fig. 22 contains two standard Voice Terminal (LDVT). Furthermore, an impor­ 16" x 7" universal wirewrap boards. The boards tant subtask of the LPCM program was the were chosen to accommodate 14- to 40-pin design and fabrication ofthe LPCM tester, which packages and the total hardware consisted of was umbilically connected to the LPCM dUring 162 dual in-line packages located on approxi­ the debugging phase. The main component of mately 1 Y2 boards. The power consumption the tester was a 1024-word x 48-bit random­ was less than 45 W. access memory that effectively replaced the It appears that as hardware becomes more programmable read-only memory (PROM) des­ compact, the equipment superstructure re- tined to reside in the LPCM. The tester also

186 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory duplicated some of the functions of the LPCM were eachused for specificfunctions: one imple­ program-control chip so thatthe LPCM could be mented the LPC analysis, another the pitch extensively tested. detection, and a third the LPC synthesis. The Another means of testing the hardware and remaining ICs included an Intel 8085-based flrmware was with a simulation program on 8-bit microprocessor chip set that performed Lincoln Laboratory's general-purpose facility, the control and communications functions. whichwas centered ona Univac 1219computer. Figures 23 and 24 show Feldman and Hof­ In addition, an assembler program that under­ stetter's device. stood LPCM mnemonics and symbolic ad­ Thus Lincoln Laboratory's pitch detector, dresses was written for the Univac machine. which used the same algorithm that had been The binary output produced by the assembler used in previous implementations and that had could be loaded into the LPCM tester, and the been invented by the author 20years ago, could PROMs were later burned with the LPCM's now fit onto a single programmable chip. This program memory. feat was especially fulfilling to the author, who By the end of the 1970s, technological ad­ had observed the original real-time implementa­ vances allowed even smaller vocoders. A design tion by V.J. Sferrino. The implemention had byJ.A. Feldman and Hofstetter (17) ofthe same occupied a complete rack of equipment (the LPC algorithm used in the LPCM required only middle rack of Fig. 20). 16 ICs (as opposed to the LPCM's 165 ICs), Itis interesting to note thatvocoderhardware occupied one-half of a 7" x 7" wirewrap board, was once described as taking up a substantial and dissipated 8.6W (compared to 45 W). Three portion ofthe space ofa large boat (6). Today the signal processing chips (NEC's PD7720s) were same functions can be embodied on a single actually individual microprocessors, but they silicon chip.

Fig. 22-The Linear Predictive Coding Microprocessor (LPCM) of Fig. 21.

The Uncoln Laboratory Journal, Volume 3. Number 2 (l990j 187 Gold - A History oJVocoder Research at Lincoln Laboratory

Speech In Pre- emphasis Serial NEC 6 -----..In /J PD7720 /8 ,-- LPC / Analyzer Protocol ---1------' Processor Transmit Filter

Single­ ~ / Chip NEC I Coder/ Coder /JPD7720 /8 Gold Pitch / Decoder Serial Detector with In Filter 4-Chip /8 '/ Control Ioe0t' I Microcomputer

NEC Receive '--- /J PD7720 /8 Filter / Serial LPC 8-Bit ------Out Synthesizer Micro- computer Analog Data Bus Synthe­ De­ sized Emphasis Speech Out

Fig. 23-lmplementation of linear predictive coding (LPG) with three programmable signal processing chips.

Low-Rate Vocoder Systems state, and prosodic nuances all count as infor­ mation. Although it is not clear how much of In an infonnal exercise, the author recently such extra infonnation actually exists or is timed himselfreading a paragraph aloud for 20 really pertinent, it seems reasonable to assume seconds and then counted the total number of that a good 2400-b/sec vocoder can contain letters, numbers, spaces, and punctuations in nearly all ofthis extra infonnation. Thus we can the paragraph. By considering each of these to conservatively deduce that the limit of band­ be eqUivalent to a 5-bit Teletype character, the width compression is somewhere between 75 author counted a total of 258 characters, or and 2400 b/sec. This section will study systems 1290 bits. A Teletype machine transmitting the with bit rates within that range. same paragraph at 75 b/sec would have com­ An obvious way to reduce the rate ofa 2400­ pleted the job in 17.2 sec. Thus, to a first ap­ b/sec vocoder is to lower the frame rate and proximation, a Teletype system can transmit quantize the parameters more coarsely. How­ textual infonnation at almost the same rate as a ever, if a brute-force approach is taken, quality person speaking. and intelligibility deteriorate rapidly. The frame Speech, however, has much more infonna­ rate canbeeffectively reduced and fewer bits can tion than text. The speaker's identity, emotional be allocated for parameters, but some degree of

188 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory

Fig. 24-Hardware for three-chip linearpredictive coding (LPG) vocoder of Fig. 23.

sophistication is required. We shall first con­ tracking methods have a long and complicated sider the frame-rate problem and then the pa­ history. To summarize, manyversions ofthe ap­ rameter-quantization problem. Such methods proach have been proposed and tried for the have proved to be useful in achieving rates as past 30 years, but they have failed in moving low as 1200 b/sec. from the research environment into a more For a further reduction. the methods become practical setting. more powerful. Notable among these methods is the concept of pattern matching that was in­ Reduction ojBit Rate by Effective vented by C.P. Smith [36. 37) and reinvented Lowering ojFrame Rate more recently under the name vector quantiza­ tion by A. Buzo. A.H. Gray. Jr.. R.M. Gray. and E. McLarnon's fixed-rate method for lowering J.D. Markel [38). Pattern-matching techniques the frame rate [39J has been widely used in the appear to give good results in the range of 600 speechresearch group at Lincoln Laboratory. In to 1000 b/sec. Further reductions entail this method. onlyalternateframes are transmit­ some form of phonetic vocoding. which implies ted. (A variation ofthe method callsfor transmit­ some type of . and the anal­ ting every third frame.) A 2-bit code that is also ysis problem becomes correspondingly more transmitted commands the receiver in the fol­ difficult. lowing way: It is also worth mentioning formant-tracking Assume frame m is the frame that is not to be methods. which are oriented more toward pat­ transmitted. Then if frame m is sufficiently similar to frame (m - 1). the receiver will tern recognition than toward pattern match­ reproduce frame (m - 1) as frame m. On the ing. Here the game is to analyze the speech into other hand. if frame m is similar to frame fewer parameters than the number used by (m + 1), then frame (m + 1) is reproduced either channel or LPC vocoders. Formant- as frame m. Finally. if frame m occurs dur-

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 189 Gold - A History ofVocoderResearch atLincoln Laboratory

ing a rapid spectral change, frame m Pattern Matching might be more similar to the interpolated valuebetweenframes (m+ 1) and (m-1). Since In the late 1950s, C.P. Smith [36] introduced these two frames are available at the receiver, the concept of pattern matching for bandwidth the necessary interpolation can be computed there. reduction. His reasoning was very straightfor­ In summary, the above method, which is ward. Imagine that you have a channel vocoder called framefill-in, achieves a savings of nearly with 16 channels and you arbitrarily devote two to one. Blankenship and M.L. Malpass 3 bits to quantize each channel. Thus a total of applied the method to both a channel vocoder 48 bits is used. This arrangement implies that and an LPC vocoder [40). Here are their sum­ your coding strategy is capable of distin­ 48 mary and conclusions: gUishing among 2 spectral measurements. It Methods have been described based on the is clear that the human ear cannot tell that principle of frame fill-in for developing re­ many patterns apart. Therefore, even ifa person duced-rate transmission systems from stan­ could tell any two patterns apart from a total of dard 2400 bits-per-second backbones. Itwas 220 (Le., 1,048,576) patterns, 3 bits per chan­ shown that both channel vocoders and LPC types ofvocoders could be adapted with virtu­ nel for a 16-channel system is still highly re­ ally no increase in computational complexity dundant. Now, given a strategy that permits the to operate at 1200 or 800 bits-per-second. identification of the storage location ofanyone It was found through forinal and informal of the million or so patterns, we see that the evaluation methods that both channel transmitter needs to transmit only that storage vocoder and LPC-based systems perform location. Because the receiver has the same set quite well at 1200 bls and would probably be usable in most environments where the of stored patterns as the transmitter, the re­ 2400-b/s parent could be successfully oper­ ceiver will generate the correct spectrum ated. At 800 bls both systems were consider­ upon receiving the address. ed marginal and usable only in limited At Lincoln Laboratory, these concepts circumstances. However, the channel vo­ were applied to D.B. Paul's Spectral Envelope coderwas seen to perform incrementally bet­ Estimation Vocoder (SEEVOC) system [43, ter, probably due to the uniquely efficient parameter coding scheme employed which 44) and to the author's channel vocoder [45). tends to be less sensitive to quantization Thus, in the late 1970s and early 1980s, inaccuracies. there were at least three implementations of vector quantization. Each used a different Reduction ofData Rate by Reduction of vocoder configuration, a different comparison the Number ofBits per Frame mechanism for measuring differences (e.g., the use of a weighted rms measure), and The British were the first to implementa very a different strategy for building a system. simpleand quite successful transformation that used differential coding [41). Their resulting Adaptive Pattern Matching Belgard vocoder coded the lowest-frequency channel with 3 bits and all other channels with To deal with the large number of patterns, 2-bit differential coding. Since the major corre­ Paul developed another approach [44) that lations in a channel vocoder are between adja­ could adapt to new speakers by continuously cent channels, the differential coding was very altering the pattern table to match the current effective. The technique was also employed by speaker and environment. In Paul's approach, the author to develop a new spectrally flattened an incoming spectrum is compared with all channel-vocoder algorithm [42). Differential existing reference spectra. If the best match codingimproved the qualityofthe algorithm and fails to satisfy a fixed criterion, the new spec­ led to a coding strategy that used 6 bits for the trum is incorporated into the pattern table. lowest channel and 3-bit differential coding for The new spectrum replaces that reference subsequent channels. spectrum which has not been matched for

190 The Lincoln Laboratory Journal. Volume 3. Number 2 (l990) Gold - A History oJVocoderResearch atLincoln Laboratory the longest time period. relatively slowdigital computercould make only The performance of Paul's system was im­ limited contributions in the field. Since then two pressive. When a new speaker began to talk, a majorevents-the developmentofa comprehen­ briefperiod ofquasi-intelligible vocoded speech sive theory ofdigital signal processing and rap­ was followed by adaptation: the system qUickly id progress in the integrated circuit field­ tuned in on the new speech. It should be noted have propelled digital technology and computer that the system required the periodic transmis­ processing methods to their current status sion of updated reference-pattern sets to the as integral aspects of both speech-processing receiver. Also, the maintenance ofa low bit rate research facilities and specialized speech required the detection ofsilence intervalsdUring hardware devices. which new pattern sets could be sent. The During the 1940s and 1950s, the human­ system performed well at 800 b/sec. Lincoln speech-production mechanism was success­ Laboratory's SEEVOC algorithm (43) was used fully modeled via transmission-line and analog­ in the experiment. network theory. Spectrum analysis of the speech wave was conceived as the outputs of a Fonnant Analysis and Synthesis bank ofanalog bandpass filters and associated for Bit-Rate Reduction energy detectors. Formants were analog tuned circuits. It seemed natural to assume that de­ Thus far our methods for reducing the vices such as the channel vocoder and various vocoder data rate have depended on finding a formant synthesizers were intrinsically analog good parametricdescription ofthe speech short­ devices. Yet it was recognized very qUickly that time spectrum, followed by attempts to remove the digital computer did have a role to play in the redundancies in the parameters without speech research. unduly disturbingthe synthesized speech. Ifthe By the mid-1950s, in fact, digital attempts to remove the redundancies are in any were used as tools for speech-recognition re­ way successful, it will imply the existence of a search at Lincoln Laboratory and at the MIT parameter set that is less redundant than the Research Laboratory of Electronics. In early parametersets thatcan be derived, for example, experiments, the speech had to be spectrum­ from channel-vocoder, LPC, or homomorphic analyzed first-a task that seemed too formi­ analyses. Indeed, such a parameter set, con­ dable for computers-so an analog filter bank sisting of formants, has long been a popular had to be incorporated into the system (Fig. 25). topic of speech research. Using formants, J.L. In the figure, each output channel is a relatively Flanagan built a device that he called a narrowband signal that has significant spectral tenninal-analog synthesizer along with an as­ components in the region of 0 to 25 Hz. There­ sociated analyzer (46). fore, sampling each output at 50 Hz preserved Despite many ingenious efforts, however, spectral information. If, for example, the spec­ formant analysis and synthesis has not led trum analyzer contained 36 channels (as was to any practical bandwidth-compression the case for an early analyzer that was con­ systems. J.N. Holmes (47), R.J. McAulay (48), nected to the Lincoln TX-2 computer, which will and G.S. Kang and D.C. Coulter (49) have be discussed in the next section), then the A/D spearheaded some of the more sophisti­ converter would need to receive the incoming cated recent research. samples at a rate of 1800 samples/secwith per­ haps 8-bit accuracy. This requirement could be Speech-Processing Facilities fulfilled in the mid-1950s. Once the speech samples were safely in the More than 30 years ago, researchers in computer, a variety of speech-recognition algo­ speech processing knew that the digital com­ rithms couldbe tested. Usingthe combinationof puter would one day playa major role in speech analog spectrum analyzer and digital computer research. At the time, however, the large and to performresearchonvowel and fricative recog-

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 191 Gold - A History oJVocoder Research at Lincoln Laboratory

Bandpass Low-Pass ..... ~ Rectifier L/. Filter .... ~ F;lter Sampler • Speech Analog-to- Computer ....~ • Multiplexer Digital ...... • • r+ Interface • Converter •

..... BandpassFilter ectl ler LowPasSL/J f-8 Filter Sampler

Fig. 25-Analog spectrum analyzer feeding a digital computer. nition, J.W. Forgie and C. D. Forgie [50-52] made In the extreme right of Fig. 26 is a panel of inroads into the massive problem of automatic lights that are associated with the arithmetic recognition ofconnected speech. element. Between the two people in the photo In the 1960s, computer methods greatly are 24 toggle-switch registers that could be augmented research and development on chan­ accessed by the computer as ifthey were part of nel vocoders. Lincoln Laboratory and MIT were the main memory. also heavily involved in the development ofDSP Unlike most computers of that period, TX-2 theories and techniques, and Lincoln Labora­ was very flexible and was often adapted to the tory engineers took advantage of the latest IC needs ofan individual user. Itwas designedwith technology to build computers that would be a large number of available input and output more effective as speech research tools. Four ports. Thus, in addition to accommodating distinct systems were developed: the TX-2, the conventional peripheral devices such as tape Fast Digital Processor (FDP), the Lincoln Digital units, drums, and printers, one could also de­ Voice Terminal (LDVT), and the Lincoln Digital velop special hardware to interface with TX-2. Signal Processor (LDSP). Each computer had an During its time, TX-2 served as a test bed for impacton speech research and eachwas in turn new memory developments, new computers, influenced by the continuing development of experimental display facilities, and various speech algorithms. other specialized user-interface devices. TX-2 had many properties and features that The TX-2 Computer made the computer useful for speech process­ ing. The following is a partial list. The TX-2 computer [53-55], designed at Lin­ a. TX-2 was managed in a nonbureaucratic coln Laboratory, followed previous efforts that way. A user could interact directly with led to computers suchasWhirlwind [561 and the TX-2 without any help from an intermedi­ Memory Test Computer [57]. TX-2 was a very ary. The degree ofinteraction was equiva­ large computer dedicated primarily to Lincoln lent to present-day workstations (such as Laboratory staff members for research in com­ those manufactured bySun Microsystems puter technology and in algorithmic work in [58]) and the SPIRE system [59]. such fields as pattern recognition and the simu­ b. Results could be given to a user on a CRT, lation of nervous-system functions. Figure 26 on a printer, or by acoustic means. shows the console ofTX-2. It is interesting that c. A large core memory permitted the storage the entire system could be run by a single user. of about 10 seconds of speech.

192 The Lincoln Laboratory Journal. Volume 3. Number 2 (J 990) Gold - A History oJVocoder Research atLincoln Laboratory d. Programs could beqUickly modified on line hours. With TX-2, the time was reduced to via the toggle switches and knobs men­ several minutes. tioned above. This feature permitted users Of course, DSP algorithms, such as digital to set parameters and control programs filtering and FIT, improved the vocoding per­ while the programs were running. formance ofall computers. Butthe advantage of The TX-2 facility played a key role in vocoder theTX-2 facility as a vehicle for speech-process­ research at Lincoln Laboratory and enabled the ing research stemmed from its mode of opera­ following accomplishments: tion as a user-interactive, private workstation. a. development of the parallel-processing TX-2's large memory, excellent display facility, pitch-detection algorithm described in the and flexible I/O, coupled with new DSP algo­ section "Pitch Detection," rithmic methods, resulted in a very sophisti­ b. testing ofthe pitch-detection algorithm by cated speech-processingfacility for thatera (the the connection ofTX-2 to real-timevocoder early 1960s). hardware, and However, perhaps the most critical property c. simulation ofa complete channel vocoder. ofan adequate facility-that of real-time simu­ With the development ofDSP theory, it be­ lation-had not been attained. Experience has camefeasible to simulatea complete chan­ taught us that a new speech-processing system nel vocoder on TX-2. Real-time simulation has not really been fully evaluated until it has was out ofthe question but the enormous processed an immense amount ofspeech mate­ increase in algorithmic speed turned rial of different speakers and different environ­ simulation into a practical alternative for ments. The only practical method of fulfilling research on a complete vocoder. In 1959, such criteria is to build real-time hardware or to M. Mathews [60] computed the time to have available a facility capable of real-time process one second ofspeech to be several simulation.

Fig. 26-TX-2 computer.

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 193 Gold - A History oJVocoderResearch at Lincoln Laboratory

The Fast Digital Processor (FDP) display was designed for fast scanning so that a real-time spectrogram could be implemented. During the 1960s, researchers at Lincoln Ifwe estimate the degree ofparallelisminFDP Laboratory felt that recent advances in high­ to be about eight times that of a standard speed circuits coupled with OSP theoretical sequential computer. then we can approximate advances made it possible to perform real-time the FOP throughput to be about 50 MIPS. simulation of software, such as vocoding algo­ rithms. that required great computational The FDP-Univac Signal speeds. Thus in 1967 Lincoln Laboratory initi­ Processing Facility ated a studythat led to the design and construc­ tion of the signal processing computer named FOP was a physically large device. Figure 28. the FastDigital Processor [35]. FOPwas the first in which the computer stands in the foreground of the class of computers that are now called ofthe photo. gives a slightly exaggerated view of array processors. FOP's size. Inside the two sets of large double The FOP circuitry was second-generation doors are the four arithmetic elements and the emitter-coupled logic (ECL) with gate-switching control element. To the right of the doors are times of about 3.5 nsec. Even with such fast drawers that house the memory, and surround­ switching times, Lincoln Laboratory research­ ing the drawers are the power supplies. The ers discovered that many useful speech algo­ Univac-1219 computer (the main frame to rithms could not be simulated in real time if a which the FOP was attached) isjust to the right straightforward sequential computer structure of the FOP. was adopted. Consequently, computing strate­ At the assembly-language level, itwas appre­ gies that used parallelism and pipelining were ciably more difficult to program FOP than itwas implemented. Figure 27 is an architectural to program a simple Von Neumann computer. sketch of FOP that helps to illustrate that Furthermore, array processors are not only a. four parallel arithmetic elements and dual more difficult to program but also more difficult data memories were implemented to speed to debug because the user must keep in mind up complex arithmetic operations for FIT many more states. No attemptwas made to write programs. Digital filtering and autocorre­ a high-level language for the FOP even though lation programs were also speeded up. such an effort might have led to more wide­ b. program memory and data memory were spread use ofthe facility. Nevertheless, valuable physically separated so that both of the research was conducted at the facility: system's memories could be simultane­ • the first real-time simulation of an LPC ously accessed (referred to as the Harvard vocoder was implemented [61], architectural style). • a real-time speech spectrogram was com­ c. a horizontal microcode structure allowed puted on FOP and displayed with the help for the simultaneous manipulation of ad­ of special-purpose hardware [62], and dress and data. • a high-speed non-real-time spectrally flat­ d. instructions were pipelined. A typical in­ tened channel vocoder was simulated on struction required three cycles (each cycle FOP [63]. took 150 nsec) so that three instructions The facility also proved to be useful in several were usually in the pipeline at any instant. radar projects. Because FOP's input and output capabilities The next generation of ECL was more than were limited, a Univac-1219 computer, to which twice as fast as the circuitry used for FOP. This FOP served as a peripheral. was in charge of increase led to new ideas toward improving the I/O control. For real-time processing of analog facility; one ofthe evolving concepts was that of signals suchas speech. A/O and 0/A converters a speech-processing device that could serve wereincorporated. Later, a core memory ofl28k both as a facility and as a field device. The 18-bitwords and a display were connected. The concept led to the design and construction of

194 The Lincoln Laboratory Joumal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory

8 Channels ••• ! ! UNIVAC 1219 Multiplexer

Channel 1 Channel 0 Channel 0 Channel 1 (In) (In) (Out) (Out)

Eo Register

18 18 18 18 Mt(Left) M~(Light) M~(Left) M~(Right) 256 256

Mo Mo

1024 1024

18 18

LEGEND M Memory X Index Register 18 Me Control Memory XAU Index Arithmetic Unit ~ =256x 18 bits SWR Switch Register

Fig. 27-Schematic of Fast Digital Processor (FOP).

The Lincoln Laboratory Joumal. Volume 3. Number 2 (1990) 195 Gold - A History oJVocoder Research at Lincoln Laboratory several Lincoln Digital Voice Terminals. formed better for most problems. Size con­ straints, however, limited the amount of mem­ The Lincoln Digital Voice ory in an LDVT. Another important limitation Terminals (LDVT) was that ofI/O capability. (Because the mission ofLDVTs was to be real-time hardware devices, Early design studies (64) showed that with there was no need to include a truly flexible ECL technology and a careful design, a simple system that could easily communicate with Harvard-style uniprocessor could support a various peripherals, including other comput­ basic cycle time of 55 nsec. The fast cycle time ers.) However, our experience with LDVTs and meant that with appropriate pipelining the raw the programs theyimplemented broughtto light speed ofan LDVT would be approximately three the many advantages of small, fast, easy-to­ times greater than that of FOP. It was also program computers. Thus, when the opportu­ estimated that the proposed structure could nity arose to construct a new generation of result in real-time simulation of many types of signal processors, we used LDVT as the starting vocoders, including LPC. And, in fact, a wide point and added more memory along with a variety ofvocoder algorithms was implemented more capable I/O system. inreal-time LDVTs, bothin the laboratoryand in the field. Hofstetter, Blankenship, Malpass, and The Lincoln Digital Signal Seneffdescribe the algorithms and their imple­ Processor (LDSP) mentations (65), and Ref. 66 describes the LDVT architecture. Figure 29 shows the architecture of the suc­ LDVTs had severaladvantages overFDP: they cessor to LDVT: the LDSP. Figure 30 is a photo were substantially smaller and cheaper, they of the rack-mounted LDSP hardware, which were much easier to program, and they per- consisted offour LDSPs controlled by a PDP-II

Fig. 28-Fast Digital Processor (FOP) facility.

196 TIle Lincoln Laboratory Journal. Volume 3. Number 2 (J 990) Gold - A History oJVocoder Research atLincoln Laboratory

16 / To I Console +l MPXR ~ :::.:::: :::.:::: 1--12 ....J ..... 16 ....J 1--16 () () n--±1r t t t t t t t x « IM~XRI .... 16 "0 ADR c To, ell Data t t E ... ALU/Multiplier Memory 0 II X III A II P I To E 4' (4kx16) 0 .... 16 1--16 .... 1--12 ~ () DO 01 Adder• 7 12 r-- 6 6...... ~ + + / Q;~ I : 5 7-f~5...... ~ 1/ 4 4 ...... ~ ~ T • "0 I/O I "0 3...... () I : 3 R I I MIR II MOR I +1--' « 2 2...... :5 I ,...... 9- --- I : , f----. 0 0"""" <3 t t t L Input To t .-- ...... MIRCLK To 7 Channels ~ 6'" 16 a:5 1J El:4 , :2 3 i;j2 ----+- To , Timing 0 Generator ADR a. ----+- T, 2 Q; T a: Bootstrap C t 4- x Loader 0- Operator Program Memory y Console Ol L..... (2 k x 16) ..... p-Code r .=i- DO Microcontrols PROM • 6 1 9 • (64 x 48) Console • To OP Code X Y I-

LEGEND

A Accumulator MOR Memory Output Register ACLK Accumulator Clock MPXR Multiplexer ADR Address OP Code Operation Code ALU Arithmetic and Logic Unit P Pulse Register 0 Data Register PROM Programmable Read-Only Memory 01 Data Input R Register DO Data Output X Index Register MIR Memory Input Register XCLK Index Register Clock MIRCLK Memory Input Register Clock

Fig. 29-Schematic ofLincoln Digital Signal Processor (LDSP). computer, and a signal processorfor each LDSP. time mode required downloading a program As a result of the development of the LDVfs from the general-purpose computer. The new and LDSP, Lincoln Laboratory established a facility used time sharing but also functioned facility where the signal processing machines effectively for real-time simulation of vocoder were connected to a single general-purpose algorithms. For fast, non-real-time processing, computer. LDVf or LDSP operation in a real- speech had to be entered into the disk system in

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 197 Gold - A History oJVocoder Research at Lincoln Laboratory

Fig. 30-The Lincoln Digital Signal Processor (LDSP). real time at a sampling rate as high as 15,000 packets are then reassembled at the receiving samples/sec. Lincoln Laboratory designed an end. intricate piece of real-time software to perform An attractive feature of packetized speech is this task [67). Another important addition its inherent flexibility in regard to data rates for was audio signal processing such as filtering, either speech or data. This flexibility inspired sampling, preemphasis and deemphasis of the idea that variable-rate vocoders could be of high frequencies, and A/D and D/A conversion benefit in a packet environment. The goal ofone [68,69). of the earliest simulation efforts on the LDSP Among other duties, the LDSP facility has was to create a useful variable-rate algorithm. conducted The vehicles chosen were the channel vocoder a. new algorithm research, and the subband coder [71). This combination b. diagnostic rhyme testing of vocoder sys- was particularly convenient because the band­ tems, pass fIlters in each channel could serve as a c. psychophysics testing, carrier of either vocoding information or wave­ d. testing ofnewsignal processing chips, and form-coding information. e. systems testing using LDSPs as compo­ LPC algorithms and variations were also im­ nents. plemented in real time on one or more LDSPs. Several examples of research are described be­ The 2400-b/sec government standard [72) and low in more detail. the Lincoln Laboratory 2400-b / secversion were Packet communications, as exemplified by implemented, each on a single LDSP. Avariation ARPANET [70], is a new method of transmit­ on the Lincoln LPC added spectral flattening. ting and receiving information. The data to be To improve the system's performance in the transmitted are divided into digitized blocks, or noisy environment of an advanced fighter air­ packets, and transmitted in a burst mode. The craft, an updated version of the LPC algorithm

198 The Lincoln Laboratory Joumal. Volume 3. Number 2 (1990) Gold - A History oJVocoder Research at Lincoln Laboratory

was implemented on the Advanced Linear Pre­ Summary and Prospects dictive Coding Microprocessor [73]. In this ver­ for the Future sion, the speech bandwidth was increased to 5 kHz so that more parameters could be trans­ Lincoln Laboratory's research on speech mitted and the analysis rate was doubled bandwidthcompressionbeganwith the applica­ [74]. Frame-fill techniques (described in the tion ofparallel processing to the fOrmidable task section "Low-Rate Vocoder Systems") were of detecting the fundamental frequency of used to achieve a 2400-b/sec data rate. speech. This successful effort led to several McAulay experimented with an LPC analyzer years of vocoder design and implementation and a spectrally flattened channel-vocoder and featured the novel ideas ofspectral flatten­ synthesizer [48]. He also implemented a noise­ ing and the use of banks of linear-phase­ reduction scheme [75] that has been used as a bandpass filters. Innovation in vocoder preprocessorfor both LPC vocoders and a simple hardware design was best exemplified by the word-recognition device. construction of the first pitch detector that Using software based on the Belgard chan­ used integrated circuits. nel-vocoder algorithm, several Lincoln re­ Lincoln Laboratory also pioneered the use of searchers wrote a program to improve vocoder computer simulation in vocoder research. This performance over a high-frequency link. The early work culminated in the late 1960s with idea was that the receiving modem would re­ the Lincoln Experimental Terminal Vocoder quest that the block of data be repeated when­ (Letvoc), the first practical vocoder for satellite ever the channel error rate exceeded a thresh­ communications (80). old. During the repeat mode the receiving Following a brief period dUring which the vocoder remained silent. When the channel efforts of speech researchers were directed returned to normal mode the vocoder was made toward the emerging theory and technology of to speak at a faster rate so that the system could DSP, FOP was designed, built, and used to catch up with the transmitted signal. Reference program the first real-time LPC vocoder. Follow­ 76 describes the work in greater detail. ing that period, speech and DSP research ad­ McAulay and T.F. Quatieri have pioneered a vanced in parallel, and Lincoln Laboratory built newanalysis-synthesis systembased on a sinu­ LDVT and LDSP and invented and implemented soidal representation of the speech signal. many different vocoder systems via real-time At high data rates the system produces speech programs. Recent efforts have included the Sine (or music) that is indistinguishable from the Transform Coder and its implementation on ad­ input [77, 78]. The incorporation of high­ vanced signal processing chips. speed programmable microprocessors en­ The descriptions of facilities in this article abled the system to operate on a real-time basis have taken us chronologically from some of the [79]. Researchers are currently trying to very early results of computers as speech pro­ develop a good, low-rate model ofthe excitation cessors to the present. The history should help parameters. us to determinewhatkind offacility is mostsuit­ In some cases, an algorithm was so com­ able for speech research. In this respect, two putation intensive that real-time implementa­ items are of major significance: tion of the algorithm on a single LDSP was not 1. The facility should have the features of a possible. Because the Lincoln Laboratory facil­ personal workstation, with good displays, ity comprised four LDSPs connected to a gen­ flexible I/O, and versatile audio-signal eral-purpose processor (the LDSPs were also conditioning. The user should be able to able to communicate with each other), it enter material easily from tapes or cas­ was often possible to handle these computa­ settes into computer memory and to dis­ tion-intensive algorithms with multiple-LDSP play waveforms, spectra, and processing real-time simulation. results.

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 199 Gold - A History oJVocoder Research atLincoln Laboratory

2. The facility should be capable of high­ Problems persist in the vocoder field. For speed processing so that real-time opera­ instance, present-day vocoders have difficulty tion can be implemented when necessary. handling adverse environmental conditions Experience has taught us that a large such as acoustically noisy backgrounds. Be­ number of vocoder algorithms cannot be cause the human auditory system does a good adequately tested unless real-time opera­ job ofcomprehendingspeech undersuch condi­ tion is available. tions, the question arises as to the feasibility of Electronic circuits are becoming faster and building a vocoder analyzer that performs com­ denser, a trend that makes the observerwonder parable functions. what to expect offuture generations of speech­ At the very outset of modeling the human processing facilities. For example, perhaps a auditory system, we are faced with a giant single LDSP that is substantially more powerful computational problem. The auditory nerve than the presentLDSP would permitimplemen­ contains about 30,000 fibers, each of which tation ofthe vocoder algorithm described above has properties not unlike that of a tuned on a single LDSP rather than three. Because circuit. The faithful simulation of such a sys­ speed (not memory) is the only limitation that tem is a large undertaking, but it would not prevents real-time simulation ofthis particular be surprising if future technology brought algorithm from being implemented onto a single us close to such capabilities. The larger chal­ LDSP, we see that either faster chips or a more lenge is to understand in greater detail the highly parallel architecture is needed. Present intricate workings of the human auditory sys­ VLSI technology has yet to produce faster chips tem. than those already in use in the LDSP. Conse­ quently, greater parallelism is called for, which Acknowledgments leads to the tentative conclusion that advances must be made both in parallel-processinghard­ The contributions made by Lincoln Labora­ ware and, perhaps most importantly, software tory for the past 30 years in the art and science to accommodate such hardware. of vocoders includes the work and efforts of Although real-time processingis often essen­ many staff members. Here is at best an incom­ tial, the detailed diagnosis ofspeech-processing plete list: PeterBlankenship, Paul Demko, Norm results is also vital. These two requirements are Daggett, John Harris, Joel Feldman, Ed conflicting because, for example, one cannot Hofstetter, Marilyn Malpass, Bob McAulay, Jim simultaneously be looking at all spectral cross Forgie, Al McLaughlin, AI Oppenheim, Doug sections in a real-time system. There are two Paul, Tom Quatieri, Charles Rader, Paul Rosen, possible solutions. With sufficient speed and Stephanie Seneff, Vinnie Sferrino, and Joe Ti­ memory, computers canbe programmed so that erney. The author would like to offer his special diagnostic information is compiled in real time thanks to Paul Rosen, whose initiative created while an algorithm is running. A researcher can the atmosphere for this extended period of re­ then scan back and examine the microscopic search; toJoeTierney, whose continued involve­ results. The second approach is to use non-real­ ment contributed much substance to the work; time algorithms (identical to the real-time algo­ and to Peter Blankenship, whose encourage­ rithms) to study the results on a non-real-time ment and thoughtful editing enhanced this basis. article.

200 The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) Gold - A History ojVocoder Research atLincoln Laboratory

28. E.M. Hofstetter, "An Introduction to the Mathematics References of Linear Predictive Filtering as Applied to Speech Analysis and Synthesis," Technical Note TN-1973-36, 1. H. Dudley and T. H. Tarnoczy, "The Speaking Machine Revision I, Lincoln Laboratory (12 Apr. 1974), of ," J. Acoust. Soc. Am 22. OTIC #AD-777579. 151 (1950). 29. B.S. Atal and S.L. Hanauer, "SpeechAnalysis by Linear 2. W. von Kempelen, Mechanismus der menschlichen Prediction ofthe Speech Wave," J. Acoust. Soc. Am. 50. Sprache nebst der Beschreibung seiner sprechenden 637 (1971). Maschine(179l). 30. J.D. Markel and AH. Gray, Jr., "On Autocorrelation 3. J.L. Flanagan, SpeechAnalysis: Synthesis andPercep­ Equations as Applied to Speech Analysis," IEEETrans. tion, 2nd ed. (Springer-Verlag, New York, 1972). Audio Electroacoust. AU-21. 69 (1973). 4. H. Dudley, RR Riesz. and S.A. Watkins, "A Synthetic 31. B. Bogert, M. Healy. and J. Tukey, ''The Quefrency Speaker," J. Franklin Inst. 227. 739 (1939). Analysis of Time Series for Echoes," Proc. Symp. on 5. H. Dudley, "TheVocoder," BellLab. Re.17. 122 (1939). Time Series Analysis, ed. M. Rosenblott (John Wiley, 6. W.R Bennett, "Secret Telephony as a Historical Ex­ New York, 1963), p. 209. ample of Spread-Spectrum Communication," IEEE 32. AV. Oppenheim, "Superposition in a Class of Non­ Trans. Comm. COM-31. 98 (1983). Linear Systems," Technical Report 432, MIT R.L. E. (31 7. F.B. Colton, "The Miracle of Talking by Telephone," Mar. 1965), OTIC #AD-815344. National Geographic, 395 (Oct. 1937). 33. AV. Oppenheim, "A SpeechAnalysis-SynthesisSystem 8. P. Lieberman, "Perturbations in Vocal Pitch," J. Acoust. Based on Homomorphic Filtering," J. Acoust. Soc. Am. Soc. Am 33.597 (1961). 45.458 (1969). 9. P. Lieberman, private communication. 34. E.M. Hofstetter, J. Tierney, and O. Wheeler, "Micropro­ 10. RJ. Ritsma, "Frequencies Dominant in the Perception cessorRealization ofa LinearPredictiveVocoder," IEEE ofthe Pitch ofComplexSounds,"J. Acoust. Soc. Am 42. Trans. Acoust. Speech Signal Process. ASSP-25. 379 191 (1967). (1977). 11. M.M. Sondhi. "New Methods ofPitch Extraction" IEEE 35. B. Gold, I.L. Lebow, P.G. McHugh, and C.M. Rader, ''The Trans. Audio and Electroacoust. AU-16. 262 (June FOP, a Fast Programmable Signal Processor," IEEE 1968). Trans. Comput. E-20. 33 (1971). 12. J.D. Markel, "The SIFT Algorithm for Fundamental 36. C.P. Smith, "Voice Communications Methods Using Frequency Estimation," IEEE Trans. Audio Electro­ Pattern Matching for ," J. Acoust. acoust. AU-20. 367 (Dec. 1972). Soc. Am. 35. 805(A) (1963). 13. AM. Noll, "Cepstrum Pitch Determination," J. Acoust. 37. C.P. Smith, "Perception ofVocoder Speech Processed Soc. Am 41. 293 (1967). by Pattern Matching," J. Acoust. Soc. Am. 46. 1562 14. B. Gold and L.R Rabiner. "Parallel Processing Tech­ (1969). niques for Estimating Pitch Periods of Speech in the 38. A. Buzo, A.H. Gray, Jr., R.M. Gray, and J.D. Markel, Time Domain" J. Acoust. Soc. Am. 46. 442 (1969). " Based upon Vector Quantization," 15. B. Gold "Note on Buzz-Hiss Detection," J. Acoust. Soc. IEEE Trans. Acoust. Speech. Signal Process. ASSP-28. Am. 36. 1659 (1964). 562 (1980). 16. N.L. Daggett, "A Computer for Vocoder Pitch 39. E. McLamon, "A Method for Reducing the Frame Extraction," TechnicalReportTN-1966-3, Lincoln Labo­ Rate of a Channel by Using Frame Interpolation," ratory (18 Feb. 1966), OTIC #AD-629361. Proc. ICASSP '78, Washington, 10-12 Apr. 1978, 17. J.A. Feldman, "A Compact Digital Channel Vocoder p.458. Using Commercial Devices," Proc. ICASSP '823. Paris, 40. P. Blankenship and M.L. Malpass, "Frame-Fill Tech­ 3-5 May 1982, p. 1960. niques for Reducing Vocoder Data Rates" Technical 18. S. Seneff "Real-Time Harmonic Pitch Detector," IEEE Report 556, Lincoln Laboratory (26 Feb. 1981), OTIC Trans. Acoust. Speech Signal Process. ASSP-26. 358 #AD-A-099395. (1978). 41. J.N. Holmes, ''The JSRU Channel Vocoder," lEE Proc. 19. J .W. Tukey, "Nonlinear (Nonsuperposable) Methods for Communications, Radar, and Signal Processing 127. Smoothing Data," Proc. EASCON '74, Washington, 7-9 Feb. 1980, Part F, No.1, p. 53. Oct. 1974, p. 673. 42. B. Gold, P.E. Blankenship, and RJ. McAulay, "New 20. M. Miller and M. Sachs, "Representation ofVoice Pitch Applications of Channel Vocoders," IEEE Trans. in Discharge Patterns ofAuditory-Nerve Fibers," Hear­ Acoust. Speech Signal Process. ASSP-20. 13 (1981). ingResearch 14. 257 (1984). 43. D.B. Paul, "TheSpectralEnvelopeEstimationVocoder," 21. G. Fant. Acoustic Theory ojSpeechProduction (Mouton IEEE Trans. Acoust. Speech Signal Process. ASSP-29. & Co., The Hague, 1960). 786(1981). 22. L. Rabiner and R Schafer, DigitalProcessingojSpeech 44. D.B. Paul, "An 800 BPS Adaptive Vector Quantization Signals (Prentice-Hall, Englewood Cliffs, NJ, 1978). Vocoder Using a Perceptual Distance Measure," Proc. 23. G. Fant,J. Mcirtony, U. Rengman, and A Risberg, ''The ICASSP '831. Boston, 14-16Apr. 1983, p. 73. OVE II Speech Synthesizer," Speech Communications 45. B. Gold, "Experiments with a Pattern-Matching Chan­ Seminar, , Sept. 1963. nel Vocoder," Proc. ICASSP '81, Atlanta, 30Mar.-1 Apr. 24. B. Gold andJ. Tierney, "Pitch-Induced Spectral Distor­ 1981, p. 32. tion in Channel Vocoders," J. Acoust. Soc. Am. 35. 730 46. J.L. Flanagan, "Noteon the Design of'Terminal-Analog' (L) (1963). Speech Synthesizers," J. Acoust. Soc. Am. 29. 306 25. G. von Bekesy, Experiments in Hearing (McGraw-Hili, (1957). New York, 1960). 47. J.N. Holmes, "Parallel Formant Vocoders," IEEE EAS­ 26. N.Y.S. Kiang, "DischargePatternsofSingleFibers in the CON '78 Washington, 25-27 Sept. 1978, p. 713. Cat's Auditory Nerve," Research Monograph No. 35 48. RJ. McAulay, "A Low-RateVocoderBased on anAdap­ (MIT Press, Cambridge. MA., 1965). tive Subband Formant Analysis," Proc. ICASSP '81, 27. B. Gold and C.M. Rader, Digital Processing ojSignals Atlanta, 30 Mar.-1 Apr. 1981, p. 28. (McGraw-Hili, New York, 1969). 49. G.S. Kangand D.C. Coulter, "600 BPSVoice Digitizer,"

The Lincoln Laboratory Journal. Volume 3. Number 2 (1990) 201 Gold - A History ofVocoderResearch atLincoln Laboratory

?roc. ICASSP '76, Philadelphia, 12-14Apr. 1976, p. 91. 66. P.E. Blankenship, "LDVT: High Performance Minicom­ 50. J.W. Forgie and C.D. Forgie, "Results Obtained from a puterfor Real-Time Speech Processing,"?roc. EASCON, Vowel Recognition Computer Program," J. Acoust. Soc. Washington, 29 Sept.-1 Oct. 1975, p. 214A. Am 31. 1480 (1959). 67. D. James, personal communication, 1974. 51. J.W. Forgie and C.D. Forgie, "A Computer Program for 68. J. Tierney, personal communication, 1975. Recognizing the English Fricative Consonants I fI and 69. D. Chapman, 'The LDSP Signal Conditioner," Lincoln I eI ," Fourth Int!. Congress on Acoustics. Copenhagen, Laboratory Memo (1978). Aug. 1962. 70. J .W. Forgie, "SpeechTransmission in Packet Switched 52. J .W. Forgie, personal communication. Store and Forward Networks," ?roc. National Computer 53. J .M. Frankovich and H.P. Peterson, "A Functional De­ Corif'. 1975. p. 137. scription ofthe LincolnTX-2 Computer," ?roc. Westem 71. T. Bially, B. Gold, and S. Seneff, "A Technique for Joint Computer Conj, Los Angeles, 26-28 Feb. 1957, Adaptive Voice Flow Control in Integrated Packet p.146. Networks," IEEETrans. onComm COM-28. 325 (1980). 54. J.W. Forgie, 'The LincolnTX-2 Input-Output System." 72. T.F. Tremain, "The Government Standard Linear Pre­ ?roc. WestemJointComputerConj, LosAngeles, 26-28 dictive Coding Algorithm LPC-lO," Speech Technology Feb. 1957, p. 156. 1.40 (1982). 55. W.A. Clark, "The Lincoln TX-2 Computer 73. E.M. Hofstetter, E. Singer, andJ. Tierney, "A Program­ Development," Proc. WestemJoint ComputerConj, Los mable Voice Processor for Fighter Aircraft Angeles, 26-28 Feb. 1957, p. 143. Applications," TechnicalReport653, Lincoln Laboratory 56. K.S. Redmond and T.M. Smith, ?rojectWhirlwind: The (18Aug. 1983), DTIC #AD-A-133780. History ofa Pioneer Computer (Digital Press, Bedford, 74. E. Singer and J. Tierney, "Implementation ofa Robust MA,1980). 2400 bls LPC AlgOrithm for Operation in Noise 57. P. Bagley, "MemoryTestComputerProgrammingRefer­ Environments," TechnicalReport 766, ESD-TR-86-166, ence Manual," Memo 6M-2527, Lincoln Laboratory (25 Lincoln Laboratory (1 Apr. 1987), DTIC #AD-A-180644. Nov. 1953), revised 9 May 1955. 75. G. Neben, RJ. McAulay, and C.J. Weinstein, "Experi­ 58. System Manager's Manual for the Sun Workstation, ments in Isolated Word Recognition Using Noisy Models 100UI150U (Sun Microsystems, Inc., CA, Speech." ?roc. ICASSP '83 3. Boston, 14-16Apr. 1983, 1983). p.1156. 59. V. Zue, personal communication, 1983. 76. B. Gold, J. Lynch, and J. Tierney, "Vocoded Speech 60. M. Mathews, "The Effective Use ofDigitalSimulationfor through FadingChannels," ?roc. ICASSP '831. Boston, SpeechProcesSing," ?roc. Sem on SpeechCompression 14-16Apr. 1983, p. 101. and Processing, AirForce Cambridge Research Center, 77. RJ. McAulay and T.F. Quatieri, "Speech Analysis­ Sept. 1959. SyntheSiS Based on a Sinusoidal Representation," 61. E.M. Hofstetter, personal communication, 1971. Technical Report 693, Lincoln Laboratory (17 May 62. T. Bially, personal communication, 1971. 1985), DTIC #AD-A-157023. 63. P. Demko and J. Tierney, personal communication, 78. T.F. Quatieri and RJ. McAulay, "Speech Transforma­ 1971. tions Based ona Sinusoidal Representation," Technical 64. AJ. McLaughlin and P.E. Blankenship, personal com­ Report 717, Lincoln Laboratory (16 May 1986), DTIC munication, 1975. #AD-A-169740. 65. E.M. Hofstetter, P.E. Blankenship, M.L. Malpass, and 79. R J. McAulay, personal communication. S. Seneff, "Vocoder Implementations on the Lincoln 80. J. Tierney, B. Gold, V. Sferrino, J.A Dumanian, and E. Digital Voice Terminal," ?roc. EASCON, Washington, Aho, "ChannelVocoderwith Digital Pitch Extractor," J. 29 Sept.-1 Oct. 1975, p. 32A Acoust. Soc. Am 36,1901 (1964).

BERNARD GOLD is a senior staffmemberoftheMachine Intelligence Technology Group, where he specializes in speech processing. Before joining Lincoln Laboratory 37 years ago, Ben worked for Hughes Aircraft Co. in Los Angeles. He received the follOwing degrees in electrical engineering: a bachelor's from the City College ofNew York, and a master's and Ph.D. from Brooklyn Polytechnic Insti­ tute. He is a fellow oftheASA and IEEE, and a memberofthe National Academy ofEngineering.

202 The Lincoln Laboratory Journal. Volume 3. Number 2 {l990j