dsp HISTORY James L. Flanagan [ ]

Parametric Representation of Speech Signals

EDITOR’S INTRODUCTION Our guest in this column is Dr. James L. Flanagan. Dr. Audio Processing Technical Field Award. He was chosen as the Flanagan holds the doctor of science degree in electrical 2005 recipient of the Research and Development Council of engineering from the Massachusetts Institute of Technology New Jersey’s Science/Technology Medal. Dr. Flanagan is a mem- (MIT), the master of science degree from MIT, and the ber of the National Academy of Engineering and the National bachelor of science degree from Mississippi State University. Academy of Sciences. Dr. Flanagan is Professor Emeritus at , In the past, Dr. Flanagan has enjoyed deep-sea fishing, swim- serving earlier as director of the Rutgers Center for ming, sailing, hiking, and flying as an instrument-rated pilot. He Advanced Information Processing and Board of Governors currently lives in New Jersey with his wife, Mildred, and they Professor of Electrical and Computer Engineering. He was have three sons, all married and with families. Rutgers University’s vice president for research until retire- In October 2009, the Marconi Foundation in Italy combined ment in 2005. Dr. Flanagan spent 33 years at Bell with the Marconi Society based at Columbia University celebrat- Laboratories before joining Rutgers University. At ed the centennial of the Nobel Prize to Guglielmo Marconi for he led Acoustics Research and later served as director of his contribution in advancing wireless telegraphy. The occasion, Information Principles Research. Over the course of his in Bologna, Italy, was also the platform for the 2009 Marconi impressive career, Dr. Flanagan has had a long list of inven- Fellowship Award. A main part of the program was a technical tions and contributions to the signal processing field in sev- symposium, which additionally was joined by the Italian eral areas including psychoacoustics, array microphone Federation of Industry Leaders. Several Marconi Fellows were processing, and digital loudspeakers. Most notably, many asked to make presentations in the symposium. Dr. Flanagan of his pioneering achievements were reduced to practice chose to talk about efficient digital speech communication, one with an impact on our current daily lives including speech area favored in his research at AT&T Bell Labs. Specifically, Dr. coding in MP3 and . Dr. Flanagan has Flanagan offered a perspective that highlighted junctures from published approximately 200 technical papers in scientific conventional analog telephony to ambitions for the future. journals. He is the author of a research text Speech Analysis, In this article, Dr. Flanagan gives a condensed summary of his Synthesis and Perception (Springer Verlag), which has Marconi presentation, devoted to parametric representation of appeared in five printings and two editions, and has been speech signals. We have arranged for his audio demonstrations translated in Russian. He holds 50 U.S. patents. to be available at http://www.signalprocessingsociety.org/publi- Dr. Flanagan is an IEEE Life Fellow, a long-time member of the cations/periodicals/spm/columns-resources/, as well as in IEEE Signal Processing Society, which he served as president in the Xplore. Regarding the future of speech coding, Dr. Flanagan earlier formative stages. Among his awards are the IEEE Medal says “The future is certain to prove interesting!” I am confident of Honor (2005) and the U.S. National Medal of Science (1996), that you, our readers, will find this column interesting and you presented at the White House by the President of the United will enjoy reading this perspective from a long-term innovator States. A special pride is the Signal Processing Society’s creation and expert in the signal processing field. and sponsorship of the IEEE James L. Flanagan Speech and Ghassan AlRegib

elephony was conceived as bandwidth adequate for intelligibility, GENESIS the electrical transmission about 3,000 Hz. Electrical noise might Even with these analog deficiencies, this of a facsimile of the sound intrude in transmission. When needed, principle has served voice communica- pressure waveform radiated electronic amplification strengthened tion, both by wire and by radio, for more from a talker’s mouth. A the signal to compensate for its attenu- than 100 years. Tmicrophone performed the acoustic to ation over distance. But, accumulated Despite the success and utility of this electrical conversion, and a low-pass fil- noise would also be amplified along principle, it was recognized early that it ter typically confined the signal to a with the signal, hence signal-to-noise was not efficient. Neural-activated vocal ratio could diminish with transmission musculatures can exert only finite force, Digital Object Identifier 10.1109/MSP.2010.936028 distance. so the velocities and displacements of

1053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE [141] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply.

[dsp HISTORY] continued

the massive articulators are con- ment of submersible amplifiers, tinuous functions of time. and didn’t become a reality until Further, the articulators change 1956.) But the parametric relatively slowly in producing a description was too coarse to sequence of distinctive sounds— Mental provide good speech quality when something at the rate of ten Formulation of synthesized at the receiver. phonemes/s—not nearly at the Speech Message Additionally, the analog parame- rate of 3,000 cycles/s, typical of ters were susceptible to noise Neuromuscular telephone bandwidth. Controls interference. The issues of how to compress speech bandwidth Sound Articulatory BANDWIDTH CONSERVATION Generation Motion and resist interference continued An early step towards bandwidth to command attention. saving was the cogent observation Acoustic Speech Resonators that the vocal sound source, and Output QUALITY INDEPENDENT the intelligence modulated upon it Sound OF DISTANCE by the resonant vocal system, were Source Resistance to analog noise was largely linearly separable functions dramatically impacted by expanded (Figure 1). This raised the possi- understanding of sampled-data bility for parametric description of theory and by the advent of digital the radiated signal more in terms [FIG1] Source-resonator representation of the speech technology. An initial step, pulse process. of the slowly changing vocal code modulation (PCM), was sim- motions. This notion led to the ply the conversion of the 3 kHz Bell Labs Vocoder [1], where a frequency- provided a popular display at the New sound waveform into digital form modulated pulse generator and a broad- York World’s Fair.) The time-varying (Figure 2). This entailed sampling a spectrum noise generator could parameters that described the source band-limited signal, quantizing the approximate vocal-cord vibration and and resonant system occupied a band- amplitude samples, and converting the turbulent frication, and the modulat- quantized values into time-framed ed intelligence could be approximated CONTINUED PROGRESS AIMED binary “words” by an encoder. Any by values of the short-time amplitude TO EXPLOIT THE SLOWLY noise accumulated in transmission spectrum taken at ten frequencies could be “stripped away” by detecting CHANGING NATURE OF THE over the audible frequency range. the binary pulses and regenerating Implicitly, this development suggest- SPEECH SIGNAL AND ITS them before they were overwhelmed by ed that while waveform facsimile LOW-PASS CHARACTER. interference. At the receiver, the binary transmission was sufficient, it was not words were decoded, converted to pulse necessary. Rather, perceptually, preserva- width less than 300 Hz, or one-tenth amplitudes, and low-pass filtered to recover tion of the short-time amplitude spec- that of the telephone channel. This was the original signal (along with quantizing trum was central to speech intelligibility. almost small enough to transmit speech noise, which could be made negligible The Vocoder was demonstrated in over the transatlantic telegraph cable, with enough steps in the quantizer, or 1939. (And, a keyboard-operated ver- laid in 1866! (The first transatlantic enough binary digits, i.e., bits per word). sion of the synthesizer, the Voder [2], telephone cable had to await develop- Although conceived by Rainey in 1926 and rediscovered independently by Reeves in 1937 [3], PCM had to await electronic Noise progress. The first commercial deploy- ^ ment was in 1962, when Illinois Bell Sn Sn s(t) introduced the T1 carrier, employing 8 LP × Encoder + Talker kHz sampling and 8-bit log-amplitude Microphone Filter quantization. This process was still a T Quantizer . . .010,011. . . waveform transmission system. But it Sampler gave the world noise-free telephonic trans- ∼ s(t) mission whose quality was essentially LP Decoder R independent of the transmission distance. Listener Filter Speaker Regenerator DIFFERENTIAL CODING Continued progress aimed to exploit the [FIG2] An example of PCM. slowly changing nature of the speech

IEEE SIGNAL PROCESSING MAGAZINE [142] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply. signal and its low-pass character. (The ratio of the frequency of the upper band d^ Sn d n edge to the centroid of the speech spec- s(t) + n LP × + trum is about six, with the bulk of spec- Talker − tral energy in the lower frequencies—by Microphone Filter P T Predictor Quantizer virtue of the characteristics of vocal a a Sampler 1.. k sound generation and radiation.) Adjacent ∼ sample values of the waveform are conse- s(t) LP + Decoder R + Encoder quently similar. Differential PCM (DPCM) Listener (Figure 3) was therefore proposed, where- Filter P Speaker Regenerator Noise by, at the transmitter, a local estimate of Predictor a ..ak each signal sample is made based upon 1 correlation statistics of past values [4], [FIG3] DPCM: open-loop quantizing. [5]. This parameterized estimate, or pre- diction, is subtracted from the input sig- be fixed at both the transmitter and prediction), the coefficients are typi- nal. If the estimate is good, the difference receiver, and only the difference signal cally computed every 20–30 ms. The signal is greatly reduced in power and is transmitted. Its power is reduced by time-varying coefficients are then sepa- requires fewer bits of quantization. After about 10 dB (Figure 4). If short-term rately transmitted to the receiver along transmission, with regeneration and then statistics are used to make the predictor with the difference signal. Refinements decoding to pulse amplitude form, the adaptive (and hence achieve better in prediction techniques permit encod- signal is recovered by an accumulator ing delays significantly shorter than with the same predictor, and finally LINEAR PREDICTION 20–30 ms. For predictors of order desampled by a low-pass filter. CAN BE EXTENDED TO greater than about six, adaptive pre- The nature of the typical predic- diction reduces the power of the dif- CHARACTERIZE THE VOCAL tor is a weighted linear sum of some ference signal by about another 4 number, k, of past samples. This SOUND SOURCE AS WELL AS dB, requiring still fewer bits for essentially is a transversal filter, THE RESONATOR SPECTRUM. quantization [7]. whose time domain impulse Commonly, traditional DPCM response is the sum of weighted employs fixed predictors that delta functions of delay equal to k accommodate the low-pass nature 14 times the sampling interval T. Adaptive Predictor of the speech spectrum. Linear The weights are the coefficients 12 predictive coding (LPC) addition-

{ak}. In the sampled-data fre- ally transmits time-varying pre- 10 Fixed Predictor quency domain, the filter dictor coefficients that follow the 8 response, P(z), is the sum over k slow changes in the amplitude of the product of the predictor 6 spectrum of the signal [8], [9]. coefficients and their correspond- 4 The difference signal retains 2 ing delay operator z k. The trans- Prediction Gain (dB) much of the characteristics of the 2 mitter operates on the spectral vocal sound source. 2 0 input as [1 P (z)] and the 02 468101214161820 While shown in Figure 3 as receiver operates on the differ- Number of Coefficients open loop and fixed quantizing at ence spectrum as 1/[12P(z)] the transmitter, there are advan- which, in the absence of quantiz- [FIG4] Prediction gain as a function of predictor order [7]. tages to closed-loop quantization ing, exactly recovers the input. (Figure 5). In this arrangement, A body of mathematics pro- the predicted signal derives from vides a closed-form computation the quantized difference, and is S + d d^ of the predictor coefficients, {ak}, n n n the same as that generated at the Σ Q[·] Same that minimizes the power of the − Receiver receiver. Quantizing noise at difference signal [6]. The compu- + transmitter and receiver are the tation requires inversion of a P(z) Σ same and do not accumulate [10]. + matrix of correlation values, and . . . (One can confirm that the closed- a a hence comprises the main pro- 1.. k loop transmitter operation, in cessing requirement. If long- D(z) = [1 – P(z)]S(z) absence of quantizing, is [1-P(z)], term statistics are used, the as for the open-loop case.) coefficients of the predictor can [FIG5] DPCM: closed-loop quantizing. Further, arranging for the

IEEE SIGNAL PROCESSING MAGAZINE [143] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply.

[dsp HISTORY] continued

If interest focuses solely upon the written equivalent of the information in a Nasal Tract speech signal, a greater gap is evident. For a given language, the number of dis- Lung Vocal-Cord tinctive speech sounds is of the order of Vocal Tract + Reservoir Oscillator Mouth and 32–64. Typically, about ten phonemes are Nostril Sound uttered per second, corresponding to an Rib Cage Cord Cord Shape Radiation information rate of only 50–60 b/s. Some Muscle Force Tension Position Coefficients perceptual experiments suggest that the At the transmitter, parameters are optimized for least squares fit to human is capable of processing and mak- the short-time amplitude spectrum of the input. ing decisions on pattern features at infor- mation rates only of the order of 100 b/s [FIG6] Components of the “speech mimic” system. [13], [14]. This seems incredibly low, but the task of making as many as 100 yes/no quantizer step size to be adaptive in time in millions of instructions per second decisions per second could indeed additionally reduces the number of bits (MIPS), processing grows from a frac- be taxing. required in the quantizer. This involves tional MIP for 64 kb/s PCM up to the These observations provoke attempts specifying the desired number of bits order of 100 MIPS for 2.4 kb/s Vocoders. to look at information representation and simple processing logic to constantly For error-free transmission, good quality higher up the chain of acoustic, muscular, examine the code words issued from the and talker recognition can be maintained and neural processes. That is, not focus encoder, and expand or contract the step by increased complexity down to cell on the radiated sound, but on the factors size in accordance with whether the that produce the sound. A small step signal is consistently occupying the in this direction is to examine speech- highest or lowest quantal value. In PRACTICAL SPEECH sound generation from first principles the absence of transmission error, COMPRESSION HAS ADVANCED of fluid flow, incorporating the physi- the receiver “sees” the same code A DISTANCE, BUT LIKELY ology and dynamic constraints of the words and has the same logic to HAS NOT REACHED THE LIMIT vocal mechanism (Figure 6). Here, modify step size for digital-to-analog OF EFFICIENCY. the control factors relate to the sub- recovery [11]. glottal air reservoir, the vocal-cord Linear prediction can be extended source, turbulence generation and the to characterize the vocal sound source as phone speeds of about 8 kb/s (requiring time-varying vocal resonator system. With well as the resonator spectrum [12]. This about 20 MIPS). In the Vocoder range, enough computation, controls for this parameterization of vocal excitation even with greater complexity, some deg- formulation can be sought by having a allows further reduction in transmission radation in quality and talker recognition physiological model “mimic” a continu- rate, especially into the Vocoder range. typically remains evident. ous speech input. Control parameters can Various approaches have been established be computed by gradient decent to mini- for this, embracing pitch and voiced/ RESEARCH OUTLOOK mize the difference between the short unvoiced estimation at one extreme, and What is the outlook for ultimately time amplitude spectra of the original “code books” of excitation at the other. If achieving good performance into the speech and that of the “mimic.” In sim- employed, parameters for these analyses very low coding rates? The most refined plest form, excitation information is sepa- must be additionally transmitted to the present-day Vocoders at 2.4 kb/s provide rately measured [15]. receiver. The price of this progressive useful intelligibility, but do not achieve reduction in transmission rate is good talker recognition. Is there a ARTICULATORY REPRESENTATION increased complexity. fundamental limit to the minimum A main intelligence-bearing component At this time, ITU standards have been information rates that meet the joint here is the shape of the vocal tract, promulgated for a range of coding rates objectives of speech intelligibility and which can be parameterized [16]. Some from 64 kb/s down to cell phone speed 8 high quality? Informal experiments have of the inherent constraints and dynam- kb/s, based upon differential predictive indeed demonstrated an “existence ics can be incorporated into a model of coding. Military standards have been proof” of transparency (where the syn- the sagittal plane cross-sectional area of established for the low rates of 4.8 and thesis is indistinguishable from the nat- the vocal conduit. A model of the whole 2.4 kb/s. The latter are greatly refined and ural input) for coding at rates as low as system allows computation of sound for more complex derivatives of the original 2,000 b/s. But, this compression has one-dimensional wave propagation in concept of the Vocoder. Complexity and required lengthy and laborious human the conduit, when it is excited by non- cost are essentially in inverse relation to intervention in the analysis. No practi- linear valving of air flow by the vocal the coding rate. If complexity is expressed cal solution yet exists for this goal. cords and turbulence generation at

IEEE SIGNAL PROCESSING MAGAZINE [144] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply. positions where the Reynolds number services. The continuing interests in REFERENCES [1] H. Dudley, “The vocoder,” Bell Labs Rec., vol. 17, exceeds a critical value. The acoustic formant analysis/synthesis seek auto- pp. 122–126, 1939. volume velocities can be obtained for matic extraction of the time-varying [2] H. Dudley, R. Riesz, and S. Watkins, “A synthetic the glottal excitation and for the mouth eigen frequencies of the vocal system. speaker,” J. Franklin Inst., vol. 227, pp. 739–764, 1939. and nostril radiation. The output vol- These contribute the prominent max- [3] E. O’Neill, Ed., in A History of Engineering and ume currents act through radiation ima in the short-time amplitude spec- Science in the Bell System: Transmission Technology (1925–1975). AT&T Bell Laboratories, impedances and encounter atmospheric trum and, perceptually, promise even 1985, ch. 18, p. 527. pressure. The sound pressure in front of more parsimonious description of [4] C. Cutler, “Differential quantization of commu- the speaker’s mouth is determined as speech information. All these factors nications,” U.S. Patent 2 605 361, July 1952. the superposition of the radiation of pis- underlie the transmission techniques [5] F. de Jager, “Delta modulation, a method of PCM transmission using a 1-unit code,” Philips Res. Rep., tons (mouth and nostril) set in a spheri- emphasized here. vol. 7, pp. 442–466, 1952. cal baffle (the head). All controls are [6] P. Elias, “Predictive coding,” IRE Trans. Inform. related to the dynamic physiology. Initial EPILOGUE Theory, vol. IT-1, pp. 16–33, 1955. implementations are exceedingly primi- So, practical speech compression has [7] P. Noll, “A comparative study of various schemes for speech encoding,” Bell Syst. Tech. J., vol. 54, pp. tive. But, by appealing to an articulatory advanced a distance, but likely has not 1597–1611, 1975. domain for parameterization, we are reached the limit of efficiency. Implied [8] B. Atal and S. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,” J. able to focus on the speech-producing is even the possibility for obtaining fun- Acoust. Soc. Amer., vol. 50, pp. 637–655, 1971. mechanism, rather than on the sound damental speech coding parameters at [9] F. Itakura and S. Saito, “An analysis-synthesis output itself. the neural level. That is, just think what telephony based on maximum likelihood method,” in Proc. Int. Congr. Acoustics, Tokyo, Japan, 1968, Preliminary experiments with such a you want to say! And, there are ambi- Paper C-5-5. “mimic” suggest that information rates tious studies commencing in this sec- [10] R. McDonald, “Signal-to-noise and idle channel performance of differential pulse code modulation in the range of 1,000–2,000 b/s may pre- tor. The future is certain to prove systems,” Bell Syst. Tech. J., vol. 45, pp. 1123–1151, serve quality and personal characteris- interesting! 1966. tics. But, so far deep studies of the fluid [11] P. Cummiskey, N. Jayant, and J. Flanagan, “Adative quantization in differential PCM coding of flow approach have not been made (ham- ACKNOWLEDGMENTS speech,” Bell Syst. Tech. J., vol. 52, pp. 1105–1118, pered in part by the fact that even a This review is an abbreviated form of a 1973. [12] B. Atal and M. Schroeder, “Predictive coding “stripped down” model runs over 100 presentation to the Marconi Foundation of speech signals,” in Proc. Int. Congr. Acoustics, times real time on a mainframe com- Symposium honoring the centennial of Tokyo, Japan, 1968, Paper C-5-4. puter, mainly to compute solutions of G. Marconi’s Nobel Prize for radio teleg- [13] J. Pierce and J. Karlin, “Information rate of the human channel,” Proc. IRE, vol. 45, p. 368, 1957. the Navier-Stokes fluid flow equations). raphy in Bologna, Italy, 9 October 2009. [14] W. Keidel, “Information processing by sen- Because this discussion has focused I am indebted to Prof. Lawrence Rabiner, sory modalities in man,” in Cybernetic Problems in Bionics, H. Oestreicher and D. primarily on applied commercial voice Dr. Richard Cox, Dr. Joseph Hall, and Moore, Eds. New York: Gordon and Breach, transmission, it has not touched on a Ann-Marie Flanagan for their advice and 1968, pp. 277–300. variety of related topics that partake of assistance in preparing this article. [15] J. Flanagan, K. Ishizaka, and K. Shipley, “Signal models for low bit-rate coding of common fundamental components. speech,” J. Acoust. Soc. Amer., vol. 68, pp. 780– Automatic speech recognition (What AUTHOR 791, 1980. was said?) and talker verification (Who James L. Flanagan ([email protected]. [16] C. Coker, “Speech synthesis with a parametric articulatory model,” in Proc. Kyoto Speech Symp., said it?) are cases in point, and are edu) is a Professor Emeritus at Rutgers Kyoto, Japan, 1968, pp. A-4-1–A-4-6. being brought into commercial telecom University. [SP]

[from the GUEST EDITORS] continued from page 19 an important role in system design. The optimization to discriminative training In closing, we would like to thank all of article by Zhang et al. gives a broad in speech and language processing. For our colleagues who have contributed to overview of the spectrum sharing many widely used statistical models, this special issue, including the authors of approach for cognitive radio networks and discriminative training for speech pro- submitted papers. We also thank the describes in detail various convex optimi- cessing normally leads to nonconvex reviewers for their quality work, and the zation formulations and solutions for the optimization problems. This article editorial board for their support, without design of cognitive radio systems. shows how convex relaxation techniques which this special issue would not have Finally, the article by Jiang and Li (such as linear programming relaxation been possible. focuses on the applications of convex or SDR) can be used in this context. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [145] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply.