Parametric Representation of Speech Signals
Total Page:16
File Type:pdf, Size:1020Kb
dsp HISTORY James L. Flanagan [ ] Parametric Representation of Speech Signals EDITOR’S INTRODUCTION Our guest in this column is Dr. James L. Flanagan. Dr. Audio Processing Technical Field Award. He was chosen as the Flanagan holds the doctor of science degree in electrical 2005 recipient of the Research and Development Council of engineering from the Massachusetts Institute of Technology New Jersey’s Science/Technology Medal. Dr. Flanagan is a mem- (MIT), the master of science degree from MIT, and the ber of the National Academy of Engineering and the National bachelor of science degree from Mississippi State University. Academy of Sciences. Dr. Flanagan is Professor Emeritus at Rutgers University, In the past, Dr. Flanagan has enjoyed deep-sea fishing, swim- serving earlier as director of the Rutgers Center for ming, sailing, hiking, and flying as an instrument-rated pilot. He Advanced Information Processing and Board of Governors currently lives in New Jersey with his wife, Mildred, and they Professor of Electrical and Computer Engineering. He was have three sons, all married and with families. Rutgers University’s vice president for research until retire- In October 2009, the Marconi Foundation in Italy combined ment in 2005. Dr. Flanagan spent 33 years at Bell with the Marconi Society based at Columbia University celebrat- Laboratories before joining Rutgers University. At Bell Labs ed the centennial of the Nobel Prize to Guglielmo Marconi for he led Acoustics Research and later served as director of his contribution in advancing wireless telegraphy. The occasion, Information Principles Research. Over the course of his in Bologna, Italy, was also the platform for the 2009 Marconi impressive career, Dr. Flanagan has had a long list of inven- Fellowship Award. A main part of the program was a technical tions and contributions to the signal processing field in sev- symposium, which additionally was joined by the Italian eral areas including psychoacoustics, array microphone Federation of Industry Leaders. Several Marconi Fellows were processing, and digital loudspeakers. Most notably, many asked to make presentations in the symposium. Dr. Flanagan of his pioneering achievements were reduced to practice chose to talk about efficient digital speech communication, one with an impact on our current daily lives including speech area favored in his research at AT&T Bell Labs. Specifically, Dr. coding in MP3 and speech recognition. Dr. Flanagan has Flanagan offered a perspective that highlighted junctures from published approximately 200 technical papers in scientific conventional analog telephony to ambitions for the future. journals. He is the author of a research text Speech Analysis, In this article, Dr. Flanagan gives a condensed summary of his Synthesis and Perception (Springer Verlag), which has Marconi presentation, devoted to parametric representation of appeared in five printings and two editions, and has been speech signals. We have arranged for his audio demonstrations translated in Russian. He holds 50 U.S. patents. to be available at http://www.signalprocessingsociety.org/publi- Dr. Flanagan is an IEEE Life Fellow, a long-time member of the cations/periodicals/spm/columns-resources/, as well as in IEEE Signal Processing Society, which he served as president in the Xplore. Regarding the future of speech coding, Dr. Flanagan earlier formative stages. Among his awards are the IEEE Medal says “The future is certain to prove interesting!” I am confident of Honor (2005) and the U.S. National Medal of Science (1996), that you, our readers, will find this column interesting and you presented at the White House by the President of the United will enjoy reading this perspective from a long-term innovator States. A special pride is the Signal Processing Society’s creation and expert in the signal processing field. and sponsorship of the IEEE James L. Flanagan Speech and Ghassan AlRegib elephony was conceived as bandwidth adequate for intelligibility, GENESIS the electrical transmission about 3,000 Hz. Electrical noise might Even with these analog deficiencies, this of a facsimile of the sound intrude in transmission. When needed, principle has served voice communica- pressure waveform radiated electronic amplification strength ened tion, both by wire and by radio, for more from a talker’s mouth. A the signal to compensate for its attenu- than 100 years. Tmicrophone performed the acoustic to ation over distance. But, accumulated Despite the success and utility of this electrical conversion, and a low-pass fil- noise would also be amplified along principle, it was recognized early that it ter typically confined the signal to a with the signal, hence signal-to-noise was not efficient. Neural-activated vocal ratio could diminish with transmission musculatures can exert only finite force, Digital Object Identifier 10.1109/MSP.2010.936028 distance. so the velocities and displacements of 1053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE [141] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply. dsp HISTORY continued [ ] the massive articulators are con- ment of submersible amplifiers, tinuous functions of time. and didn’t become a reality until Further, the articu lators change 1956.) But the parametric relatively slowly in producing a description was too coarse to sequence of distinctive sounds— Mental provide good speech quality when something at the rate of ten Formulation of synthesized at the receiver. phonemes/s—not nearly at the Speech Message Additionally, the analog parame- rate of 3,000 cycles/s, typical of ters were susceptible to noise Neuromuscular telephone bandwidth. Controls interference. The issues of how to compress speech bandwidth Sound Articulatory BANDWIDTH CONSERVATION Generation Motion and resist interference continued An early step towards bandwidth to command attention. saving was the cogent observation Acoustic Speech Resonators that the vocal sound source, and Output QUALITY INDEPENDENT the intelligence modulated upon it Sound OF DISTANCE by the resonant vocal system, were Source Resistance to analog noise was largely linearly separable functions dramatically impacted by expanded (Figure 1). This raised the possi- understanding of sampled-data bility for parametric description of theory and by the advent of digital the radiated signal more in terms [FIG1] Source-resonator representation of the speech technology. An initial step, pulse process. of the slowly changing vocal code modulation (PCM), was sim- motions. This notion led to the ply the conversion of the 3 kHz Bell Labs Vocoder [1], where a frequency- provided a popular display at the New sound waveform into digital form modulated pulse generator and a broad- York World’s Fair.) The time-varying (Figure 2). This entailed sampling a spectrum noise generator could parameters that described the source band-limited signal, quantizing the approximate vocal-cord vibration and and resonant system occupied a band- amplitude samples, and converting the turbulent frication, and the modulat- quantized values into time-framed ed intelligence could be approximated CONTINUED PROGRESS AIMED binary “words” by an encoder. Any by values of the short-time amplitude TO EXPLOIT THE SLOWLY noise accumulated in transmission spectrum taken at ten frequencies could be “stripped away” by detecting CHANGING NATURE OF THE over the audible frequency range. the binary pulses and regenerating Implicitly, this development suggest- SPEECH SIGNAL AND ITS them before they were overwhelmed by ed that while waveform facsimile LOW-PASS CHARACTER. interference. At the receiver, the binary transmission was sufficient, it was not words were decoded, converted to pulse necessary. Rather, perceptually, preserva- width less than 300 Hz, or one-tenth amplitudes, and low-pass filtered to recover tion of the short-time amplitude spec- that of the telephone channel. This was the original signal (along with quantizing trum was central to speech intelligibility. almost small enough to transmit speech noise, which could be made negligible The Vocoder was demonstrated in over the transatlantic telegraph cable, with enough steps in the quantizer, or 1939. (And, a keyboard-operated ver- laid in 1866! (The first transatlantic enough binary digits, i.e., bits per word). sion of the synthesizer, the Voder [2], telephone cable had to await develop- Although conceived by Rainey in 1926 and rediscovered independently by Reeves in 1937 [3], PCM had to await electronic Noise progress. The first commercial deploy- ^ ment was in 1962, when Illinois Bell Sn Sn s(t) introduced the T1 carrier, employing 8 LP × Encoder + Talker kHz sampling and 8-bit log-amplitude Microphone Filter quantization. This process was still a T Quantizer . .010,011. waveform transmission system. But it Sampler gave the world noise-free telephonic trans- ∼ s(t) mission whose quality was essentially LP Decoder R independent of the transmission distance. Listener Filter Speaker Regenerator DIFFERENTIAL CODING Continued progress aimed to exploit the [FIG2] An example of PCM. slowly changing nature of the speech IEEE SIGNAL PROCESSING MAGAZINE [142] MAY 2010 Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on May 18,2010 at 12:24:17 UTC from IEEE Xplore. Restrictions apply. signal and its low-pass character. (The ratio of the frequency of the upper band d^ Sn d n edge to the centroid