Linear Statistical Modeling of Speech and Its Applications -- Over 36 Year History of LPC

Total Page:16

File Type:pdf, Size:1020Kb

Linear Statistical Modeling of Speech and Its Applications -- Over 36 Year History of LPC Linear Statistical Modeling of Speech and its Applications -- Over 36 year history of LPC -- Fumitada ITAKURA Graduate School of Engineering, Nagoya University Furo-cho 1, Chikusa-ku Nagoya 464-8603, Japan [email protected] Abstract which are locally time-invariant for describing important Linear prediction of speech signal is now a common speech events. A time series obtained by sampling a speech knowledge for students, engineers and researchers who are signal shows a significant autocorrelation between adjacent interested in speech science and technology. The first samples. The short time autocorrelation function is related to introduction of the concept to the international speech the running spectrum, which plays the most important role in community was two presentations [6],[7] at the 6th ICA held speech perception. Let {xn [ ], n= integer} be a discrete time in Tokyo in 1968. While Atal and Schroeder used the concept speech waveform sampled at every ∆T seconds. Assume that of linear prediction to encode speech waveform efficiently, {[x nii− ],= 0," , p } are (p+1) dimensional random variables Itakura and Saito used the concept of maximum likelihood taken from a stochastic process which is stationary over a estimation of speech spectrum to implement a new type of short interval such as 30 milliseconds. When x[n ] is a real vocoder. Since this epoch, these concepts are investigated widely in order to achieve accurate speech analysis, efficient sampled value, the next autoregressive (AR) relation is transmission of speech and speech recognition. In this paper, assumed: the history of LPC will be surveyed mainly based on my p α[][ixn−= i ]σε [ n ] (1) personal experience. ∑i=0 ′ 1. Introduction where α[i ] s (withα [0]= 1) are the AR or linear predictive coefficients, and σε[n ] is the residual and σ is its rms value. Homer Dudley is a foremost pioneer of analysis and synthesis An explicit representation of Eq. (1) in a digital filter form is of speech. He invented the "Channel Vocoder", as early as shown in Fig. 1, where D is the unit time delay ∆T . The 1939 at Bell Telephone Laboratories [1]. He established the transfer function H(z) equivalent to Eq.(1) is principles of the carrier nature of speech. Following his footstep, various speech analysis and synthesis methods have Xz() σ Hz()== p (2) been proposed for the purpose of transmitting speech signal as Ez() 1[]+ α iz−i ∑i=1 efficient as possible. Most research has been concentrated on finding feature parameters which express speech spectral where zjT= exp(ω∆ ) , ( −π <∆<ωπT ). characteristics efficiently. Among these methods, the most effective and successful one is based on the autoregressive or all-pole model of speech signal. From 1966 through 1968 3. Maximum likelihood estimation of [2],[3],[4] we proposed the maximum likelihood estimation of speech spectrum speech signal (MLES) and its application to vocoder. However An approach to estimating parameters in the frequency domain the MLES approach itself was not been very efficient due to has been first reported [2]. It is assumed that the speech signal its poor quantization characteristics. The PARCOR scheme has the following characteristics. was proposed in 1969 by F. Itakura, in which all-pole model is (1)The speech production system can be represented by a specified by a set of parameters called PARCOR coefficients transfer function with poles only. or the reflection coefficients. It is shown that a significant (2) The speech signal is generated by adding a random information reduction is accomplished through optimal non- white Gaussian signal to the system mentioned above. The linear quantization and bit allocation of parameters. In 1975, averaged value of the random signal is zero, and its the line spectral pair (LSP) representation of all-pole model 2 was proposed by F.Itakura, which is now widely used in many variance is assumed to beσ . speech compression systems.[14] This paper describes the Based on these assumptions, parameters such as speech analysis and synthesis methods developed in 1960’s Θ=(σα2 , [1], α [2]," , α [p ]) can be estimated from through 1980’s at the Electrical Communications Laboratories X = (x [1], xxN[2], ..., [ ]) . of the Nippon Telegraph and Telephone Public Corporation. The all-pole speech envelope is assumed to These methods involve MLES, PARCOR, LSP and CSM speech analysis/synthesis. 2. All-pole speech production model based autoregressive (AR) stochastic process Speech is a continuously time-varying process. By making reasonable assumptions, it is possible to develop linear models Fig.1 All-pole (AR) model of speech production We3.D III - 2077 2 σ 2 called the maximum likelihood method. L(X |Θ ) is SHjT()ωω=−∆ (exp( ) AR 2π maximized with respect to σ 2 ; by computing σ 2 1 σ 2 = arg max L(|)X Θ . Consequently, the maximization of = σ 2 p 2 2π 1− zz−1 ∏i=1()i L(|)X Θ is equivalent to the minimization of 2 pp σ 1 2 ˆ = (3) σααα ==−J ( [1], [2], ..., [piVijj ])∑∑ α [ ] [ ] α [ ] (8) p 2 2π αω[ijiT ]exp(−∆ ) ij==00 ∑i=1 and the corresponding likelihood function is given by 2 σ 1 pp = −N ⎡ ⎪⎧ ⎪⎫⎤ p L(|)X Θ=22 log2παeiVijj [][ − ][] α (9) 2π Ai[]cos(ω i∆ T ) σσ= ⎢ ⎨∑∑ ⎬⎥ ∑ip=− 2 ⎣⎢ ⎩⎭⎪ ij==00 ⎪⎦⎥ where zi represents roots of a polynomial By maximizing the above expression with respect to 1[1][2]++ααzz−−12 ++=" α [] pz −p 0 and A[i ] is given by {[],α ii= 1,2,,}" p, we get the following Yule-Walker (or pi− normal) equation: A[]ijji=+αα [ ] [ ] p ∑ j =0 (4) ∑Vi[][]0,(1,2,,)−== jα j i" p (10) A logarithmic likelihood function j =0 −N ⎡⎤1 p LAVk(|)X Θ= log(2)πσ 2 + [] It has been shown that this maximization of the likelihood ⎢⎥2 ∑ kp=− k 2 ⎣⎦σ function is equivalent to the minimization of the following (5) spectral distance measure defined by ⎡⎤π ⎡⎤ −NS1()DFT ω π =+⎢⎥2log(2π ) logSdAR (ωω ) + ∫−π ⎢⎥ E{dddd (ω )}=+−− 2[] (ωωω ) exp( ( )) 1 22⎢⎥πωSAR () ∫−π ⎣⎦⎣⎦ (11) π ⎡ 22 ⎤ where Vk[] is the short time autocorrelation function of =−+−ddd23()ω ()ωωω 4 ()" d ∫−π ⎢ ⎥ samples X : ⎣ 3! 4! ⎦ 1 Ni− where dSS()ω = log{()/ARωω DFT ()}is the log spectral Vi[]=+ xjxj [ ] [ i ] N ∑ j=1 difference between the AR spectrum and observed DFT (6) spectrum. It is non-negative and zero if and only if SSAR()ω ≡ DFT ()ω for all ω ; it is a measure of the goodness and is the short-time power spectrum which is SDFT ()ω of fit between two spectra SSAR()ω and DFT ()ω . The obtained by the discrete Fourier transformation of X : integrant of Eq.(11) is the penalty function against the 1 N 2 mismatch between SS()ω and ()ω as shown in Fig.2. SxkjTk()ωω=−∆ []exp( ). (7) AR DFT DFT 2π N ∑ k =1 Due to its asymmetricity, it is very sensitive to the On the basis of these equations, when X is given, the underestimation S AR ()ω << SDFT ()ω , but is tolerant for the logarithmic likelihood ratio is represented by Vk[] alone. The overestimation SSAR()ω >> DFT ()ω . For this reason, MLES is approach which determines Θ , by maximizing L(X |Θ ) , is well suited for extracting spectral peaks such as the formants Fig.2 Distance measure Fi.3 Maximum Likelihood Vocoder III - 2078 of speech. Fig.3 Definition of PARCOR coefficient 4. PARCOR speech analysis and synthesis Physically, these k parameters correspond to reflection method coefficients in the acoustic tube model of the vocal tract. The following relation can be obtained by substituting Eqs. The direct form speech synthesis filter Hz( ) with parameters (12) , (13) and (14). m−1 α[]i requires a higher quantization accuracy to maintain the α[]iV [ m− i ] { } ∑i=0 wm[1]− km ==m−1 stability of Hz( ) . To overcome this problem, the partial α[]iV [] i um[1]− ∑i=0 (15) autocorrelation (PARCOR) lattice filter was invented in 1969 2 [4]. Since then, the optimum coding of the PARCOR um[]=−− um [ 1](1 km ) coefficients have been extensively studied in order to improve αα()mm[]iiki=− (−− 1) [] β ( m 1) [], the synthetic speech quality. [13],[14] m ()mm (−− 1) ( m 1) ββ[]iiki= [−− 1]m α [ − 1], (16) 4.1 Definition of PARCOR coefficients βα(1)mm−−[]imi=− (1) [ ] The conventional autocorrelation coefficients Vk[] can be 4.2 Direct PARCOR coefficient derivation regarded as a measure of linear time shift dependency, but the The PARCOR coefficients are also derived directly from set of these parameters is still redundant since there is the speech signal. This is called the PARCOR lattice analysis significant dependency among them. The notion of partial method. The PARCOR coefficients have already been defined autocorrelation was introduced to reduce the redundancy using as the correlation coefficient between the residuals of the successive linear prediction techniques. Suppose that x[]n and forward and of the backward predictions. The forward and x[]np− are predicted values of x[]nxnp and [− ] from the backward residual linear operators are introduced: p−1 intermediate samples xn[−+ p 1],,[" xn − 2],[ xn − 1] in the εα(1)p− []niDxnADxn==(1)pi− [] [] ( )[] fp∑i=0 −1 (1)pp−−p−1 (1) x []nixni=−α [][ − ], p−1 (1)pi− ∑i=1 where AD( )= α [ iD ] , forms, (12) p−1 ∑i=0 (1)pp−−p−1 (1) (17) x []np−=−β [][] ixni − (1)p− p ()pi ∑i=1 εβ[np−= ] [] iDxnB [] = ( Dxn )[] bp∑i=1 −1 (1)p− (1)p− where the prediction coefficients α []i and β []i are p where BD( )= β (1)pi− [ iD ] , determined to minimize the residuals p−1 ∑i=1 Exnx{([]− (1)2p− [])} n where D is the delay operator for unit time.
Recommended publications
  • The History of Linear Prediction the History of Linear Predictionl
    [dsp HISTORY] Bishnu S. Atal The History of Linear Predictionl n 1965, while attending a seminar on coding (LPC) method, then the multi- an airplane. Since the plane moves, one information theory as part of my pulse LPC and the code-excited LPC. must predict its position at the time the Ph.D. course work at the Polytechnic shell will reach the plane. Wiener’s work Institute of Brooklyn, New York, I PREDICTION AND appeared in his famous monograph [2] came across a paper [1] that intro- PREDICTIVE CODING published in 1949. Iduced me to the concept of predictive The concept of prediction was at least a At about the same time, Claude coding. At the time, there would have quarter of a century old by the time I Shannon made a major contribution [3] been no way to foresee how this concept learned about it. In the 1940s, Norbert to the theory of communication of sig- would influence my work over the years. Wiener developed a mathematical theory nals. His work established a mathe- Looking back, that paper and the ideas for calculating the best filters and pre- matical framework for coding and that it generated must have been the dictors for detecting signals hidden in transmission of signals. Shannon also force that started the ball rolling. My noise. Wiener worked during the described a system for efficient encoding story, told next, recollects the events that Second World War on the problem of of English text based on the predictability led to proposing the linear prediction aiming antiaircraft guns to shoot down of the English language.
    [Show full text]
  • Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation
    IEICE TRANS. INF. & SYST., VOL.E91-D, NO.3 MARCH 2008 439 PAPER Special Section on Robust Speech Processing in Realistic Environments Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation Tran HUY DAT†a),Nonmember, Kazuya TAKEDA† †, Member, and Fumitada ITAKURA† ††, Fellow SUMMARY We present a multichannelspeech enhancementmethod oped an approach based on speech estimation using adaptive based on MAP speechspectral magnitude estimation using a generalized generalized gamma distribution modeling on spectrum do- gammamodel of speechprior distribution,where the model parameters main [9], [10] and were able to achieve better performance are adapted from actualnoisy speechin a frame-by-framemanner. The utilizationof a moregeneral prior distribution with its onlineadaptive esti- than conventional methods. The point of this method is that mationis shownto be effectivefor speechspectral estimation in noisyen- using more general prior distribution with online adaptation vironments.Furthermore, the multi-channelinformation in terms of cross- could more accurately estimate the signal and noise statis- channelstatistics are shownto be useful to betteradapt the prior distribu- tics and therefore improve the speech estimation, resulting tion parametersto the actualobservation, resulting in better performance in better performance of speech enhancement. Furthermore, of speechenhancement algorithm. We tested the proposed algorithmin an in-car speechdatabase and obtained significantimprovements of the the generalized gamma model is suitable in the implemen- speech recognitionperformance, particularly under non-stationarynoise tation, yielding tractable-form solutions for the estimation conditionssuch as music, air-conditionerand open window. and adaptation. In this work, we shall extend this method key words: multi-channelspeech enhancement, speech recognition, gen- to the multi-channel case to improve the speech recognition eralizedgamma distribution, moment matching performance.
    [Show full text]
  • Curriculum Vitae Sadaoki Furui, Ph.D. 1. Education 2. Honors and Awards
    Curriculum Vitae Sadaoki Furui, Ph.D. Chair of the Board of Trustees, Toyota Technological Institute at Chicago 6045 S. Kenwood Ave., Chicago, IL 60637 USA [email protected] Chief Research Director, National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430 Japan [email protected] Professor Emeritus, Tokyo Institute of Technology 2-36-6 Shinmachi, Setagaya-ku, Tokyo, 154-0014 Japan [email protected] Tel/Fax: +81-3-3426-0087, Cell: +81-80-4088-1199 1. Education Ph.D., Mathematical Engineering and Instrumentation Physics, University of Tokyo, Tokyo, Japan, 1978 M. S., Mathematical Engineering and Instrumentation Physics, University of Tokyo, Tokyo, Japan, 1970 B. S., Mathematical Engineering and Instrumentation Physics, University of Tokyo, Tokyo, Japan, 1968 2. Honors and Awards 1975 Yonezawa Prize (IEICE) 1985 Sato Paper Award (ASJ) 1987 Sato Paper Award (ASJ) 1988 Best Paper Award (IEICE) 1989 Senior Award (Best Paper Award) (IEEE) 1989 Achievement Award from the Japanese Minister of Science and Technology 1990 Book Award (IEICE) 1993 Best Paper Award (IEICE) 1993 Distinguished Lecturer (IEEE) 1993 Fellow (IEEE) 1996 Fellow (Acoustical Society of America) 2001 Fellow (IEICE) 2001 Mira Paul Memorial Award (Acoustical Foundation, India) 2003 Best Paper Award (IEICE) 2003 Achievement Award (IEICE) 2005 Signal Processing Society Award (IEEE) 2006 Achievement Award from the Japanese Minister of Education, Culture, Sports, Science and Technology 2006 Purple Ribbon Medal from Japanese Emperor 2008 Distinguished Achievement
    [Show full text]
  • Blind Signal Separation for Recognizing Overlapped Speech
    J. Acoust. Soc. Jpn. (E) 19, 6 (1998) Blind signal separation for recognizing overlapped speech Tomohiko Taniguchi, Shouji Kajita, Kazuya Takeda, and Fumitada Itakura Department of Information Electronics, Graduate School of Engineering, Nagoya Univer- sity, Furo-cho 1, Chikusa-ku Nagoya, 464-8603 JAPAN (Received 26 February 1998) Utilization of blind signal separation method for the enhancement of speech recognition accuracy under multi-speaker is discussed. The separation method is based upon the ICA (Independent Component Analysis) and has very few assumptions on the mixing process of speech signals. The recognition experiments are performed under various conditions con- cerned with (1) acoustic environment, (2) interfering speakers and (3) recognition systems. The obtained results under those conditions are summarized as follows. (1) The separation method can improve recognition accuracy more than 20% when the SNR of interfering signal is 0 to 6dB, in a soundproof room. In a reverberant room, however, the improvement of the performance is degraded to about 10%. (2) In general, the recognition accuracy deterio- rated more as the number of interfering speakers increased and the more improvement by the separation is obtained. (3) There is no significant difference between DTW isolated word discrimination and HMM continuous speech recognition results except for the fact that saturation in improvement is observed in high SNR condition in HMM CSR. Keywords: ICA, Blind signal separation, Overlapped speech PACS number: 43. 72. Ew, 43. 72. Ne always apart from each other, has not been clarified 1. INTRODUCTION yet. Tracking speech structure hidden in the acous- Since the multi-speaker situation is quite usual in tic signal, for example harmonic structure or for- speech communication, recognizing overlapped mant frequency, has been also studied in Ref.5-7) speech is one of the most typical and important The method, however, has inherent difficulty in problems of speech recognition in real world.
    [Show full text]
  • Linear Predictive Coding and the Internet Protocol a Survey of LPC and a History of of Realtime Digital Speech on Packet Networks
    Foundations and Trends R in sample Vol. xx, No xx (2010) 1{147 c 2010 DOI: xxxxxx Linear Predictive Coding and the Internet Protocol A survey of LPC and a History of of Realtime Digital Speech on Packet Networks Robert M. Gray1 1 Stanford University, Stanford, CA 94305, USA, [email protected] Abstract In December 1974 the first realtime conversation on the ARPAnet took place between Culler-Harrison Incorporated in Goleta, California, and MIT Lincoln Laboratory in Lexington, Massachusetts. This was the first successful application of realtime digital speech communication over a packet network and an early milestone in the explosion of real- time signal processing of speech, audio, images, and video that we all take for granted today. It could be considered as the first voice over In- ternet Protocol (VoIP), except that the Internet Protocol (IP) had not yet been established. In fact, the interest in realtime signal processing had an indirect, but major, impact on the development of IP. This is the story of the development of linear predictive coded (LPC) speech and how it came to be used in the first successful packet speech ex- periments. Several related stories are recounted as well. The history is preceded by a tutorial on linear prediction methods which incorporates a variety of views to provide context for the stories. Contents Preface1 Part I: Linear Prediction and Speech5 1 Prediction6 2 Optimal Prediction9 2.1 The Problem and its Solution9 2.2 Gaussian vectors 10 3 Linear Prediction 15 3.1 Optimal Linear Prediction 15 3.2 Unknown
    [Show full text]
  • DSP for IN-VEHICLE and MOBILE SYSTEMS This Page Intentionally Left Blank DSP for IN-VEHICLE and MOBILE SYSTEMS
    DSP FOR IN-VEHICLE AND MOBILE SYSTEMS This page intentionally left blank DSP FOR IN-VEHICLE AND MOBILE SYSTEMS Edited by Hüseyin Abut Department of Electrical and Computer Engineering San Diego State University, San Diego, California, USA <[email protected]> John H.L. Hansen Robust Speech Processing Group, Center for Spoken Language Research Dept. Speech, Language & Hearing Sciences, Dept. Electrical Engineering University of Colorado, Boulder, Colorado, USA <[email protected]> Kazuya Takeda Department of Media Science Nagoya University, Nagoya, Japan <[email protected]> Springer eBook ISBN: 0-387-22979-5 Print ISBN: 0-387-22978-7 ©2005 Springer Science + Business Media, Inc. Print ©2005 Springer Science + Business Media, Inc. Boston All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Springer's eBookstore at: http://ebooks.kluweronline.com and the Springer Global Website Online at: http://www.springeronline.com DSP for In-Vehicle and Mobile Systems Dedication To Professor Fumitada Itakura This book, “DSP for In-Vehicle and Mobile Systems”, contains a collection of research papers authored by prominent specialists in the field. It is dedicated to Professor Fumitada Itakura of Nagoya University. It is offered as a tribute to his sustained leadership in Digital Signal Processing during a professional career that spans both industry and academe. In many cases, the work reported in this volume has directly built upon or been influenced by the innovative genius of Professor Itakura.
    [Show full text]
  • Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions
    Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2007, Article ID 16921, 10 pages doi:10.1155/2007/16921 Research Article Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions Weifeng Li,1 Kazuya Takeda,1 and Fumitada Itakura2 1 Graduate School of Information Science, Nagoya University, Nagoya 464-8603, Japan 2 Department of Information Engineering, Faculty of Science and Technology, Meijo University, Nagoya 468-8502, Japan Received 31 January 2006; Revised 10 August 2006; Accepted 29 October 2006 Recommended by S. Parthasarathy We address issues for improving handsfree speech recognition performance in different car environments using a single distant microphone. In this paper, we propose a nonlinear multiple-regression-based enhancement method for in-car speech recogni- tion. In order to develop a data-driven in-car recognition system, we develop an effective algorithm for adapting the regression parameters to different driving conditions. We also devise the model compensation scheme by synthesizing the training data using the optimal regression parameters and by selecting the optimal HMM for the test speech. Based on isolated word recognition experiments conducted in 15 real car environments, the proposed adaptive regression approach shows an advantage in average relative word error rate (WER) reductions of 52.5% and 14.8%, compared to original noisy speech and ETSI advanced front end, respectively. Copyright © 2007 Weifeng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION ization (CMN) [8], codeword-dependent cesptral normal- ization (CDCN) [9], and so on.
    [Show full text]