Linear Predictive Coding (LPC)

Total Page:16

File Type:pdf, Size:1020Kb

Linear Predictive Coding (LPC) LPC Methods • LPC methods are the most widely used in speech coding, speech synthesis, speech Digital Speech Processing— recognition, speaker recognition and verification Lecture 13 and for speech storage – LPC methods provide extremely accurate estimates of speech parameters, and does it extremely efficiently Linear Predictive – basic idea of Linear Prediction: current speech sample can be closely approximated as a linear Coding (LPC)- combination of past samples, i.e., Introduction p sn( )=−αα sn ( k ) for some value of p, 's ∑ kk 1 k =1 2 LPC Methods LPC Methods • LP is based on speech production and synthesis models - speech can be modeled as the output of a linear, time-varying system, excited by either quasi-periodic • for periodic signals with period Np, it is obvious that pulses or noise; - assume that the model parameters remain constant sn()≈− sn ( N ) p over speech analysis interval but that is not what LP is doing; it is estimating sn( ) from LP provides a robust, reliable and accurate method for the pp (<< N) most recent values of sn( ) by linearly estimating the parameters of the linear system (the combined p vocal tract, glottal pulse, and radiation characteristic for voiced speech) predicting its value • for LP, the predictor coefficients (the αk 's) are determined (computed) by minimizing the sum of squared differences (over a finite interval) between the actual speech samples and the linearly predicted ones 3 4 LPC Methods Basic Principles of LP p sn() asn ( k ) Gun () • LP methods have been used in control and =−+∑ k information theory—called methods of system k =1 estimation and system identification – used extensively in speech under group of names • the time-varying digital filter including represents the effects of the glottal pulse shape, the vocal tract IR, and 1. covariance method radiation at the lips 2. autocorrelation method • the system is excited by an impulse 3. lattice method train for voiced speech, or a random 4. inverse filter formulation Sz() 1 noise sequence for unvoiced speech Hz()== p 5. spectral estimation formulation GU() z • this ‘all-pole’ model is a natural 1 az−k 6. maximum likelihood method − ∑ k representation for non-nasal voiced 7. inner product method k =1 speech—but it also works reasonably well for nasals and unvoiced sounds 5 6 1 LP Basic Equations LP Estimation Issues • a pth order linear predictor is a system of the form • need to determine {αk } directly from speech such ppSz%() sn%( )=−⇔==αα sn ( k ) Pz ( ) z−k that they give good estimates of the time-varying ∑∑kkSz() kk==11 spectrum • the prediction error, en( ), is of the form p • need to estimate {αk } from short segments of speech en( ) sn ( ) sn ( ) sn ( ) sn ( k ) =−=−% ∑αk − • need to minimize mean-squared prediction error over k =1 • the prediction error is the output of a system with transfer function short segments of speech Ez() p Az( )==−=−11 Pz ( ) α z−k • resulting {αkk } assumed to be the actual {a } in the Sz() ∑ k k =1 speech production model •=≤≤ if the speech signal obeys the production model exactly, and if α akp,1 kk => intend to show that all of this can be done efficiently, ⇒=en( ) Gun ( )and Az( ) is an inverse filter for Hz( ), i.e., 1 reliably, and accurately for speech Hz( ) = Az() 7 8 Solution for {αk} Solution for {αk} • can find values of αk that minimize Enˆ by setting: • short-time average prediction squared-error is defined as ∂E nˆ ==012,ip , ,..., 2 2 ∂α Eemsmsm==( ) ( ) − ( ) i nnˆˆ∑∑() nn ˆˆ% mm • giving the set of equations 2 p ⎛⎞p −−−−=≤≤201sm( ism )[ ( )αˆ sm ( k )] , i p =−⎜⎟sm()α smk ( − ) ∑∑nnˆˆk n ˆ ∑∑⎜⎟nnˆˆk mk=1 mk⎝⎠=1 −−=≤≤201sm()(), iem i p ∑ nnˆˆ •=+ select segment of speech smnˆ ( ) smn (ˆ ) in the vicinity m • where αˆ are the values of α that minimize E (from now of sample nˆ kknˆ on just use αα rather than ˆ for the optimum values) • the key issue to resolve is the range of m for summation kk •− prediction error (em ( )) is orthogonal to signal ( sm ( i )) for (to be discussed later) nnˆˆ delays (ip ) of 1 to 9 10 Solution for {αk} Solution for {αk} • minimum mean-squared prediction error has the form • defining p Esmsmsmk=−2 ( )α ( ) ( − ) nnˆˆ∑∑∑k nn ˆˆ φ (ik , )=−− s ( m is ) ( m k ) nnnˆˆˆ∑ mkm=1 m • which can be written in the form • we get p Ek=−φαφ(00 , ) ( 0 , ) p nnˆˆ∑ k n ˆ αφ(ik , )== φ ( i ,012 ), i , ,..., p k =1 ∑ k nnˆˆ k =1 Process: • leading to a set of pp equations in unknowns that can be 1. compute φnˆ (ik , ) for 10≤≤ip, ≤ kp ≤ 2. solve matrix equation for αk solved in an efficient manner for the {αk } • need to specify range of mik to compute φnˆ ( , ) • need to specify smnˆ ( ) 11 12 2 Autocorrelation Method Autocorrelation Method ⋅ if smnˆ ( ) is non-zero only for 01≤≤− m L then p assume smˆ ( ) exists for 01≤≤− m L and is em()=− sm ()α smk ( − ) n nnˆˆ∑ k n ˆ k =1 exactly zero everywhere else (i.e., window of is non-zero only over the interval 01≤≤−+mL p, giving ∞ Lp−+1 length L samples) (Assumption #1 ) Eemem==22( ) ( ) nnˆˆ∑∑ n ˆ mm=−∞ =0 smˆ ()=+ smnwm (ˆ )(),01 ≤≤− m L n at values of mm near 0 (i.e., = 01, ....,p − 1 ) we are predicting signal from zero-valued samples outside the window range => em( ) will be (relatively) large where wm( ) is a finite length window of length nˆ at values near mL==++− (i.e., mLL,11 ,..., Lp ) we are predicting zero-valued samples L samples (outside window range) from non-zero samples => emnˆ ( ) will be (relatively) large • for these reasons, normally use windows that taper the segment to zero (e.g., Hamming window) ^^ nn+L-1 m=0 m=L m=0 m=L-1 m=p-1 m=L+p-1 13 14 The “Autocorrelation Method” Autocorrelation Method ^^nLˆ + −1 snˆ []msmnwm= [+ ˆ ][] L-1 Lk−−1 Rknnnˆˆˆ[]=+=∑ smsmk [ ] [ ] k12 ,,K , p m=0 15 16 The “Autocorrelation Method” Autocorrelation Method • for calculation of φnnˆˆ(ik , ) since s( m )=≤≤−001 outside the range m L , then Lp−+1 ˆ ˆ φ (ik , )=−−≤≤≤≤ s ( m is ) ( m k ),10 i p , k p nˆ n nLˆ +−1 nL+−1 nnnˆˆˆ∑ m=0 snˆ[]msmnwm=+ [ˆ ][] • which is equivalent to the form Lik−+1 () − φ (ik , )= s ( ms ) (mik+−),10 ≤≤ i p , ≤ k ≤ p nnnˆˆˆ∑ m=0 •−− there are Lik| | non-zero terms in the computation of φnˆ ( ik , ) for each value p L−1 of ik and ; can easily show that Large errors φ (ik , )=−= fi ( k ) R ( i − k ),10 ≤≤≤≤ i p , k p emnnˆˆ[]=− sm []α kn smk ˆ [ − ] nnˆˆ snˆ []msmnwm=+ [ˆ ][] ∑ at ends of k=1 • where Rinnˆˆ(−− k ) is the short-time autocorrelation of sm( ) evaluated at i k where window Lk−−1 Rk( )=+ smsmk ( ) ( ) nnnˆˆˆ∑ m=0 L −1 (1)Lp−+ 17 18 3 Autocorrelation Method Autocorrelation Method • since Rknˆ ( ) is even, then φnnˆˆ(ik , )=−≤≤≤≤ R ( i k ),10 i p , k p • as expressed in matrix form • thus the basic equation becomes ⎡⎤⎡⎤RRnnˆˆ(01 ) () . RpR n ˆ (− 1 )⎡⎤α1 n ˆ () 1 ⎢⎥ p ⎢⎥⎢⎥α RRnnˆˆ()10 ( ) . RpR n ˆ (− 2 )⎢⎥2 n ˆ ( 2 ) αφ(ik−= ) φ ( i ,01 ), ≤≤ i p ⎢⎥⎢⎥ ∑ k nnˆˆ ⎢⎥⎢⎥.....⎢⎥. = . k =1 ⎢⎥⎢⎥⎢⎥ . p ⎢⎥⎢⎥.....⎢⎥ . ⎢⎥⎢⎥⎢⎥ α Rik(−= ) Ri ( ), 1 ≤≤ i p Rpnnˆˆ()−−12 Rp ( )..() R n ˆ 0αp Rp n ˆ () ∑ k nnˆˆ ⎣⎦⎣⎦⎣⎦ k =1 ℜα=r • with the minimum mean-squared prediction error of the form with solution p α=ℜ−1r Ekˆˆ=−φαφ(00 , ) ˆ ( 0 , ) nn∑ k n •ℜ is a pp × Toeplitz Matrix => symmetric with all diagonal elements equal k =1 p => there exist more efficient algorithms to solve for {αk } than simple matrix =−RRk()0 α () nnˆˆ∑ k inversion k =1 19 20 Covariance Method Covariance Method • changing the summation index gives • there is a second basic approach to defining the speech Li−−1 φ (ik , )=+−≤≤≤≤ smsmi ( ) ( k ),10 i p , k p nnnˆˆˆ∑ segment smnˆ ( ) and the limits on the sums, namely fix the mi=− interval over which the mean-squared error is computed, Lk−−1 φ (,ik )=+−≤≤≤≤ smsmk ( ) ( i ),10 i p , k p nnnˆˆˆ∑ giving (Assumption #2 ): mk=− • key difference from Autocorrelation Method is that limits of summation 2 LL−−11⎡⎤p include terms before mp= 0 => window extends samples backwards Eemsmsmk==2 ( )⎢⎥ ( ) −α ( − ) nnˆˆ∑∑∑ n ˆk n ˆ from sn(ˆˆ - p ) to sn(+ L -1 ) mm==00⎣⎦⎢⎥ k = 1 • since we are extending window backwards, don't need to taper it using L−1 a HW- since there is no transition at window edges φ (,ik )=−−≤≤≤≤ s ( m is ) ( m k ),10 i p , k p nnnˆˆˆ∑ m=0 m=-p m=L-1 21 22 Covariance Method Covariance Method • cannot use autocorrelation formulation => this is a true cross correlation • need to solve set of equations of the form p αφ(ik , )== φ ( i ,012 ), i , ,..., p , ∑ k nnˆˆ k =1 p Ek=−φαφ(,)00 (, 0 ) nnˆˆ∑ k n ˆ k =1 ⎡⎤⎡⎤φnˆ (,11)φφnnˆˆ (,)12 . (,) 1p ⎡⎤α1 φ n ˆ (,) 10 ⎢⎥ ⎢⎥⎢⎥α ⎢⎥⎢⎥φφnnˆˆ(,)21 (,) 22 . φ n ˆ (, 2p )⎢⎥2 φ n ˆ (,) 20 ⎢⎥⎢⎥.....⎢⎥. = . ⎢⎥⎢⎥⎢⎥ ⎢⎥⎢⎥.....⎢⎥. ⎢⎥⎢⎥⎢⎥α ⎣⎦⎣⎦φφnnˆˆ(,)pp12 (,) . φ n ˆ (,) ppp⎣⎦p φ n ˆ (,) 0 φα =ψψor α = φ−1 23 24 4 Covariance Method Summary of LP • use psnpth order linear predictor to predict (ˆ ) from previous samples • minimize mean-squared error, ELnˆ , over analysis window of duration -samples •= we have φφnnˆˆ(ik , ) ( ki , ) => symmetric but not Toeplitz matrix • solution for optimum predictor coefficients, {α }, is based on solving a matrix equation whose diagonal elements are related as k => two solutions have evolved φφnˆˆˆˆˆˆ(i++=11 , k ) nnn ( iksi , ) +−−−−− ( 1 ) sk ( 1 ) sL n ( −− 1 isL ) n ( −− 1 k ) - autocorrelation method => signal is windowed by a tapering window in order to minimize discontinuities at beginning (predicting speech from zero-valued samples) φφnnnnˆˆˆˆ(,)22=+− (,) 11ss ( 2 ) ()−−222sLnnˆˆ ( − )( sL − ) and end (predicting zero-valued samples from speech samples) of the interval; the • all terms φ ˆ (ik , ) have a fixed number of terms contributing to the n matrix φnˆ (ik , ) is shown to be an autocorrelation
Recommended publications
  • Digital Speech Processing— Lecture 17
    Digital Speech Processing— Lecture 17 Speech Coding Methods Based on Speech Models 1 Waveform Coding versus Block Processing • Waveform coding – sample-by-sample matching of waveforms – coding quality measured using SNR • Source modeling (block processing) – block processing of signal => vector of outputs every block – overlapped blocks Block 1 Block 2 Block 3 2 Model-Based Speech Coding • we’ve carried waveform coding based on optimizing and maximizing SNR about as far as possible – achieved bit rate reductions on the order of 4:1 (i.e., from 128 Kbps PCM to 32 Kbps ADPCM) at the same time achieving toll quality SNR for telephone-bandwidth speech • to lower bit rate further without reducing speech quality, we need to exploit features of the speech production model, including: – source modeling – spectrum modeling – use of codebook methods for coding efficiency • we also need a new way of comparing performance of different waveform and model-based coding methods – an objective measure, like SNR, isn’t an appropriate measure for model- based coders since they operate on blocks of speech and don’t follow the waveform on a sample-by-sample basis – new subjective measures need to be used that measure user-perceived quality, intelligibility, and robustness to multiple factors 3 Topics Covered in this Lecture • Enhancements for ADPCM Coders – pitch prediction – noise shaping • Analysis-by-Synthesis Speech Coders – multipulse linear prediction coder (MPLPC) – code-excited linear prediction (CELP) • Open-Loop Speech Coders – two-state excitation
    [Show full text]
  • Speech Signal Basics
    Speech Signal Basics Nimrod Peleg Updated: Feb. 2010 Course Objectives • To get familiar with: – Speech coding in general – Speech coding for communication (military, cellular) • Means: – Introduction the basics of speech processing – Presenting an overview of speech production and hearing systems – Focusing on speech coding: LPC codecs: • Basic principles • Discussing of related standards • Listening and reviewing different codecs. הנחיות לשימוש נכון בטלפון, פלשתינה-א"י, 1925 What is Speech ??? • Meaning: Text, Identity, Punctuation, Emotion –prosody: rhythm, pitch, intensity • Statistics: sampled speech is a string of numbers – Distribution (Gamma, Laplace) Quantization • Information Theory: statistical redundancy – ZIP, ARJ, compress, Shorten, Entropy coding… ...1001100010100110010... • Perceptual redundancy – Temporal masking , frequency masking (mp3, Dolby) • Speech production Model – Physical system: Linear Prediction The Speech Signal • Created at the Vocal cords, Travels through the Vocal tract, and Produced at speakers mouth • Gets to the listeners ear as a pressure wave • Non-Stationary, but can be divided to sound segments which have some common acoustic properties for a short time interval • Two Major classes: Vowels and Consonants Speech Production • A sound source excites a (vocal tract) filter – Voiced: Periodic source, created by vocal cords – UnVoiced: Aperiodic and noisy source •The Pitch is the fundamental frequency of the vocal cords vibration (also called F0) followed by 4-5 Formants (F1 -F5) at higher frequencies.
    [Show full text]
  • Mpeg Vbr Slice Layer Model Using Linear Predictive Coding and Generalized Periodic Markov Chains
    MPEG VBR SLICE LAYER MODEL USING LINEAR PREDICTIVE CODING AND GENERALIZED PERIODIC MARKOV CHAINS Michael R. Izquierdo* and Douglas S. Reeves** *Network Hardware Division IBM Corporation Research Triangle Park, NC 27709 [email protected] **Electrical and Computer Engineering North Carolina State University Raleigh, North Carolina 27695 [email protected] ABSTRACT The ATM Network has gained much attention as an effective means to transfer voice, video and data information We present an MPEG slice layer model for VBR over computer networks. ATM provides an excellent vehicle encoded video using Linear Predictive Coding (LPC) and for video transport since it provides low latency with mini- Generalized Periodic Markov Chains. Each slice position mal delay jitter when compared to traditional packet net- within an MPEG frame is modeled using an LPC autoregres- works [11]. As a consequence, there has been much research sive function. The selection of the particular LPC function is in the area of the transmission and multiplexing of com- governed by a Generalized Periodic Markov Chain; one pressed video data streams over ATM. chain is defined for each I, P, and B frame type. The model is Compressed video differs greatly from classical packet sufficiently modular in that sequences which exclude B data sources in that it is inherently quite bursty. This is due to frames can eliminate the corresponding Markov Chain. We both temporal and spatial content variations, bounded by a show that the model matches the pseudo-periodic autocorre- fixed picture display rate. Rate control techniques, such as lation function quite well. We present simulation results of CBR (Constant Bit Rate), were developed in order to reduce an Asynchronous Transfer Mode (ATM) video transmitter the burstiness of a video stream.
    [Show full text]
  • Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks
    sensors Article Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks Roneel V. Sharan * , Hao Xiong and Shlomo Berkovsky Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2109, Australia; [email protected] (H.X.); [email protected] (S.B.) * Correspondence: [email protected] Abstract: Audio signal classification finds various applications in detecting and monitoring health conditions in healthcare. Convolutional neural networks (CNN) have produced state-of-the-art results in image classification and are being increasingly used in other tasks, including signal classification. However, audio signal classification using CNN presents various challenges. In image classification tasks, raw images of equal dimensions can be used as a direct input to CNN. Raw time- domain signals, on the other hand, can be of varying dimensions. In addition, the temporal signal often has to be transformed to frequency-domain to reveal unique spectral characteristics, therefore requiring signal transformation. In this work, we overview and benchmark various audio signal representation techniques for classification using CNN, including approaches that deal with signals of different lengths and combine multiple representations to improve the classification accuracy. Hence, this work surfaces important empirical evidence that may guide future works deploying CNN for audio signal classification purposes. Citation: Sharan, R.V.; Xiong, H.; Berkovsky, S. Benchmarking Audio Keywords: convolutional neural networks; fusion; interpolation; machine learning; spectrogram; Signal Representation Techniques for time-frequency image Classification with Convolutional Neural Networks. Sensors 2021, 21, 3434. https://doi.org/10.3390/ s21103434 1. Introduction Sensing technologies find applications in detecting and monitoring health conditions.
    [Show full text]
  • Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications
    Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications Cosmin Munteanu Nikhil Sharma Abstract University of Toronto Google, Inc. Traditional interfaces are continuously being replaced Mississauga [email protected] by mobile, wearable, or pervasive interfaces. Yet when [email protected] Frank Rudzicz it comes to the input and output modalities enabling Pourang Irani Toronto Rehabilitation Institute our interactions, we have yet to fully embrace some of University of Manitoba University of Toronto the most natural forms of communication and [email protected] [email protected] information processing that humans possess: speech, Sharon Oviatt Randy Gomez language, gestures, thoughts. Very little HCI attention Incaa Designs Honda Research Institute has been dedicated to designing and developing spoken [email protected] [email protected] language and multimodal interaction techniques, Matthew Aylett Keisuke Nakamura especially for mobile and wearable devices. In addition CereProc Honda Research Institute to the enormous, recent engineering progress in [email protected] [email protected] processing such modalities, there is now sufficient Gerald Penn Kazuhiro Nakadai evidence that many real-life applications do not require University of Toronto Honda Research Institute 100% accuracy of processing multimodal input to be [email protected] [email protected] useful, particularly if such modalities complement each Shimei Pan other. This multidisciplinary, two-day workshop
    [Show full text]
  • Introduction to Digital Speech Processing Schafer Rabiner and Ronald W
    SIGv1n1-2.qxd 11/20/2007 3:02 PM Page 1 FnT SIG 1:1-2 Foundations and Trends® in Signal Processing Introduction to Digital Speech Processing Processing to Digital Speech Introduction R. Lawrence W. Rabiner and Ronald Schafer 1:1-2 (2007) Lawrence R. Rabiner and Ronald W. Schafer Introduction to Digital Speech Processing highlights the central role of DSP techniques in modern speech communication research and applications. It presents a comprehensive overview of digital speech processing that ranges from the basic nature of the speech signal, through a variety of methods of representing speech in digital form, to applications in voice communication and automatic synthesis and recognition of speech. Introduction to Digital Introduction to Digital Speech Processing provides the reader with a practical introduction to Speech Processing the wide range of important concepts that comprise the field of digital speech processing. It serves as an invaluable reference for students embarking on speech research as well as the experienced researcher already working in the field, who can utilize the book as a Lawrence R. Rabiner and Ronald W. Schafer reference guide. This book is originally published as Foundations and Trends® in Signal Processing Volume 1 Issue 1-2 (2007), ISSN: 1932-8346. now now the essence of knowledge Introduction to Digital Speech Processing Introduction to Digital Speech Processing Lawrence R. Rabiner Rutgers University and University of California Santa Barbara USA [email protected] Ronald W. Schafer Hewlett-Packard Laboratories Palo Alto, CA USA Boston – Delft Foundations and Trends R in Signal Processing Published, sold and distributed by: now Publishers Inc.
    [Show full text]
  • Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden
    Proc. Natl. Acad. Sci. USA Vol. 92, pp. 9932-9937, October 1995 Colloquium Paper This paper was presented at a colloquium entitled "Human-Machine Communication by Voice," organized by Lawrence R. Rabiner, held by the National Academy of Sciences at The Arnold and Mabel Beckman Center in Irvine, CA, February 8-9,1993. Models of speech synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden ABSTRACT The term "speech synthesis" has been used need large amounts of speech data. Models working close to for diverse technical approaches. In this paper, some of the the waveform are now typically making use of increased unit approaches used to generate synthetic speech in a text-to- sizes while still modeling prosody by rule. In the middle of the speech system are reviewed, and some of the basic motivations scale, "formant synthesis" is moving toward the articulatory for choosing one method over another are discussed. It is models by looking for "higher-level parameters" or to larger important to keep in mind, however, that speech synthesis prestored units. Articulatory synthesis, hampered by lack of models are needed not just for speech generation but to help data, still has some way to go but is yielding improved quality, us understand how speech is created, or even how articulation due mostly to advanced analysis-synthesis techniques. can explain language structure. General issues such as the synthesis of different voices, accents, and multiple languages Flexibility and Technical Dimensions are discussed as special challenges facing the speech synthesis community.
    [Show full text]
  • Speech Compression
    information Review Speech Compression Jerry D. Gibson Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93118, USA; [email protected]; Tel.: +1-805-893-6187 Academic Editor: Khalid Sayood Received: 22 April 2016; Accepted: 30 May 2016; Published: 3 June 2016 Abstract: Speech compression is a key technology underlying digital cellular communications, VoIP, voicemail, and voice response systems. We trace the evolution of speech coding based on the linear prediction model, highlight the key milestones in speech coding, and outline the structures of the most important speech coding standards. Current challenges, future research directions, fundamental limits on performance, and the critical open problem of speech coding for emergency first responders are all discussed. Keywords: speech coding; voice coding; speech coding standards; speech coding performance; linear prediction of speech 1. Introduction Speech coding is a critical technology for digital cellular communications, voice over Internet protocol (VoIP), voice response applications, and videoconferencing systems. In this paper, we present an abridged history of speech compression, a development of the dominant speech compression techniques, and a discussion of selected speech coding standards and their performance. We also discuss the future evolution of speech compression and speech compression research. We specifically develop the connection between rate distortion theory and speech compression, including rate distortion bounds for speech codecs. We use the terms speech compression, speech coding, and voice coding interchangeably in this paper. The voice signal contains not only what is said but also the vocal and aural characteristics of the speaker. As a consequence, it is usually desired to reproduce the voice signal, since we are interested in not only knowing what was said, but also in being able to identify the speaker.
    [Show full text]
  • Audio Processing Laboratory
    AC 2010-1594: A GRADUATE LEVEL COURSE: AUDIO PROCESSING LABORATORY Buket Barkana, University of Bridgeport Page 15.35.1 Page © American Society for Engineering Education, 2010 A Graduate Level Course: Audio Processing Laboratory Abstract Audio signal processing is a part of the digital signal processing (DSP) field in science and engineering that has developed rapidly over the past years. Expertise in audio signal processing - including speech signal processing- is becoming increasingly important for working engineers from diverse disciplines. Audio signals are stored, processed, and transmitted using digital techniques. Creating these technologies requires engineers that understand real-time application issues in the related areas. With this motivation, we designed a graduate level laboratory course which is Audio Processing Laboratory in the electrical engineering department in our school two years ago. This paper presents the status of the audio processing laboratory course in our school. We report the challenges that we have faced during the development of this course in our school and discuss the instructor’s and students’ assessments and recommendations in this real-time signal-processing laboratory course. 1. Introduction Many DSP laboratory courses are developed for undergraduate and graduate level engineering education. These courses mostly focus on the general signal processing techniques such as quantization, filter design, Fast Fourier Transformation (FFT), and spectral analysis [1-3]. Because of the increasing popularity of Web-based education and the advancements in streaming media applications, several web-based DSP Laboratory courses have been designed for distance education [4-8]. An internet-based signal processing laboratory that provides hands-on learning experiences in distributed learning environments has been developed by Spanias et.al[6] .
    [Show full text]
  • ECE438 - Digital Signal Processing with Applications 1
    Purdue University: ECE438 - Digital Signal Processing with Applications 1 ECE438 - Laboratory 9: Speech Processing (Week 1) October 6, 2010 1 Introduction Speech is an acoustic waveform that conveys information from a speaker to a listener. Given the importance of this form of communication, it is no surprise that many applications of signal processing have been developed to manipulate speech signals. Almost all speech processing applications currently fall into three broad categories: speech recognition, speech synthesis, and speech coding. Speech recognition may be concerned with the identification of certain words, or with the identification of the speaker. Isolated word recognition algorithms attempt to identify individual words, such as in automated telephone services. Automatic speech recognition systems attempt to recognize continuous spoken language, possibly to convert into text within a word processor. These systems often incorporate grammatical cues to increase their accuracy. Speaker identification is mostly used in security applications, as a person’s voice is much like a “fingerprint”. The objective in speech synthesis is to convert a string of text, or a sequence of words, into natural-sounding speech. One example is the Speech Plus synthesizer used by Stephen Hawking (although it unfortunately gives him an American accent). There are also similar systems which read text for the blind. Speech synthesis has also been used to aid scientists in learning about the mechanisms of human speech production, and thereby in the treatment of speech-related disorders. Speech coding is mainly concerned with exploiting certain redundancies of the speech signal, allowing it to be represented in a compressed form. Much of the research in speech compression has been motivated by the need to conserve bandwidth in communication sys- tems.
    [Show full text]
  • Advanced Speech Compression VIA Voice Excited Linear Predictive Coding Using Discrete Cosine Transform (DCT)
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-2 Issue-3, February 2013 Advanced Speech Compression VIA Voice Excited Linear Predictive Coding using Discrete Cosine Transform (DCT) Nikhil Sharma, Niharika Mehta Abstract: One of the most powerful speech analysis techniques LPC makes coding at low bit rates possible. For LPC-10, the is the method of linear predictive analysis. This method has bit rate is about 2.4 kbps. Even though this method results in become the predominant technique for representing speech for an artificial sounding speech, it is intelligible. This method low bit rate transmission or storage. The importance of this has found extensive use in military applications, where a method lies both in its ability to provide extremely accurate high quality speech is not as important as a low bit rate to estimates of the speech parameters and in its relative speed of computation. The basic idea behind linear predictive analysis is allow for heavy encryptions of secret data. However, since a that the speech sample can be approximated as a linear high quality sounding speech is required in the commercial combination of past samples. The linear predictor model provides market, engineers are faced with using other techniques that a robust, reliable and accurate method for estimating parameters normally use higher bit rates and result in higher quality that characterize the linear, time varying system. In this project, output. In LPC-10 vocal tract is represented as a time- we implement a voice excited LPC vocoder for low bit rate speech varying filter and speech is windowed about every 30ms.
    [Show full text]
  • Speech Processing Based on a Sinusoidal Model
    R.J. McAulay andT.F. Quatieri Speech Processing Based on a Sinusoidal Model Using a sinusoidal model of speech, an analysis/synthesis technique has been devel­ oped that characterizes speech in terms of the amplitudes, frequencies, and phases of the component sine waves. These parameters can be estimated by applying a simple peak-picking algorithm to a short-time Fourier transform (STFf) of the input speech. Rapid changes in the highly resolved spectral components are tracked by using a frequency-matching algorithm and the concept of"birth" and "death" ofthe underlying sinewaves. Fora givenfrequency track, a cubic phasefunction is applied to a sine-wave generator. whose outputis amplitude-modulated andadded to the sine waves generated for the other frequency tracks. and this sum is the synthetic speech output. The resulting waveform preserves the general waveform shape and is essentially indistin­ gUishable from the original speech. Furthermore. inthe presence ofnoise the perceptual characteristics ofthe speech and the noise are maintained. Itwas also found that high­ quality reproduction was obtained for a large class of inputs: two overlapping. super­ posed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. The analysis/synthesis system has become the basis for new approaches to such diverse applications as multirate coding for secure communications, time-scale and pitch-scale algorithms for speech transformations, and phase dispersion for the enhancement ofAM radio broadcasts. Moreover, the technology behind the applications has been successfully transferred to private industry for commercial development. Speech signals can be represented with a Other approaches to analysis/synthesis that· speech production model that views speech as are based on sine-wave models have been the result of passing a glottal excitation wave­ discussed.
    [Show full text]