ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vo l . 5, Iss u e Sp l - 2, Ja n - Ma r c h 2014 Speech Coding Using Code Excited Linear Prediction 1Hemanta Kumar Palo, 2Kailash Rout 1ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, Odisha, India 2Gandhi Institute For Technology, Bhubaneswar, Odisha, India

Abstract The main problem with the speech coding system is the optimum utilization of channel bandwidth. Due to this the speech signal is coded by using as few bits as possible to get low bit-rate speech coders. As the of the coder goes low, the intelligibility, SNR and overall quality of the speech signal decreases. Hence a comparative analysis is done of two different types of speech coders in this paper for understanding the utility of these coders in various applications so as to reduce the bandwidth and by different speech coding techniques and by reducing the number of bits without any appreciable compromise on the quality of speech. Hindi language has different number of stops than English , hence the performance of the coders must be checked on different languages. The main objective of this paper is to develop speech coders capable of producing high quality speech at low data rates. The focus of this paper is the development and testing of voice coding systems which cater for the above needs.

Keywords PCM, DPCM, ADPCM, LPC, CELP Fig. 1: Speech Production Process I. Introduction Speech coding or speech compression is the compact digital The two types of speech sounds are voiced and unvoiced [1]. representations of voice signals [1-3] for the purpose of efficient They produce different sounds and spectra due to their differences storage and transmission. Most of the coders incorporate in sound formation. With voiced speech, air pressure from the mechanisms to represent the spectral properties of speech signal. lungs forces normally closed vocal cords to open and vibrate. These spectral properties are useful for speech waveform matching The vibration frequencies (pitch) vary from about 50 to 400 Hz and ‘optimize’ the coder’s performance for the human ear. (depending on the person’s age and sex) and forms resonance in the An analog speech signal for telephone communication is band- vocal tract at odd harmonics. Unvoiced sounds, called fricatives limited below 4 kHz by passing it through a low-pass filter with a (e.g., s, f, sh) are formed by forcing air through an opening (hence 3.4 kHz cutoff frequency and then sampling at an 8 kHz sampling the term, derived from the word “friction”). Fricatives do not frequency to represent it as an integer stream. For terrestrial vibrate the vocal cords and therefore do not produce as much telephone networks, simple Pulse-Coded Modulation (PCM) [2] is periodicity as seen in the formant structure in voiced speech; used to convert the information into a 64 kb/s (kilo-bits/sec) binary unvoiced sounds appear more noise-like. stream. Despite the heavy user demand, the terrestrial network is able to accommodate many 64 kb/s signals by installing a sufficient III. Metheodology number of transmission lines. In mobile communications, on the To establish a fair means of comparing speech coding or other hand, very low bit rate coders [3-5] are desired because of enhancement algorithms, a variety of quality assessment techniques the bandwidth constraints. have been formulated. The signal-to-noise (SNR) is one of the most common objective measures for evaluating the performance II. Speech Analysis of a compression algorithm. A coder compresses the source data by removing redundant information. The receiver must know the compression mechanism A. Speech Signal Database to decompress the data. Coders intended specifically for speech The first step in modelling a system is the signals employ schemes which model the characteristics of human preparation of required Database. For this purpose a database of speech to achieve an economical representation of the speech. A the Hindi Digits from “shunya” to “nau” of one speaker is recorded simple scheme of a speech production process is shown in figure using “cool software”. Sampling rate has been taken as 16 KHZ, 1 [11]. The human vocal tract as shown in figure1 is an acoustic 16-bit, mono. Speech signal is digitized by the software itself. tube, which has one end at the glottis and the other end at the lips. Segmentation of speech signal is done using “cool” software. The vocal tract shape continuously changes with time, creating an Microphone made of “Intex” company was used for recording the acoustic filter with a time-varying frequency response. samples. We know that speech signal is a time-varying signal, but over a short period of time (0 to 50 msec) its characteristics may be assumed almost stationary. When the speech signal is sampled or digitized, we can analyze the discrete-time representation in a short time interval, such as 10-30 ms. www.ijcst.com International Journal of Computer Science And Technology 37 IJCST Vo l . 5, Iss u e Sp l - 2, Ja n - Ma r c h 2014 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

B. Speech Coding Techniques In CELP coders, speech is segmented into frames (typically 10-30 Here one of the Wave form coder i.e Adaptive Differential PCM ms long) and for each frame an optimum set of linear prediction (ADPCM) and one of the Hybrid coders i.e. Code-excited linear and pitch filter parameters are determined and quantized. Each prediction (CELP)[6][7] are discussed. The goal of waveform speech frame is further divided into a number of sub frames coding is to reproduce the original waveform as accurately as (typically 5 ms) and for each sub frame an excitation codebook is possible. It is sample-by-sample coding and often not speech- searched to find the input vector to the quantized predictor system specific. Waveform coding can deal with non-speech signals that gives the best reproduction of the speech signal. without difficulty. However, the cost of this fidelity is a relatively A step-by-step procedure for excitation codebook search is given high bit rate. These coders work best at a bit rate of 32 kbps below. and higher. This class includes Pulse Code Modulation (PCM), • Filter the input speech sub frame with the perceptual weighting Differential Pulse Code Modulation, Adaptive Differential PCM filter. (ADPCM) and sub band coders. • For each code vector in the excitation codebook: Calculate Hybrid coders combines features from both waveform coders the optimal gain (described later) and scale the code vector and parametric coders to provide good-quality, efficient speech using the value found. coding. Like a parametric coder, it relies on a speech model. During • Filter the scaled excitation code vector with the pitch synthesis encoding, parameters of the model are estimated. Additional filter. parameters of the model are optimized in such a way that the • Filter the pitch synthesis filter’s output with the modified decoded speech is as close as possible to the original waveform, formant synthesis filter. with the closeness often measured by a perceptually weighted • Subtract the perceptually filtered input speech from the error signal. modified formant synthesis filter’s output; the result represents an error sequence. Calculate the energy of the C. Adaptive Differential Pulse Code Modulation error sequence. (DPCM) • The index of the excitation code vector associated with the Number of bits/sample can be reduced from 8 to 4, by using an lowest error energy is retained as information on the input adaptive quantization and adaptive prediction. A digital coding sub frame. scheme that uses both adaptive quantization and adaptive prediction A block diagram of the CELP encoder and decoder is shown in is called Adaptive differential pulse code modulation (ADPCM). fig. 2 and 3. The decoder unpacks and decodes various parameters Adaptive DPCM (ADPCM) is a variant of DPCM (differential from the bit-stream, which are directed to the corresponding block pulse-code modulation) that varies the size of the quantization so as to synthesize the speech. A post-filter is added at the end to step, to allow further reduction of the required bandwidth for enhance the quality of the resultant signal. a given signal to noise ratio, adaptation mode, respectively. ADPCM[8] provides greater levels of prediction gain than simple DPCM depending on the sophistication of the adaptation logic and the number of past samples used to predict the next sample. The prediction gain of ADPCM is ultimately limited by the fact that only a few past samples are used to predict the input and the adaptation logic only adapts the quantizer – not the prediction weighting coefficients.

Fig. 2: Analysis-by-Synthesis Loop of a CELP Encoder

Fig. 3: Block Diagram of a Generic CELP Decoder

The principle of linear prediction used in ADPCM and CELP is explained below. The basic idea behind the LPC model [19] is that a given speech sample s(n) at time n can be approximated The idea of code-excited linear prediction (CELP) was born as as a linear combination of the past p speech samples, such that another attempt to improve on the existent LPC coder [17-18].

38 International Journal of Computer Science And Technology www.ijcst.com ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vo l . 5, Iss u e Sp l - 2, Ja n - Ma r c h 2014

the speech waveform. To set up the equations that must be solved to determine the (1) predictor coefficients we define short term speech and error u(n) is a normalized excitation and G is the gain of the excitation. By segments at time n as expressing above equation in the z domain we get the relation s (m) = s(n + m) n (8) e (m)= e(n + m) (2) n (9) Leading to the transfer function And we seek to minimize the mean squared error signal at time n p 2 En = ∑[sn (m) − ∑ ak sn (m − k)] m k=1 (10) (3) To solve Eq. (10) for the predictor coefficients, we differentiate

En with respect to each ak and set the result to zero. Pitch Period ∂E voiced/unvoiced n = 0, k =1,2,, p switch Vocal Tract ∂ parameters ak (11) Impulse Train giving Generator p s (m − i)s (m) = ak s (m − i)s (m − k) u(n) ∑ n n ∑ ∑ n n Time-Varying m k =1 m Digital Filter s(n) (12) In Eq. (12) ak are the values of ak that minimize En . Since a Random Noise k are unique we drop the bar on ak and use the notation ak to Generator G denote the values that minimize En . We can express in a compact form as Fig. 4: Model Based on LPC Model Φ n (i,k) = ∑ sn (m − i)sn (m − k) The interpretation of Eq. (3) is given in fig. 4 which shows the m (13) normalized excitation sourceu(n) , being scaled by the gain ‘ We can express Eq. (13) in the compact notation G’ and acting as input to the all pole system H (z) to produce the speech signal s(n). Based on the knowledge that the actual p excitation function for speech is either a quasi-periodic pulse Φ n (i,0) = ∑ ak Φ n (i,k) train (for voiced speech sounds) or a random noise source (for k =1 (14) unvoiced sounds), the appropriate synthesis model for speech, which describes a set of p equations in p unknowns. It is readily corresponding to the LPC[16] analysis is as shown in fig. 4. shown that the minimum mean squared error En can be expressed Based on the model of Figure 4 the exact relation between s(n) as and u(n) is p 2 En = ∑ sn (m) − ∑ ak ∑ sn (m)sn (m − k) m k =1 m (15) (4) p We consider the linear combination of the past speech samples = Φ n (0,0) − ∑ ak Φ n (0,k) as the estimate ~s (n) k =1 (16) p To solve Eq. (16) for the optimum predictor coefficients the(a k 's) ~ = − Φ ≤ ≤ ≤ ≤ s (n) ∑ ak s(n k) we have to compute n (i,k) for 1 i p and 0 k p and then k=1 (5) solve the resulting set of p simultaneous equations. In practice the We now form the prediction error, e(n) defined as method of solving the equations is a strong function of the range p ~ of m used in defining both the section of speech for analysis and e(n) = s(n)− s (n)= s(n)− ∑ ak s(n − k) the region over which the mean squared error is computed. k=1 (6) The Autocorrelation Method: A fairly simple and straightforward with error transfer function way of defining the limits on m in the summations is to assume p that the speech segment sn (m) is identically zero outside the range E(z) −k A(z)= = 1− ∑ ak z 0 ≤ m ≤ N −1. This is equivalent to assuming that the speech signal S(z) k=1 (7) s(m + n) is multiplied by a finite length window, w(m) which is The basic problem of linear prediction analysis is to determine identically zero outside the range 0 ≤ m ≤ N −1. Thus the speech the set of predictor coefficients { ak }, directly from the fig. 4 sample for minimization can be expressed as match those of the speech waveform within the analysis window. s (m)= s(m + n)∗ w(m), 0 ≤ m ≤ N −1 Since the spectral characteristic of speech vary over time, the n (17) predictor coefficients at a given time n, must be estimated from = 0 otherwise a short segment of the speech signal occurring around time, n . Based on using the weighted signal of Eq. (3.25) the mean squared Thus the basic approach is to find a set of predictor coefficients error becomes that minimize the mean-squared error over a short segment of www.ijcst.com International Journal of Computer Science And Technology 39 IJCST Vo l . 5, Iss u e Sp l - 2, Ja n - Ma r c h 2014 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

N −1+P 2 E(n)= ∑en (m) m=0 (18)

And Φn (i,k) can be expressed as N −1−P Φ n (i,k) = ∑ sn (m − i)sn (m − k) m=0 , 1 ≤ i ≤ p ≤ ≤ 0 k p (19) Or N −1−(i−k )

Φ n (i, k) = ∑ sn (m)sn (m + i − k) m=0 , 1 ≤ i ≤ p Fig. 5: CELP Waveforms for Hindi Digit Shunya ≤ ≤ 0 k p (20) B. Analysis of Results in Terms of SNR and Complexity Since Eq. (20) is only a function of ( i − k ), the covariance function, PCM has output SNR of around 49 dB while the SNR for DPCM Φn (i,k ) reduces to the simple autocorrelation function, i.e., falls around 25 dB. However, ADPCM has a better SNR of around 49 dB, due to use of adaptive quantization and adaptive prediction. N −1−(i−k ) CELP has the SNR around 49.5. This is shown in Table 1. Φ n (i, k) = rn (i − k) = ∑ sn (m)sn (m + i − k) The complexity of the speech coders, PCM, DPCM, ADPCM and m=0 (21) CELP in terms of bit rate is shown in table 2.. We note that as the

Since the autocorrelation function is symmetric i.e., rn (−k) = rn (k) bit-rate goes down, the computational complexity increases on a the LPC equation can be expressed as large scale. This introduces a delay as well as an increase in the p cost of implementation. The bandwidth as it depends on the bit ∑ rn i − k ak = rn (i), 1≤ i ≤ p rate is also reduced in CELP drastically than ADPCM k =1 (22) Table 1: Signal To Noise Ratio IV. Results Methods SNR(dB) Number of Bits PCM 49.9257 8 A. Graphical Results of Speech Coding Techniques DPCM 25.8433 5 This section presents the graphical results of ADPCM and. In the ADPCM 49.9257 4 results of ADPCM, the input signal, output signal are plotted. CELP waveform is also plotted. It can be observed from comparative CELP 49.7256 3 study of waveforms of both coders that CELP reproduces the Table 2: Complexity Figure signal more closely to the original signal as compared to other Bit Rate Algorithm Complexity coders. This is due to the fact that CELP uses long term and short (bit/sec) term linear prediction models, where PCM 64 kbps Simple an excitation sequence is excited from the code book through an index. The performance of CELP is superior over LPC based DPCM 40 kbps Complex coders because it uses both magnitude and phase information ADPCM 32kbps More complex during synthesis. CELP 4.8 kbps Most complex

IV. Conlusion The ultimate goal of this paper is to design a speech coder to achieve the best possible quality of speech signal at low bit rate, with constraints on complexity and delay. In this paper, specially two types of speech coders were studied. Each coder has its own advantages and weaknesses. In order to arrive at a useful conclusion a database is prepared for Hindi digits ‘’Shunya to Nau” using cool software in female voice. Matlab programs on various speech coders are prepared. CELP is a hybrid coder which has shown its superiority over waveform coders in terms of bit rate and reduction in bandwidth requirement.

References [1] B. Gold, N. Morgan,“Speech and ”, Fig. 5: ADPCM Waveforms for Hindi Digit Shunya John Wiley and Sons, 2003. [2] W. C. Chu," Speech Coding Algorithms: Foundation and Evolution of Standardized Coders”, John Wiley & Sons, 2003.

40 International Journal of Computer Science And Technology www.ijcst.com ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vo l . 5, Iss u e Sp l - 2, Ja n - Ma r c h 2014

[3] A. M. Kondoz,“Digital Speech: Coding for Low Bit Rate Communications Systems”, John Wiley and Sons Publishing, Hemanta Kumar Palo,Assistant England, 1994. Professor in the Department of ECE, [4] Thomas F. Quatieri,“Discrete-Time Speech Signal ITER, Siksha "O" Anusandhan Processing”, Prentice-Hall, third edition, 1996. University, Bhubaneswar, Odisha, [5] B. S. Atal, M. R. Schroeder,“Predictive coding of speech India. signals and subjective error criteria", IEEE Trans. Acoust., Speech, Signal Proc., Vol. ASSP-27, pp. 247-254, June 1979. [6] B. Atal, R. Cox, P. Kroon,“Spectral quantization and interpolation for CELP coders", In Proc. Int. Conf. Acoust., Speech, Signal Proc., (Glasgow), pp. 69-72, 1989. [7] B. S. Atal, M. R. Schroeder,“Stochastic coding of speech at very low bit rates”, In Proc. IEEE Int. Conf. Communications, Kailash Rout, Asst. Professor of Amsterdam, The Netherlands, May 1984, pp. 1610–1613. Electronic & Communication Engg. [8] A. S. Spanias,“Speech coding: A tutorial review," Proc. IEEE, in Gandhi Institute for technology, Vol. 82, pp. 1541-1582, Oct. 1994. Bhubaneswar with 8 years of vast [9] Jacob Benesty, M. Mohan Sondhi, Yiteng Huang (Eds.), “ experience. Springer Hand book of ”, Springer-Verlag Berlin Heidelberg 2008. [10] L. R. Rabiner, B. H. Juang,“Fundamental of Speech Recognition”, 1st ed., Pearson Education, Delhi, 2003. [11] J. M. Tribolet, M. P. Noll, B. J. McDermott,“A study of complexity and quality of speech waveform coders," in Proc. Int. Conf. Acoust., Speech, Signal Proc., (Tulsa), April 1978, pp. 586-590. [12] B. S. Atal, S. L. Hanauer,“Speech Analysis and Synthesis by Linear Prediction of the Speech Wave”, J. Acoust. Soc. Am, Aug. 1971, Vol. 50, No. 2, pp. 637-655. [13] T. P. Barnwell III, K. Nayebi, C. H. Richardson,“Speech Coding, A Computer Laboratory Textbook”, John Wiley & Sons, Inc. 1996. [14] T. Ohya, H. Suda, T. Miki,"5.6 kbits PSI-CELP of the half- rate PDC speech coding standard," In Proc. IEEE Vehic. Tech. Conf., pp. 1680-1684, 1994. [15] Erik Ordentlich, Yair Shoham,“Low-delay code-excited of wideband speech at 32 kbps”, In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Ontario, Canada, 1991, pp. 9 - 12. [16] J. Makhoul,“Linear prediction: A tutorial review," Proceedings of the IEEE, Vol. 63, pp. 561-580, April 1975. [17] K. Samudravijaya,“Hindi Speech Recognition,” Journal of Acoustical Society of India, Vol. 29, No. 1, 2001, pp. 385- 393. [18] M. Kumar N. Rajput and A. Verma, “A large-vocabulary continuous speech recognition system for Hindi,” IBM Journal of Research and Development, Vol. 48, No. 5/6, 2004, pp. 703-715.

www.ijcst.com International Journal of Computer Science And Technology 41