<<

Chapter 8: Speech Coding

School of Information Science and Engineering, SDU l The performance of speech coders determines the quality of the recovered speech and the capacity of the system. l In mobile communication systems, bandwidth is a precious commodity, and service providers are continuously met with the challenge of accommodating more users within a limited allocated bandwidth. l The lower the at which the coder can deliver toll quality speech, the more speech channels can be compressed within a given bandwidth.

For this reason, manufacturers and service providers are continuously in search of speech coders that will provide toll quality speech at lower bit rates. 8.1 Introduction l The goal of all speech coding systems: to transmit speech with the highest possible quality using the least possible channel capacity. This has to be accomplished while maintaining certain required levels of complexity of implementation and communication delay. l In general, there is a positive correlation between coder bit-rate efficiency and the algorithmic complexity required to achieve it. A balance needs to be struck between these conflictingfactors.

Two categories of coders: (Based on the means by which they achieve compression) l Waveform Coders l .

(1)Waveform coders: reproduce the time waveform of the speech signal as closely as possible. l Source independent l Code equally well a variety of signals. l Robust for a wide range of speech characteristics and for noisy environments. l With minimal complexity l Achieves only moderate economy in transmission bit rate. Examples: 1. Pulse code modulation (PCM) 2. Differential pulse code modulation (DPCM) 3. Adaptive differential pulse code modulation (ADPCM) 4. (DM) 5. Continuously variable slope delta modulation (CVSDM) 6. Adaptive predictive coding (APC). 8.2 Characteristics of Speech Signals l Speech waveforms have a number of useful properties that can be exploited when designing efficient coders. l Nonuniformprobability distribution of speech amplitude l Nonzero autocorrelation between successive speech samples l Nonflatnature of the speech spectra l Existence of voiced and unvoiced segments in speech l Quasiperiodicityof voiced speech signals l The most basic property is bandlimited. Time discretizedpossible at a finite rate and reconstructed completely from its samples. 1) Probability Density Function (pdf)

l Characteristics of speech signal pdf: l very high probability of near-zero amplitudes l Significant probability of very high amplitudes l Monotonically decreasing function of amplitudes between these extremes. Exact distribution depends on the input bandwidth and recording conditions. Nonuniformquantizers, including the vector quantizers. attempt to match the distribution of quantization levels to that of the pdfof the input speech signal.

l An approximation to the long-term pdf of telephone quality speech signals:

l Two-sided exponential (Laplacian) function equation l There is a distinct peak at zero due to the existence of frequent pauses and low level speech segments. l Short-time pdfsof speech segments are also single-peaked functions and are usually approximated as a Gaussian distribution. 2) Autocorrelation Function (ACF)

l There exists much correlation between adjacent samples of a segment of speech. allow easily predicting. All differential and predictive coding schemes are based on this

l Definition:

l ACF gives a quantitative measure of the closeness between samples.

l Typical signals have an adjacent sample correlation, C(1) , as high as 0.85 to 0.9. 3) Power Spectral Density Function (PSD)

l PSD is nonflat. High frequency components contribute very little to the total speech energy. l Can be used to obtain significant compression in frequency domain. l Coding speech separately in different frequency bands can lead to significant coding gain. Though high frequency is insignificant in energy, they are very important carriers of speech information, and hence need to be adequately represented.

l A qualitative measure of the theoretical maximum coding gain that can be obtained by exploiting the nonflatcharacteristics of the PDF, is given by the spectral flatness measure (SFM). l SFM is defined as the ratio of the arithmetic to geometric mean of the samples of the PSD taken at uniform intervals in frequency. 8.3 Quantization Techniques

(1) Uniform Quantization l Quantization is the process of mapping a continuous range of amplitudes of a signal into a finite set of discrete amplitudes. l The operation is irreversible. l Introduces distortion. determines to a great extent the overall distortion l One of the most frequently used measures of distortion: MSE (mean square error) l The distortion introduced by a quantizeris often modeled as additive quantization noise l The performance of a quantizeris measured as the output signal-to-quantization noise ratio (SQNR).

l The SQNR of a PCM encoder:

where a = 4.77 for peak SQNR and a = 0 for the average SQNR. with one additional bit, the output SQNR improves by 6 dB. (2) NonuniformQuantization l Distribute the quantization levels in accordance with the pdf of the input waveform. l Mean square distortion:

l To design an optimal nonuniformquantizer, we need to determine the quantization levels which will minimize the distortion of a signal with a given pdf. l The Lloyd-Max algorithm provides a method to determine the optimum quantization levels by iteratively changing the quantization levels in manner that minimizes the mean square distortion. l A simple and robust implementation: logarithmic quantizer. l Different compandingtechniques:

l m-law (U.S)

l A-law (Europe) (3) Adaptive Quantization l There is a distinction between the long term and short term pdf of speech waveforms. because of the nonstationaritycharacteristic. usually the is 40 dB or more. l Time varying quantization technique is useful. varies the step size in accordance to the input signal power. (4) l Shannon's Rate-Distortion Theorem: There exists a mapping from a source waveform to output code words such that for a given distortion D, R(D) bits per sample are sufficient to reconstruct the waveform with an average distortion arbitrarily close to D.

l R(D) is called the rate-distortion function, represents a fundamental limit on the achievable rate for a given distortion l The actual rate R has to be greater than R(D). l Shannon predicted that better performance can be achieved by coding many samples at a time instead of one sample at a time. . l Vector quantization (VQ) a delayed-decision coding technique which maps a group of input samples (typically a speech frame), called a vector, to a code book index. l A code book is set up consisting of a finite set of vectors covering the entire anticipated range of values. l In each quantizing interval, the code-book is searched and the index of the entry that gives the best match to the input signal frame isselected. l VQ can yield better performance even when the samples are independent of one another. l The number of samples in a block (vector) is called the dimension L of the vector quantizer. l The rate R of the vector quantizeris defined as:

n is the size of the VQ code book. R may take fractional values. l Quantization vectors are used instead of quantization levels l Distortion is measured as the squared Euclidean distance between the quantization vector and the input vector. l VQ is most efficient at very low bit rates (R = 0.5 bits/sample or less). l But VQ is a computationally intensive l Not often used to code speech signals directly. l Usually used to quantize the speech analysis parameters, such as l Linear prediction coefficients l spectral coefficients l filter bank energies, etc. 8.4 Adaptive Differential Pulse Code Modulation (ADPCM) l Amore efficient coding scheme l Exploits the redundancies present in the speech signal between adjacent samples. l The difference between adjacent samplesis transmitted. l Allows speech to be encoded at a bit rate of 32kbps. The CCITT standard G.721 ADPCM algorithm for 32 kbps speech coding is used in cordless telephone systems like CT2 and DECT. l Signal prediction techniques is used.

8.5 Frequency Domain Coding of Speech

l Speech signal is divided into a set of frequency components which are quantized and encoded separately.

l Different frequency bands can be preferentiallyencoded according to some perceptual criteria for each band.

l The quantization noise can be contained within bands and prevented from creating harmonic distortions outside the band.

Advantage: The number of bits used to encode each frequency component can be dynamically varied and shared among the different bands. (1) Sub-band Coding

l The human ear does not detect the quantization distortion at all frequenciesequally well.

l It is therefore possible to achieve substantial improvement in quality by coding the signal in narrower bands.

l In a sub-band coder, speech istypically divided into four or eight sub-bands by a bank of filters, and each subbandis sampled at a bandpass Nyquistrate and encoded with different accuracy in accordance to a perceptualcriteria. (1) Sub-band Coding

Ways of Band-splitting: l Divide the entire speech band into unequal sub-bands that contribute equally tothe articulation index(清晰度指数).

method suggested by Crochiere: Sub-band Number Frequency Range 1 200-700 Hz 2 700-1310 Hz 3 1310-2020 Hz 4 2020-3200 Hz

l Divide band into equal sub-bands and assign to each sub-band number of bits proportional to perceptual significance. octave(音阶) band splitting is often employed instead of equal splitting. As the human ear has an exponential decreasing sensitivity to frequency, this kind of splitting is more in tunewith the perception process. (1) Sub-band Coding

Method for processing the sub-band signals: make a low pass translation of the sub-band signal to zero frequency by a modulation process equivalent to single sideband modulation.

l The low pass translation technique is straightforward and takes advantage of a bank of nonoverlappingbandpass filters. l Perceptible aliasing effects exist unless we use sophisticated bandpassfilters. Some techniques has been developed to deal with it.

Sub-band coding is useful for lower bit rates in the range 9.6 to 32 kbps. Especially when bit rate below 16kbps. The CD-900 cellular telephone system uses sub-band coding speech compression.

(2) Adaptive

l Make the transformations of windowed input segments of the speech waveform. Each segment is represented by a set of transform coefficients, which are separately quantized and transmitted. More complex

l Successfully used to encode speech at bit rates in the range 9.6 kbps to 20 kbps. l Discrete cosine transform (DCT) is usually used to implement the transform. l The DCT of a N-point sequence x (n):

DCT

IDCT

Fast algorithms are developed to computing DCT and IDCT. 8.6 Vocoders

l Vocodersare a class of speech coding systems that analyze the voice signal at the transmitter, transmit parameters derived from the analysis, and then synthesize the voice at the receiver using those parameters. All vocodersystems attempt to model the speech generation process as a dynamic system and try to quantify certain physical constraints of the system.

l Characteristics: l Much more complex than the waveform coders l Achieve very high economy in transmission bit rate. l Less robust l Tends to be talker dependent.

l Types: l Linear predictive coder (LPC). 线性预测编码器 l Channel 信道声码器 l Formant vocoder 共振峰声码器 l Cepstrumvocoder 倒谱声码器 l Voice excited vocoder. 语音激励声码器 8.6 Vocoders

l All vocodingsystems are based on speech generation model. l The sound generating mechanism forms the source and is linearly separated from the intelligence modulating vocal tract filter which forms the system. 8.7 Linear Predictive Coders(LPCs) 8.7.1 LPC Vocoders l Belong to the time domain class of vocoders. l Attempts to extract the significant features of speech from the time waveform. l Computationally intensive, but most popular among the class of low bit rate vocoders. l Transmit good quality voice at 4.8 kbps and poorer quality voice at even lower rates. l Models the vocal tract as an all pole linear filter

l excitation to the filter is either a pulse at the pitch frequency or random white noise depending on whether the speech segment is voiced or unvoiced. l The coefficients of the all pole filter are obtained in the time domain using linear prediction techniques l The prediction principles are similar to those in ADPCM, buty transmits only selected characteristics of the error signal, includes: l G factor l Pitch information l Voiced/unvoiced decision information. l At the receiver, the received information about the error signal is used to determine the appropriate excitation for the synthesis filter. That is, the error signal is the excitation to the decoder. l The synthesis filter is designed at the receiver using the received predictor coefficients. l Various LPC schemes differ in the way they recreate the error signal (excitation) at the receiver. Three alternatives are shown below. l The First one is most popular. l It uses two sources at the receiver, one of white noise and the other with a series of pulses at the current pitch rate. l The selection of either of these excitation methods is based on the voiced/unvoiced decision made at the transmitter and communicated to the receiver along with the other information. l This technique requires that the transmitter extract pitch frequency information which is often very difficult. l Moreover, the phase coherence between the harmonic components of the excitation pulse tends to produce a buzzytwang(蜂鸣声) in the synthesized speech. These problems are mitigated in the other two approaches: Multi- pulse excited LPC and stochastic or code excited LPC. 8.7.2 MultI-pulse Excited LPC

No matter how well the pulse is positioned, excitationby a single pulse per pitch period produces audible distortion. l Atalsuggested using more than one pulse, typically eight per period, and adjusting the individual pulse positions and amplitudes sequentially to minimize a spectrally weighted mean square error. l This technique called the multipulseexcited LPC (MPE-LPC). l Can results in better speech quality, because The prediction residual is better approximated by several pulses per pitch period The multi-pulse algorithm does not require pitch detection. 8.7.3 Code-Excited LPC l In this method, the coder and decoder have a predetermined code book of stochastic (zero-mean white Gaussian) excitation signals. l For each speech signal the transmitter searches through its code book of stochastic signals for the one that gives the best perceptual match to the sound when used as an excitation to the LPC filter. l The index of the code book where the best match was found is then transmitted. l The receiver uses this index to pick the correct excitation signal for its synthesizer filter. l Extremely complex, but can provide high quality even when the excitation is coded at only 0.25 bits per sample. Advances in DSP and VLSI technology have made real-time implementation of CELP codecspossible.

Example: CDMA digital cellular standard (15-95)----variable rate CELP at 1.2 to 14.4 kbps, and QCELP13 at 13.4 kbps.

8.7.4 Residual Excited LPC l In this class of LPC coders, after estimating the model parameters (LP coefficients or related parameters) and excitation parameters (voiced/unvoiced decision, pitch, gain) from a speech frame, the speech is synthesized at the transmitter and subtracted from the original speech signal to from a residual signal. l The residual signal is quantized, coded, and transmitted to the receiver along with the LPC model parameters.

l At the receiver the residual error signal is added to the signal generated using the model parameters to synthesize an approximation of the original speech signal.

l The quality of the synthesized speech is improved due to the addition of the residual error.

8.8 Choosing Speech for Mobile Communications l Choosing the right speech codec is an important step in the design of a digital mobile communication system. l A balance must be struck between the perceived quality of the speech resulting from this compression and the overall system cost and capacity l Other criterion includes: l The end-to-end encoding delay l The algorithmic complexity of the coder l The d.c. power requirements l compatibility with existing standards l Robustness of the encoded speech to transmission errors. Different speech coders show varying degree of immunity to transmission errors. l The choice of the speech coder will also depend on the cell size used.

Cordless telephone systems: l cell size is sufficiently small, high spectral efficiency is achieved through frequency reuse, thus a simple high rate speech codec is enough. l In CT2 and DECT. which use very small cells (microcells), 32 kbps ADPCM coders are used to achieve acceptable performance even without channel coding and equalization.

Cellular systems: l poorer channel conditions need to use error correction coding l requiring the speech codecsto operate at lower bit rates.

Mobile satellite communications: l Cell sizes are very large, available bandwidth is very small. l Speech rate must be of the order of 3 kbps, requiring the use of vocoder techniques. l The type of multiple access technique used, being an important factor in determining the spectral efficiency of the system, strongly influences the choice of speech codec. l The type of modulation employed also has considerable impact on the choice of speech codec.

8.9The GSM Codec

The speech coder used in the pan-European digital cellular standard GSM, was chosen after conducting exhaustive subjective tests on various competing codecs. l The name is rather grandiose: Regular pulse excited long term prediction (RPE-LTP) codec. l Net bit rate: 13 kbps. l RPE-LTP is a combination of two proposed codec: l basebandRELP codec, proposed by French. l Advantage: provides good quality speech at low complexity. l Drawback: affected by channel errors. l multi-pulse excited long-term prediction (MPE-LTP) codec, proposed by Germany. l Advantage: produces excellent speech quality, not much affected by bit errors in the channel. l Drawback: high complexity. l By modifying the RELP codec to incorporate certain features of the MPE-LTP codec, the net bit rate was reduced from 14.77 kbps to 13.0 kbps without loss of quality. The most important modification was the addition of along-term prediction Loop.

l The GSM codec is relatively complex and power hungry. STP----short time prediction LTP----long time prediction RPE----regular pulse excitation

8.10The USDC Codec(Skip) 8.11 Performance Evaluation of Speech Coders

There are two approaches to evaluating the performanceof a speech coder in terms of its ability to preserve the signal quality (1) Objective measures l Mean square error (MSE) distortion l Frequency weighted MSE l Segmented SNR l Articulation index, etc. l have the general nature of a signal-to-noise ratio and provide a quantitative values of how well the reconstructed speech approximates the original speech. l Useful in initial design and simulation of coding systems l Do not necessarily give an indication of speech quality as perceived by the human ear. because it is the listener who is the ultimate judge of the signal quality 8.11 Performance Evaluation of Speech Coders

(2) Subjective listening tests l Playing the sample to a number of listeners and asking them to judge the quality of the speech. Speech coders are highly speaker dependent in that thequality varies with the age and gender of the speaker, the speed at which the speaker speaks and other factors. l Carried out in different environments to simulate real life conditions Such as noisy, multiple speakers, etc. l Terms used to describe the results: l Overall quality l Listening effort l Intelligibility which measure the listeners ability to identify the spoken word. l Naturalness. 8.11 Performance Evaluation of Speech Coders l These kinds of tests results are difficult to rank and hence require a reference system. l The most popular ranking system: ----mean opinion score (MOS) ranking. l A five point quality ranking scale l Each point associated with a standardized descriptions: • In general, the MOS rating of a speech codec decreases with decreasing bit rate.