<<

AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING

Pratik R. Bhatt B.E., C.U. Shah College of Engineering and Technology, India, 2006

PROJECT

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

ELECTRICAL AND ELECTRONIC ENGINEERING

at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL 2010

AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING

A Project

by

Pratik R. Bhatt

Approved by:

______, Committee Chair Jing Pang, Ph.D.

______, Second Reader Preetham Kumar, Ph.D.

______Date

ii

Student: Pratik R. Bhatt

I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to

be awarded for the project.

, Graduate Coordinator Preetham Kumar, Ph.D. Date

Department of Electrical and Electronic Engineering

iii

Abstract

of

AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING

by

Pratik R. Bhatt

Audio coder using perceptual linear predictive coding is a new technique for compressing and decompressing audio . This technique codes only the information of an audio signal, which is audible to the human ear, all other information is discarded during coding. Audio signal is given to a psychoacoustic model to generate information to control the linear prediction filter. At the encoder side, a signal is passed through a prediction filter, and then the filtered signal contains less information as it is coded according to perceptual criteria. At the decoder side, an encoded signal is decoded using a linear prediction filter, which is the inverse of the filter used at the encoder side. The decoded signal is similar to the original signal. Since information from the audio signal relative to the human hear is coded so the signal contains less data and has a high signal to noise ratio.

iv

This project is about designing and implementing the MATLAB code for an audio

coder using perceptual linear predictive coding. The following tasks were performed in

this project.

(i) Process the audio signal according to a perceptual model.

(ii) Encode and decode the signal using the linear predictor.

(iii) Check simulation results for the original audio signal as well as the retrieved

audio signal after signal processing.

(iv) Listen to the wave file of the original as well as the reconstructed signal.

, Committee Chair Jing Pang, Ph.D.

______Date

v

ACKNOWLEDGMENTS

I would like to thank my project adviser, Dr. Jing Pang, for giving me an opportunity to implement the Audio coder based on perceptual linear predictive coding. I have learned many concepts of signal processing during the implementation of this project. Dr. Pang provided great support and guidance throughout the development of this project. She has been a great listener and problem solver throughout the implementation.

Her valuable research experience has helped me to develop fundamental research skills.

I would also like to thank Dr. Preetham Kumar for his valuable guidance and support in writing this project report. In addition, I would like to thank Dr. Suresh

Vadhva, Chair of the Electrical and Electronic Engineering Department, for his support in completing the requirements for a Master’s degree at California State University,

Sacramento.

vi

TABLE OF CONTENTS

Page

Acknowledgments ...... vi

List of Tables ...... ix

List of Figures ...... x

Chapter

1. INTRODUCTION ...... 1

1.1 Background ...... 1

1.2 Types of Audio Coding ...... 1

1.3 Concept ...... 3

1.4 Overview of Design ...... 4

2. PSYCHOACOUSTIC MODEL ...... 6

2.1 Threshold of Hearing ...... 6

2.2 Critical Bands...... 8

2.3 Simultaneous Masking ...... 11

2.3.1 Noise Masking Tone ...... 12

2.3.2 Tone Masking Noise ...... 13

2.4 Non-simultaneous Masking ...... 14

2.5 Calculate Overall Masking Threshold ...... 15

3. LINEAR PREDICTIVE CODING ...... 18

3.1 Linear Prediction ...... 18

vii

3.2 Linear FIR Predictor as Pre-filter ...... 19

3.3 Linear IIR Predictor as Post-filter ...... 21

3.4 Calculation of LPC Filter Coefficients ...... 22

4. MATLAB IMPLEMENTATION ...... 27

5. SIMULATION RESULTS ...... 31

6. CONCLUSION ...... 37

6.1 Future Improvements ...... 37

Appendix ...... 39

References ...... 49

viii

LIST OF TABLES

Page

1. Table 1 List of critical bands ...... 11

ix

LIST OF FIGURES

Page

1. Figure 1 Audio coder using perceptual linear predictive coding ...... 4

2. Figure 2 Threshold of hearing for average human ...... 7

3. Figure 3 Frequency to place transformation along the basilar membrane ...... 8

4. Figure 4 Narrow band noise sources masked by two tones ...... 9

5. Figure 5 Decrease in threshold after critical bandwidth ...... 9

6. Figure 6 Noise masking tone ...... 12

7. Figure 7 Tone masking noise ...... 13

8. Figure 8 Non-simultaneous masking properties of human ear ...... 14

9. Figure 9 Steps to calculate masking threshold of signal ...... 16

10. Figure 10 FIR linear predictor structure ...... 19

11. Figure 11 IIR linear predictor structure ...... 21

12. Figure 12 Determination of linear predictor coefficients ...... 23

13. Figure 13 Input signal ...... 31

14. Figure 14 Pre-filtered signal ...... 32

15. Figure 15 Reconstructed signal from post-filter ...... 33

16. Figure 16 Spectrogram for masking threshold of input signal ...... 34

17. Figure 17 Spectrogram for estimated masking threshold ...... 35

18. Figure 18 Masking threshold and LPC response for frame of signal ...... 36

x 1

Chapter 1

INTRODUCTION

1.1 Background

In our day-to-day lives, we use various kinds of audio equipment, such as Cell phones, music players and wireless radio receivers. These devices process audio in different ways. The basic form of audio signals is analog. It is easier to process a digital signal form rather than an analog signal form. The process of converting an analog signal into a digital representation is described as audio coding. Moreover, audio processing has become significantly highly digitized in recent years. The general procedure of audio processing is as follows. An analog signal is sampled at regular intervals and sampled data is encoded with a coding scheme. At the decoder side, data is converted back to the original form. Therefore, as we increase the sampling rate of the audio signal, the bandwidth required by that signal will increase and the audio quality will also increase. The main goal of audio coding is to minimize the sampling rate with a higher signal to noise ratio and reduced and achieve high audio quality.

1.2 Types of Audio Coding

There are various types of coding schemes available for audio coding. Each coding scheme has its benefits and drawbacks depending on the technique used to process the audio signal. For any coding technique, an area of concern is the quality of the signal and the achievable bit rate used to reduce the transmission bandwidth. Based on these criteria, different coding techniques are used to code the audio signal. One of

2

the coding techniques is waveform coding, which uses a signal to perform a coding

operation and produces a reconstructed signal that is the same as the original signal.

Pulse coded modulation (PCM), Differential pulse coded modulation (DPCM), and

Adaptive differential pulse coded modulation (ADPCM) are known as waveform coders.

Source coding technique or parametric coder uses a model of the signal source to code

the signal; this source model is represented by a mathematical representation to reduce

signal redundancies. Linear predictive coding is an example of source coding. In linear

predictive coding, signal parameters are coded as well as transmitted and these

parameters are used to produce a signal at the decoder side. In this technique, the

reconstructed signal does not have the same quality as the original signal. Sub-band

coding and are examples of source coders that work in the frequency

domain [1].

Transform-based coding uses a transform length to analyze and process the audio

signal; therefore, the resolution of the signal is dependent on the transform length. If the

transform length is longer, then it provides good resolution for speech signals because

they change very little over time. If the transform length is short, it provides good

resolution for audio signals because they change abruptly within a short time. Therefore,

required transform length is dependent on the signal type to be coded [2]. Another

concept is to code the signal according to perceptual criteria. In this method, information

or the sound of a signal that cannot be heard by the human ear is removed and

quantization is performed according to a perceptual model. This allows quantization

3

noise to be kept below the masking threshold of the signal such that it cannot be

perceived by the human ear; then, appropriate transform coding is used to achieve

compression by removing redundant information from the signal. The purpose of an

audio coder using perceptual linear predictive coding is to use a linear predictive filter at

the encoder side as well as at the decoder side. A psychoacoustic model is used to control

the linear predictor filter according to a perceptual criterion of the human auditory

system. The linear predictive filter at the encoder side codes the signal based on a

psychoacoustic model and the linear predictive filter at the decoder side decodes the

signal to reconstruct the original signal.

1.3 Concept

The basic idea of an audio coder based on perceptual linear predictive coding is to

combine a linear predictive filter with a psychoacoustic model. The combination of a

psychoacoustic model and linear predictive coding is used to reduce irrelevant

information of the input signal. The linear predictor at the encoder side is known as the pre-filter and the linear predictor at the decoder side is known as post-filter. The pre- and

post-filter are controlled by a psychoacoustic model which shapes the quantization noise

such that it stays below the masking threshold of the signal and it becomes inaudible to

the human ear. This technique provides good resolution for speech as well as audio

signals and the resolution of signals is independent of the coding technique used [2].

4

1.4 Overview of Design

The project design consists of a pre-filter and post-filter with a psychoacoustic model. Figure 1 shows a basic block diagram of an audio coder using perceptual linear predictive coding. The psychoacoustic model takes the input audio signal and generates a masking threshold value for each frame of signal. It utilizes properties of the human auditory system to generate the masking threshold of the input audio signal.

Psycho acoustic model

Input Audio Pre Filtered Pre Filtered Post Filtered signal Pre Filter Signal Signal Post Filter Signal

Figure 1. Audio coder using perceptual linear predictive coding [2]

Pre- and post-filter are based on the linear predictive filter structure and are

adapted by the masking threshold generated by a psychoacoustic model. The masking

threshold of an audio signal is used to estimate the linear predictive filter coefficients,

and these estimated coefficients change with time making a linear predictive filter

adaptive.

At the transmitter side, a psychoacoustic pre-filter generates a time varying

magnitude response according to the masking threshold of the input signal, and this is

achieved by time varying filter coefficients from a masking threshold for that particular

5

frame of signal. A linear predictor as a pre-filter takes the input audio signal and

processes it for each frame. Adaption of a pre-filter using a psychoacoustic model

removes any inaudible components from the input signal and reduces information that is

irrelevant to the human ear. This pre-filtered audio signal is now coded, as it contains less

information than the original signal. During pre-filtering, the signal is processed or coded according to perceptual criteria.

At the receiver side, a linear predictor works as a post-filter. It takes an encoded

signal to reconstruct the original signal according to a perceptual model. This

psychoacoustic post-filter also shapes noise according to the masking threshold level of a

signal. Filter coefficients are estimated from the masking threshold for each frame of

signal. Therefore, the post-filter provides a time varying magnitude response for each

frame of signal. Post-filter has an inverse magnitude response of the pre-filter used at the

encoder side, and it predicts the output using the same predictor coefficients used by the

pre-filter. In this design, the predictor coefficients are not transmitted but are stored in a file for each frame of signal such that a post-filter has access to these coefficients to reconstruct the signal.

6

Chapter 2

PSYCHOACOUSTIC MODEL

The psychoacoustic model is based on the characteristics of the human ear. The

human ear does not hear all frequencies in the same manner. The human ear typically

hears sounds in the 20 Hz to 20 KHz range. Sounds outside of this range are not perceived by the human ear. The psycho acoustic model is used to eliminate information which is not audible to the average human ear. This basic phenomenon of the psychoacoustic model is used in to reduce the size of information or data of the audio signal. While doing audio coding based on a psychoacoustic model, signal components in the range of the average human ear are coded and others are discarded since they are not perceived by the human ear. By coding this way, fewer bits are required to code an audio signal and efficiency of the coding scheme is improved.

2.1 Threshold of Hearing

In the frequency range of 20 to 20 KHz, sounds should have a minimum sound pressure level (SPL) to be heard by the human ear. This minimum sound pressure level is known as the threshold of hearing for the average human. If sound has a pressure level below the absolute threshold of hearing, then it will not be heard even if it is in the frequency range of the human ear. The equation for the threshold of hearing is as follows

[3]:

Th = 3.64(f/1000)-0.8 - 6.5 + 10-3(f/1000)4 (1)

7

For example, a signal with a frequency of 1 KHz is going to heard only when its

pressure level satisfies the above equation (1). This level is defined as the threshold level

of hearing for this particular frequency. Each different frequency has its own threshold level with which to have an effect on the human ear. Figure 2 shows the value of the absolute threshold of hearing for different values of frequency. It also shows that the

level for the threshold of hearing is more sensitive in the middle range of the human

auditory system. While coding the signal, if the signal frequency has a pressure level less

than the threshold of hearing, then that component can be removed since it cannot be

heard by the average human [4].

Sound Pressure Level, SPL Level, Sound Pressure (dB)

Frequency (Hz)

Figure 2. Threshold of hearing for average human [4]

8

2.2 Critical Bands

The basic idea of critical bands is to analyze the human auditory system. As signals vary over time, the threshold of hearing also changes with time. By analyzing the human auditory system, determination of the actual threshold becomes possible. Figure 3 shows how the frequency to place transformation occurs along the basilar membrane of the human inner ear. Inside the inner ear (cochlea), the wave travels along the length of basilar membrane and it reaches the maximum amplitude at a certain point of the basilar membrane because each point of basilar membrane is associated with a particular frequency [3].

Displacement

Distance from oval window (mm)

Figure 3. Frequency to place transformation along the basilar membrane [4]

To imitate this behavior of the human ear (cochlea) for audio spectrum analysis, band pass filters are used. They are also known as Cochlea filters. The bandwidth of each filter is known as critical bandwidth. This bandwidth is smaller for low frequencies and it increases for higher frequencies. It is a non-linear function of frequency and can be described as follows [4].

9

Critical Bandwidth= 25+75 (1+1.4(f/1000)2)) 0.69 (Hz) (2)

Sound Pressure Level (dB) Level Sound Pressure

Freq. Figure 4. Narrow band noise sources masked by two tones [3]

Audibility Th. Audibility

f ∆f cb

Figure 5. Decrease in threshold after critical bandwidth [3]

One way to see the effect on the threshold in critical bandwidth is shown in

Figure 4. It shows that the narrowband noise is located between two masking tones separated in the critical band. Figure 5 shows that within the critical band value, the threshold remains constant and independent of location of the masking tone but it starts to decrease as the difference between tones increases beyond the critical bandwidth. To perform spectrum analysis, the human auditory range is divided into 24 critical bands as

10 shown in Table 1 and each critical band distance is known as one “bark.” It is calculated from a given frequency by the following equation [4].

Z(f) = 13 arctan (0.00076f) + 3.5 arctan [(f/7500)2] ( bark) ( 3)

Band Center Frequency Bandwidth No (Hz) (Hz)

1 50 0 -100

2 150 100 - 200

3 250 200 - 300

3 350 300 -400

5 450 400 - 510

6 570 510 - 630

7 700 630 - 770

8 840 770 - 920

9 1000 920 - 1080

10 1175 1080 - 1270

11 1370 1270 - 1480

12 1600 1480 - 1720

13 1850 1720 - 2000

14 2150 2000 - 2320

15 2500 2320 - 2700

16 2900 2700 - 3150

17 3400 3150 - 3700

11

18 4000 3700 - 4400

19 4800 4400 - 5300

20 5800 5300 - 6400

21 7000 6400 - 7700

22 8,500 7700 - 9500

23 10,500 9500 - 12,000

24 13,500 12,000 -15,500

25 19,500 15,500 -

Table 1. List of critical bands [4]

Table 1 shows how the human auditory range is divided into 25 critical bands with

bandwidth and it is non-linear with frequency.

2.3 Simultaneous Masking

In an audio signal, there is more than one sound or signal present at a particular

time or frequency. When the presence of a strong signal makes a weaker signal inaudible

to the human ear, the event is known as a masking effect. This effect increases the

threshold value of hearing from the threshold value in quiet. A strong signal is known as a masker while a weaker signal is known as masked signal. When two signals are close in

frequency range or within critical bandwidth and perceived by the human ear at same

time, then this type of masking is known as simultaneous masking. This type of masking

describes most of the masking effects in an audio signal. In the frequency domain, a

magnitude spectrum determines which components will be detected and which

12

components will be masked. The theory of simultaneous masking is explained by the fact

that the basilar membrane is excited by a strong signal much more than by a weaker

signal; therefore, a weaker signal is not detected [3]. In an audio signal, simultaneous

masking is described mainly by Noise masking tone and Tone masking noise scenarios.

2.3.1 Noise Masking Tone

SPL (dB) SPL

410 Freq. (Hz) Crit. BW

Figure 6. Noise masking tone [4]

The above figure shows noise masker with critical bandwidth centered at 410Hz.

It has a sound pressure level of 80 dB. Masked tone is also centered at 410Hz with a sound pressure level of 76 dB. The masked tone threshold level depends on the center frequency and the sound pressure level (SPL) of the masker noise. When tone frequency becomes equal to the center frequency of masked noise, minimum signal-to-mask ratio

(SMR) is achieved to get a threshold value for critical band noise masking tone. Signal- to-mask ratio is defined as the level difference between noise masker and masked tone.

13

The scenario represented in Figure 6 shows the minimum signal-to- mask ratio with the value of 4 dB. The signal-to-mask ratio increases and the masking power decreases as tone frequency changes above or below the center frequency [4].

2.3.2 Tone Masking Noise

SPL (dB) SPL 1000 Crit. BW Freq. (Hz)

Figure 7. Tone masking noise [4]

The above figure shows a tone signal with a center frequency of 1000 Hz within critical bandwidth and it has a sound pressure level of 80 dB. This tone signal masks a noise centered at 1000Hz with a sound pressure level of 56 dB. The masked noise threshold level depends on the center frequency and sound pressure level of the masker tone. When the center frequency of masked noise is close to the frequency of masker tone, minimum signal-to-mask ratio (SMR) is achieved to get the threshold value for noise masked by tone. In this case, a minimum value of SMR equal to 24dB is achieved for the threshold value of noise. The SMR value increases and the masking power

14 decreases as noise changes above and below the center frequency of the masking tone.

Here the minimum value of SMR is much higher compared to the noise masking tone [4].

2.4 Non-simultaneous Masking

Simultaneous masking takes place when a masker and masked signal are perceived at the same time. However, a masking effect also takes place before the masker appears as well as after the masker is removed. This type of masking is known as non- simultaneous or temporal masking [3][4].

Masker Audibility Masker (dB) Increase Threshold

Time after masker appearance (ms) Time after masker removal (ms)

Figure 8. Non-simultaneous masking properties of human ear [3]

Pre-masking is described as the effect before the masker appears and post– masking can be described as the effect after the masker is removed. Figure 8 shows pre- and post- masking regions, and it also shows how the threshold of hearing changes within these regions with non-simultaneous masking. Pre-masking can range from 1-2 ms before the masker appears and post-masking can be in the range of 100-300 ms depending on

15

intensity as well as the duration of the masker. This masking effect increases the threshold of hearing for the masked signal from the threshold of hearing in quiet [3].

2.5 Calculate Overall Masking Threshold [4]

For this project, a basic psychoacoustic model of MPEG layer-1 is used to generate the masking threshold of input signal. These steps are as follows.

Step 1: Get the power spectrum of signal.

In this step, an input signal is first normalized such that its sound pressure level stays at the appropriate threshold of hearing for the average human. After normalization,

Fast (FFT) is used to take frames of the input signal, and the power spectrum of the input signal is obtained for each frame.

Step2: Determination of tone and noise maskers.

For a given power spectrum of signal, when the component exceeds the level of a neighboring component by a minimum of 7 dB, it is considered a tone masker. Then, it is added to generate a new power spectrum. A given frequency is a tone if its power satisfies any of the conditions below.

(i) P[i] > P [i-1] and P[i+1]

(ii) P[i] > P[i+ Nk] + 7 dB and P[i] > + P[i –Nk] + 7 dB

Where the neighboring component Nk is defined as:

[i-2..i+2] for 0.17 Hz< f<5.5 KHz

[i-3..i+3] for 5.5 KHz ≤f <11 KHz

[i-6…i+6] for 11 KHz ≤ f <20 KHz

16

Noise maskers are calculated after determining tone maskers for a given spectrum of signal.

Input signal

Get the power spectrum of signal

Determine tone maskers and noise maskers

Reduction and determination of maskers

Determine individual masking threshold

Get the overall masking threshold

Figure 9. Steps to calculate masking threshold of signal[4]

Step 3: Reduce and find maskers in critical bandwidth.

In the masker power spectrum, if the masker is below the threshold of hearing, then it is discarded since it is not heard by the average human. After that, if there is more

17

than one masker in a critical bandwidth for a given power spectrum, then the value of the

strongest masker is considered while the value of the weaker one is not considered.

Step 4: Determine individual masking threshold.

After the tone masker and noise masker, the individual masking threshold for

noise and tone are calculated using a spreading function. A spreading function determines

the minimum level for a neighborhood frequency so it can be perceived by the human

ear. The spreading function shows the effect of masking to nearby critical bands. It

depends on masker location j, masked signal location i, power spectrum PTM for the tone

masker and the difference between the masker and masked signal location in barks. It is described as follows.

Spreading function (i,j) = 17 Δ– 0.4PTM (j)+11 -3≤ Δ < -1 (4)

= (0.4 PTM (j) +6) Δ -1≤Δ< 0 (5)

= -17 Δ 0≤Δ< 1 (6)

= (0.15 PTM(j) – 17)Δ - 0.15 PTM(j) 1≤Δ<8 (7)

Where Δ= z(i) –z (j).

Step 5: Find overall masking threshold.

To determine the overall masking threshold, each individual masking threshold

for a given frequency component is added, and all individual maskers are added together.

18

Chapter 3

LINEAR PREDICTIVE CODING

3.1 Linear Prediction

A speech signal is continuous and varies with time; therefore, it becomes difficult

to analyze and more complex to code. After coding, the transmission of coded signal

requires more bandwidth and a higher bit rate as speech contains more information. At

the receiver side, it also becomes much more complex to reconstruct the original signal

based on coded speech signal. For this reason, it becomes necessary to code a signal such that it requires a lower bit rate and the reconstructed signal should resemble the original signal.

The basic idea behind linear prediction is that a signal carries relative information.

Therefore, the value of consecutive samples of a signal is approximately the same and the difference between them is small. It becomes easy to predict the output based on a linear combination of previous samples. For example, continuous time varying signal is x and its value at time t is the function of x(t). When this signal is converted into discrete time domain samples, then the value of the nth sample is determined by x(n). To determine the

value of x(n) using linear prediction techniques, values of past samples such as (n-1),x(n-

2), x(n-3)……..x(n-p) are used where p is the predictor order of the filter [8]. This coding

based on the values of previous samples is known as linear predictive coding. This

coding technique is used to code speech signals. The normal speech signal is sampled at

8000 samples/sec and 8-bit is used to code each sample. This provides a bit rate around

19

64,000 bits/sec. By using linear predictive coding, a bit rate of 2400 bits/sec is achieved.

In a linear prediction system or coding scheme, prediction error e(n) is transmitted rather

than a sampled signal x(n) because the error is relatively smaller than a sample of signal;

therefore, it requires fewer bits to code [5]. At the receiver side, this error signal is used

to generate and reconstruct the original signal using previous output samples.

3.2 Linear FIR Predictor as Pre-filter

The basic Linear FIR predictor is defined as the finite impulse response linear

predictor. This structure is used to build a psychoacoustic pre-filter at the encoder side.

The pre-filter takes an input sample from a current frame of signal and the linear

predictor structure determines the predicted sample using previous samples of the input signal [6].

Z -1 Z -1 Z -1 x(n)

a1 a2 ap

(n) + +

e(n) +

Figure 10. FIR linear predictor structure [6]

20

Figure 10 shows the basic FIR linear predictor structure. The predicted signal is

generated by multiplying predictor coefficients with corresponding previous values of a

signal and then adding these values together. Here a1, a2, a3……ap are known as linear

prediction coefficients. They are explained in the following section. The value of p is

known as the predictor order and its value determines the closest estimation of a signal.

Predicted signal (n) is the function of these LPC coefficients and previous input

samples, and the predicted signal is represented by following equation.

(n) = k x(n – k) (8)

Prediction error = original signal – predicted signal (9)

e(n) = x(n) – (n) (10)

Replaced by equation (8),

e(n) = x(n)– k x (n- k) (11)

To convert the above equation in the frequency domain, taking Z transform

-k E(z) = X(z) - k X(z) z (12)

-k E(z) = X(z) [1 – k z ] (13)

E(z) = X(z) H(z) (14)

-k Where H (z) = 1 – k z is the transfer function of the linear predictor as a

pre-filter. Here the error signal E (z) is represented by the product of input signal X(z)

and the transfer function of the FIR predictor filter H(z). From the transfer function H(z), it is realized that it is an all-zero filter. It is also called an analysis filter in linear predictive coding.

21

3.3 Linear IIR Predictor as Post-filter

The basic infinite impulse response linear predictor is known as linear IIR predictor. This linear predictor structure is used for a psychoacoustic post-filter at the decoder side and its filter response is the inverse of that pre-filter. The linear IIR predictor takes the error signal generated at the encoder side and generates predicted values of previous output samples. To reconstruct the original signal, values of previous samples are added to the values of the error signal [6].

Z -1 Z -1 Z -1

a1 a2 ap

(n) + +

e(n) + y(n) Figure 11. IIR linear predictor structure [6]

The above figure shows the basic structure for IIR linear predictor. In this

structure, previous output samples are used to predict the current output sample. Here, the

reconstructed signal is the function of the error signal and predicted output samples.

y(n) = e(n) + k y(n-p) (15)

Taking the z-transform of above equation,

-k Y(z) = E(z) + k Y(z) z (16)

22

-k Y(z) [1 – k z ] = E(z) (17)

-k Y(z)/ E (z) = 1 / (1- k z ) = 1/ H (z) (18)

The above equation represents a transfer function of the linear predictor at the decoder side. Post-filter transfer function is the inverse of the pre-filter transfer function.

This transfer function represents an all-pole filter. It is also defined as a synthesis filter in linear predictive coding since it is used to reproduce speech signals at the decoder side.

3.4 Calculation of LPC Filter Coefficients

In a normal linear predictor, coefficients are obtained from a signal for each frame, and they are calculated using linear equations to get a minimum prediction error.

These coefficients make linear predictor response time vary according to the signal. The calculated coefficients are transmitted with coded speech such that they are used to reconstruct the signal at the receiver side [9].

Figure 12 shows the method to determine linear predictor coefficients from the masking threshold of each frame signal. Psychoacoustic pre- and post-filter are linear predictors but they are adapted by the masking threshold generated by a psychoacoustic model. The masking threshold is estimated for each frame of the signal. Filter coefficients are determined from the masking threshold spectrum of the signal. However, in normal linear predictor, they are determined from signal spectra [6].

23

Audio Signal

Psychoacoustic Model

Masking Threshold

IFFT to get Auto- correlation values

Levinson’s Theorem

Linear Predictor Coefficients

Figure 12. Determination of linear predictor coefficients [6]

Here, first predictor coefficients for linear prediction are extracted from signal spectra and auto-correlation values for the signal. These auto-correlation values are obtained from the masking threshold of a signal such that predictor coefficients represent an approximation of the masking threshold rather than the signal.

According to [8], linear predictor coefficients are obtained such that the predicted sample value is close to the original sample; therefore, prediction error should be small.

24

Overall error is defined as the sum of the mean squared error for samples that can be expressed by the below equation.

E = ∑ e2 (n) (19)

From equation [11],

e(n)=x(n)– k x(n-k) (20)

2 E= ∑ {x(n)- k x(n-k)} (21)

To get filter coefficients, taking derivation of E with respect to the set of coefficients ak , where k = 1, 2…p. and setting it to zero,

dE/ da1 = 0, dE/da2= 0 ..….dE/dap=0

The above linear equations are represented as a matrix in the following way,

(22)

Here, R (k) = which is an auto-correlation function of the sequence x(n).

For this project, predictor order p=10 is taken so above equation can be simplified as

25

(23)

The above equation can be represented by:

R .a = r

R= (24)

According to [7] [9], to get optimum predictor coefficients, the above equation is solved by Levinson’s theorem because it requires less computation than any other method. From equation (11) the sum of the squared error is represented by the equation below:

2 2 E = ∑ e [n] = ∑ {x(n) – k x(n-k)} (25)

Solving the above equation for P th order can be represented as:

Ep = R (0) – a1 R (1) – a2 R (2) – …. – ap-1.R (p-1) (26)

Ep = R (0) – k R(k) (27)

To compute the pth order coefficients, the following steps are performed repetitively for

p= 1,2…P

(i) Bp = R(p) – i (p-1) R(p-i)

(ii) ap(p) = kp= Bp / E p-1

26

(iii) ai(p) = ai( p-1 ) - kp a(p-1) (p-i) where i = 1,2…p-1

2 (iv) Ep = Ep-1 ( 1 – kP )

Therefore, the final solution for LPC coefficients is given by ai = ai(p) where i =

1,2…p-1. For each analysis frame, the psychoacoustic model calculates the masking

threshold of the audio signal. According to [6], the masking threshold is expressed in

frequency domain and it is represented as a power spectral density function of signal.

Auto-correlation function can be represented as the power of spectral density function of

a signal.

To convert the masking threshold from frequency domain to time domain, the

Inverse (IFFT) of the masking threshold is performed. This can

be done by IFFT, which is a built-in function in MATLAB. As per theorem, when

Inverse Fast Fourier Transform is performed on a power spectral density function, it

gives values of auto-correlation function. While taking the Inverse Fast Fourier transform

of the masking threshold the real part of the corresponding value is taken because only

the real part can approximate predictor coefficients. After getting auto-correlation values

for the masking threshold, Levinson’s theorem is used to determine linear predictive coefficients.

27

Chapter 4

MATLAB IMPLEMENTATION

Audio coder using perceptual linear predictive filter is implemented in MATLAB.

For this project, a perceptual model is the basic psychoacoustic model for MPEG layer-1 and it is taken from Rice University. It is used to determine the overall masking threshold of the audio signal. This psychoacoustic model takes the current portion of signal and performs Fast Fourier Transform (FFT). In a psychoacoustic model, signal is converted from a time domain to a frequency domain. To determine the masking effect, tone maskers, noise maskers and overall maskers within a critical band are calculated.

After considering the overall effects of maskers, the masking threshold is obtained for each window frame of input signal.

In the encoder part, a signal is computed over a 2048-point Fast Fourier

Transform (FFT) length. Fast Fourier Transform provides the masking threshold from a psychoacoustic model for each frame. This FFT length gives a large number of samples for the masking threshold; therefore, a close approximation of predictor coefficients is possible for each frame of signal. Linear predictor codes the audio signal using these predictor coefficients. At the encoder side, the following steps are performed to code the signal.

• A window size of 25ms is used in the encoder as well as in the decoder because it

provides enough resolution to perform analysis for speech and audio signal.

28

• A process window size of 5ms is used in the encoder as well as in the decoder to

update predictor coefficients within each window frame.

• First the input audio signal is read from WAV file (Waveform Audio Format) to

process using perceptual model.

• Initially, for the first window size, the masking threshold is calculated and it is

used to estimate predictor coefficients for that signal frame. The linear prediction

filter processes the portion of the signal using predictor coefficients.

• Once enough data is available to process, the current window is shifted by a

window size of 25ms.

• The masking threshold is calculated for the current window size of a signal.

• The calculated masking threshold is used to get linear predictor coefficients for

the current portion of signal.

• The estimated coefficients are stored in a matrix for each processing window of

5ms for the current window size.

• The pre-filter takes a current sample, predictor coefficients, and the previous

values stored in a buffer to compute filter output.

• At every process window size, predictor coefficients are updated and they are

stored in a matrix.

• The above steps are repeated for the length of the input signal.

• For each frame of signal, estimated predictor coefficients are stored in a MAT

(Matlab) file so the predictive filter at the decoder side has access to it.

29

• Once the pre-filter output is computed, it is stored in a file

In the predictor coefficients estimation, the masking threshold spectrum is represented in a frequency domain. It is defined for positive as well as negative frequencies. To get the real estimation of coefficients, negative frequencies are reconstructed from the masking threshold spectrum. Inverse Fast Fourier Transform

(IFFT) is performed on the power spectrum of the masking threshold to convert from

frequency domain to time domain. While converting masking threshold spectrum to time

domain values, only the real part is used to approximate values. These values are

represented as auto-correlation values for the masking threshold of signal. In this project,

Levinson’s theorem is implemented in MATLAB. It is used to estimate linear predictive

coefficients from auto-correlation values of the masking threshold. These estimated linear

predictive coefficients generate a predictor response according to the masking threshold

of signal.

At the decoder side, the linear predictive filter is used to decode or reconstruct the

signal. In decoder implementation, the following steps are performed to decode the

signal.

• Processed or coded audio signal is read from a WAV (Waveform Audio Format)

file.

• Predictor coefficients from MAT file (Matlab file) are stored into the matrix.

These coefficients correspond to the masking threshold of input audio signal and

it is generated by the psychoacoustic model.

30

• The initial window of a signal is processed. Initial predictor coefficients are taken

from the matrix to generate a post-filtered signal and initial data for the next

window of signal.

• After initial window processing, the current window is shifted by window size

(25 ms) to process signal.

• Within each current window, predictor coefficients are extracted from a matrix for

every 5ms and it is updated to the post-filter.

• For each window size, the post-filter takes a current sample, predictor

coefficients, and previous samples to determine output sample.

• After processing the signal, post-filter output is computed over the length of input

signal.

• The reconstructed original signal is stored in the audio (wav) file.

After performing the post-filter operation, the output audio signal is played as well as compared with the original audio signal. Output audio signal quality is the same as the original signal.

31

Chapter 5

SIMULATION RESULTS

To verify the project design input signal is taken from the wave (Waveform

Audio Format) file recorded in a quiet room with no noise effects.

Input signal 0.8

0.6

0.4

0.2

0

-0.2 Amplitude (v) -0.4

-0.6

-0.8 0.5 1 1.5 2 2.5 3 Time (Sec)

Figure 13. Input signal

Figure 13 shows the input signal at the encoder side. The signal is sampled at 16

KHz and it is from the artistic database of CMU (Carnegie Melon University) speech group. The audio signal has audible quality.

32

Pre-filtered signal 0.25

0.2

0.15

0.1

0.05

0

-0.05 Amplitude (v) -0.1

-0.15

-0.2

-0.25 0.5 1 1.5 2 2.5 3 Time (Sec)

Figure 14. Pre-filtered signal

Figure 14 shows the output signal from the pre-filter. It contains less information than the original signal since this information is removed based on the perceptual model.

The masking threshold of the input signal is extracted by a psychoacoustic model, and linear predictor coefficients are estimated to generate frequency response. The pre- filtered signal is coded according to perceptual criteria.

33

Reconstructed signal from post-filter 0.8

0.6

0.4

0.2

0

-0.2 Amplitude (v)

-0.4

-0.6

-0.8 0.5 1 1.5 2 2.5 3 Time (Sec)

Figure 15. Reconstructed signal from post-filter

Figure 15 shows a reconstructed signal from a post-filter, and it is similar to the input signal. A post-filter generates a frequency response inverse to the pre-filter to decode the signal. The post-filter uses the same filter coefficients used by the pre-filter for that particular frame of signal. Predictor coefficients are stored in a MAT file so the post-filter has access to it to generate a frequency response according to perceptual criteria to reconstruct the signal.

34

Masking threshold spectrogram Masking threshold spectrogram

1000

2000

3000

4000

5000 Frequency (v) Frequency

6000

7000

8000 0.5 1 1.5 2 2.5 3 Time (Sec)

Figure 16. Spectrogram for masking threshold of input signal

Figure 16 shows the masking threshold of the input signal extracted by a psychoacoustic model. This spectrogram shows different magnitude levels for a masking threshold of a signal at a particular frequency. It shows color representation for a masking threshold describing magnitude levels. It is extracted for each frame of signal. A red color bar refers to highest magnitude level while the blue color refers to the lowest maginude level of masking threshold.

35

Masking threshold estimate spectrogram Masking threshold estimate spectrogram

1000

2000

3000

4000

5000 Frequency (v) Frequency

6000

7000

8000 0.5 1 1.5 2 2.5 3 Time (Sec)

Figure 17. Spectrogram for estimated masking threshold

Figure 17 shows the estimated masking threshold of an input signal and it is also described as a frequecy response of linear predictor. This shows the frequency response in terms of color bars. Color bars show the approximation for LPC coefficents from generating the masking threshold of the input signal. A red color bar refers to the highest magnitude level while the blue color refers to the lowest maginude level of the estimated masking threshold.

36

60 Maskingmasking threshold threshold Estimatedestimated magnitude magitide response response 40

20

0

-20 magnitude dB gnitude (dB)

Ma -40

-60

-80 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequencytime (Hz)

Figure 18. Masking threshold and LPC response for frame of signal

Figure 18 shows the masking threshold for one of the frames of the audio signal.

The red plot shows the extracted masking threshold while the blue plot shows the response of the linear predictor for that frame of input signal. From the plot it is clear that the response is almost similar to the extracted masking threshold of the input signal and

the approximation of LPC coefficents is accurate.

37

Chapter 6

CONCLUSION

The implementation of audio coder based on perceptual linear predictive coding

represents another way to code audio signal. This project combines a perceptual model

and linear predictive technique to perform audio coding. The project is successfully implemented in MATLAB. Linear predictor coefficients are obtained using Levinson’s theorem. During coding, information from the audio signal is removed based on a perceptual model and irreverence reduction is achieved. After decoding, the reconstructed audio signal has the same quality as the original audio signal.

This project provided me a great opportunity to study and learn the fundamentals of digital signal processing. During implementation, I have used these fundamentals to get the best results for this project. I have finished all the requirements for the success of this project. After developing this project, I have learned the fundamentals of a psychoacoustic model and how it is combined with a linear predictor filter to process the audio signal.

6.1 Future Improvements

This project uses a perceptual model that removes irrelevant information from an audio signal. There is a need for redundancy reduction from an audio signal to achieve further compression. This design can be extended further or be combined with appropriate quantization and coding techniques. Therefore, it will be able to get a lower bit rate as well as maintain signal quality. The coding technique is used either in

38 frequency domain or in time domain. In this project, filter coefficients are not transmitted to the decoder side. To control post-filter according to perceptual criteria in another way, coefficients are quantized and efficiently coded. After coding, they are transmitted along with the coded signal as side information.

39

APPENDIX

encode.m

% It reads the audio signal from wave file and code the input signal according to perceptual criteria and linear predictive filter (Pre-filter).

% function [y, A, fs] = encode_file (filename, plotting) filename = 'cmu_us _arctic _slt _a0001.wav'; plotting = 1;

% if plotting is not set, no plotting should be done

if nargin == 1 plotting = 0; end

% create figure for masking threshold and estimate filter evolution

if plotting == 2 fig = figure; end

% reading the signal from the file

[x, fs] = wavread (filename);

%------% system initialization

% assigning the number of point over which the FFT is computed

FFTlength = 2048;

% set the LPC analysis order p = 10;

% computing the processing step, window size and frequency values (in Hz % and Barks)

40

ps = round (fs * 0.005); % 5 ms ws = round (fs * 0.025); % 25 ms

N = length (x); % the length of the signal in samples f = fs/FFTlength:fs/FFTlength:fs/2; b = hz2bark (f);

% initializing the filter coefficients matrix

A = zeros (floor ((N-ws)/ps)+1, p+1);

% if plotting is enabled, initialize the filter response and MT % spectrograms

if plotting

H = []; TH = []; end

%------% estimating initial values for the pre-filter

% selecting the first window from the signal crtW = x (1:ws);

% extracting the masking threshold th = findThreshold (crtW, FFTlength, f);

% get an estimate filter for the extracted threshold a = getMTest (th, p, f);

% save the filter coefficients into matrix A A (1, :) = a;

% if plotting is enabled, save the filter response and the MT % if plotting

H = [H 10*log10 (abs (freqz (1, a, FFTlength/2)))]; TH = [TH; th];

41

% end

%------% initialize the channel output y = zeros (size (x));

%------

% start processing the signal

% initialize the filter delay chains prebuff = zeros (p+1, 1);

% first process and transmit the initial frame

for i=1:ws

% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);

crtW = [crtW (2:ws); x (i)]; end

% sufficient new data has been gathered and now the normal processing steps % can be taken

lastUpdated = 1;

crtW = [crtW (ps+1:ws); x (i+1:i+ps)]; th = findThreshold (crtW, FFTlength, f);

a = getMTest (th, p, f); A (floor ((i-ws)/ps)+1, :) = a;

% if plotting is enabled, save the filter response and the MT

if plotting

H = [H 10*log10 (abs (freqz (1, a, FFTlength/2)))]; TH = [TH; th];

42 end for i=i+1:N-ps

% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);

% adding new values to the analysis window % crtW = [crtW (2:ws); x (i)];

% if the coefficients were last updated ps samples ago, it's time to % compute new coefficients if lastUpdated == ps

lastUpdated = 0; crtW = [crtW (ps+1:ws); x (i+1:i+ps)]; % th = get_AMT (crtW, FFTlength, f, b); th = findThreshold (crtW, FFTlength, f); a = getMTest (th, p, f); A (floor ((i-ws)/ps)+2, :) = a;

% if plotting is enabled, save the filter response and the MT if plotting

h = 10*log10 (abs (freqz (1, a, FFTlength/2)).^2); H = [H h-mean (h)]; TH = [TH; th-mean (th)];

if plotting == 2

figure (fig); plot (f, th-mean (th), 'r', f, h-mean (h), 'b'); pause (0.1);

end

end end

lastUpdated = lastUpdated+1;

43

end

for i=i+1:N

% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);

end

time = (1:N)/fs;

wavwrite (y, fs, [filename (1:end-3) 'pre.wav']);

% LPC coefficients are stored for each frame of signal . and updated for every process frame for window size. It is used to pass these coefficients to post-filter.

save ('filter_coeffs.mat', 'A');

if plotting close all; figure;

plot (time, x), title ('Original signal'), xlim ([time (1) time (N)]); plot (time, y), title ('Pre-filtered signal'), xlim ([time (1) time (N)]);

figure; imagesc (time, f, TH'); title ('Masking threshold spectrogram'); xlim ([time (1) time (N)]); ylim ([f (1) f (FFTlength/2)]);

imagesc (time, f, H); title ('Masking threshold estimate spectrogram'); xlim ([time (1) time (N)]); ylim ([f (1) f (FFTlength/2)]); end

44

prefilter.m

function [y, buff] = prefilter (x, buff, a)

% determine the filter order p = length (a) - 1;

% push the current value in the delay chain buff = [x; buff (1:p)];

% compute the output sample y = a*buff;

getMtest.m

function [a, e] = getMTest (th, p, f) th = th - min (th); th = [th fliplr (th)]; % plot (th); % pause (1); acf = real (ifft (power (10, th/10)));

[a e] = levi (acf (1:round (length (acf)/2)), p);

% [b a] = invfreqz (th, f/max(f)*pi, 1, p);

Decode.m

% function y = decode (x, A, fs, plotting)

45

clear all;

filename = 'cmu_us _arctic _slt _a0001.pre.wav'; [x, fs] = wavread (filename);

load ('filter_coeffs.mat');

% if plotting is not set, no plotting should be done

% if nargin == 3 plotting = 1; % end

%------% system initialization

% set the LPC analysis order p = 10;

% computing the processing step, window size and frequency values (in Hz % and Barks)

ps = round (fs * 0.005); % 5 ms ws = round (fs * 0.025); % 25 ms N = length (x); % the length of the signal in samples

%------% initialize the postfilter output y = zeros (size (x));

%------

% start processing the signal

% initialize the filter delay chains postbuff = zeros (p+1, 1);

% get the first filter coefficients a = A (1,:);

% process the first frame

46 for i=1:ws

% compute the post-filter output [y (i), postbuff] = postfilter (x (i), postbuff, a); end lastUpdated = 1; a = A (2, :); for i=i+1:N-ps

% computing the postfilter output [y (i), postbuff] = postfilter (x (i), postbuff, a);

% if the coefficients were last updated ps samples ago, it's time to % compute new coefficients

if lastUpdated == ps lastUpdated = 0; a = A (floor ((i-ws)/ps)+2, :); end lastUpdated = lastUpdated+1; end for i=i+1:N

% computing the postfilter output

[y (i), postbuff] = postfilter (x (i), postbuff, a); end time = (1:N)/fs; if plotting

figure; plot (time, x), title (' pre-filtered signal'), xlim ([time (1) time (N)]); plot (time, y), title ('Reconstructed signal from post-filter'), xlim ([time (1) time (N)]);

47

end

wavwrite (y, fs, [filename (1:end-7) 'post.wav']);

levi.m

% It takes predictor order and auto correlation values for each frame of signal . %calculates LPC coefficients using autocorrelation values function a = levi(r, p)

% initializing the LP coefficients a = zeros(1, p+1); a(1) = 1;

% running the first iteration

k = -r(2) / r(1); a(2) = k; E = r(1) * (1-k^2);

% iterating the remaining steps for i=2 : p % setting the j index

j = 1:i-1;

% computing the reflection coefficient k = r(i+1); k = k + sum(a(j+1) .* r(i-j+1)); k = -k/E;

% saving the previous LP coefficients ap = a;

% computing the new LP coefficients from 1 to i-1

48

a(j+1) = ap(j+1) + k * ap(i-j+1);

% saving the reflection coefficient as the i'th LP coef a(i+1) = k;

% computing the new prediction gain E = E * (1-k^2);

End

Postfilter.m

% it takes stored LPC coefficients from matrix for each frame of signal and generates %frequency response inverse of the pre-filter using previous values of output samples to %decode the signal.

function [y, buff] = postfilter (x, buff, a)

% determine the filter order p = length (a) - 1;

% compute the output sample y = x - [a (2:end) 0]*buff;

% push the current value in the delay chain buff = [y; buff (1:p)];

49

REFERENCES

[1] A. Gersho, “Advances in Speech and Audio Compression.” Proceedings of the IEEE,

Vol. 82, Issue 6, 1994, pp. 900-918.

[2] B. Edler, and G. Schuller, “Audio Coding Using a Psychoacoustic Pre- and Post- filter.” Proceedings of the IEEE International Conference on Acoustics, Speech, Signal

Processing, Istanbul, Turkey, 2000, pp. 881–884.

[3] E. Zwicker, and H. Fastl, “: Facts and Models.” New York, Springer,

1999. Ch. 2, 4, and 6.

[4] Ted Painter, and Andreas Spanias, “Perceptual Coding of .” Proceedings of the IEEE, Vol. 88, No. 4, April 2000, pp. 451–512.

[5] Jeremy Bradbury, “Linear Predictive Coding.” December 5, 2000.

[6] B.Edler, C. Faller, and G. Schuller, “Perceptual Audio Coding Using a Time-Varying

Linear Pre- and Post-filter.” Proceedings of AES Symposium, Los Angeles, CA,

September 2000.

[7] Nagarajan and S. Sankar, “Efficient Implementation of Linear Predictive Coding

Algorithms.” Proceedings of the IEEE, Southeastcon 1998, pp. 69-72.

[8] John Makhoul, “Linear Prediction: A Tutorial Review.” Proceedings of the IEEE,

Vol. 63, No. 4, April 1975.

[9] Douglas O’Shaughnessy, “Linear Predictive Coding.” IEEE Potentials, Vol. 7, Issue

1, 1988, pp. 19-32.