AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING
Pratik R. Bhatt B.E., C.U. Shah College of Engineering and Technology, India, 2006
PROJECT
Submitted in partial satisfaction of the requirements for the degree of
MASTER OF SCIENCE
in
ELECTRICAL AND ELECTRONIC ENGINEERING
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL 2010
AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING
A Project
by
Pratik R. Bhatt
Approved by:
______, Committee Chair Jing Pang, Ph.D.
______, Second Reader Preetham Kumar, Ph.D.
______Date
ii
Student: Pratik R. Bhatt
I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
, Graduate Coordinator Preetham Kumar, Ph.D. Date
Department of Electrical and Electronic Engineering
iii
Abstract
of
AUDIO CODER USING PERCEPTUAL LINEAR PREDICTIVE CODING
by
Pratik R. Bhatt
Audio coder using perceptual linear predictive coding is a new technique for compressing and decompressing audio signal. This technique codes only the information of an audio signal, which is audible to the human ear, all other information is discarded during coding. Audio signal is given to a psychoacoustic model to generate information to control the linear prediction filter. At the encoder side, a signal is passed through a prediction filter, and then the filtered signal contains less information as it is coded according to perceptual criteria. At the decoder side, an encoded signal is decoded using a linear prediction filter, which is the inverse of the filter used at the encoder side. The decoded signal is similar to the original signal. Since information from the audio signal relative to the human hear is coded so the signal contains less data and has a high signal to noise ratio.
iv
This project is about designing and implementing the MATLAB code for an audio
coder using perceptual linear predictive coding. The following tasks were performed in
this project.
(i) Process the audio signal according to a perceptual model.
(ii) Encode and decode the signal using the linear predictor.
(iii) Check simulation results for the original audio signal as well as the retrieved
audio signal after signal processing.
(iv) Listen to the wave file of the original as well as the reconstructed signal.
, Committee Chair Jing Pang, Ph.D.
______Date
v
ACKNOWLEDGMENTS
I would like to thank my project adviser, Dr. Jing Pang, for giving me an opportunity to implement the Audio coder based on perceptual linear predictive coding. I have learned many concepts of signal processing during the implementation of this project. Dr. Pang provided great support and guidance throughout the development of this project. She has been a great listener and problem solver throughout the implementation.
Her valuable research experience has helped me to develop fundamental research skills.
I would also like to thank Dr. Preetham Kumar for his valuable guidance and support in writing this project report. In addition, I would like to thank Dr. Suresh
Vadhva, Chair of the Electrical and Electronic Engineering Department, for his support in completing the requirements for a Master’s degree at California State University,
Sacramento.
vi
TABLE OF CONTENTS
Page
Acknowledgments ...... vi
List of Tables ...... ix
List of Figures ...... x
Chapter
1. INTRODUCTION ...... 1
1.1 Background ...... 1
1.2 Types of Audio Coding ...... 1
1.3 Concept ...... 3
1.4 Overview of Design ...... 4
2. PSYCHOACOUSTIC MODEL ...... 6
2.1 Threshold of Hearing ...... 6
2.2 Critical Bands...... 8
2.3 Simultaneous Masking ...... 11
2.3.1 Noise Masking Tone ...... 12
2.3.2 Tone Masking Noise ...... 13
2.4 Non-simultaneous Masking ...... 14
2.5 Calculate Overall Masking Threshold ...... 15
3. LINEAR PREDICTIVE CODING ...... 18
3.1 Linear Prediction ...... 18
vii
3.2 Linear FIR Predictor as Pre-filter ...... 19
3.3 Linear IIR Predictor as Post-filter ...... 21
3.4 Calculation of LPC Filter Coefficients ...... 22
4. MATLAB IMPLEMENTATION ...... 27
5. SIMULATION RESULTS ...... 31
6. CONCLUSION ...... 37
6.1 Future Improvements ...... 37
Appendix ...... 39
References ...... 49
viii
LIST OF TABLES
Page
1. Table 1 List of critical bands ...... 11
ix
LIST OF FIGURES
Page
1. Figure 1 Audio coder using perceptual linear predictive coding ...... 4
2. Figure 2 Threshold of hearing for average human ...... 7
3. Figure 3 Frequency to place transformation along the basilar membrane ...... 8
4. Figure 4 Narrow band noise sources masked by two tones ...... 9
5. Figure 5 Decrease in threshold after critical bandwidth ...... 9
6. Figure 6 Noise masking tone ...... 12
7. Figure 7 Tone masking noise ...... 13
8. Figure 8 Non-simultaneous masking properties of human ear ...... 14
9. Figure 9 Steps to calculate masking threshold of signal ...... 16
10. Figure 10 FIR linear predictor structure ...... 19
11. Figure 11 IIR linear predictor structure ...... 21
12. Figure 12 Determination of linear predictor coefficients ...... 23
13. Figure 13 Input signal ...... 31
14. Figure 14 Pre-filtered signal ...... 32
15. Figure 15 Reconstructed signal from post-filter ...... 33
16. Figure 16 Spectrogram for masking threshold of input signal ...... 34
17. Figure 17 Spectrogram for estimated masking threshold ...... 35
18. Figure 18 Masking threshold and LPC response for frame of signal ...... 36
x 1
Chapter 1
INTRODUCTION
1.1 Background
In our day-to-day lives, we use various kinds of audio equipment, such as Cell phones, music players and wireless radio receivers. These devices process audio signals in different ways. The basic form of audio signals is analog. It is easier to process a digital signal form rather than an analog signal form. The process of converting an analog signal into a digital representation is described as audio coding. Moreover, audio processing has become significantly highly digitized in recent years. The general procedure of audio processing is as follows. An analog signal is sampled at regular intervals and sampled data is encoded with a coding scheme. At the decoder side, data is converted back to the original form. Therefore, as we increase the sampling rate of the audio signal, the bandwidth required by that signal will increase and the audio quality will also increase. The main goal of audio coding is to minimize the sampling rate with a higher signal to noise ratio and reduced bit rate and achieve high audio quality.
1.2 Types of Audio Coding
There are various types of coding schemes available for audio coding. Each coding scheme has its benefits and drawbacks depending on the technique used to process the audio signal. For any coding technique, an area of concern is the quality of the signal and the achievable bit rate used to reduce the transmission bandwidth. Based on these criteria, different coding techniques are used to code the audio signal. One of
2
the coding techniques is waveform coding, which uses a signal to perform a coding
operation and produces a reconstructed signal that is the same as the original signal.
Pulse coded modulation (PCM), Differential pulse coded modulation (DPCM), and
Adaptive differential pulse coded modulation (ADPCM) are known as waveform coders.
Source coding technique or parametric coder uses a model of the signal source to code
the signal; this source model is represented by a mathematical representation to reduce
signal redundancies. Linear predictive coding is an example of source coding. In linear
predictive coding, signal parameters are coded as well as transmitted and these
parameters are used to produce a signal at the decoder side. In this technique, the
reconstructed signal does not have the same quality as the original signal. Sub-band
coding and transform coding are examples of source coders that work in the frequency
domain [1].
Transform-based coding uses a transform length to analyze and process the audio
signal; therefore, the resolution of the signal is dependent on the transform length. If the
transform length is longer, then it provides good resolution for speech signals because
they change very little over time. If the transform length is short, it provides good
resolution for audio signals because they change abruptly within a short time. Therefore,
required transform length is dependent on the signal type to be coded [2]. Another
concept is to code the signal according to perceptual criteria. In this method, information
or the sound of a signal that cannot be heard by the human ear is removed and
quantization is performed according to a perceptual model. This allows quantization
3
noise to be kept below the masking threshold of the signal such that it cannot be
perceived by the human ear; then, appropriate transform coding is used to achieve
compression by removing redundant information from the signal. The purpose of an
audio coder using perceptual linear predictive coding is to use a linear predictive filter at
the encoder side as well as at the decoder side. A psychoacoustic model is used to control
the linear predictor filter according to a perceptual criterion of the human auditory
system. The linear predictive filter at the encoder side codes the signal based on a
psychoacoustic model and the linear predictive filter at the decoder side decodes the
signal to reconstruct the original signal.
1.3 Concept
The basic idea of an audio coder based on perceptual linear predictive coding is to
combine a linear predictive filter with a psychoacoustic model. The combination of a
psychoacoustic model and linear predictive coding is used to reduce irrelevant
information of the input signal. The linear predictor at the encoder side is known as the pre-filter and the linear predictor at the decoder side is known as post-filter. The pre- and
post-filter are controlled by a psychoacoustic model which shapes the quantization noise
such that it stays below the masking threshold of the signal and it becomes inaudible to
the human ear. This technique provides good resolution for speech as well as audio
signals and the resolution of signals is independent of the coding technique used [2].
4
1.4 Overview of Design
The project design consists of a pre-filter and post-filter with a psychoacoustic model. Figure 1 shows a basic block diagram of an audio coder using perceptual linear predictive coding. The psychoacoustic model takes the input audio signal and generates a masking threshold value for each frame of signal. It utilizes properties of the human auditory system to generate the masking threshold of the input audio signal.
Psycho acoustic model
Input Audio Pre Filtered Pre Filtered Post Filtered signal Pre Filter Signal Signal Post Filter Signal
Figure 1. Audio coder using perceptual linear predictive coding [2]
Pre- and post-filter are based on the linear predictive filter structure and are
adapted by the masking threshold generated by a psychoacoustic model. The masking
threshold of an audio signal is used to estimate the linear predictive filter coefficients,
and these estimated coefficients change with time making a linear predictive filter
adaptive.
At the transmitter side, a psychoacoustic pre-filter generates a time varying
magnitude response according to the masking threshold of the input signal, and this is
achieved by time varying filter coefficients from a masking threshold for that particular
5
frame of signal. A linear predictor as a pre-filter takes the input audio signal and
processes it for each frame. Adaption of a pre-filter using a psychoacoustic model
removes any inaudible components from the input signal and reduces information that is
irrelevant to the human ear. This pre-filtered audio signal is now coded, as it contains less
information than the original signal. During pre-filtering, the signal is processed or coded according to perceptual criteria.
At the receiver side, a linear predictor works as a post-filter. It takes an encoded
signal to reconstruct the original signal according to a perceptual model. This
psychoacoustic post-filter also shapes noise according to the masking threshold level of a
signal. Filter coefficients are estimated from the masking threshold for each frame of
signal. Therefore, the post-filter provides a time varying magnitude response for each
frame of signal. Post-filter has an inverse magnitude response of the pre-filter used at the
encoder side, and it predicts the output using the same predictor coefficients used by the
pre-filter. In this design, the predictor coefficients are not transmitted but are stored in a file for each frame of signal such that a post-filter has access to these coefficients to reconstruct the signal.
6
Chapter 2
PSYCHOACOUSTIC MODEL
The psychoacoustic model is based on the characteristics of the human ear. The
human ear does not hear all frequencies in the same manner. The human ear typically
hears sounds in the 20 Hz to 20 KHz range. Sounds outside of this range are not perceived by the human ear. The psycho acoustic model is used to eliminate information which is not audible to the average human ear. This basic phenomenon of the psychoacoustic model is used in audio signal processing to reduce the size of information or data of the audio signal. While doing audio coding based on a psychoacoustic model, signal components in the range of the average human ear are coded and others are discarded since they are not perceived by the human ear. By coding this way, fewer bits are required to code an audio signal and efficiency of the coding scheme is improved.
2.1 Threshold of Hearing
In the frequency range of 20 to 20 KHz, sounds should have a minimum sound pressure level (SPL) to be heard by the human ear. This minimum sound pressure level is known as the threshold of hearing for the average human. If sound has a pressure level below the absolute threshold of hearing, then it will not be heard even if it is in the frequency range of the human ear. The equation for the threshold of hearing is as follows
[3]:
Th = 3.64(f/1000)-0.8 - 6.5 + 10-3(f/1000)4 (1)
7
For example, a signal with a frequency of 1 KHz is going to heard only when its
pressure level satisfies the above equation (1). This level is defined as the threshold level
of hearing for this particular frequency. Each different frequency has its own threshold level with which to have an effect on the human ear. Figure 2 shows the value of the absolute threshold of hearing for different values of frequency. It also shows that the
level for the threshold of hearing is more sensitive in the middle range of the human
auditory system. While coding the signal, if the signal frequency has a pressure level less
than the threshold of hearing, then that component can be removed since it cannot be
heard by the average human [4].
Sound Pressure Level, SPL Level, Sound Pressure (dB)
Frequency (Hz)
Figure 2. Threshold of hearing for average human [4]
8
2.2 Critical Bands
The basic idea of critical bands is to analyze the human auditory system. As signals vary over time, the threshold of hearing also changes with time. By analyzing the human auditory system, determination of the actual threshold becomes possible. Figure 3 shows how the frequency to place transformation occurs along the basilar membrane of the human inner ear. Inside the inner ear (cochlea), the wave travels along the length of basilar membrane and it reaches the maximum amplitude at a certain point of the basilar membrane because each point of basilar membrane is associated with a particular frequency [3].
Displacement
Distance from oval window (mm)
Figure 3. Frequency to place transformation along the basilar membrane [4]
To imitate this behavior of the human ear (cochlea) for audio spectrum analysis, band pass filters are used. They are also known as Cochlea filters. The bandwidth of each filter is known as critical bandwidth. This bandwidth is smaller for low frequencies and it increases for higher frequencies. It is a non-linear function of frequency and can be described as follows [4].
9
Critical Bandwidth= 25+75 (1+1.4(f/1000)2)) 0.69 (Hz) (2)
Sound Pressure Level (dB) Level Sound Pressure
Freq. Figure 4. Narrow band noise sources masked by two tones [3]
Audibility Th. Audibility
f ∆f cb
Figure 5. Decrease in threshold after critical bandwidth [3]
One way to see the effect on the threshold in critical bandwidth is shown in
Figure 4. It shows that the narrowband noise is located between two masking tones separated in the critical band. Figure 5 shows that within the critical band value, the threshold remains constant and independent of location of the masking tone but it starts to decrease as the difference between tones increases beyond the critical bandwidth. To perform spectrum analysis, the human auditory range is divided into 24 critical bands as
10 shown in Table 1 and each critical band distance is known as one “bark.” It is calculated from a given frequency by the following equation [4].
Z(f) = 13 arctan (0.00076f) + 3.5 arctan [(f/7500)2] ( bark) ( 3)
Band Center Frequency Bandwidth No (Hz) (Hz)
1 50 0 -100
2 150 100 - 200
3 250 200 - 300
3 350 300 -400
5 450 400 - 510
6 570 510 - 630
7 700 630 - 770
8 840 770 - 920
9 1000 920 - 1080
10 1175 1080 - 1270
11 1370 1270 - 1480
12 1600 1480 - 1720
13 1850 1720 - 2000
14 2150 2000 - 2320
15 2500 2320 - 2700
16 2900 2700 - 3150
17 3400 3150 - 3700
11
18 4000 3700 - 4400
19 4800 4400 - 5300
20 5800 5300 - 6400
21 7000 6400 - 7700
22 8,500 7700 - 9500
23 10,500 9500 - 12,000
24 13,500 12,000 -15,500
25 19,500 15,500 -
Table 1. List of critical bands [4]
Table 1 shows how the human auditory range is divided into 25 critical bands with
bandwidth and it is non-linear with frequency.
2.3 Simultaneous Masking
In an audio signal, there is more than one sound or signal present at a particular
time or frequency. When the presence of a strong signal makes a weaker signal inaudible
to the human ear, the event is known as a masking effect. This effect increases the
threshold value of hearing from the threshold value in quiet. A strong signal is known as a masker while a weaker signal is known as masked signal. When two signals are close in
frequency range or within critical bandwidth and perceived by the human ear at same
time, then this type of masking is known as simultaneous masking. This type of masking
describes most of the masking effects in an audio signal. In the frequency domain, a
magnitude spectrum determines which components will be detected and which
12
components will be masked. The theory of simultaneous masking is explained by the fact
that the basilar membrane is excited by a strong signal much more than by a weaker
signal; therefore, a weaker signal is not detected [3]. In an audio signal, simultaneous
masking is described mainly by Noise masking tone and Tone masking noise scenarios.
2.3.1 Noise Masking Tone
SPL (dB) SPL
410 Freq. (Hz) Crit. BW
Figure 6. Noise masking tone [4]
The above figure shows noise masker with critical bandwidth centered at 410Hz.
It has a sound pressure level of 80 dB. Masked tone is also centered at 410Hz with a sound pressure level of 76 dB. The masked tone threshold level depends on the center frequency and the sound pressure level (SPL) of the masker noise. When tone frequency becomes equal to the center frequency of masked noise, minimum signal-to-mask ratio
(SMR) is achieved to get a threshold value for critical band noise masking tone. Signal- to-mask ratio is defined as the level difference between noise masker and masked tone.
13
The scenario represented in Figure 6 shows the minimum signal-to- mask ratio with the value of 4 dB. The signal-to-mask ratio increases and the masking power decreases as tone frequency changes above or below the center frequency [4].
2.3.2 Tone Masking Noise
SPL (dB) SPL 1000 Crit. BW Freq. (Hz)
Figure 7. Tone masking noise [4]
The above figure shows a tone signal with a center frequency of 1000 Hz within critical bandwidth and it has a sound pressure level of 80 dB. This tone signal masks a noise centered at 1000Hz with a sound pressure level of 56 dB. The masked noise threshold level depends on the center frequency and sound pressure level of the masker tone. When the center frequency of masked noise is close to the frequency of masker tone, minimum signal-to-mask ratio (SMR) is achieved to get the threshold value for noise masked by tone. In this case, a minimum value of SMR equal to 24dB is achieved for the threshold value of noise. The SMR value increases and the masking power
14 decreases as noise changes above and below the center frequency of the masking tone.
Here the minimum value of SMR is much higher compared to the noise masking tone [4].
2.4 Non-simultaneous Masking
Simultaneous masking takes place when a masker and masked signal are perceived at the same time. However, a masking effect also takes place before the masker appears as well as after the masker is removed. This type of masking is known as non- simultaneous or temporal masking [3][4].
Masker Audibility Masker (dB) Increase Threshold
Time after masker appearance (ms) Time after masker removal (ms)
Figure 8. Non-simultaneous masking properties of human ear [3]
Pre-masking is described as the effect before the masker appears and post– masking can be described as the effect after the masker is removed. Figure 8 shows pre- and post- masking regions, and it also shows how the threshold of hearing changes within these regions with non-simultaneous masking. Pre-masking can range from 1-2 ms before the masker appears and post-masking can be in the range of 100-300 ms depending on
15
intensity as well as the duration of the masker. This masking effect increases the threshold of hearing for the masked signal from the threshold of hearing in quiet [3].
2.5 Calculate Overall Masking Threshold [4]
For this project, a basic psychoacoustic model of MPEG layer-1 is used to generate the masking threshold of input signal. These steps are as follows.
Step 1: Get the power spectrum of signal.
In this step, an input signal is first normalized such that its sound pressure level stays at the appropriate threshold of hearing for the average human. After normalization,
Fast Fourier Transform (FFT) is used to take frames of the input signal, and the power spectrum of the input signal is obtained for each frame.
Step2: Determination of tone and noise maskers.
For a given power spectrum of signal, when the component exceeds the level of a neighboring component by a minimum of 7 dB, it is considered a tone masker. Then, it is added to generate a new power spectrum. A given frequency is a tone if its power satisfies any of the conditions below.
(i) P[i] > P [i-1] and P[i+1]
(ii) P[i] > P[i+ Nk] + 7 dB and P[i] > + P[i –Nk] + 7 dB
Where the neighboring component Nk is defined as:
[i-2..i+2] for 0.17 Hz< f<5.5 KHz
[i-3..i+3] for 5.5 KHz ≤f <11 KHz
[i-6…i+6] for 11 KHz ≤ f <20 KHz
16
Noise maskers are calculated after determining tone maskers for a given spectrum of signal.
Input signal
Get the power spectrum of signal
Determine tone maskers and noise maskers
Reduction and determination of maskers
Determine individual masking threshold
Get the overall masking threshold
Figure 9. Steps to calculate masking threshold of signal[4]
Step 3: Reduce and find maskers in critical bandwidth.
In the masker power spectrum, if the masker is below the threshold of hearing, then it is discarded since it is not heard by the average human. After that, if there is more
17
than one masker in a critical bandwidth for a given power spectrum, then the value of the
strongest masker is considered while the value of the weaker one is not considered.
Step 4: Determine individual masking threshold.
After the tone masker and noise masker, the individual masking threshold for
noise and tone are calculated using a spreading function. A spreading function determines
the minimum level for a neighborhood frequency so it can be perceived by the human
ear. The spreading function shows the effect of masking to nearby critical bands. It
depends on masker location j, masked signal location i, power spectrum PTM for the tone
masker and the difference between the masker and masked signal location in barks. It is described as follows.
Spreading function (i,j) = 17 Δ– 0.4PTM (j)+11 -3≤ Δ < -1 (4)
= (0.4 PTM (j) +6) Δ -1≤Δ< 0 (5)
= -17 Δ 0≤Δ< 1 (6)
= (0.15 PTM(j) – 17)Δ - 0.15 PTM(j) 1≤Δ<8 (7)
Where Δ= z(i) –z (j).
Step 5: Find overall masking threshold.
To determine the overall masking threshold, each individual masking threshold
for a given frequency component is added, and all individual maskers are added together.
18
Chapter 3
LINEAR PREDICTIVE CODING
3.1 Linear Prediction
A speech signal is continuous and varies with time; therefore, it becomes difficult
to analyze and more complex to code. After coding, the transmission of coded signal
requires more bandwidth and a higher bit rate as speech contains more information. At
the receiver side, it also becomes much more complex to reconstruct the original signal
based on coded speech signal. For this reason, it becomes necessary to code a signal such that it requires a lower bit rate and the reconstructed signal should resemble the original signal.
The basic idea behind linear prediction is that a signal carries relative information.
Therefore, the value of consecutive samples of a signal is approximately the same and the difference between them is small. It becomes easy to predict the output based on a linear combination of previous samples. For example, continuous time varying signal is x and its value at time t is the function of x(t). When this signal is converted into discrete time domain samples, then the value of the nth sample is determined by x(n). To determine the
value of x(n) using linear prediction techniques, values of past samples such as (n-1),x(n-
2), x(n-3)……..x(n-p) are used where p is the predictor order of the filter [8]. This coding
based on the values of previous samples is known as linear predictive coding. This
coding technique is used to code speech signals. The normal speech signal is sampled at
8000 samples/sec and 8-bit is used to code each sample. This provides a bit rate around
19
64,000 bits/sec. By using linear predictive coding, a bit rate of 2400 bits/sec is achieved.
In a linear prediction system or coding scheme, prediction error e(n) is transmitted rather
than a sampled signal x(n) because the error is relatively smaller than a sample of signal;
therefore, it requires fewer bits to code [5]. At the receiver side, this error signal is used
to generate and reconstruct the original signal using previous output samples.
3.2 Linear FIR Predictor as Pre-filter
The basic Linear FIR predictor is defined as the finite impulse response linear
predictor. This structure is used to build a psychoacoustic pre-filter at the encoder side.
The pre-filter takes an input sample from a current frame of signal and the linear
predictor structure determines the predicted sample using previous samples of the input signal [6].
Z -1 Z -1 Z -1 x(n)
a1 a2 ap
(n) + +
e(n) +
Figure 10. FIR linear predictor structure [6]
20
Figure 10 shows the basic FIR linear predictor structure. The predicted signal is
generated by multiplying predictor coefficients with corresponding previous values of a
signal and then adding these values together. Here a1, a2, a3……ap are known as linear
prediction coefficients. They are explained in the following section. The value of p is
known as the predictor order and its value determines the closest estimation of a signal.
Predicted signal (n) is the function of these LPC coefficients and previous input
samples, and the predicted signal is represented by following equation.
(n) = k x(n – k) (8)
Prediction error = original signal – predicted signal (9)
e(n) = x(n) – (n) (10)
Replaced by equation (8),
e(n) = x(n)– k x (n- k) (11)
To convert the above equation in the frequency domain, taking Z transform
-k E(z) = X(z) - k X(z) z (12)
-k E(z) = X(z) [1 – k z ] (13)
E(z) = X(z) H(z) (14)
-k Where H (z) = 1 – k z is the transfer function of the linear predictor as a
pre-filter. Here the error signal E (z) is represented by the product of input signal X(z)
and the transfer function of the FIR predictor filter H(z). From the transfer function H(z), it is realized that it is an all-zero filter. It is also called an analysis filter in linear predictive coding.
21
3.3 Linear IIR Predictor as Post-filter
The basic infinite impulse response linear predictor is known as linear IIR predictor. This linear predictor structure is used for a psychoacoustic post-filter at the decoder side and its filter response is the inverse of that pre-filter. The linear IIR predictor takes the error signal generated at the encoder side and generates predicted values of previous output samples. To reconstruct the original signal, values of previous samples are added to the values of the error signal [6].
Z -1 Z -1 Z -1
a1 a2 ap
(n) + +
e(n) + y(n) Figure 11. IIR linear predictor structure [6]
The above figure shows the basic structure for IIR linear predictor. In this
structure, previous output samples are used to predict the current output sample. Here, the
reconstructed signal is the function of the error signal and predicted output samples.
y(n) = e(n) + k y(n-p) (15)
Taking the z-transform of above equation,
-k Y(z) = E(z) + k Y(z) z (16)
22
-k Y(z) [1 – k z ] = E(z) (17)
-k Y(z)/ E (z) = 1 / (1- k z ) = 1/ H (z) (18)
The above equation represents a transfer function of the linear predictor at the decoder side. Post-filter transfer function is the inverse of the pre-filter transfer function.
This transfer function represents an all-pole filter. It is also defined as a synthesis filter in linear predictive coding since it is used to reproduce speech signals at the decoder side.
3.4 Calculation of LPC Filter Coefficients
In a normal linear predictor, coefficients are obtained from a signal for each frame, and they are calculated using linear equations to get a minimum prediction error.
These coefficients make linear predictor response time vary according to the signal. The calculated coefficients are transmitted with coded speech such that they are used to reconstruct the signal at the receiver side [9].
Figure 12 shows the method to determine linear predictor coefficients from the masking threshold of each frame signal. Psychoacoustic pre- and post-filter are linear predictors but they are adapted by the masking threshold generated by a psychoacoustic model. The masking threshold is estimated for each frame of the signal. Filter coefficients are determined from the masking threshold spectrum of the signal. However, in normal linear predictor, they are determined from signal spectra [6].
23
Audio Signal
Psychoacoustic Model
Masking Threshold
IFFT to get Auto- correlation values
Levinson’s Theorem
Linear Predictor Coefficients
Figure 12. Determination of linear predictor coefficients [6]
Here, first predictor coefficients for linear prediction are extracted from signal spectra and auto-correlation values for the signal. These auto-correlation values are obtained from the masking threshold of a signal such that predictor coefficients represent an approximation of the masking threshold rather than the signal.
According to [8], linear predictor coefficients are obtained such that the predicted sample value is close to the original sample; therefore, prediction error should be small.
24
Overall error is defined as the sum of the mean squared error for samples that can be expressed by the below equation.
E = ∑ e2 (n) (19)
From equation [11],
e(n)=x(n)– k x(n-k) (20)
2 E= ∑ {x(n)- k x(n-k)} (21)
To get filter coefficients, taking derivation of E with respect to the set of coefficients ak , where k = 1, 2…p. and setting it to zero,
dE/ da1 = 0, dE/da2= 0 ..….dE/dap=0
The above linear equations are represented as a matrix in the following way,
(22)
Here, R (k) = which is an auto-correlation function of the sequence x(n).
For this project, predictor order p=10 is taken so above equation can be simplified as
25
(23)
The above equation can be represented by:
R .a = r
R= (24)
According to [7] [9], to get optimum predictor coefficients, the above equation is solved by Levinson’s theorem because it requires less computation than any other method. From equation (11) the sum of the squared error is represented by the equation below:
2 2 E = ∑ e [n] = ∑ {x(n) – k x(n-k)} (25)
Solving the above equation for P th order can be represented as:
Ep = R (0) – a1 R (1) – a2 R (2) – …. – ap-1.R (p-1) (26)
Ep = R (0) – k R(k) (27)
To compute the pth order coefficients, the following steps are performed repetitively for
p= 1,2…P
(i) Bp = R(p) – i (p-1) R(p-i)
(ii) ap(p) = kp= Bp / E p-1
26
(iii) ai(p) = ai( p-1 ) - kp a(p-1) (p-i) where i = 1,2…p-1
2 (iv) Ep = Ep-1 ( 1 – kP )
Therefore, the final solution for LPC coefficients is given by ai = ai(p) where i =
1,2…p-1. For each analysis frame, the psychoacoustic model calculates the masking
threshold of the audio signal. According to [6], the masking threshold is expressed in
frequency domain and it is represented as a power spectral density function of signal.
Auto-correlation function can be represented as the power of spectral density function of
a signal.
To convert the masking threshold from frequency domain to time domain, the
Inverse Fast Fourier Transform (IFFT) of the masking threshold is performed. This can
be done by IFFT, which is a built-in function in MATLAB. As per theorem, when
Inverse Fast Fourier Transform is performed on a power spectral density function, it
gives values of auto-correlation function. While taking the Inverse Fast Fourier transform
of the masking threshold the real part of the corresponding value is taken because only
the real part can approximate predictor coefficients. After getting auto-correlation values
for the masking threshold, Levinson’s theorem is used to determine linear predictive coefficients.
27
Chapter 4
MATLAB IMPLEMENTATION
Audio coder using perceptual linear predictive filter is implemented in MATLAB.
For this project, a perceptual model is the basic psychoacoustic model for MPEG layer-1 and it is taken from Rice University. It is used to determine the overall masking threshold of the audio signal. This psychoacoustic model takes the current portion of signal and performs Fast Fourier Transform (FFT). In a psychoacoustic model, signal is converted from a time domain to a frequency domain. To determine the masking effect, tone maskers, noise maskers and overall maskers within a critical band are calculated.
After considering the overall effects of maskers, the masking threshold is obtained for each window frame of input signal.
In the encoder part, a signal is computed over a 2048-point Fast Fourier
Transform (FFT) length. Fast Fourier Transform provides the masking threshold from a psychoacoustic model for each frame. This FFT length gives a large number of samples for the masking threshold; therefore, a close approximation of predictor coefficients is possible for each frame of signal. Linear predictor codes the audio signal using these predictor coefficients. At the encoder side, the following steps are performed to code the signal.
• A window size of 25ms is used in the encoder as well as in the decoder because it
provides enough resolution to perform analysis for speech and audio signal.
28
• A process window size of 5ms is used in the encoder as well as in the decoder to
update predictor coefficients within each window frame.
• First the input audio signal is read from WAV file (Waveform Audio Format) to
process using perceptual model.
• Initially, for the first window size, the masking threshold is calculated and it is
used to estimate predictor coefficients for that signal frame. The linear prediction
filter processes the portion of the signal using predictor coefficients.
• Once enough data is available to process, the current window is shifted by a
window size of 25ms.
• The masking threshold is calculated for the current window size of a signal.
• The calculated masking threshold is used to get linear predictor coefficients for
the current portion of signal.
• The estimated coefficients are stored in a matrix for each processing window of
5ms for the current window size.
• The pre-filter takes a current sample, predictor coefficients, and the previous
values stored in a buffer to compute filter output.
• At every process window size, predictor coefficients are updated and they are
stored in a matrix.
• The above steps are repeated for the length of the input signal.
• For each frame of signal, estimated predictor coefficients are stored in a MAT
(Matlab) file so the predictive filter at the decoder side has access to it.
29
• Once the pre-filter output is computed, it is stored in a wav file
In the predictor coefficients estimation, the masking threshold spectrum is represented in a frequency domain. It is defined for positive as well as negative frequencies. To get the real estimation of coefficients, negative frequencies are reconstructed from the masking threshold spectrum. Inverse Fast Fourier Transform
(IFFT) is performed on the power spectrum of the masking threshold to convert from
frequency domain to time domain. While converting masking threshold spectrum to time
domain values, only the real part is used to approximate values. These values are
represented as auto-correlation values for the masking threshold of signal. In this project,
Levinson’s theorem is implemented in MATLAB. It is used to estimate linear predictive
coefficients from auto-correlation values of the masking threshold. These estimated linear
predictive coefficients generate a predictor response according to the masking threshold
of signal.
At the decoder side, the linear predictive filter is used to decode or reconstruct the
signal. In decoder implementation, the following steps are performed to decode the
signal.
• Processed or coded audio signal is read from a WAV (Waveform Audio Format)
file.
• Predictor coefficients from MAT file (Matlab file) are stored into the matrix.
These coefficients correspond to the masking threshold of input audio signal and
it is generated by the psychoacoustic model.
30
• The initial window of a signal is processed. Initial predictor coefficients are taken
from the matrix to generate a post-filtered signal and initial data for the next
window of signal.
• After initial window processing, the current window is shifted by window size
(25 ms) to process signal.
• Within each current window, predictor coefficients are extracted from a matrix for
every 5ms and it is updated to the post-filter.
• For each window size, the post-filter takes a current sample, predictor
coefficients, and previous samples to determine output sample.
• After processing the signal, post-filter output is computed over the length of input
signal.
• The reconstructed original signal is stored in the audio (wav) file.
After performing the post-filter operation, the output audio signal is played as well as compared with the original audio signal. Output audio signal quality is the same as the original signal.
31
Chapter 5
SIMULATION RESULTS
To verify the project design input signal is taken from the wave (Waveform
Audio Format) file recorded in a quiet room with no noise effects.
Input signal 0.8
0.6
0.4
0.2
0
-0.2 Amplitude (v) -0.4
-0.6
-0.8 0.5 1 1.5 2 2.5 3 Time (Sec)
Figure 13. Input signal
Figure 13 shows the input signal at the encoder side. The signal is sampled at 16
KHz and it is from the artistic database of CMU (Carnegie Melon University) speech group. The audio signal has audible quality.
32
Pre-filtered signal 0.25
0.2
0.15
0.1
0.05
0
-0.05 Amplitude (v) -0.1
-0.15
-0.2
-0.25 0.5 1 1.5 2 2.5 3 Time (Sec)
Figure 14. Pre-filtered signal
Figure 14 shows the output signal from the pre-filter. It contains less information than the original signal since this information is removed based on the perceptual model.
The masking threshold of the input signal is extracted by a psychoacoustic model, and linear predictor coefficients are estimated to generate frequency response. The pre- filtered signal is coded according to perceptual criteria.
33
Reconstructed signal from post-filter 0.8
0.6
0.4
0.2
0
-0.2 Amplitude (v)
-0.4
-0.6
-0.8 0.5 1 1.5 2 2.5 3 Time (Sec)
Figure 15. Reconstructed signal from post-filter
Figure 15 shows a reconstructed signal from a post-filter, and it is similar to the input signal. A post-filter generates a frequency response inverse to the pre-filter to decode the signal. The post-filter uses the same filter coefficients used by the pre-filter for that particular frame of signal. Predictor coefficients are stored in a MAT file so the post-filter has access to it to generate a frequency response according to perceptual criteria to reconstruct the signal.
34
Masking threshold spectrogram Masking threshold spectrogram
1000
2000
3000
4000
5000 Frequency (v) Frequency
6000
7000
8000 0.5 1 1.5 2 2.5 3 Time (Sec)
Figure 16. Spectrogram for masking threshold of input signal
Figure 16 shows the masking threshold of the input signal extracted by a psychoacoustic model. This spectrogram shows different magnitude levels for a masking threshold of a signal at a particular frequency. It shows color representation for a masking threshold describing magnitude levels. It is extracted for each frame of signal. A red color bar refers to highest magnitude level while the blue color refers to the lowest maginude level of masking threshold.
35
Masking threshold estimate spectrogram Masking threshold estimate spectrogram
1000
2000
3000
4000
5000 Frequency (v) Frequency
6000
7000
8000 0.5 1 1.5 2 2.5 3 Time (Sec)
Figure 17. Spectrogram for estimated masking threshold
Figure 17 shows the estimated masking threshold of an input signal and it is also described as a frequecy response of linear predictor. This shows the frequency response in terms of color bars. Color bars show the approximation for LPC coefficents from generating the masking threshold of the input signal. A red color bar refers to the highest magnitude level while the blue color refers to the lowest maginude level of the estimated masking threshold.
36
60 Maskingmasking threshold threshold Estimatedestimated magnitude magitide response response 40
20
0
-20 magnitude dB gnitude (dB)
Ma -40
-60
-80 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequencytime (Hz)
Figure 18. Masking threshold and LPC response for frame of signal
Figure 18 shows the masking threshold for one of the frames of the audio signal.
The red plot shows the extracted masking threshold while the blue plot shows the response of the linear predictor for that frame of input signal. From the plot it is clear that the response is almost similar to the extracted masking threshold of the input signal and
the approximation of LPC coefficents is accurate.
37
Chapter 6
CONCLUSION
The implementation of audio coder based on perceptual linear predictive coding
represents another way to code audio signal. This project combines a perceptual model
and linear predictive technique to perform audio coding. The project is successfully implemented in MATLAB. Linear predictor coefficients are obtained using Levinson’s theorem. During coding, information from the audio signal is removed based on a perceptual model and irreverence reduction is achieved. After decoding, the reconstructed audio signal has the same quality as the original audio signal.
This project provided me a great opportunity to study and learn the fundamentals of digital signal processing. During implementation, I have used these fundamentals to get the best results for this project. I have finished all the requirements for the success of this project. After developing this project, I have learned the fundamentals of a psychoacoustic model and how it is combined with a linear predictor filter to process the audio signal.
6.1 Future Improvements
This project uses a perceptual model that removes irrelevant information from an audio signal. There is a need for redundancy reduction from an audio signal to achieve further compression. This design can be extended further or be combined with appropriate quantization and coding techniques. Therefore, it will be able to get a lower bit rate as well as maintain signal quality. The coding technique is used either in
38 frequency domain or in time domain. In this project, filter coefficients are not transmitted to the decoder side. To control post-filter according to perceptual criteria in another way, coefficients are quantized and efficiently coded. After coding, they are transmitted along with the coded signal as side information.
39
APPENDIX
encode.m
% It reads the audio signal from wave file and code the input signal according to perceptual criteria and linear predictive filter (Pre-filter).
% function [y, A, fs] = encode_file (filename, plotting) filename = 'cmu_us _arctic _slt _a0001.wav'; plotting = 1;
% if plotting is not set, no plotting should be done
if nargin == 1 plotting = 0; end
% create figure for masking threshold and estimate filter evolution
if plotting == 2 fig = figure; end
% reading the signal from the file
[x, fs] = wavread (filename);
%------% system initialization
% assigning the number of point over which the FFT is computed
FFTlength = 2048;
% set the LPC analysis order p = 10;
% computing the processing step, window size and frequency values (in Hz % and Barks)
40
ps = round (fs * 0.005); % 5 ms ws = round (fs * 0.025); % 25 ms
N = length (x); % the length of the signal in samples f = fs/FFTlength:fs/FFTlength:fs/2; b = hz2bark (f);
% initializing the filter coefficients matrix
A = zeros (floor ((N-ws)/ps)+1, p+1);
% if plotting is enabled, initialize the filter response and MT % spectrograms
if plotting
H = []; TH = []; end
%------% estimating initial values for the pre-filter
% selecting the first window from the signal crtW = x (1:ws);
% extracting the masking threshold th = findThreshold (crtW, FFTlength, f);
% get an estimate filter for the extracted threshold a = getMTest (th, p, f);
% save the filter coefficients into matrix A A (1, :) = a;
% if plotting is enabled, save the filter response and the MT % if plotting
H = [H 10*log10 (abs (freqz (1, a, FFTlength/2)))]; TH = [TH; th];
41
% end
%------% initialize the channel output y = zeros (size (x));
%------
% start processing the signal
% initialize the filter delay chains prebuff = zeros (p+1, 1);
% first process and transmit the initial frame
for i=1:ws
% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);
crtW = [crtW (2:ws); x (i)]; end
% sufficient new data has been gathered and now the normal processing steps % can be taken
lastUpdated = 1;
crtW = [crtW (ps+1:ws); x (i+1:i+ps)]; th = findThreshold (crtW, FFTlength, f);
a = getMTest (th, p, f); A (floor ((i-ws)/ps)+1, :) = a;
% if plotting is enabled, save the filter response and the MT
if plotting
H = [H 10*log10 (abs (freqz (1, a, FFTlength/2)))]; TH = [TH; th];
42 end for i=i+1:N-ps
% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);
% adding new values to the analysis window % crtW = [crtW (2:ws); x (i)];
% if the coefficients were last updated ps samples ago, it's time to % compute new coefficients if lastUpdated == ps
lastUpdated = 0; crtW = [crtW (ps+1:ws); x (i+1:i+ps)]; % th = get_AMT (crtW, FFTlength, f, b); th = findThreshold (crtW, FFTlength, f); a = getMTest (th, p, f); A (floor ((i-ws)/ps)+2, :) = a;
% if plotting is enabled, save the filter response and the MT if plotting
h = 10*log10 (abs (freqz (1, a, FFTlength/2)).^2); H = [H h-mean (h)]; TH = [TH; th-mean (th)];
if plotting == 2
figure (fig); plot (f, th-mean (th), 'r', f, h-mean (h), 'b'); pause (0.1);
end
end end
lastUpdated = lastUpdated+1;
43
end
for i=i+1:N
% computing the prefilter output [y (i), prebuff] = prefilter (x (i), prebuff, a);
end
time = (1:N)/fs;
wavwrite (y, fs, [filename (1:end-3) 'pre.wav']);
% LPC coefficients are stored for each frame of signal . and updated for every process frame for window size. It is used to pass these coefficients to post-filter.
save ('filter_coeffs.mat', 'A');
if plotting close all; figure;
plot (time, x), title ('Original signal'), xlim ([time (1) time (N)]); plot (time, y), title ('Pre-filtered signal'), xlim ([time (1) time (N)]);
figure; imagesc (time, f, TH'); title ('Masking threshold spectrogram'); xlim ([time (1) time (N)]); ylim ([f (1) f (FFTlength/2)]);
imagesc (time, f, H); title ('Masking threshold estimate spectrogram'); xlim ([time (1) time (N)]); ylim ([f (1) f (FFTlength/2)]); end
44
prefilter.m
function [y, buff] = prefilter (x, buff, a)
% determine the filter order p = length (a) - 1;
% push the current value in the delay chain buff = [x; buff (1:p)];
% compute the output sample y = a*buff;
getMtest.m
function [a, e] = getMTest (th, p, f) th = th - min (th); th = [th fliplr (th)]; % plot (th); % pause (1); acf = real (ifft (power (10, th/10)));
[a e] = levi (acf (1:round (length (acf)/2)), p);
% [b a] = invfreqz (th, f/max(f)*pi, 1, p);
Decode.m
% function y = decode (x, A, fs, plotting)
45
clear all;
filename = 'cmu_us _arctic _slt _a0001.pre.wav'; [x, fs] = wavread (filename);
load ('filter_coeffs.mat');
% if plotting is not set, no plotting should be done
% if nargin == 3 plotting = 1; % end
%------% system initialization
% set the LPC analysis order p = 10;
% computing the processing step, window size and frequency values (in Hz % and Barks)
ps = round (fs * 0.005); % 5 ms ws = round (fs * 0.025); % 25 ms N = length (x); % the length of the signal in samples
%------% initialize the postfilter output y = zeros (size (x));
%------
% start processing the signal
% initialize the filter delay chains postbuff = zeros (p+1, 1);
% get the first filter coefficients a = A (1,:);
% process the first frame
46 for i=1:ws
% compute the post-filter output [y (i), postbuff] = postfilter (x (i), postbuff, a); end lastUpdated = 1; a = A (2, :); for i=i+1:N-ps
% computing the postfilter output [y (i), postbuff] = postfilter (x (i), postbuff, a);
% if the coefficients were last updated ps samples ago, it's time to % compute new coefficients
if lastUpdated == ps lastUpdated = 0; a = A (floor ((i-ws)/ps)+2, :); end lastUpdated = lastUpdated+1; end for i=i+1:N
% computing the postfilter output
[y (i), postbuff] = postfilter (x (i), postbuff, a); end time = (1:N)/fs; if plotting
figure; plot (time, x), title (' pre-filtered signal'), xlim ([time (1) time (N)]); plot (time, y), title ('Reconstructed signal from post-filter'), xlim ([time (1) time (N)]);
47
end
wavwrite (y, fs, [filename (1:end-7) 'post.wav']);
levi.m
% It takes predictor order and auto correlation values for each frame of signal . %calculates LPC coefficients using autocorrelation values function a = levi(r, p)
% initializing the LP coefficients a = zeros(1, p+1); a(1) = 1;
% running the first iteration
k = -r(2) / r(1); a(2) = k; E = r(1) * (1-k^2);
% iterating the remaining steps for i=2 : p % setting the j index
j = 1:i-1;
% computing the reflection coefficient k = r(i+1); k = k + sum(a(j+1) .* r(i-j+1)); k = -k/E;
% saving the previous LP coefficients ap = a;
% computing the new LP coefficients from 1 to i-1
48
a(j+1) = ap(j+1) + k * ap(i-j+1);
% saving the reflection coefficient as the i'th LP coef a(i+1) = k;
% computing the new prediction gain E = E * (1-k^2);
End
Postfilter.m
% it takes stored LPC coefficients from matrix for each frame of signal and generates %frequency response inverse of the pre-filter using previous values of output samples to %decode the signal.
function [y, buff] = postfilter (x, buff, a)
% determine the filter order p = length (a) - 1;
% compute the output sample y = x - [a (2:end) 0]*buff;
% push the current value in the delay chain buff = [y; buff (1:p)];
49
REFERENCES
[1] A. Gersho, “Advances in Speech and Audio Compression.” Proceedings of the IEEE,
Vol. 82, Issue 6, 1994, pp. 900-918.
[2] B. Edler, and G. Schuller, “Audio Coding Using a Psychoacoustic Pre- and Post- filter.” Proceedings of the IEEE International Conference on Acoustics, Speech, Signal
Processing, Istanbul, Turkey, 2000, pp. 881–884.
[3] E. Zwicker, and H. Fastl, “Psychoacoustics: Facts and Models.” New York, Springer,
1999. Ch. 2, 4, and 6.
[4] Ted Painter, and Andreas Spanias, “Perceptual Coding of Digital Audio.” Proceedings of the IEEE, Vol. 88, No. 4, April 2000, pp. 451–512.
[5] Jeremy Bradbury, “Linear Predictive Coding.” December 5, 2000.
[6] B.Edler, C. Faller, and G. Schuller, “Perceptual Audio Coding Using a Time-Varying
Linear Pre- and Post-filter.” Proceedings of AES Symposium, Los Angeles, CA,
September 2000.
[7] Nagarajan and S. Sankar, “Efficient Implementation of Linear Predictive Coding
Algorithms.” Proceedings of the IEEE, Southeastcon 1998, pp. 69-72.
[8] John Makhoul, “Linear Prediction: A Tutorial Review.” Proceedings of the IEEE,
Vol. 63, No. 4, April 1975.
[9] Douglas O’Shaughnessy, “Linear Predictive Coding.” IEEE Potentials, Vol. 7, Issue
1, 1988, pp. 19-32.