NORTHWESTERN UNIVERSITY

Computational Auditory Scene Induction

A DISSERTATION

SUBMITTED TO THE GRADUATE SCHOOL

AND THE DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER

SCIENCE

OF NORTHWESTERN UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

for the degree

DOCTOR OF PHILOSOPHY

Field of Computer Science

By

Jinyu Han

EVANSTON, ILLINOIS

August 2012 2

c Copyright by Jinyu Han 2012

All Rights Reserved 3

ABSTRACT

Computational Auditory Scene Induction

Jinyu Han

Real world sound is a mixture of different sources. The sound scene of a busy cof- feehouse, for example, usually consists of several conversations, music playing, laughter and maybe a baby crying, the door being slammed, different machines operating in the background and more. When humans are confronted with these sounds, they rapidly and automatically adjust themselves in this complex sound environment, paying attention to the sound source of interest. This ability has been labeled in psychoacoustics under the name of Auditory Scene Analysis (ASA).

The counterpart to ASA in machine listening is called Computational Auditory Scene

Analysis (CASA) - the efforts to build computer models to perform auditory scene anal- ysis. Research on CASA has led to great advancement in machine systems capable of analyzing complex sound scene, such as audio source separation and multiple pitch esti- mation. Such systems often fail to perform in presence of corrupted or incomplete sound scenes. In a real world sound scene, different sounds overlap in time and frequency, in- terfering with and canceling each other. Sometimes, the sound of interest may have some 4 critical information totally missing, examples including an old recording from a scratched

CD or a band-limited telephone speech signal. In the real world filled with incomplete sounds, the human auditory system has the ability, known as Auditory Scene Induction

(ASI), to estimate the missing parts of a continuous auditory scene briefly covered by noise or other interferences, and perceptually resynthesize them. Since human is able to infer the missing elements in an auditory scene, it is important for machine systems to have the same function. However, there are very few efforts in computer audition to computationally realize this ability.

This thesis focuses on the computational realization of auditory scene induction - Com- putational Auditory Scene Induction (CASI). More specifically, the goal of my research is to build computer models that are capable of resynthesizing the missing information of an audio scene. Building upon existing statistical models (NMF, PLCA, HMM and

N-HMM) for audio representation, I will formulate this ability as a model-based spectro- gram analysis and inference problem under the expectation–maximization (EM) frame- work with missing data in the observation. Various sources of information, including the spectral and temporal structure of audio, and the top-down knowledge about speech are incorporated into the proposed models to produce accurate reconstruction of the missing information in an audio scene. The effectiveness of these proposed machine systems are demonstrated on three audio tasks: singing melody extraction, audio imputation and audio bandwidth expansion. Each system is assessed through experiments on real world audio data and compared to the state-of-art. Although far from perfect, the proposed systems have shown many advantages and significant improvement over the existing systems. In addition, this thesis has shown that different applications related to 5 missing audio data can be considered under the unified framework of CASI. This opened a new avenue of research in the Computer Audition community. 6

Acknowledgements

First and foremost, I would like to thank my advisor, Professor Bryan Pardo, for creating the group in which I was able to do this work, for inviting me to join his lit- tle ensemble back in 2007, and for supporting me since then. Bryan opened the door for me to a whole new world of knowledge and practice. Without his unabated trust, and unwavering commitment to providing me a creative and protected environment, this work would not have been accomplished. His passion for scientific exploration and his philosophy of research will continue to inspire me in the future.

I owe an immense amount of gratitude to Gautham J. Mysore, who has been an excellent mentor and collaborator over the last year. He sets an example as scholar and taught me the qualities a researcher should possess, for which I am particularly grateful.

He has taught me a great deal about research from general approaches to problem solving to specifics about machine learning and signal processing.

Special thanks go to my thesis readers, Jorge Nocedal and Thrasyvoulos N. Pappas for serving on my dissertation committee and for providing valuable feedback on this dissertation. Their insightful reading and suggestions of my original proposal have greatly improved the final work. I thank Professor Thrasyvoulos N. Pappas for his enjoyable class on Digital Signal Processing which built the foundations of my thesis work. I thank

Professor Jorge Nocedal for his excellent lectures from which I learned a great deal about 7 optimization and machine learning. I am also grateful to Professor Doug Downey for participating in my PhD qualify exam.

I would like to thank all of the members at the Media Technology Lab, Gracenote.

I am extremely grateful to Markus Cremer and Bob Coovor for their inspiration and encouragement in research and my personal life. Special thanks go to Ching-Wei Chen, with whom the collaboration has been a great joy.

I would like to thank my wonderful former and present labmates who make the Interac- tive Audio (IA) Lab a pleasant place to work. Particular honors go to Zhiyao Duan, Zafar

Rafii, Mark Cartwright, David Little, and Michael Skalak, with whom I have had par- ticularly enlightening discussions and fruitful collaborations. Without John Woodruff’s foundational work, my research would have been much more difficult. Many thanks also go to Arefin Huq, Rui Jiang, Sara Laupp, Anda Bereczky and Dominik Kaeser for making my time at the IA Lab particularly enjoyable.

I would like to thank Prof. Yuan Dong for giving me my first opportunity to conduct research and encouraging me to pursue graduate study. It was at his lab at the Orange

Labs (Beijing), France Telecom, that I discovered my love and passion for audio related research.

I would also like to acknowledge the financial support provided to me through two

NSF grants (IIS-0812314 and IIS-0643752). 8

I dedicate this thesis to Jiayi Han, Feng Li and Jin Xu 9

Table of Contents

ABSTRACT 3

Acknowledgements 6

List of Tables 11

List of Figures 13

Chapter 1. Introduction 19

1.1. Contribution 21

1.2. Outline 26

1.3. Structure in Audio 28

1.4. Auditory Scene Analysis and Induction 31

1.5. Motivation 34

Chapter 2. Singing Melody Extraction 40

2.1. Related work 42

2.2. Modeling of Audio 47

2.3. System description 54

2.4. Illustrative example 59

2.5. Experiment 62

2.6. Contributions and Conclusion 65 10

Chapter 3. Audio Imputation 69

3.1. Related work 71

3.2. Non-negative Hidden Markov Model 80

3.3. Audio Imputation by Non-negative Spectrogram Factorization 96

3.4. System description 99

3.5. Experiment 107

3.6. Contribution and Conclusion 117

Chapter 4. Language Informed Audio Bandwidth Expansion 119

4.1. Related work 122

4.2. System Overivew 126

4.3. Word Models 127

4.4. Speaker Level Model 129

4.5. Estimation of incomplete data 131

4.6. Experimental results 135

4.7. Contribution and Conclusion 142

Chapter 5. Conclusion and Future Research 153

5.1. Future Directions 154

References 158 11

List of Tables

2.1 The expectation–maximization (EM) algorithm of PLCA learning 51

2.2 Performance comparison of the proposed algorithm against DHP and

LW, averaged across 9 songs of 270 seconds from the MIREX melody

extraction dataset. 63

3.1 The parameters of the Non-negative Hidden Markov Model. These

parameters can be estimated using Expectation-Maximization algorithm.

q and z range over the sets of spectral component indices and dictionary

indices respectively. f ranges over the set of analysis frequencies in the

FFT. 90

3.2 The generative process of an audio spectrogram using N-HMM. 91

3.3 The EM process of N-HMM Learning 95

3.4 Algorithm I for Audio Imputation 104

3.5 Algorithm II for Audio Bandwidth Expansion 106

3.6 Audio excepts dataset used for Evaluations 112

3.7 Performances of the Audio Imputation results by the proposed

Algorithm I and PLCA. There is no statistical difference at a significant

level 0.05 between the two methods with a p-value 0.76. 115 12

3.8 Performances of the Audio Bandwidth Expansion results by the

proposed Algorithm II and PLCA. There is statistical difference at a

significant level 0.05 between the two methods with a p-value 0.01 116

4.1 Algorithm III for Language Informed Speech Bandwidth Expansion 134

4.2 Scale of Mean Opinion Score used by the objective measure OVRL. 137

4.3 Performances of audio BWE results by the proposed method and PLCA

in Con-A . Numbers in bold font indicate the difference between the

proposed and PLCA is statistically significant by a student t-test at 5%

significance level. 140

4.4 Performances of audio BWE results by the proposed method and PLCA

in Con-B . Numbers in bold font indicate the difference between the

proposed and PLCA is statistically significant by a student t-test at 5%

significance level. 140 13

List of Figures

1.1 Illustration of the (a) waveform and (b) spectrogram of an audio clip of

a male speaker saying, “She had your dark suit in greasy wash water all

year”. The level of the signal at a given time-frequency bin is indicated

by a color value as explained in the (c) colorbar. 29

1.2 Fragments that are parts of a number of familiar objects: character

“B”. The fragments were obtained by taking the objects and laying an

irregularly shaped mask over it. Then the parts that were underneath

the mask were eliminated, leaving visible only those parts that had not

been covered by the mask. a) Fragments do not organize themselves

when there is no information for occlusion. b) The same fragments

shown as in (a) except that information for occlusion has been added,

causing the fragments on the boundaries of the occluding form to be

grouped. (From [Bregman 90]) 34

1.3 Illustration of Masking (adapted from [Bregman 90]) in idealized

spectrograms. The vertical axis is frequency and the horizontal axis is

time. Pattern A is the tonal sound that is softer. B is the sound

mask that is louder. In the task of singing voice extraction, A can be

viewed as the singing voice and B the accompaniment music. 35 14

1.4 Illusion of continuity in idealized spectrograms. The vertical axis is

frequency and the horizontal axis is time. (a) The stimulus with gaps;

(b) The stimulus when the gaps are filled with noise. (adapted from

[Bregman 90]) 37

1.5 a) Original image b) The region corresponding to the foreground

person has been removed and filled in with synthesized textures. (From

[Criminisi 04]) 39

2.1 Illustration of NMF on the spectrogram of a clip speech ‘Bad dog’.

When NMF is applied to the spectrogram (with K = 4), four distinct

spectral components are learned. Additionally, the weights of these

spectral components at each time frame are learned. 49

2.2 Spectral components learned from singing voice (top), piano (middle)

and snare drum (bottom). 53

2.3 Overview of the proposed Singing Melody Extraction System. 55

2.4 Illustration of the Accompaniment Model Training stage. 56

2.5 Illustration of the Accompaniment Reduction stage. 58

2.6 Melody extraction results on a clip of “Simple Man” by Lynyrd Skynyrd. 67

2.7 Melody detection result. Ground Truth (black solid lines) is obtained by

applying the pitch tracker to the singing voice before mixing (Fig. 2.6

(b)). Estimation 1 (blue solid lines) is obtained by applying the same

pitch tracker to the audio mixture directly (Fig. 2.6 (a)). Estimation 2 15

(red dot) is obtained by applying the same pitch tracker to the extracted

singing voice (Fig. 2.6 (c)). 68

3.1 Comparison of dictionaries learned by non-negative models. PLCA uses

a single large dictionary to explain a sound source, whereas the N-HMM

uses multiple small dictionaries and a Markov chain. Here, each column

represents a single spectral component in the dictionary. 82

3.2 A comparison between PLCA and N-HMM. We start with a single

large dictionary that is learned by PLCA (b) to explain everything in

an audio spectrogram (a), and work up to several small dictionaries

and a Markov chain (c) jointly learned from the given spectrogram by

N-HMM. Each dictionary corresponds to a state of the Markov chain. 84

3.3 Illustration of N-HMM on the spectrogram of a clip speech ‘Bad dog’, as

shown in (a). ‘States’ represent small dictionaries learned by N-HMM.

In this example, four dictionaries (states) with five spectral components

per state are learned as shown in (d). The state posterior given the

observation and state transition matrix are plotted in (b) and (c)

respectively. 85

3.4 Graphical Model for an HMM with multiple draws at every time frame

(from [Mysore 10]). {Q, F } is a set of random variables. vt represents

the number draws at time t from a distributions. Shaded variable

indicates observed variable. 87 16

3.5 Graphical Model for the N-HMM ([Mysore 10]). {Q, Z, F } is a set

of random variables. vt represents the number draws at time t from

a distributions. Shaded variable indicates observed data. Q and Z

range over the sets of spectral component indices and dictionary indices

respectively. F ranges over the set of analysis frequencies in the FFT. 89

3.6 General Procedure of Supervised Audio Imputation 97

3.7 Supervised Audio Imputation using an N-HMM 99

3.8 Example reconstruction of a music signal with a binary mask occluding

roughly 50% of the samples. The first plot shows the original signal,

the second plot shows the masked input we used for the reconstruction,

the third plot shows the reconstruction using PLCA and the fourth one

shows the reconstruction using our model. 108

3.9 Example reconstruction of a music signal with a binary mask occluding

roughly 60% of the samples. The first plot shows the original signal,

the second plot shows the masked input we used for the reconstruction,

the third plot shows the reconstruction using PLCA and the fourth one

shows the reconstruction using our model. 110

4.1 Procedure of source-filter based BWE method illustrated in the

frequency domain (adapted from [Kornagel 02]). 123

4.2 Block diagram of the proposed system. Our current implementation

includes modules with solid lines. Modules with dashed lines indicate 17

possible extensions in order to make the system more feasible for large

vocabulary BWE. 126

4.3 Example of speech BWE. a) Original speech. b) Narrowband speech.

Frequencies below 300 Hz or above 3400 Hz are removed. c) Result

using the PLCA. d) Result using the proposed method. 143

4.4 Example of speech BWE. The x-axis is time and y-axis frequency.

a) Original speech. b) Narrowband speech. Lower 1000 Hz of the

spectrogram are removed. c) Result using the PLCA. Regions marked

with white-edge boxes are regions in which PLCA performed poorly. d)

Result using the proposed method. The lower 4000 Hz are plotted in

log-scale. 144

4.5 Boxplot of audio BWE results SNR in Con-A (top plot) and Con-B

(bottom plot). Each boxplot is generated by 500 SNR results from 10

speakers. 145

4.6 Boxplot of audio BWE results Covl in Con-A (top plot) and Con-B

(bottom plot). Each boxplot is generated by 500 Covl results from 10

speakers. 146

4.7 SNR Boxplot of audio BWE results for each speaker by the proposed

method and PLCA in Con-A . Each boxplot is generated by 50 SNR

results from one speaker. 147 18

4.8 Covl Boxplot of audio BWE results for each speaker by the proposed

method and PLCA in Con-A . Each boxplot is generated by 50 Covl

results from one speaker. 148

4.9 SNR Boxplot of audio BWE results for each speaker by the proposed

method and PLCA in Con-B . Each boxplot is generated by 50 SNR

results from one speaker. 149

4.10 Covl Boxplot of audio BWE results for each speaker by the proposed

method and PLCA in Con-B . Each boxplot is generated by 50 Covl

results from one speaker. 151 19

CHAPTER 1

Introduction

In a concert, how can we quickly focus on the performer’s singing voice when it comes at our ears mixed with dozens of people’s conversations and the background music? In a busy street, how do we recognize a person’s speech over the phone when his or her voice is only partially transmitted by the communication channels and corrupted by the street noise such as cars passing?

The first question relates to Auditory Scene Analysis (ASA) [Bregman 90] - the process by which the human auditory system organizes sound into perceptually meaningful auditory streams. An auditory stream is our perceptual grouping of the parts of the auditory scene that go together. With dozens of people speaking at the same time in a music venue (auditory scene), we are able to follow a particular singing voice even though others’ voices and accompaniment music are present. In this example, the ear is segregating the singing voice from other sounds, and the mind ‘streams’ these segregated sounds into different auditory streams.

The counterpart to the ability of auditory scene analysis in machine is called

Computational Auditory Scene Analysis (CASA) [Wang 06]. Research on CASA has led to the development of many audio processing methods such as multipitch estimation and tracking, music transcription, source separation, and melody extraction.

The second question is related to another remarkable ability of human auditory system,

Auditory Scene Induction (ASI) [Warren 70, Warren 72] - the process by which the 20 human auditory system resynthesizes the missing parts of a continuous auditory stream.

“Auditory induction” was illustrated in the well-known “phonemic restoration illusion” of [Warren 70]. Phonemes were removed (leaving silence) from words and listeners correctly identified their absence. However, when masking noise was present in the si- lences, listeners reported hearing both the nose and the “masked” phonemes. Study by

[Repp 92] indicates that the listener’s auditory system does not segregate the restored speech from the extraneous sound, but instead uses an abstract phonological phonetic representation (top-down knowledge of speech acoustics) that is activated in the process of word recognition to make up the missing phonemes “in the mind’s ear”.

Most of the work on machine listening focuses on developing computer models for auditory scene analysis tasks on intact sound, little emphasis has been given to the com- putational realization of the human Auditory Scene Induction ability - what we term as

Computational Auditory Scene Induction (CASI). The development of CASI systems will lead to great advances in audio imputation, noise reduction, audio bandwidth expansion, sound enhancement and restoration, speech recognition, audio de-clipping and more.

There are compelling reasons for machines to realize the auditory scene induction ability. In many cases, the sound of interest is not intact, with important parts of the audio missing. This will make many traditional CASA systems that are developed for analyzing intact sound less effective. I will show that many important audio processing tasks can be unified and addressed under the framework of computational auditory scene induction. Examples include restoring heavily corrupted signal caused by bad connections in cordless phones or VoIP systems, distorted speech due to limited telephone bandwidth, large scratches over CD, poorly separated audio signals and more. Reconstructing the 21 missing information from the corrupted audio can be a very challenge problem. In order to perform successful reconstruction, we need to both fill in the corrupted regions with plausible information and also keep the original meaning of the reconstructed speech or music, which calls for solutions informed by our understanding of how humans solve this problem, as well as sophistical computer models for audio representation.

1.1. Contribution

Broadly, the goal of my dissertation is to build computer models to realize some key abilities that are related to Computational Auditory Scene Induction (CASI). Building upon existing statistical models (NMF, PLCA, HMM and N-HMM) for audio represen- tation, I will formulate this ability as a model-based spectrogram analysis and inference problem under the expectation–maximization (EM) framework with missing data.

More specifically, I will show many important audio processing applications can be approached under the unified framework of computational auditory scene induction. I will first design a machine system that is capable of performing singing voice segregation in an auditory scene. Building upon the same computer model, I will further extend it for reconstructing the underlying acoustic events of audio signals in the presence of missing regions as if the information inside parts of the reconstruction had not been occluded or corrupted. I will then investigate how the performance of the existing CASI systems can be improved by considering both the spectral and temporal structure of audio. So far the proposed systems are based on statistical modeling of audio signals. In the last part of this thesis, I will show that the performance of the proposed CASI system can be further improved by incorporating the top-down knowledge of speech, specifically syntactic 22 knowledge in the form of language models and acoustic knowledge in the form of word models.

Three machine systems that are closely related to CASI are developed in my thesis.

Accordingly, the contributions of each system are as follows:

• Singing Melody Extraction

Singing melody extraction is the process of extracting the pitch contour of

the singing voice from a polyphonic audio mixture consisting of the singing voice

and the accompaniment music including harmonic and percussive instruments.

Instead of relying on source-filter models of speech production as many exist-

ing singing melody extraction systems did, the proposed system takes an alterna-

tive way of adaptively modeling and removing the accompaniment music based

on recent development of statistical audio modeling techniques.

More specifically, Probabilistic Latent Component Analysis (PLCA) is ap-

plied to learn a dictionary of local spectral components 1 that explain the near-by

music accompaniment. Then the learned dictionary is used to remove the accom-

paniment components, leaving mainly the singing components of the music.

The proposed system is semi-supervised, assuming no prior information on the

type or the number of instruments in the mixture. This approach is an advance

because it can adjust the learned accompaniment model adaptively from the

identified near-by non-vocal music. It has the flexibility of extending itself to

extract the main melody of the audio regardless of the melody being from the

singing voice or other type of leading instruments.

1For now, a spectral component can be viewed as a normalized magnitude spectrum. A more precise definition of spectral components is in Sec. 1.3 . 23

• Audio Imputation

Audio imputation is the process of re-synthesizing the missing parts of an

audio signal so that after the reconstruction the information inside missing parts

is seamlessly recovered. It is the closest endeavor to realize human’s auditory

scene induction ability in machine listening - also known as CASI.

Particular attention in this thesis will be given to the development of a CASI

system that is capable of reconstructing the missing data in an audio signal.

While most previous audio imputation methods are based on non-structured,

non-constrained models which do not comply well with the characteristics of

audio signals, I will consider a more structured and constrained model for audio

imputation. This structured, constrained model will allow recovery of missing

spectrogram elements that have fewer artifacts and are more temporally coherent

with the original signal.

More specifically, a recent development in statistical modeling of audio, the

Non-negative Hidden Markov Model (N-HMM), will be adapted for the compu-

tational realization of Auditory Scene Induction. While the existing work only

models the spectral structure explicitly, the proposed model will take into account

both the spectral and temporal information (the non-stationarity and temporal

dynamics) of the audio signal, making it more suitable for imputation of tempo-

rally varying audio signals (including speech and music) with large portions of

missing data.

• Audio Bandwidth Expansion 24

After addressing the general problem of audio imputation, I further concen- trate on a particular problem in missing audio data - audio bandwidth expansion.

Audio bandwidth expansion is the process of increasing the bandwidth of a signal due to the bandwidth limitation effect of the transmission channel, either for high frequencies or low frequencies. This task is one of the most popular and widely studied problems in the area of audio imputation [Larsen 04].

Most bandwidth expansion methods are based on the source-filter model for speech production. However, these methods need to be trained on parallel wide- band and narrowband corpora to learn a specific mapping between narrowband features and wideband spectral envelopes. Thus, the usage of these systems is limited. For example, a system trained on telephony and wideband speech can- not be readily applied to expand the bandwidth of a low-quality loudspeaker.

To address this issue, I treat bandwidth expansion as a missing data imputation problem that directly modeling audio spectrogram. By framing the bandwidth expansion problem as an imputation problem, the proposed system only need to be trained once on wideband corpora. Once the system is trained, it can be used to expand any missing frequencies of narrowband signals, despite never having been trained on the mapping between the narrowband and wideband corpus.

The proposed bandwidth expansion system takes into account both the spec- tral and temporal structures. Inspired by the phenomenon that human auditory system uses top-down knowledge of speech knowledge to achieve auditory in- duction, the high-level information of speech in the form of language model is incorporated into the process of speech bandwidth expansion. More specifically, 25

language models are used train a non-negative hidden Markov model that incor-

porates the syntactic knowledge of speech.

The idea of language models can be easily generalized to deal with expanding

the bandwidth of different types of audio signal. For example, the language model

can be used to learn the relationship among different notes for musical signals

according to musical rules.

Before going into details of each proposed system, I first give an outline of this thesis in the next section. 26

1.2. Outline

In this section, I discuss the overall organization of this thesis and the concentration of the individual chapters.

The contributions of my thesis have been sumerized in Sec. 1.1 . The remainder of

Chap. 1 is organized as follows:

In Sec. 1.3 , I first discuss structures in audio that are exploited throughout this thesis. While this thesis is not trying to simulate how the human auditory system performs auditory scene analysis and induction, the proposed work is partly inspired and motivated by the remarkable ability of human audition. Sec. 1.4 will give an introduction to the background knowledge on auditory scene analysis and induction, elaborating on how the human auditory system solves the “Cocktail Party Problem” and “phonemic restoration illusion”. In Sec. 1.5 , I discuss in detail how the work of this thesis is motivated by the phenomenon of continuity in human audition and relate my work to the work of image in-painting from (CV).

In Chap. 2 , I discuss the proposed work on Singing melody extraction , one of our attempts to realize a machine system that extracts the singing melody from a polyphonic musical mixture consisting of the singing voice and accompaniment instruments.

After giving an overview of the existing approaches in Sec. 2.1 , I will introduce some background knowledge in Sec. 2.2 on the probabilistic models that are used for the proposed singing melody extraction system. They are Non-negative Matrix Factorization

(NMF) and Probabilistic Latent Component Analysis (PLCA). This is followed by a detailed description of the proposed method in Sec. 2.3 and an illustration example in 27

Sec. 2.4 . Experimental results are presented in Sec. 2.5 . I conclude the contribution and future directions in Sec. 2.6 .

In Chap. 3 , I discuss our proposed work directly related to Computational Auditory

Scene Induction (CASI): an Audio imputation system that automatically fills in the missing values of an audio spectrogram.

I start in Sec. 3.1 by giving an overview of existing audio imputation approaches with a particular concentration on methods that operate in the time-frequency domain to reconstruct missing regions of the spectrogram. In Sec. 3.2 , I point out the disadvantages of the non-negative spectrogram fatorization models (NMF and PLCA that aren described in detail in Sec. 2.2 ), and show how to address these disadvantages by the recently proposed non-negative hidden Markov model (N-HMM). I will describe in Sec. 3.3 an audio imputation system using PLCA as the base line system that my proposed system will be compared to. The proposed audio imputation system and experimental results are discussed in detail in Sec. 3.4 and Sec. 3.5 respectively. This chapter is concluded in

Sec. 3.6 .

In Chap. 4, I further extend the proposed algorithm to incorporate the high-level knowledge of speech. The proposed algorithm is evaluated on a particular use case of audio imputation: Language Informed Audio bandwidth expansion .

I first introduce the problem of audio bandwidth expansion, followed by an survey on existing works in Sec. 4.1 . The overview of the proposed system is described in Sec.

4.2 . I then discuss how to train a speaker-level N-HMM that incoporates the high-level acoustic knowledge in the form of word models in Sec. 4.3 and syntactic knowledge in the form of language models in Sec. 4.4 . The detailed description of the proposed system 28 is presented in Sec. 4.5 . Illustrative examples and quantative experimental results are presented in Sec. 4.6 . The contribution and direction of future research is summarized in Sec. 4.7 .

I conclude this thesis with Chap. 5 . This chapter has some closing remarks on the contribution of this thesis as well as the future directions for research.

1.3. Structure in Audio

Although there is some randomness in audio, it has a great deal of structural regularity.

In order to discover the structure regularity, the first step is to find a right representa- tion of audio. In this section I first describes how audio is commonly represented using time-amplitude and time-frequency representations. I then discuss the important audio structuries that are exploited throughout this thesis.

When analyzing an audio signal we are interested in the signal waveform which gives us the sound pressure or amplitude of the signal versus time. The waveform in Fig. 1.1

(a) is an example of such a signal. From the waveform we can compute parameters like average level, beginning and end of speech segments, and pauses etc.

More complicated questions can be answered more easily if we transform the signal using Fast Fourier Transform (FFT) [Oppenheim 75] to the time-frequency domain i.e. computing the magnitude spectrogram of the signal. We apply a moving window to the waveform and FFT is applied to each windowed waveform to get each frame of the spectrogram. Thus each column of a spectrogram is the magnitude of the FFT over a fixed window of the wavform. To display the spectrogram we usually use a two dimensional diagram with horizontal (X) axis for time and vertical (Y) axis for frequency, as shown in 29

Fig. 1.1 (b) . In the spectrogram, the level of the signal at a given time and frequency is displayed as a color or gray value as explained by the colorbar as shwon in Fig. 1.1 (c) .

Figure 1.1. Illustration of the (a) waveform and (b) spectrogram of an audio clip of a male speaker saying, “She had your dark suit in greasy wash water all year”. The level of the signal at a given time-frequency bin is indicated by a color value as explained in the (c) colorbar.

The time-domain signal can be reconstructed from the spectrogram and the phase information using overlap-add techniques [Oppenheim 75]. 30

In this thesis, I use “spectrogram” as the basic representation of an audio signal. By visual inspection of the spectrogram, we can see a great deal of structure. For example, in Fig. 1.1 (b) we can roughly count how many words (or phonemes) there are in this clip of speech. We can also see some of the fundamental frequencies and harmonic structures of the phonemes.

Overall speaking, the aspects of audio structure exploited in this thesis are the follow- ing:

(1) Spectral structure – Each column of the spectrogram tells us the spectral content

for a given time frame. As seen in Fig. 1.1 (b), there are very clear spectral pat-

terns within the spectrogram that are repeated over several time frames. Almost

all types of audio (including speech, music and environmental sounds) have some

amount of regularity in spectral structure.

(2) Non-stationarity – It is well known that the statistics of the spectral structure

of audio signals change over time. Even within a very short period of time, the

spectrum of an audio could change dramatically. Therefore it is important to

take the non-stationarity into consideration when we models the audio.

(3) Temporal dynamics – Although the statistics of audio change with time, there

is a structure to the non-stationarity, especially for the case of speech and mu-

sic. For example, the statistics of speech is constrained by various knowledge

sources such as acoustic, lexical, syntactic, semantic and even pragmatic knowl-

edge constraints. We use the term temporal dynamics to indicate the structures

of temporal changes in audio. 31

1.4. Auditory Scene Analysis and Induction

The sound of a busy environment, such as a city street, gives rise to a perception of numerous distinct events in a human listener - the auditory scene analysis of the acoustic information [Ellis 96]. Auditory scenes generally contain multiple sound sources, the sound sources that add together to produce a mixed signal that enters the ears. In many cases, it is the sources (such as people, music, cars, birds), not the mixture, that are of interest.

The human auditory system is able to focus on a single source within a mixture of sound sources. For example, when conversing in a noisy party, most people can still listen to and understand the person they are talking with, and simultaneously ignore the background noise and conversations. In human perception, this ability is commonly referred to as the “cocktail party effect”. This so-called “cocktail party effect” has been believed to be solved by the human auditory system via the combination of three kinds of information: the bottom-up grouping cues such as the assumptions regarding the temporal or spectral structure of the sounds [Bregman 90], the top-down knowledge of specific sound classes such as the prior knowledge of speech acoustic [Bregman 90, Warren 70], and the more recently discovered sound source repetition cue [McDermott 11]. The work proposed in this thesis is closely related to the first two kinds of information for auditory analysis.

The bottom-up grouping cues utilized by the human auditory system are derived from the statistical regularities of natural sounds. They are assumptions, or prior information built into our auditory system, about what sound sources are like. For instance, listeners assume that frequency components that are regularly spaced in frequency (Harmonicity), 32 begin and end simultaneously in time (Common onsets), or have similar distributions of binaural spatial cues (amplitude ratio and phase difference) belong to the same sound source [Bregman 90].

The bottom-up grouping cues have been studied extensively and motivated intensive research on CASA, leading to great advancement in multi-pitch estimation & tracking and source separation, with the notable work by [Klapuri 03] for multi-pitch estima- tion and [Duan 09, Duan 10a] for multi-pitch tracking both based on harmonicity,

[Yilmaz 04, Han 09] for stereo source separation based on spatial cues and sparsity, and [Li 09, Han 10, Han 11b] for monaural source separation based on common am- plitude modulation and pitch cues. A comprehensive introduction to CASA can be found in [Wang 06]. Specially, the temporal information of audio provides important informa- tion for detections periodicity and pitch in speech and music, and has been utilized in many CASA systems. Audio signal is non-stationary, changing what set of sound sources are active over a period of time. Furthermore, the evolving of different sound sources over time is captured by the temporal dynamics of audio. The temporal dynamics and non-stationarity of audio signals have never been considered together in the computa- tional realization of auditory induction (CASI). This thesis incorporated such important information of audio to design more effective CASI systems.

The second kind of information is related to the knowledge of specific sounds, i.e., the top-down cues of speech acoustics. In the “phonemic restoration illusion”, [Warren 70] showed that listeners use knowledge of specific familiar sound classes (speech phonemes), 33

filling in masked syllable segments in ways that are consistent with known speech acous- tics. In human perception, the ability of filling in masked brief segments is usually referred to as Auditory Scene Induction (ASI).

The top-down knowledge of speech acoustic has been believed to be one of the pieces of information utilized by the human auditory system to perform auditory scene induction.

The computational realization of auditory scene induction (CASI) has led to the develop- ment of different audio imputation techniques [Le Roux 10, Smaragdis 11, Han 12a].

None of these techniques has used domain knowledge (e.g. language models) to aid recon- struction. While the top-down knowledge of speech acoustics has led great advancement in speech recognition based on language model [Rabiner 93], it has not been employed in CASI. This thesis is one of the first attempts to utilize the top-down speech acoustic cues to develop better CASI system for audio imputation and bandwidth extension.

The prior knowledge of different sounds needs to be somehow acquired by the auditory system; however, natural environments rarely feature isolated sound sources from which they could be readily learned. It is possible that these priors are at least partially built into the auditory system by evolution, or the listeners can learn them from occasionally hearing sound sources in isolation. [McDermott 11] found that sound source repetition serves as the third kind of information to parse sound mixtures. It provides an alternate, complementary solution for the auditory system to obtain the prior knowledge of the sound source - that listeners might detect sources as repeating spectral-temporal patterns embedded in the acoustic mixture input. For instance, listeners can identify novel sounds that occurred more than once across different mixtures, even when the same sounds are impossible to identify in single mixtures. There is computational realization of the source 34 repetition cue [Rafii 11], however, this is beyond the scope of the work proposed in this thesis.

1.5. Motivation

(a) (b)

Figure 1.2. Fragments that are parts of a number of familiar objects: char- acter “B”. The fragments were obtained by taking the objects and laying an irregularly shaped mask over it. Then the parts that were underneath the mask were eliminated, leaving visible only those parts that had not been covered by the mask. a) Fragments do not organize themselves when there is no information for occlusion. b) The same fragments shown as in (a) except that information for occlusion has been added, causing the fragments on the boundaries of the occluding form to be grouped. (From [Bregman 90])

One of the motivations for this thesis is from the “phenomenon of continuity” [Bregman 90].

It can be considered as the ability of our perception systems to connect fragmented views of a sequence of events in plausible ways. The “phenomenon of continuity” exists in both human visual and auditory perception. This is not surprising because our senses of vi- sion and audition often face similar problems and thus possibly use similar approaches to overcome those problems. 35

An example of our “perceived continuity” or “perceptual closure” in vision is shown in Fig. 1.2 [Bregman 90]. In this example, Fig. 1.2 (a) shows a number of fragments that are really parts of familiar objects. When the mask is not present, these fragments do not close up perceptually in our vision because the visual system does not know where the evidence is incomplete. However, when looking at Fig. 1.2 (b), with the mask present, our visual system quickly joins the fragments without having to consciously think about it. The perceived continuity usually occurs in an interrupted form if the contour is

“strong” at the point of interruption. This would be true when the contours of the form continued smoothly on both sides of the interruption so that a smooth continuation could be perceived.

Figure 1.3. Illustration of Masking (adapted from [Bregman 90]) in ide- alized spectrograms. The vertical axis is frequency and the horizontal axis is time. Pattern A is the tonal sound that is softer. B is the sound mask that is louder. In the task of singing voice extraction, A can be viewed as the singing voice and B the accompaniment music. 36

There is evidence that the same phenomenon occurs in our auditory system. One of the examples is called the sound “Masking” effect. “Masking” occurs when a louder sound covers up a softer one as illustrated in Fig. 1.3. Despite the masking from sound B , if the softer sound A is longer, and can be heard both before and after the interference sound B , A can be still “heard” by the auditory system to continue behind the louder one.

The “phenomenon of continuity” from human audition partly motivated my work on

Singing melody extraction that is proposed in Chap. 2 . Sound A in Fig. 1.3 can be considered as the singing voice and B the background music, and vice versa. While most of the frequency components of A are overlapped with those of B , our auditory system can still hear A during the masking period regardless of the masked singing voice being softer or louder. In Chap. 2 , I will show how to use statistical modeling technique to learn a dictionry of spectral components to describe the characteristics of spectral structure of the background music, and then use this dictionary of spectral components to remove the background music from the polyphonic music mixture, leaving mainly the singing voice in the residual audio.

The “phenomenon of continuity” even occurs when the masked sound is completely removed “behind” the louder one in the region of masking. An illustration is shown in

Fig. 1.4 .

The example illustrated in Fig. 1.4 shows an alternative rising and falling pure-tone glide which is periodically interrupted by a short loud burst of broad-band noise. Gaps are introduced as silent spaces in the pure-tone glide, as shown in Fig. 1.4 (a) . When the masking noise is introduced so as to exactly cover the silent spaces, the ear hears the 37 glide as one continuous rising and falling sound passing right through the interrupting noise. This phenomenon is usually referred as the “illusion of continuity”.

Figure 1.4. Illusion of continuity in idealized spectrograms. The vertical axis is frequency and the horizontal axis is time. (a) The stimulus with gaps; (b) The stimulus when the gaps are filled with noise. (adapted from [Bregman 90])

“Illusion of continuity” is an example of the “auditory induction” ability in human perception of sound in noisy environment. In the well-known “phonemic restoration il- lusion” presented in [Warren 70], “auditory induction” was illustrated by the fact that listeners believe they hear the deleted phonemes masked by an extraneous sound. The

“auditory induction” ability of human’s auditory system is the major motivation for my

Audio imputation work proposed in Chap. 3 . If the human’s auditory system re- synthesizes the missing parts of a continuous auditory stream, it is reasonable for machine to have the same ability.

The information that is used by our auditory system to fill in the missing regions of a auditory stream contains both the bottom-up grouping cues: grouping and connecting 38 the sound elements that are similar before and after the interference sound, and also the top-down knowledge of specific sound classes [Bregman 90]).

Study by [Repp 92] indicates that human induction of missing elements in speech relies heavily on our high-level knowledge of speech acoustics. In the “phonemic restora- tion illusion”, the listener’s auditory system does not segregate the restored speech from the extraneous sound, but instead uses an abstract phonological phonetic representation

(top-down knowledge of speech) that is activated in the process of word recognition to make up the missing phonemes “in the mind’s ear”.

The same phenomenon in vision is also illustrated in Fig. 1.2 . Obviously, our previous language knowledge of character “B” helps our visual system to reconstruct the missing parts of ‘B’ that are underneath the mask.

The realization of the top-down knowledge about speech in computational auditory scene induction is rare, despite this information being widely used in the area of speech recognition. This motivated me to introduce the high-level knowledge about speech in the form of word models and language models to audio imputation. This work is described in detail in Chap. 4 of Audio bandwidth expansion .

The work of computer audition described in this thesis is reminiscent of the ob- ject removal and region filling techniques, known as image inpainting [Bertalmio 00,

Criminisi 04, Hays 08], in the field of computer vision. As shown in Fig. 1.5 , re- moving the foreground object in an image can be related to the process of removing the background music in an audio mixture. The process of filling in the background region of a removed object in an image corresponds to the reconstruction of missing elements in the regions of an audio spectrogram. In addition, in human’s visual or audiotry system, 39

(a) (b)

Figure 1.5. a) Original image b) The region corresponding to the foreground person has been removed and filled in with synthesized textures. (From [Criminisi 04]) the reconstruction of missing information in an image or audio spectrogram usually in- volves in the high-level knowledge about the image or audio, which has further justifies our efforts in introducing this high-level knowledge to CASI. 40

CHAPTER 2

Singing Melody Extraction

Melody is one of the most basic and easily recognizable traits of musical signals. The main melody of a song is usually defined as the pitch sequence that a human listener is most likely to perceive and associate with that piece of music. Knowing the melody of a song is useful in numerous applications, including music recognition, analysis of musical structure, automatic music transcription [Klapuri 03, Duan 10a, Duan 10a], content-based music search [Pardo 08, Skalak 08], audio source separation [Han 09,

Han 11b] and more. It is especially useful in applications such as Query-By-Humming

(QBH). Current QBH systems depend on humans to listen to polyphonic audio files (song recordings) and build machine-searchable melodies from them. This is problematic for large databases of audio (e.g. the millions of songs on iTunes) and has greatly limited their deployment. A system that is able to automatically extract main melodies from the audio would make such systems much more broadly useful [Cartwright 11].

Although humans have a natural ability to identify and isolate the main melody from polyphonic music, automatic extraction of melody by a machine remains a challenging task.

In polyphonic music, there are multiple instruments and sound sources playing simul- taneously. Some of these sounds may be pitched, such as those from a vocalist or harmonic instrument, while others may be un-pitched such as those from a rhythm instrument or 41 sound effects. Determining the main melody from such an audio recording involves ex- tracting a single dominant pitch contour out of a mixture of concurrent spectral events. In this chapter, melody is defined as the pitch contour of the lead vocal in a song. This is a reasonable assumption since when music contains a singing voice; many people remember and recognize that piece of music by the melody line of the lead vocal part.

In this chapter, I focus on improving machine’s ability to perform singing melody extraction - the process that extracts the pitch contour of the singing voice from the accompaniment of the instruments which are simultaneously playing.

The ability to extract the singing melody in an audio mixture shares some resemblance to the auditory segregation ability because both abilities in some way segregate the one sound source (i.e., singing voice in the case of singing melody extraction) from other sources (i.e., background music accompaniment). This relates the proposed task to the

“cocktail party effect”, i.e., the ability of the human auditory system to focus on a single talker among a mixture of conversations and background noises.

I address this problem by formulating extraction of the melody from an audio mix- ture as a predominant source separation problem. The proposed algorithm is based on adaptively learning a statistical model for each component of the music from the mixture itself. In this task, I concentrate on polyphonic music containing singing voice and ac- companiment. Based on the assumption that the sound produced by the accompaniment is similar during both the non-vocal and vocal parts of the song, a probabilistic model for the accompaniment is learned from the non-vocal segments of the mixture and then used to remove the accompaniment from the polyphonic mixture. After the accompaniment 42 is suppressed in the mixture, the melody line of the music can be more easily extracted from the remaining singing components of the signal.

The proposed semi-supervised system assumes no prior information on the type or the number of instruments in the mixture, and can adjust the learned accompaniment model adaptively from the identified non-vocal segments of the music. Quantitative experimental evaluation shows that the new system performs significantly better than two existing melody extraction algorithms for polyphonic single-channel music mixtures.

In the remainder of this chapter, I first give an review of the existing singing melody extraction system in Sec. 2.1 . The probabilistic model used in this chapter for audio representation is discuessed in detail in Sec. 2.2 . I then describe in Sec. 2.3 the proposed system on singing meldoy extraction. This is followed by experimental evaluation of our proposed system and comparing to other two methods in Sec. 2.5 . I conclude in the end of this chapter and point out the direct of future work in Sec. 2.6 .

2.1. Related work

Before going into the details of the proposed system, I first give an overview of the existing melody extraction systems.

Melody extraction used to be treated as a multi-pitch estimation & tracking (MPE&T) problem. If the pitch track of every instrument in the polyphonic music can be obtained, it is trivial to get the main melody by picking the most salient pitch track.

Multi-pitch estimation & tracking systems usually employ different probabilistic mod- els for pitch candidate selection [Klapuri 03, Duan 10b], followed by a pitch tracker 43 that connects individual pitch candidate of the same source into trajectories across the time [Duan 09, Duan 10a].

[Klapuri 03] works in an iterative fashion by estimating the most significant F0 from the spectrum of the current mixture and then removing its harmonics from the mixture spectrum. [Duan 10b] is a multiple fundamental frequency (F0) estimation approach that tries to maximize the likelihood models of both spectral peaks and non-peak regions.

No-peak regions are the frequencies further than a musical quarter tone from all observed peaks. The first detected pitch of the multi-pitch estimation methods can be considered as the predominant melody in the sense that the score of this pitch hypothesis is the highest. However, there is no guarantee the estimated most predominant pitches are from the same source, since these algorithms only output individual pitch estimates in every time frame independently but not the pitch trajectories of each source across the time.

[Duan 09, Duan 10a] is one of the first systems that estimate the pitch estimates in individual frames and connect them into pitch trajectory of each monophonic source in a mixture of harmonic sounds.

Since estimating all the pitch contours in an polyphonic audio mixture is an extremely difficult problem, [Goto 04] has proposed a less difficult task called predominant melody extraction – the process of extracting the predominant melody line (usually the vocal melody in case of pop music containing singing voice) from polyphonic audio. This is the

first time that melody extraction was treated as a separate problem.

Since then, many melody extraction algorithms have been proposed. Although these algorithms have their differences, generally speaking they can be classified into three categories. 44

A majority of melody extraction methods such as those in [Goto 04, Paiva 05,

Li 05, Ryynanen 06, Rao 10, Joo 11] follow a common framework in that they obtain the melody line in two steps:

(1) Extract multiple pitch candidates for each time frame.

(2) Construct melody line based on the assumptions such as that successive pitches

of melody are highly correlated or predominant melody is most predominant.

For example, [Goto 04] uses maximum a posteriori probability (MAP) to extract mul- tiple pitches, and afterwards, incorporates simple rules concerning the temporal continuity of melody in obtaining the melody line. [Li 05] outputs several singing pitch candidates for each time frame using a channel/peak selection scheme that exploits the salience of singing voice and the beating phenomenon in high frequency channels. The most pre- dominant pitch trajectory is picked by a hidden Markov model (HMM). [Ryynanen 06] uses an HMM trained for each note to transcribe melody, based on multiple-F0 estima- tion followed by acoustic and musicological modeling. [Joo 11] uses pre-coded harmonic structure estimate possible pitch candidates under a minimum mean-square estimation

(MMSE) framework, and the melody line is extracted by a rule-based procedure.

The primary concern with these algorithms is extracting multiple pitches with high recall and precision. However, multi-pitch estimation continues to be far from being solved for the use of melody extraction. Due to this, some of the melody extraction systems [Poliner 05, Hus 09, Jo 11] try to directly model the relation between the melodic line and the polyphonic audio.

[Poliner 05] performs dominant melodic note classification via a support vector ma- chine (SVM) classier trained directly from labeled data. [Hus 09] employs a pre-trained 45

HMM to model the relationship between the adjacent melody pitches and their corre- sponding audio context. [Jo 11] models the melody pitch and harmonic amplitudes as two uncoupled first-order Markov process, and treat each frame of polyphonic audio con- ditionally independent given the melody pitch and harmonic amplitudes. A sequential

Bayesian model is used to represent the probabilistic relation among melody pitch, har- monic amplitudes, and polyphonic audio.

The aforementioned algorithms [Poliner 05, Hus 09, Jo 11] still suffer from the problem of accompaniment interference. Accompaniment sounds from the harmonic and percussive instruments act as interferences during the process of melody pitch estimation, usually making algorithms that derive the predominant melody from the audio (without

first extracting all the concurrent pitches) much less effective. This leads to the third category of melody extraction algorithm inspired by source separation.

Systems based on source separation techniques use probabilistic models to represent the lead singing components and the background accompaniment separately. The main melody line is then extracted from the singing components, based on the assumption that the main melody is usually the vocal melody. Please keep in mind that separation is not the ultimate purpose here so the source separation techniques utilized by the melody extraction algorithms need not perfectly separate the singing voice from its accompani- ment. As long as the accompaniment components can be suppressed in the mixture to the extent the singing melody can be accurately extracted, it is a good singing melody extraction system.

[Tachibana 10] proposed to enhance melodic component of the music by separating the percussive component from the music, and then use a simple dynamic programming 46 algorithm to obtain a smooth melody line. However, the background music of polyphonic music usually contains more than one harmonic source, which makes this approach less effective.

In [Durrieu 10], the leading vocal part is explicitly represented by a specific source/filter model. The proposed representation is investigated in the framework of two statistical models: a Gaussian Scaled Mixture Model (GSMM) and an extended Instantaneous Mix- ture Model (IMM). For both models, the estimation of the different parameters is done within a maximum-likelihood framework adapted from single-channel source separation techniques. Compared to Gaussian mixture models, non-negative matrix factorization provides richer representations for musical structures. This thesis will consider using

NMF to model music signals.

The singing melody extraction algorithm [Han 11a] proposed in this thesis belongs to the third category, focusing on the problem of removing the accompaniment interference.

This problem is approached by building a model for the accompaniment and then apply- ing it to suppress the background musical components in the polyphonic music. Assume that we have a polyphonic audio signal featuring a singing voice and multiple instruments.

Previous work [Grindlay 10] used a set of training instruments to learn a model space which fits each individual instrument based on Probabilistic Latent Component Analysis.

In contrast to [Grindlay 10], the work in this dissertation does not use pre-trained mod- els. Instead, the models for the accompaniment and singing voice are learned adaptively from the music mixture itself.

Compared to [Tachibana 10], which suppresses the percussive instruments only, the proposed algorithm takes into consideration all the accompaniment instruments. Instead 47 of relying on very constrained models like GMM and/or source-filter model described in

[Durrieu 10], the proposed system employs spectral vectors learned from the polyphonic music itself to model the singing component and accompaniment components of the same piece of music. This makes our system more suitable for the music signals with complex structure and various kings of accompaniment interference. Furthermore, the proposed system can be easily extended to extracting the predominant melody from a leading instrument from its accompaniment.

2.2. Modeling of Audio

In this section I introduce the statistical models that are commonly used to represent audio signals for the purpose of source separation. These models are known under the same name of “Non-negative Spectrogram Factorization”. They are Non-negative Ma- trix Factorization (NMF) and its probabilistic extension Probabilistic Latent Component

Analysis (PLCA).

2.2.1. Non-negative Spectrogram Factorization

Non-negative spectrogram factorization refers to a class of techniques that include non- negative matrix factorization (NMF) [Lee 99] and its probabilistic counterparts such as probabilistic latent component analysis (PLCA) [Smaragdis 06]. Audio spectrograms are often low rank non-negative matrices and can therefore be compactly represented by a few spectral patterns. As illustrated in Fig. 2.1 , these spectral patterns can be inter- preted as a dictionary of spectral components, each of which can be seen as a normalized magintude spectrum. 48

In matrix notation, this can be represented as:

(2.1) V = WH where the spectrogram V is a F × T matrix. The dictionary W is a F × K matrix where each column is a spectral component. The mixture weight H is a K × T matrix where each row represents the weights for a given component. The goal of NMF is to learn a set of spectral components that can explain a sound source (or group of sources) of interest.

The value of W and H matrices are usually estimated in NMF iteratively by optimiza- tion methods that minimize the distance between the observed spectrogram V and the reconstruction WH. Euclidean distance and Kullback Leibler (KL) divergence [Lee 99],

Bregman divergence [Dhillon 05] , and itakura–saito divergence [F´evotte 09] have been used as the distance functions between V and WH.

2.2.2. Probabilistic Latent Component Analysis

Probabilistic Latent Component Analysis (PLCA) [Smaragdis 06] is a probabilistic ex- tension to Non-negative Matrix Factorization (NMF). It has been shown that PLCA is numerically identical to NMF for two-dimensional input, and Non-negative Tensor Fac- torization (NTF) for arbitrary dimensions [Shashanka 08]. However, PLCA presents a much more straightforward way to make easily extensible models. PLCA allows the use of statistical techniques while still keeping the general ideas of NMF.

In PLCA, an audio spectrogram is modeled as a histogram of “sound quanta”. The amount of sound quanta in a given time-frequency bin indicates the Fourier magnitude 49

Figure 2.1. Illustration of NMF on the spectrogram of a clip speech ‘Bad dog’. When NMF is applied to the spectrogram (with K = 4), four distinct spectral components are learned. Additionally, the weights of these spectral components at each time frame are learned.

1 of that bin and is given by Vft . Once normalized, the spectrogram can be thought of as a joint probability distribution P (f, t) over time and frequency. For every time-frequency bin in the spectrogram, we can use P (f, t) to represent the relative magnitude of the spectrogram at time-frequency position (t, f).

1In theory, this would involve a scaling of the spectrogram (a single scale factor for all timefrequency bins) such that each timefrequency bin has a whole number of sound quanta. 50

Please not that the same kind of modeling philosophy (treat audio spectrogram as a joint probability distribution) will be used again when I introduce the non-negative hidden markov model in Sec. 3.2 .

Once the spectrogram of interest is treated as a joint probability distribution P (f, t), the spectrogram can be explicitly modeled as a two-dimensional distribution in time and frequency using PLCA:

X P (f, t) = P (z)P (f|z)P (t|z)(2.2) z where P (f|z) and P (t|z) are conditional distributions along the frequency and time di- mensions. P (f|z) are the latent components that correspond to spectral components. For a given value of z, P (f|z) is a multinomial distribution.

The collection of all P (f|z) form a dictionary, analogous to the dictionary of spectral components in NMF. P (t|z) are also modeled as latent components and correspond to the occurrences of the spectral components in time. For a given value of z, P (t|z) is a multinomial distribution. P (z) is a distribution of gains and is also a multinomial distribution.

In NMF, there are explicit constraints to enforce non-negativity of components. In

PLCA, non-negativity is implicitly enforced by mapping the values of the individual spec- tral magnitudes and time activations to parameters of multinomial distributions, which are by definition non-negative.

Given the spectrogram, the value of P (z), P (f|z) and P (t|z) can be estimated using the expectation–maximization (EM) algorithm that is described in Tab. 2.1 . 51

In the expectation step, the posterior of the latent variable z is estimated:

E Step

P (z)P (f|z)P (t|z) (2.3) P (z|f, t) = P . z P (z)P (f|z)P (t|z)

In the maximization step, the marginals are re-estimated as follows: M Step X X (2.4) P (z) = P (f, t)P (z|f, t) f t P P (f, t)P (z|f, t) (2.5) P (f|z) = t P (z) P P (f, t)P (z|f, t) (2.6) P (t|z) = f P (z)

Table 2.1. The expectation–maximization (EM) algorithm of PLCA learning

The frequency axis marginals P (f|z) contain a dictionary of the spectral components which best describes the sound represented by the input. PLCA uses a single dictionary of spectral components to model a given sound source. Specifically, each time frame of the spectrogram is explained by a linear combination of spectral components from the dictionary. The frequency marginals can be used as a model for certain kinds of sounds such as singing voice, speech or particular instruments.

Fig. 2.2 shows three sets of spectral components learned from three different kinds of sounds: singing voice, piano and snare drum. The top plots display the spectrogram of singing voice and a set of derived spectral components. Likewise the middle and bottom display the same information for a piano and snare drum sound. A set of four 52 latent variables is introduced for conditional independence. Note how the derived spectral components in different cases extract representative spectra for each sound.

As seen in Fig. 2.2 the extracted spectral components capture a unique energy dis- tribution along the frequency dimension for the sounds. For example, the frequency marginals extracted from singing voice display clear harmonic structures for the vowel sounds and high frequency distribution for the fricative at the end, while the marginals from the snare drum have a flatter and more uniform distribution.

Once the frequency marginals are known for a certain sound in a mixture, they can be used to extract this kind of sound from the mixture as demonstrated first by

[Smaragdis 07b] in a manual mode. In the next section, I describe how this model can be used in a semi-supervised way for melody extraction. 53

Figure 2.2. Spectral components learned from singing voice (top), piano (middle) and snare drum (bottom). 54

2.3. System description

In this section, I describe a new approach for automatic melody extraction from poly- phonic audio using Probabilistic Latent Component Analysis (PLCA). The proposed al- gorithm is based on adaptively learning a statistical model for each component of the music from the mixture itself. The overall structure of the proposed system is illustrated in Fig. 2.3 .

An audio signal is first divided into vocal and non-vocal segments using a trained

Gaussian Mixture Model (GMM) classifier in a similar manner as shown in [Li 07]. Based on the observation that the music accompaniment evolves smoothly before, during and after the introducing of singing voice within a short period of time, usually within the musical boundaries, it is reasonable to assume that the accompaniments in both the non-vocal and vocal parts adjacent to each other have similar spectral patterns. So a spectral dictionary for the accompaniment can be learned adaptively from the non-vocal segments of the mixture by a probabilistic model PLCA and then used to remove the accompaniment components from the near-by vocal segments, leaving mainly the singing voice components. After the accompaniment is suppressed in the mixture, the melody pitch line of the music can be easily extracted from the remaining singing voice.

The system deals with the melody extraction problem in four stages. In the first stage,

Singing Voice Detection , the mixture is divided into vocal and non-vocal segments using a pre-trained Gaussian Mixture Model (GMM) classifier that is similar to one used in

[Li 07].

In the Accompaniment Model Training stage, a dictionary of spectral components for the accompaniment music is learned from the mixture itself. Let Xv be the spectrogram 55

Audio Signal

Singing Voice Detection

Non-Vocal Vocal Segments Segments

Accompaniment Model Training

Accompaniment Model

Accompaniment Reduction

Accompaniment- Reduced Audio Signal

Pitch Estimation Melody

Figure 2.3. Overview of the proposed Singing Melody Extraction System. 56

of the vocal segments containing the singing voice and Xnv be the non-vocal segments with only the accompaniment. The frequency marginals distribution Pnv(f|z) for the accompaniment can be then learned from Xnv using the PLCA model described in Sec.

2.2.1 . z ∈ Znv in Pnv(f|z) can be thought as the set of spectral components (latent variables) extracted from Xnv that fit the accompaniment music well. The procedure of accompaniment model training is illustrated in Fig. 2.4 .

Figure 2.4. Illustration of the Accompaniment Model Training stage.

In the third stage, Accompaniment Reduction , the accompaniment in the mixture is removed (or compressed to some extent) and the singing voice is extracted from the mixture as follows. 57

Assuming that the spectral structure of the accompaniment music stays stable during both the non-vocal and vocal segments of the music, Xv(f, t) can be decomposed into two sets of frequency marginals by the following equation:

X X (2.7) Xv(f, t) = P (z)Pnv(f|z)P (t|z) + P (z)Pv(f|z)P (t|z).

z∈Znv z∈Zv where Znv is the same set of latent variables, learned from Xnv in the previous stage, which describes the non-vocal music extracted from the non-vocal portions of the mixture, and

Zv is the set of additional latent variables we added to explain the singing voice in the mixture of Xv(f, t) that cannot be well described by Znv.

We perform PLCA on the vocal portions of the audio Xv but we make sure that the frequency marginals corresponding to Znv are fixed to Pnv(f|z), as we update only the time marginals and the remaining frequency marginals using the same EM procedure descrbied in Tab. 2.1 .

The additional frequency marginals Pv(f|z) we learned will best explain the lead singing voice in the mixture which is not present in the non-vocal partitions Xnv. Once the marginals of the singing voice have been learned, we can reconstruct the magnitude spectrogram of the singing components Xs(f, t) using only the distributions associated with Zv:

X (2.8) Xs(f, t) = P (z)Pv(f|z)P (t|z).

z∈Zv

The procedure of accompaniment reduction is illustrated in Fig. 2.5 : 58

Figure 2.5. Illustration of the Accompaniment Reduction stage.

We assume the phase of the singing components is the same as the phase of the polyphonic audio, since the human ear is relatively insensitive to phase variations. Then

Xs plus the original phase of the mixture can be converted to a time domain signal by a simple overlap-add technique [Oppenheim 75]. The converted time domain signal can be considered as the separated singing voice signal from the accompaniment.

Furthermore, we can get the melody line of the singing voice by passing he time domain signal to the fourth stage Pitch Estimation for final melody extraction. Given the singing voice extracted from the mixture, the main pitch sequence can be easily estimated by a simple auto-correlation technique similar to [Boersma 93]. 59

2.4. Illustrative example

To illustrate the effectiveness of the PLCA model for accompaniment removal, I ob- tained a clip of rock music which contains a mix of four sources including singing voice, electric guitar, electric bass and drum kit, as well as a separate track of the singing voice.

I manually divided the clip into a 15-second non-vocal segment and 14-second vocal segment. A spectral dictionary for the accompaniment is learned from the non-vocal segment by PLCA and then applied to reduce the accompaniment from the mixture as described in Sec. 2.3 .

Fig. 2.6 and Fig. 2.7 shows the result of this process on the vocal segment of the mixture. The magnitude spectrogram of the mixture is shown in Fig. 2.6 (a) . The spectrogram of the original singing vocal track before mixing it to the accompaniment is plotted in Fig. 2.6 (c) and the spectrogram of the signal extracted from the mixture is plotted in Fig. 2.6 (b) .

Comparing the spectrogram of the extracted singing voice (Fig. 2.6 (b)) to that of the polyphonic mixture (Fig. 2.6 (a)), we notice that many of harmonic components from the accompaniment in the low and middle range frequency are gone. An example of these harmonic components is marked using dashed blue boxes in Fig. 2.6 (a) and the same region with many harmonic components being removed is marked using blue boxes in Fig.

2.6 (b) .

There is still energy from the accompaniment left in the extracted signal, mainly from the percussion instruments. The percussive instruments are shown in the spectrogram by vertical spectral patterns. Some of the percussion components are marked using green boxes in Fig. 2.6 (a), (b) and (c) for comparison. One reason for the percussion residual 60 in the extracted singing voice is that the percussion instruments are not well represented in the training data so the proposed model does not have enough examples to learn from during training.

Fig. 2.7 shows the melody detection results on the same clip of the song from Fig. 2.6

. The ground truth singing voice melody is obtained from the original singing voice before mixing it to the accompaniment. It is plotted in black solid lines in both Fig. 2.7 (a) and (b). In Fig. 2.7 (a) , the blue line represents the pitch estimation result by applying the same pitch tracker to the mixture. The melody pitch estimation from the extracted singing voice using the same pitch tracker is plotted in red dots against the ground-truth pitch (black solid lines) in Fig. 2.7 (b).

The pitch tracker from [Boersma 93] is used. [Boersma 93] is a robust algorithm for periodicity detection in periodic signals and has been commonly used to obtain the groundtruth pitches from single source speech or music. In this work, [Boersma 93] is applied to both the audio mixture and the extracted singing voice to get the estimated melody lines ploted in Fig. 2.7 . [Boersma 93] is also used to obtain the groundtruth melody line from the original singing voice before mixing it to the accompaniment.

For each estimated pitch trajectory, a pitch estimate in a frame is called correct if it deviates less than a quarter-tone from the pitch in the ground-truth pitch trajectory.

As shown in Fig. 2.7 , the detected pitch track by the proposed method matches well with the ground-truth track. About 80% of the groundtruth pitches are correctly identified by the proposed method. In contrast, the pitch contour estimated from the mixture without the proposed accompaniment compression module is far away from the 61 singing melody pitch. Instead, they are the pitches from the accompaniment due to the strong interferences from the harmonic accompaniment instruments.

This example shows that our proposed system works well to remove the accompani- ment when we have a perfect Singing Voice Detection module. 62

2.5. Experiment

In this section we show quantitative evaluation of the proposed system with an auto- matic singing voice detector.

The GMM-based singing voice detector is trained on a data set of 51 commercial songs across various genres. The ground-truth vocal/non-vocal segments are manually anno- tated by the author. Mel-frequency cepstral coefficients (MFCCs) are used as the input feature for the classifier. We performed three-fold cross validation on this data set. The av- erage precision of the classifier is 76% for vocal detection and 73% for non-vocal detection.

The parameters for the best GMM classifier are used for the Singing Voice Detection module of the singing voice extraction system.

The overall melody extraction system is tested on parts of the MIREX 2005 training data set of 13 songs for audio melody extraction [Poliner 05]. We only considered songs in the database containing lead vocals, i.e., 9 songs, totaling about 270 seconds of audio, with two musical styles: jazz and pop. All test songs are single channel PCM data with

44.1 kHz sample rate and 16-bit quantization.

For each estimated pitch trajectory, a pitch estimate in a frame is called correct if it deviates less than a quarter-tone from the pitch in the ground-truth pitch trajectory. The metrics Precision, Recall, F-measure and overall Accuracy for each pitch trajectory by

#corP #corp (2.9) Precision = Recall = #estP #refP

precision · recall #corP+#corS (2.10) F-measure = 2 · Accuracy = precision+recall #refP+#refS 63 where #corP, #estP, #refP and #corS #refS are the number of correctly estimated pitch, estimated pitch, reference pitch, correctly estimated silence and refence silence, respectively.

We compared our proposed system to two recent pitch/melody estimation systems:

DHP [Duan 09] and LW [Li 07]. DHP is a state-of-art multi-pitch estimation algorithm based on spectral peak and non-peak region selection. It outputs a likelihood score for each pitch hypothesis to indicate the confidence level of the estimate. The first pitch being detected is considered the predominant pitch in the sense that the score of this pitch hypothesis is the highest. LW is a predominant pitch detection algorithm based on channel/peak selection and HMM model. This algorithm is specially designed for extracting the singing voice melody from polyphonic audio. For both algorithms, we use the source code and recommended parameters provided by the authors. The pitch value is estimated every 10 milliseconds.

Precision Recall F-measure Accuracy DHP 0.52 0.48 0.50 0.48 LW 0.09 0.086 0.09 0.19 Proposed 0.43 0.80 0.55 0.61

Table 2.2. Performance comparison of the proposed algorithm against DHP and LW, averaged across 9 songs of 270 seconds from the MIREX melody extraction dataset.

The results are summarized in Tab. 2.2 . The LW algorithm performs poorly in all metrics. We believe the strong energy from accompaniment in the music causes the poor performance of LW, because the estimated pitches are found to match many pitches from the pitched accompaniment instruments. We also speculate that the parameters 64 for LW may need to be specially tuned for a certain data set, even though we used the recommended values for all the parameters.

DHP has the best precision measurement but it failed to output pitch estimates in the singing voice regions where there is strong interference from the percussion instru- ments, producing a much lower recall than our system. In the presence of other in- strumental sounds, our proposed system achieves the best recall, F-measure and overall accuracy on this data set. The high recall of our system indicates that the proposed

Accompaniment Reduction stage successfully suppresses the background instruments, leaving the singing voice as the predominant component in the extracted spectrogram.

The relatively low precision is because that the background music is not completely removed from the mixture partly due to an imperfect Singing Voice Detection stage.

The singing voice detector used in this work was a very preliminary and can be improved in various ways such as those proposed in [Ramona 08, Regnier 10]. In Fig. 2.6 , I have showed that an overall accuracy of 80% can be achieved if we have a better singing voice detector. We believe the proposed system can perform significantly better with an improved singing voice detection module. 65

2.6. Contributions and Conclusion

In this chapter I developed a semi-supervised algorithm for singing melody extraction from single channel polyphonic music. The contributions of the proposed algorithm are as follows. The proposed system assumes no prior information on the type or the number of instruments in the mixture. Compared to the existing melody extraction work, the proposed system is able to adjust the learned accompaniment model adaptively from the identified near-by non-vocal music to better fit the audio of interest.

Inspired by the phenomenon of continuity illustrated by auditory scene analysis of human auditory system, the proposed system is based on the observation that the spectral pattern of the music accompaniment is stable and consistent before, during, and after the period of the vocal segments. The Probabilistic Latent Component Analysis (PLCA) is applied to learn a dictionary of spectral vectors from the accompaniment identified before and after the segment of singing voice. This learned dictionary is then used to remove the accompaniment components from the vocal segment

Experimental results illustrated that the PLCA model successfully suppressed the background music in the mixture audio. Quantitative evaluation showed our proposed algorithm is significantly better than two other melody extraction algorithms. The pro- posed system can be easily extended to extract the melody of a lead instrument from its accompaniment or a predominant source separation system.

Although the proposed method does not require pre-trained instrument models, its performance indeed depends on the performance of the singing voice detection. Future direction of research includes more advanced singing voice detection, pitch estimation 66 techniques robust to noise interference, and incorporating the structure of music to make the algorithm unsupervised. 67

Figure 2.6. Melody extraction results on a clip of “Simple Man” by Lynyrd Skynyrd. 68

Figure 2.7. Melody detection result. Ground Truth (black solid lines) is obtained by applying the pitch tracker to the singing voice before mixing (Fig. 2.6 (b)). Estimation 1 (blue solid lines) is obtained by applying the same pitch tracker to the audio mixture directly (Fig. 2.6 (a)). Estimation 2 (red dot) is obtained by applying the same pitch tracker to the extracted singing voice (Fig. 2.6 (c)). 69

CHAPTER 3

Audio Imputation

The problem of missing data in an audio spectrogram occurs in many scenarios. For example, the problem is common in signal transmission, where the signal quality is de- graded by limited bandwidth due to linear or non-linear filtering operations. In other cases, audio compression and editing techniques often introduce spectral holes to the spectrogram of audio. Missing values also occur frequently in the output of audio source separation algorithms, due to time-frequency component masking.

It is well known that the human auditory system is able to resynthesize the missing parts of a continuous auditory stream – an ability called “auditory scene induction (ASI)”

[Warren 70, Warren 72], which enables us to recognize sound in noisy environment.

The computational realization of ASI is called computational auditory scene induction

(CASI).

In , audio imputation is the task of filling in missing values of the audio signal to improve the perceived quality of the resulting signal. In this thesis, I treat the terms “audio imputation”and “CASI” as interchangeable in the sense that the goal of both systems is to resynthesize the missing values of an audio signals. However, the term “audio imputation” is not implying that the missing data is estimated in the same way as the human auditory system does. More often, audio imputation techniques rely on advanced statistical signal models and machine learning methods. An effective approach for audio imputation could benefit many important applications, such as speech 70 recognition in noisy environments, audio bandwidth expansion, sound restoration and enhancement, audio declipping, audio source separation, and more.

In this chapter, I propose an audio imputation algorithm using the Non-negative

Hidden Markov Model (N-HMM) [Mysore 10]. Compared to previous methods which mainly make use of the spectral structures of audio to perform imputation, the proposed work takes both the spectral and temporal structures of audio into consideration.

I first consider an existing audio imputation method using PLCA (details of PLCA are discussed in Sec. 2.2 ) as a starting point. I then point out the problems of using

PLCA, and introduce the N-HMM model to address these problems. These materials are presented in Sec. 3.2 and Sec. 3.3 .

In Sec. 3.4 , I show how to apply the N-HMM model to estimate the missing values of audio spectrogram. More specifically, the missing data is estimated iteratively during the N-HMM learning process via an expectation–maximization (EM) framework. The proposed framework allows both unsupervised and supervised imputation. In Sec. 3.5 , I show the proposed algorithm has promising performance by comparing it to an imputation algorithm using PLCA, on real-world polyphonic music audio. This chapter is concluded in Sec. 3.6 .

Before going into details of the proposed algorithm, I first give an overview of the existing audio imputation techniques. 71

3.1. Related work

Audio imputation from corrupted recordings can be a challenging problem. Missing or corrupted data in audio often occurs in different scenarios, such as the processes of signal acquisition, transmission, manipulation, and separation. Due to the different scenarios that cause the missing data, this problem has been usually treated differently depending on the context. The popular missing data problems include audio gap interpolation, audio declipping, feature-compensation for speech recognition, audio band expansion and more.

Audio imputation has also been dubbed “audio inpainting”, the counterpart to image inpainting in computer vision.

The work in this thesis addresses the general scenario where the missing data can occur randomly in the spectrogram, in regions of any shapes, or be bandlimited of any frequencies. However, the proposed approach is limited to estimate the missing values in the time-frequency representation (also known as “spectrogram”) of audio. Thus, we will only consider the audio imputation approaches that operate in the time-frequency domain when comparing the proposed system to the existing works. These kinds of work include audio bandwidth expansion, feature compensation for speech recognition, and more general audio imputation methods that can deal with more than one missing data scenarios.

In this section, I will also give a brief summary on the problems of audio gap interpo- lation and declipping, which are usually approached in the time-amplitude domain. But we do not directly compare the proposed method to these approaches. The advantages of considering the time-frequency representation of audios are that more information about 72 the structures of audio can be obtained from the spectrogram and more complex audio modeling techniques can be employed to model the signals of interest.

In this section, I first give an overview of popular missing data imputation techniques according to different scenarios that cause these problems to occur, and then narrow down to the particular problem that is considered in this thesis.

3.1.1. Audio gap interpolation

Audio gap interpolation is the process of reconstructing a missing or corrupted segment of samples in audio signals. Such a need arises when, for example, an audio signal contains impulsive noise introduced by scratches over an old CD, or a transmitted signal loses packets of samples due to bad connections over VoIP system.

The reconstruction across gaps of missing samples in audio signals has been approached through several means, among them are auto-regressive modeling (AR) [Janssen 86,

Etter 96, Esquef 06]; sub-band interpolation [Cocchi 02, Clark 08]; sinusoidal mod- eling [Maher 94, Lagrange 05]; and amplitude and phase estimation (APES) filtering

[Ofir 07].

Most of the approaches such as those in [Maher 94, Cocchi 02, Esquef 06, Clark 08] are only appropriate for reconstructing missing data with gap length of up to 100 ms.

In the case of [Lagrange 05] based on sinusoidal modeling, good interpolation can be achieved for gap lengths up to 1600 ms, but this approach is limited to interpolation of harmonic sources only, which makes it unsuitable for music signals consisting of both harmonic and percussive sources. 73

3.1.2. Audio declipping

Audio declipping is the process of restoring the clipped waveform signal beyond a threshold when the maximum range in an acquisition system is exceeded. It can be considered as another use case of audio imputation in the time domain.

First considered in [Abel 91], audio clipping is usually approached in a manner similar to a time-domain interpolation problem. For example, [Godsill 01] applied the time-varying auto-regressive model (TVAR) to restore the clipped speech signal.

[Dahimene 08] proposed an algorithm specially designed for clipped speech, with the assumption that the clipped speech is voiced and can be linearly predicted with a high accuracy. Both methods are only suitable for declipping of speech signals based on the assumption that speech signals are highly predictable; however, this is not usually the case for music.

[Adler 12] introduced a more advanced algorithm for declipping of both speech and music signals using the constrained Matching Pursuit. This algorithm imputes each time frame of clipped signal independently, ignoring the temporal information of audio.

Despite their differences, audio gap interpolation and declipping share some common properties. Most of the techniques operate on a time-amplitude representation of the signal to recover the missing values. The underlying assumption is that the signal is highly predictable, such as a single speech source. As the audio of interest becomes more complex and contains multiple sources active simultaneously, the time-amplitude representations are less effective in providing useful information for signal recovering, and the assumption that the signal is stationary with time no longer holds. 74

In this thesis, the missing data problem is approached in the time-frequency repre- sentation of audio signals. The proposed algorithm considers temporal changes of audio signals, which makes it more suitable for imputation of complex audio structures.

Next, I introduce existing imputation methods that operate in the time-frequency representation of the audio signal.

3.1.3. Audio bandwidth expansion

The most popular and wildly studied missing data problem is audio bandwidth expansion, due to the popular bandwidth limitting effect that exists in many communication channels, such as telephone transmission systems.

Audio Bandwidth Expansion (BWE) [Larsen 04] refers to methods that increase the frequency bandwidth of narrowband audio signals. Such frequency expansion is desirable if at some point the bandwidth of the signal has been reduced, as can happen during signal recording, transmission, storage, or reproduction.

Most BWE methods are based on the source-filter model of speech production [Jax 02].

Such methods generate an excitation signal and modify it with an estimated spectral en- velope that simulates the characteristics of the vocal tract. An artificially generated highband signal is then combined with the original narrowband signal to form a speech signal with an extended bandwidth. The main focus has been on the spectral envelope estimation. Classical techniques for spectral envelope estimation include codebook map- ping [Enbom 99], Gaussian mixture models (GMM) [Park 00], hidden Markov models

(HMM) [Bauer 09], and neural networks [Pulakka 11a]. 75

A sizeable literature exists on the topics of audio bandwidth expansion. In this thesis,

I dedicate the last topic to this particular problem. A more comprehensive coverage of

BWE techniques can be found in Chapter 4 .

As we can see, the main focus of the BWE is on the bandwidth expansion of speech, since this is the case that usually incurs the bandwidth limitation problem. These meth- ods need to be trained on parallel wideband and narrowband corpora to learn a specific mapping between narrowband features and wideband spectral envelopes. The BWE algo- rithms in this thesis (discussed in Chap. 3 and Chap. 4 ) treat the expansion of bandwidth as a missing data problem, which makes the proposed algorithms more flexible on training regime and extensible to many more types of bandlimited situations.

3.1.4. Feature-compension for ASR

Audio imputation has been employed in the area of Automatic speech recognition (ASR).

ASR systems perform poorly when the speech to be recognized is corrupted by noise, especially when the system has been trained on clean speech. Feature-compensation methods [Cooke 01, Raj 00, R. 04, Kim 09, Gemmeke 11] reconstruct complete spectrograms from the incomplete ones prior to recognition, effectively improving the performance of ASR systems. To achieve this, the true values of the unreliable time- frequency components of the spectrogram are estimated using the information from the reliable components of the spectrogram as well as the known statistical relationships between the various components.

[Raj 00, R. 04] proposed two feature-compensation algorithms for ASR. The first algorithm, called a correlation-based method, assumes that the spectrogram of a clean 76 speech signal is the output of a Gaussian wide-sense stationary (GWSS) random process

[Papoulis 91]. The parameters of the GWSS random process are first learned from the training corpus of clean speech. The unreliable components of the corrupted speech spec- trogram are then estimated from their correlations (characterized by the GWSS random process) with reliable components using Maximum a posteriori (MAP) estimation.

The second algorithm is referred to as a cluster-based method. It clusters spectral vectors of clean training speech. Corrupt components of noisy speech are estimated from the distribution of each spectral cluster by a weighted mean estimate. In [Smaragdis 11], it has been shown that cluster-based method is not suitable for imputation of highly complex signals, such as music.

[Kim 09] incorporates the correlation-based method into the cluster-based method for missing-feature reconstruction by including an improved number of reliable time-frequency components which are highly correlated across the time and frequency axes to the missing frequency region.

Instead of learning a generative model for the clean speech signal, [Gemmeke 11] uses the example speech segments themselves, called exemplars, as the spectral dictionary.

The corrupted speech spectrogram can be approximated using a sparse combination of the atoms from the spectral dictionary. However, this approach requires a large amount of clean speech for training. The work in this thesis, however, can be used in an unsupervised way, while additional training data will further improve the performance.

In [Cooke 01], a state-based data imputation algorithm is proposed to estimate values for the unreliable regions by conditioning on the reliable parts of the spectrogram and the recognition hypothesis. This algorithm is similar to the proposed imputation method 77 in the sense that both methods utilized a hidden Markov model structure to learn the temporal dynamics of different speech states. However, the approach in [Cooke 01], like many other speech recognition algorithms, assume the density function in each state can be modeled using Gaussian mixture models, which greatly constrained this algorithm to highly predictable signal such as single source speech.

The aforementioned ASR feature-compensation algorithms are suitable for imputation of speech for the purpose of speech recognition. Since these algorithms are only evaluated in terms of speech recognition accuracy, they are presumably optimized for this purpose and the reconstructed spectrograms are never evaluated in terms of speech quality. On the other hand, my goal is high quality reconstructed audio signal and we make design decisions for this goal. Furthermore the effectiveness of these systems has never been tested in the case of musical or general audio recordings. Real-world recordings include a variety of sounds, many of which are non-stationary, concurrently active at any time, and each of which has its own typical patterns, and are hence much harder to model and reconstruct using the techniques for ASR Feature-compensation.

3.1.5. Generic imputation algorithm

Designing more generic algorithm for imputation of general audio signals with various types of corruptions is a challenging task. It is the focus of the proposed work in this chapter. In recent years, there is increasing interest in using statistical models for audio imputation, such as singular value decomposition, hidden Markov model and non-negative spectrogram decomposition. 78

[Brand 02] introduced a generic imputation algorithm based on Singular Value De- composition (SVD). This algorithm claimed to be able to handle arbitrary missing values because it only takes into consideration the generic properties of incomplete matrix. How- ever, as shown in [Smaragdis 11], it is usually ill-suited for use with audio signals and results in audible distortions.

[Moussallam 10] uses sparse approximations with a learned dictionary to model the main components of the undamaged source spectra. It aims at being generic for sources, provided it has been trained on undamaged samples of the source. This algorithm, however, is only applicable to reconstruct stationary spectral patterns that do not change with time.

Algorithms such as those in [Le Roux 10, Smaragdis 11] are based on non-negative spectrogram factorization techniques such as Non-negative matrix factorization (NMF) or Probabilistic Latent Component Analysis (PLCA) (details of NMF and PLCA are discussed in detail in Sec. 2.2 ). They are more suitable for imputation of more general audio signals such as music, and are the current state-of-the-art for audio imputation. In this section, we consider the work proposed in [Smaragdis 11] as the starting point and baseline system that the proposed approach is compared to.

The biggest issue in these algorithms is that they treat individual time frames of the spectrogram as independent of adjacent time frames, which makes them less effective to model complex audio scenes and reconstruct severely corrupted audio. The proposed algorithm (to be detailed in Sec. 3.4 ) addresses this issue by taking into consideration the non-stationary and temporal dynamics (both terms will be further explained in Sec.

3.2 ) of audio signals. 79

To be more specific, the work in this chapter shares some resemblances with the approaches in [Cooke 01, Le Roux 10, Smaragdis 11], but addresses the following problems those methods haven’t addressed:

• [Cooke 01] utilized an hidden Markov model to learn the temporal dynamics of

audio signal, but the use of Gaussian mixture model as the observation model

greatly constrains the representation ability of the model, making it less effective

for audio of complex structures.

• [Le Roux 10, Smaragdis 11] utilized non-negative spectrogram factorization

to provide rich representations of the spectral structure of sound sources, how-

ever, the dictionary learned by these methods ignores the non-stationarity and

temporal dynamics of audio signal.

The proposed algorithm uses multiple dictionaries such that each time frame is ex- plained by any one of the several dictionaries, to account for the non-stationarity of audio.

Additionally, it uses a Markov chain to explain the transitions between dictionaries to ac- count for the temporal dynamics of audio signal. The state observation model is modeled by mixture multinomial models instead Gaussian mixture models. This makes it more suitable for imputation of complex audio signals.

In a very general sense, the proposed algorithm can be viewed as a combination of the merits of [Cooke 01] using hidden Markov model and [Le Roux 10, Smaragdis 11] using non-negative spectrogram factorization techniques. However, the combination of these modeling strategies is not trivial. The proposed algorithm is based on the so called non-negative hidden Markov model (N-HMM). In the next section, I will explain in details 80 how the N-HMM model is extended from non-negative spectrogram factorization and hidden Markov model.

3.2. Non-negative Hidden Markov Model

In this section, I describe the non-negative hidden Markov model (N-HMM). In order to better explain this model, I first give a general discussion of the concept of N-HMM.

This is followed by detailed description of the model parametrization.

3.2.1. Conceptual Explanation

In Chap. 2 , I have introduced the non-negative spectrogram factorization (NMF and

PLCA), and show how PLCA can provide rich representation to model the vocal and non-vocal parts in singing melody extraction. This technique, however, has some issues in modeling audio, as described as follows.

PLCA or NMF (Fig. 3.1 (a) ) learns a single dictionary to describe the statistics of the sound source over all time frames. Given the spectrogram of another audio signal, the same dictionary is used to model every time frame (column) of the spectrogram. This modeling philosophy has taken the spectral structures into consideration but ignores the following properties of audio.

On one hand, audio is non-stationary as the statistics of its spectrum change over time. Therefore, amalgamating the statistics over all time frames into a single dictionary

(as PLCA or NMF does) is perhaps not the best strategy. 81

On the other hand, the statistics of the spectral structure are quite consistent over segments of time. Moreover, there is a structure to the non-stationarity of audio which we call temporal dynamics.

A method that conforms better to non-stationary signals is to learn several small dictionaries to explain different aspects of the sound. Each time frame of the spectrogram can then be explained by a linear combination of the spectral components from one (out of the many) dictionaries.

The Non-negative Hidden Markov Model (N-HMM) [Mysore 10] does this by jointly learning several small dictionaries, as shown in Fig. 3.1 (b) , to explain different aspects of the sound. Different dictionaries are automatically learned and they often correspond to intuitive temporal parts in audio. For example, in speech words, each dictionary is likely to correspond to a phoneme or part of a phoneme. In a sequence of music notes, each dictionary is likely to correspond to a note.

Learning several small dictionaries rather than a single large dictionary is consistent with the non-stationarity of audio. However, there is still more structure we can use.

Since each dictionary corresponds to a different aspect of the sound source, the transi- tions between dictionaries correspond to the temporal dynamics of the source that relates different temporal parts in audio. This can be modeled by a Markov chain learned from the data as well, as shown in Fig. 3.1 (b) .

A Markov chain also has intuitive interpretations. For example, if the given spectro- gram corresponds to a sequence of notes, the transmission between notes could conform to music theory rules. It will tell us the probability of a note given the note in the previous time frame. For speech words, the Markov chain could be learned using linguistic rules 82

(a) Non-negative Spectrogram Factorization

(b) Non-negative Hidden Markov Model with ergodic transition model

Figure 3.1. Comparison of dictionaries learned by non-negative models. PLCA uses a single large dictionary to explain a sound source, whereas the N-HMM uses multiple small dictionaries and a Markov chain. Here, each column represents a single spectral component in the dictionary. 83 for transitions between phonemes or words. We will return to this property of N-HMM in Chap. 4 to show how to take advantage of this Markov chain structure once we have more knowledge about the sound source that we are going to model.

A comparison between a N-HMM and a PLCA is illustrated in Fig. 3.2 . We start with a single large dictionary that is learned by a PLCA illustrated in Fig. 3.2 (b) to explain everything in an audio spectrogram that is plotted in Fig. 3.2 (a) , and work up to several small dictionaries and a Markov chain as shown in Fig. 3.2 (c) , jointly learned from the given spectrogram by an N-HMM. Each dictionary corresponds to a state of the Markov chain. The N-HMM takes into account the temporal dynamics of the audio signal by a Markov chain. Instead of using one large dictionary to explain everything in the audio, the N-HMM learns several small dictionaries, each of which will explain a particular time portion in the spectrogram to better account for the non-stationarity of audio.

An N-HMM learned from real speech is shown in Fig. 3.3 . In this example, four dic- tionaries (states) with five spectral components per state are learned. The state posterior given the observation show how likely an audio frame is to be explained by each dictio- nary. As shown in the state posterior plot, only one state is active at a particular time frame. This model uses multiple dictionaries such that each time frame of the spectro- gram is explained by any one of the several dictionaries (accounting for non-stationarity).

Additionally it uses a Markov chain to explain the transitions between its dictionaries

(accounting for temporal dynamics). 84

Figure 3.2. A comparison between PLCA and N-HMM. We start with a single large dictionary that is learned by PLCA (b) to explain everything in an audio spectrogram (a), and work up to several small dictionaries and a Markov chain (c) jointly learned from the given spectrogram by N-HMM. Each dictionary corresponds to a state of the Markov chain. 85

Figure 3.3. Illustration of N-HMM on the spectrogram of a clip speech ‘Bad dog’, as shown in (a). ‘States’ represent small dictionaries learned by N-HMM. In this example, four dictionaries (states) with five spectral components per state are learned as shown in (d). The state posterior given the observation and state transition matrix are plotted in (b) and (c) respectively. 86

3.2.2. Probabilistic Model description

The work in this thesis is built on probabilistic models of audio. Therefore we encounter various probability distributions throughout the thesis. Before going into details of the probabilistic model, I first introduce some notations.

Let {f, z, q} be a set of discrete random random variables 1 . Each random variable can take on values in a set of finite discrete real values. For example, we assume f is taken from the set of analysis frequencies in the FFT. We will introduce in a moment the sets of values z and q range over.

Audio is a temporal signal. Thus we have a set of random variables {ft, zt, qt} at time frame t. Note that ft, or zt, or qt ranges over the same set of values as f, or z, or q, respectively. For example, ft and f all take the value from the analysis frequencies in the

FFT.

We denote time-varying distributions with a subscript t. For example, Pt(z|q) indicates that we have a separate distribution for each time frame whereas P (z|q) indicates that we have a single distribution for all time frames. If we come across P (zt|qt), it means the same distribution of P (z|q), but in the context of time frame t. On the other hand, if we encounter Pt(zt|qt), it means that the conditional distribution of z given q are time- dependent.

I now describe the probabilistic model of an N-HMM. Let’s first start with simple

HMMs and build up from there in a step by step manner.

1In graphical model, upper case usually indicates random variable and lower case its realization. We will respect this notation when we introduce graphical models. When we talk about distributions, lower case means random variables. 87

The graphical model of an HMM is shown in Fig. 3.4 2 . We use the standard conven- tion of representing random variables with nodes and conditional probability distributions with arrows. The direction of the arrows indicate the direction of dependence of random variables. Shaded nodes indicate observed random variables and clear nodes indicate hidden random variables. The same notation is used henceforth.

In the HMM shown in Fig. 3.4 , we can use a multinomial distribution P (f|q) as the observation model. The use of multinomial distribution is analogous to a spectral component in non-negative spectrogram factorizations (see PLCA in Sec. 2.2.2 ). Each state corresponds to one spectral component from a dictionary. The collection of all states form a dictionary of spectral components.

Figure 3.4. Graphical Model for an HMM with multiple draws at every time frame (from [Mysore 10]). {Q, F } is a set of random variables. vt represents the number draws at time t from a distributions. Shaded variable indicates observed variable.

2In graphical model, upper case usually indicates random variable. We respect this notation only when we introduce graphical model. 88

An audio spectrogram is modeled as a histogram of “sound quanta”. The amount of sound quanta in a given time-frequency bin indicates the Fourier magnitude of that bin and is given by Vft. If we draw a single frequency from the multinomial at every time frame, our observation would then be a single “frequency quanta” at every time frame 3 .

In order to model a magnitude spectrum, we need to have multiple draws at every time P frame. Therefore we explicitly model the number of draws vt = f (Vft) for a given state with a distribution P (vt|qt). The introduction of vt is necessary. Once each spectrum is normalized to a “sound quanta”, the magnitude of the spectrum can be considered as the number of “frequency quanta” observed in the spectrum at each time. Thus the number of draws that were made to explain a given time frame intuitively corresponds to the magnitude of the spectrogram at that time frame.

The use of a single spectral component per state can be quite limiting. We can extend this to use of a mixture of a dictionary of spectral components per state. This is modeled using a multinomial mixture model. The spectral components for state q are given by the distribution P (f|z, q) , where z is the (index of) spectral component. Since we have multiple spectral components per state, we need to have a set of mixture weights Pt(z|q) for each state. Notice that we use time dependent distributions for the mixture weights.

If not, then every occurrence of state q would have the same mixture weights and P (f|q) would be fixed for each q, which would greatly reduce the expressive power of the model.

These expansions give us the N-HMM. Its graphical model is shown in Fig. 3.5 . Each state q in the N-HMM corresponds to a dictionary. Each dictionary contains a number of spectral components that can be indexed by z. Therefore, a spectral component (indexed

3This will be clearer once the generative process of N-HMM is introduced in Tab. 3.2 . 89

Qt Qt+1

Zt Ft Zt+1 Ft+1

vt vt+1

Figure 3.5. Graphical Model for the N-HMM ([Mysore 10]). {Q, Z, F } is a set of random variables. vt represents the number draws at time t from a distributions. Shaded variable indicates observed data. Q and Z range over the sets of spectral component indices and dictionary indices respectively. F ranges over the set of analysis frequencies in the FFT.

by z) from a dictionary (index by q) is represented by P (f|z, q) . f takes the value from the analysis frequencies in the FFT. Both q and z are random variables that take on the values from the sets of spectral component indices and dictionary indices respectively. For example, if we have ten dictionaries (states) in our model with indices from 1 to 10. Then we can think q or qt takes on any value from the set {1, ··· , 10}.

The observation model at time t, which corresponds to a linear combination of the spectral components from dictionary q, is given by:

X Pt(ft|qt) = Pt(zt|qt)P (ft|zt, qt)(3.1)

zt

where Pt(zt|qt) is a distribution of mixture weights at time t. The transitions between states are modeled with a Markov chain, given by P (qt+1|qt) . The Initial state probabil- ities of the Markov chain is denoted as P (q1) . All distributions are discrete. 90

N-HMM parameter parameter description P (f|z, q) Spectral components (multinomial distributions)

Pt(zt|qt) Mixture weights (multinomial distributions)

P (qt+1|qt) Transition matrix (multinomial distributions)

P (q1) Initial state probabilities (multinomial distribution) P (v|q) Energy distributions (Gaussian distributions)

Table 3.1. The parameters of the Non-negative Hidden Markov Model. These parameters can be estimated using Expectation-Maximization al- gorithm. q and z range over the sets of spectral component indices and dictionary indices respectively. f ranges over the set of analysis frequencies in the FFT.

In our model, we assume Vt, the spectrogram at time t, is generated by repeated draws from a distribution P (ft|qt) given by that we are at state qt. We explicitly model the number of draws for a given state with a Gaussian distribution P (v|q) . The number of draws that were made to explain a given time frame intuitively corresponds to the energy of the spectrogram at that time frame. We therefore call it the “energy distribution” 4.

So far, we have described all the parameters of the N-HMM. All the parameters of N-

HMM are indicated as text inside a box . The model includes the spectral components, weights distributions, energy distributions, transition matrix, and initial state probabili- ties. We listed the parameters of the model in Tab. 3.1 .

Given an N-HMM, the generative process of an audio spectrogram of T time frames is described in Tab. 3.2 . Here, we adapt the same modeling philosophy used by PLCA in Sec. 2.2 : an audio spectrogram is modeled as a histogram of “sound quanta”.

4v corresponds to the magnitude of the spectrum, we call it “energy distribution” to be consistent with [Mysore 10] 91

(1) Choose an initial state according according to P (q1). (2) Set t = 1.

(3) Choose the number of draws for the given time frame according to P (vt|qt).

(4) Repeat the following steps vt times:

• Choose a spectral component according to Pt(zt|qt).

• Choose a frequency according to P (ft|zt, qt). • Add one “sound quanta” to time-frequency bin picked by last step at t in the spectrogram.

(5) Transit to a new state qt+1 according to P (qt+1|qt). (6) Set t = t + 1 and go to step 3 if t < T .

Table 3.2. The generative process of an audio spectrogram using N-HMM.

During the The N-HMM generative process of an audio spectrogram, the random variables {q, z, f, v}, are drawn multiple times from their distributions. At every time step, we have vt to indicate the number of draws of frequencies at that time. Both f and z are drawn vt times at each time frame t. Thus at every draw of every time frame, we have an observation of f and z. We use ft,v and zt,v, t = 1, ··· ,T , v = 1, ··· , vt to represent the specific instances of observations. We also use qt to indicate the state we are in at time t. ¯ In the end, we have a sequence of draws for each random variable. We use f = {ft,v | t =

1, ··· , T, v = 1, ··· , vt} to indicate the sequence of all of the draws of frequencies over all time frames, which is the whole spectrogram. We use ¯v = {vt | t = 1, ··· ,T } to indicate the total number of draws over all time frames (the sum of the magnitude of the spectrum at each time frame). These are the observed variables in our model. 92

In the similar way, ¯z = {zt,v | t = 1, ··· , T, v = 1, ··· , vt} and ¯q = {qt | t = 1, ··· ,T } indicate the sequence of draws of all the hidden variables. For example, if we have

10 dictionaries (states) with indices from {1, ··· , 10}, then qt takes on the value from

{1, ··· , 10}.

The complete data likelihood is given by:

T −1 ! T ! ¯ Y Y (3.2) P (f, ¯v, ¯z, ¯q) =P (q1) P (qt+1|qt) P (vt|qt) ... t=1 t=1 T v ! Y Yt Pt(zt,v|qt)P (ft,v|zt,v, qt) t=1 v=1

The parameters of N-HMM are estimated by maximizing the log-likelihood of the data. Since some of the random variables (z, q) are hidden, the expectation-maximization

(EM) algorithm is used to perform the parameter estimation. It is an iterative algorithm in which the log-likelihood increases with each iteration and tends to converge after a certain number of iterations.

Specifically, we iterate between the following steps:

(1) Expectation step (E step) – Compute the posterior distribution using the esti-

mated parameters.

(2) Maximization step (M step) – Estimate the parameters by maximizing the ex-

pected value of the complete data log likelihood with respect to the posterior

distribution. 93

The posterior distribution of the model is the probability of the hidden variables given the observations and is given by:

¯ ¯ P (f, ¯v, ¯z, ¯q) (3.3) P (¯z, ¯q|f, ¯v) = ¯ Σ¯z,¯qP (f, ¯v, ¯z, ¯q)

The complete data log likelihood is obtained by taking the log of the complete data likelihood given in Eq. 3.3 :

¯ T −1 T (3.4) log P (f, ¯v, ¯z, ¯q) = log P (q1) + log Σt=1 P (qt+1|qt) + log Σt=1P (vt|qt) + ...

vt vt T Y T Y log Σt=1 Pt(zt,v|qt) + log Σt=1 P (ft,v|zt,v, qt). v=1 v=1

As we can see, in the log domain, several terms in the likelihood are decoupled. Further- more, taking the log of the data likelihood helps us avoid the underflow problem in linear domain.

The expected value of the complete data log likelihood with respect to the posterior distribution is given by:

 ¯  (3.5) L = E¯z,¯q|¯f,¯v log P (f, ¯v, ¯z, ¯q)

Since N-HMM itself is not the contribution of this thesis, I do not describe the full procedure for the estimation of the parameters of the N-HMM. I summarize the final results of the parameter estimation in Tab. 3.3 . Please refer to [Mysore 10] for the full formulation.

The parameter estimation process of N-HMM described above assumes the spectro- gram to be modeled is complete. I extend the N-HMM to the case where the spectrogram 94 is incomplete (i.e., some elements in the spectrogram are missing or corrupted). In the process of N-HMM estimation, the parameters as well as the missing data are estimated.

I will discuss the details of the proposed algorithm in Sec. 3.4 . 95

E step – Intermediate Computations

!Vft Y X (3.6) P (ft|qt) = P (vt|qt) P (ft|zt, qt)P (zt|qt)

ft zt

α1(q1) = P (f1, v1|q1)P (q1)(3.7) ! X αt+1(qt+1) = P (qt+1|qt) P (ft+1, vt+1|qt+1)(3.8)

qt

βT (qT ) = 1(3.9) X βt(qt) = βt+1(qt+1)P (qt+1|qt)P (ft+1, vt+1|qt+1)(3.10)

qt+1 P (z |q )P (f |z , q ) (3.11) P (z |f , q ) = t t t t t t t t t t P P (z |q )P (f |z , q ) zt t t t t t t E Step – Marginalized Posteriors

α (q )β (q ) (3.12) γ (q ) = t t t t , t t P α (q )β (q ) qt t t t t ¯ (3.13) Pt(zt, qt|ft, f, ¯v) = γt(qt)Pt(zt|ft, qt), α (q )P (q |q )β (q )P (f , v |q ) (3.14) P (q , q |¯f, ¯v) = t t t+1 t t+1 t+1 t+1 t+1 t+1 . t t t+1 P P α (q )P (q |q )β (q )P (f , v |q ) qt qt+1 t t t+1 t t+1 t+1 t+1 t+1 t+1 M Step

P ¯ t VftPt(z, q|f, f, ¯v) (3.15) P (f|z, q) = P P ¯ , f t VftPt(z, q|f, f, ¯v) P ¯ VftPt(zt, qt|ft, f, ¯v) (3.16) P (z |q ) = ft , t t t P P V P (z , q |f ,¯f, ¯v) zt ft ft t t t t P T − 1P (q , q |¯f, ¯v) (3.17) P (q |q ) = t=1 t t+1 , t+1 t P P T − 1P (q , q |¯f, ¯v) qt+1 t=1 t t+1

(3.18) P (q1) = γ1(q1). Table 3.3. The EM process of N-HMM Learning 96

3.3. Audio Imputation by Non-negative Spectrogram Factorization

The work in this dissertation is an N-HMM based imputation technique that works for the estimation of missing frequencies in the spectrogram. The proposed method shares some similarities with audio imputation methods using non-negative spectrogram factor- ization techniques (NMF, PLCA, etc.).

Audio imputation methods can be classified as unsupervised and supervised methods.

The method proposed in this thesis can be employed in both ways.

Given an audio recording with missing frequencies, unsupervised methods try to es- timate the missing frequencies using the information only inside the audio spectrogram itself. Supervised methods instead tend to similar training data for extra information.

In this section, I give a description of supervised audio imputation using non-negative spectrogram factorization techniques. The unsupervised methods follows intuitively and will be discussed later.

Supervised audio imputation is needed when the corrupted audio cannot provide enough information to reconstruct the missing frequencies. For example, expanding the frequency bandwidth of an narrowband telephony audio (300 Hz – 3400 Hz) to wide- band audio (up to 8000 Hz) requires us to make up the extra bandwidth that does not exist in the narrowband audio. In this case, a dictionary of spectral components can be learned by non-negative spectrogram factorization techniques from a corpus of clean audio examples that are similar to the corrupted audio. With the pre-learned spectral components, the goal of audio imputation is basically to estimate the mixture weights for the spectral components so that a combination of these spectral components fit the observed frequencies of the spectrogram. Since the spectral components are trained from 97 clean audio spectrograms, the missing frequencies are also “re-generated” in the process of fitting these spectral components to the observed frequencies of the spectrogram.

The general procedure of supervised audio imputation methods is as follows. First learn a dictionary of spectral components from the training data using a non-negative spectrogram factorization technique, such as Non-negative Matrix Factorization (NMF) or Probabilistic Latent Component Analysis (PLCA). Each frame of the spectrogram is then modeled as a linear combination of the spectral components from the dictionary.

Given the spectrogram of a corrupted audio recording, we estimate the weights for each spectral vector as well as the expected values for the missing entries of the spectrogram using an EM algorithm.

Figure 3.6. General Procedure of Supervised Audio Imputation 98

Fig. 3.6 shows an example of audio imputation using PLCA. In this example, a dictionary of 30 spectral vectors is learned from an intact audio spectrogram by PLCA.

Given a corrupted audio recording that is similar to the training audio, the missing values of the spectrogram can be estimated by a linear combination of the spectral components from the dictionary.

Unsupervised audio imputation follows a very similar procedure that is described above. The difference is that the spectral components are learned from the corrupted spectrogram in the process of the EM algorithm. 99

3.4. System description

Previous audio imputation methods [Le Roux 10, Smaragdis 11] are based on

NMF or PLCA to learn a single dictionary of spectral components to represent the en- tire signal. These approaches treat individual time frame independently. The dictionary learned by these approaches does not take into consideration the non-stationarity and temporal dynamics of audio signal (as explained in Sec. 3.2 ).

Figure 3.7. Supervised Audio Imputation using an N-HMM

The proposed approach is illustrated in Fig. 3.7 , as an supervised approach. We will see later that the proposed system can be also used as an unsupervised approach. In this example, we use N-HMM to learn several (five in this particular illustration example) dictionaries from the training audio. Different dictionaries usually correspond to differ- ent temporal aspects of the audio signal (to account for the non-stationarity of audio). 100

Dictionaries are associated with states by a Markov chain that incorporates the dynamic temporal of the given audio signal. During the imputation process, the spectral compo- nents from usually one of many dictionaries are used to reconstruct a certain frame of the corrupted spectrogram.

Next I describe in detail the proposed imputation methods.

3.4.1. Estimation of incomplete data

When the spectrogram is incomplete, a great deal of the entries in the spectrogram could be missing. In this paper, we assume the locations of the corrupted bins are known.

Identifying the corrupted region is beyond the scope of this work. In many cases the locations of the corrupted bins can be treated as prior information once we know the

filtering process the signals have been through. Examples include telephony bandwidth expansion, speaker enhancement, or enhancing the outputs of the audio source separa- tion using binary masks. Our objective is to estimate missing values in the magnitude spectrogram of audio signals.

In the rest of the paper we use the following notation: we will denote the observed regions of any spectrogram V as V o. The missing regions V m can be seen as the relative

o m o complement of V in V , denoted by V = V \ V . Within any magnitude spectrum Vt at

o time t, we will represent the set of observed entries of Vt as Vt and the missing entries as

m o Vt . Ft will refer to the set of frequencies for which the values of Vt are known, i.e. the

o m set of frequencies in Vt . Ft will similarly refer to the set of frequencies for which the

m o m values of Vt are missing, i.e. the set of frequencies in Vt . Vt (f) and Vt (f) will refer to

o m the magnitude at frequency f of Vt and Vt respectively. 101

To reconstruct the spectrogram, we first reconstruct the contribution of each dictionary

Pt(ft|qt) using Eq. 3.1 . We then reconstruct each time frame at t as follows:

X Pt(f) = P (ft|qt)γt(qt)(3.19)

qt where γt(qt) is the posterior distribution over the states, conditioned on all the observa- tions over all time frames.

We compute γt(qt) using the forward-backward algorithm [Rabiner 93] as in tradi- tional HMMs when performing the EM iterations. Note that in practice γt(qt) tends to have a probability of nearly 1 for one the dictionaries and 0 for all other dictionaries so there is usually effectively only one active dictionary per time frame. A plot of γt(qt) can be found in Fig. 3.3 (b) of Sec. 3.2 . As shown in the plot, only one dictionary (state) is active at each time frame.

Here, the resulting value Pt(f) can also be viewed as an estimate of the relative magnitude of of the frequencies at frequency f and time t. However, we need to estimate the absolute magnitudes of the missing frequencies so that they are consistent with the observed frequencies. We therefore need to estimate a scaling factor for Pt(f).

We do not know the total amplitude at time t because some values are missing. In order to estimate a scaling factor, we sum the values of the uncorrupted frequencies in

o P o o the original audio to get n = o V (f). We sum the values of Pt(f) for f ∈ F to t f∈Ft t t o P o get p = o Pt(f). The expected amplitude at time t is obtained by dividing n by t f∈Ft t o pt . This gives us a scaling factor. 102

As we discussed in Sec. 2.2 , we treat the normalized spectrogram as a distribution of frequency and time. Therefore in the sense of probabilistic modeling, Pt(f) can be thought of as a probability distribution of frequencies at each time frame.

m The expected value of any missing term Vt (f) whose probability Pt(f) is specified by Eq. 3.19 can be estimated by:

o m nt (3.20) E[Vt (f)] = o Pt(f) pt

Then we can update the corrupted spectrogram by:

 o  Vt(f) if f ∈ F ¯  t (3.21) Vt(f) =  m m  E[Vt (f)] if f ∈ Ft

This work focus on reconstructing the missing magnitude values in the spectrogram so it does not address the problem of missing phase recovery. Instead I use the recovered mag- nitude spectrogram with the phase from the original uncorrupted signal to re-synthesize the time domain signal.

Standard phase recovery algorithms such as [Nawab 83, Bouvrie 06] can be used in real-life applications. However, such phase estimation algorithms will introduce extra er- rors when we evaluate our algorithms in the time domain. In order to accurately compare the reconstructed magnitude spectrogram using different audio imputation algorithms, we use the original phase during the experimental evaluations in Sec. 3.5 . Of course, the original phase will produce more perceptually pleasing than a standard phase recovery method [Nawab 83, Bouvrie 06]. 103

I now describe the actual learning procedures to estimate model parameters and miss- ing elements in the spectrogram. I identify two situations, stated in order of increasing levels of corruption in audio, as in Sec. 3.4.2 where the spectral dictionaries are unknown but can be estimated from the corrupted audio itself during the imputation process, and in

Sec. 3.4.3 where the spectral component dictionaries need to be learned from the training data.

3.4.2. Algorithm I

The first algorithm considers the case where the missing regions in the spectrogram are in arbitrary shape (scattered or coherent), and spread over all possible frequencies.

The proposed algorithm can be carried out either in an unsupervised way without prior training data or supervised way with training. However, in the case that does not require training, we assume that for every FFT analysis frequency, there are at least some uncorrupted bins available for the algorithm to learn from during the imputation process.

Otherwise, the algorithm tends to training data to learn the dictionaries.

In this algorithm, the missing values in the spectrogram are initialized randomly, and then an EM algorithm is carried to iteratively estimate the expected values of the missing elements as well as the parameters of the N-HMM. Please note although random initialization is chosen in this work, any initialization methods can be used in this case, for example an output of another imputation method.

The procedure of the first algorithm is described in Tab. 3.4.

The algorithm in Tab. 3.4 is proposed to reconstruct the corrupted spectrogram when the missing regions spread over all the frequencies of the spectrogram. When the missing 104

(1) IF training data is available: Learn an N-HMM from the training data, as described in Sec. 3.2. ELSE Initialize the N-HMM with random valuesa. END (2) Initialize the missing entries of the corrupted spectrogram V to random values. Call the new spectrogram V¯ . (3) Perform the N-HMM learning on V¯ . ¯ • During the E step, calculate Eq. 3.6 – Eq. 3.14 with Vft replacing Vft in Eq. 3.6. • During the M step, IF an N-HMM is learned from step 1 Fix the all parameters of the N-HMM except for the mixture weights. Update mixture weights with Eq. 3.16 ELSE Update all the parameters with Eq. 3.15 – Eq. 3.18 END

• Then update Pt(f) using Eq. 3.19. • Update every missing entry in the spectrogram V¯ with its expected value using Eq. 3.21. Repeat until the EM algorithm converges or the specified number of iterations is reached. (4) Convert the estimated spectrogram to the time domain.

aThere are better ways (e.g., K-means) to initilize the spectral components when the spectrogram is complete. Table 3.4. Algorithm I for Audio Imputation

frequencies are limited to a certain range of the spectrogram and the available frequencies are also bandlimited, a second algorithm specially designed for audio bandwidth expansion is proposed next. 105

3.4.3. Algorithm II

In the case of the bandwidth expansion, all the time frames will share the same set of observed and missing frequencies. Therefore there is nothing to learn from the narrowband audio about the extra frequencies of the wideband audio. In this case, we tend to a training corpus of wideband audio to learn the spectral dictionaries first.

We use F o to indicate the set of observed frequencies in the narrowband spectrogram,

F m the set of frequencies to be expanded to, and F all the frequencies of the wideband spectrogram.

Given the wideband audio training data, we first train the N-HMMs on the wideband spectrogram. Now we have N-HMM dictionaries P (f|z, q) that span the full frequency range of the wideband spectrogram. We call them wideband dictionaries.

Given the narrowband audio, we perform N-HMM parameter estimation on the nar- rowband spectrogram. With pre-trained N-HMMs , the only parameters that we need to estimate from the narrowband audio are the mixture weights. Thus We keep the dictionaries and transition matrix from the N-HMMs fixed. One issue is that the dic- tionaries are learned on wideband audio, but we are trying to fit them to narrowband audio. We therefore only consider the frequencies of the dictionaries that are present in the narrowband spectrogram, for the purposes of mixture weights estimation.

Given the wideband dictionaries learned from wideband spectrogram, the narrowband dictionaries are obtained by:

P (f|z, q) (3.22) P˜(f|z, q) = f ∈ F o. Σf∈F o P (f|z, q) 106

Once the mixture weights are estimated, we can reconstruct the wideband spectro- gram from the narrowband spectrogram using the wideband dictionaries and the learned mixture weights.

The second algorithm is described in Tab. 3.5

Let F o and F denote the sets of narrowband and wideband frequencies. (1) Learn an N-HMM model from the wideband audio. We now have a set of spectral components P (f|z, q), f ∈ F for the wideband spectrogram. Call P (f|z, q) wideband dictionaries. (2) Given the narrowband spectrogram, construct the narrowband dictionaries: P˜(f|z, q) by keeping only the frequencies of the wideband dictionaries that are present in the narrowband spectrogram, as described in Eq. 3.22 . (3) Perform N-HMM parameter estimation on the narrowband spectrogram. During the learning process, • Perform N-HMM learning using the narrowband dictionaries P˜(f|z, q). • During the E step, calculate Eq. 3.6 – Eq. 3.14.

• During the M step, only update the mixture weights Pt(zt|qt) with Eq. 3.16.

(4) Calculate Pt(f) using Eq. 3.19 with wideband dictionaries P (f|z, q) and the mixture weights estimated from the narrowband spectrogram. (5) Reconstruct the wideband audio spectrogram using Eq. 3.21 (6) Convert the estimated spectrogram to the time domain. Table 3.5. Algorithm II for Audio Bandwidth Expansion 107

3.5. Experiment

In this section we will evaluate the algorithms described in Sec. 3.4 with real-world music audio examples and quantitative experiments. In our experiments, the examples are all taken from real-world music songs, most of which have complex musical sources with multiple, concurrent spectral patterns.

All songs are single channel PCM data with 44.1 kHz sample rate and 32-bit quan- tization. A FFT size of 1024 points and hop-size of 256 points are used to get the time-frequency values. The proposed methods are compared to a recent audio impu- tation method using PLCA [Smaragdis 11] which is the current state-of-art for audio imputation.

3.5.1. Illustrative examples

We first evaluate the proposed algorithm I described in Tab. 3.4 . The original audio is a 3-second clip from “Bad Day” by Daniel Powter with male singing voice and piano accompaniment. The missing data is evenly and randomly distributed across the input.

For this test a smoothed random binary mask was applied to the original sound to remove about 50% of the spectrogram elements.

We modeled the data with an N-HMM with 10 dictionaries and 10 spectral compo- nents per dictionary, totaling 100 spectral components. As a comparator, a PLCA with

100 spectral components was also used to reconstruct the spectrogram. The only infor- mation available to both methods is the corrupted input data itself. No training has been conducted in this case. 108

The results of this experiment are shown in Fig. 3.8 . We note that PLCA results a reconstruction with grainier energy in the missing parts, especially in the high fre- quency region, whereas the proposed model results in a smoother output, with some of the harmonic structures recovered in the missing region.

Figure 3.8. Example reconstruction of a music signal with a binary mask occluding roughly 50% of the samples. The first plot shows the original signal, the second plot shows the masked input we used for the reconstruc- tion, the third plot shows the reconstruction using PLCA and the fourth one shows the reconstruction using our model. 109

(a) Original audio.

(b) Input audio.

(c) Imputation result by PLCA.

(d) Imputation result by algorithm I. 110

In the next example we present an even more challenging case where about 60% of the original spectrogram elements were randomly removed with a smoothed random binary mask. The input audio is an 11-second jazzy blues recording of the song “Cry Me A

River” by Justin Timberlake. The audio consists of the sound of “rain”, an electronic guitar throughout the audio and a male singing voice entering the song near the end.

Again, we modeled the data with an N-HMM with 10 dictionaries and 10 spectral com- ponents per dictionary. As a comparator, a PLCA with 100 spectral components was also used to reconstruct the spectrogram. The only information available to both methods is the corrupted spectrogram itself.

The results of this experiment are shown in Fig. 3.9 . The proposed algorithm has again produced a better reconstruction than PLCA. More specifically, PLCA has added addi- tional energy in the mid-frequency band before the human singing voice enters the song.

As we examines the spectrogram more closely, we can further see that the distribution of the temporal energy of the reconstruction produced by PLCA is different from the original spectrogram. In contrast, the proposed method produced a smoother reconstruction with respect to the temporal dynamics.

Figure 3.9. Example reconstruction of a music signal with a binary mask occluding roughly 60% of the samples. The first plot shows the original signal, the second plot shows the masked input we used for the reconstruc- tion, the third plot shows the reconstruction using PLCA and the fourth one shows the reconstruction using our model. 111

(a) Original audio.

(b) Input audio.

(c) Imputation result by PLCA.

(d) Imputation result by algorithm I. 112

ID Name Artist Duration (s) 1 Free Falling Tom Petty 6.08 2 Dream On Areosmith 7.00 3 No Woman No Cry Bob Marley 6.25 4 Cry Me A River Justin Timberlake 11.2 5 I Shot The Sheriff Bob Marley 4.60 6 Dangerous Kardinal Offishall 6.00 7 Viva La Vida Colplay 5.00 8 One Step At A Time Jordin Sparks 5.05 9 Scar Tissue Red Hot Chili Peppers 5.40 10 Against The Wind The Original Masters 5.55 11 Born To Be Wild Steppenwolf 5.45 12 Other Side Red Hot Chili Peppers 4.95 13 Adams Song Blink-182 5.10 14 Remedy Seether 5.55 Table 3.6. Audio excepts dataset used for Evaluations

3.5.2. Quantitative Evaluation

In this section I show the quantitative results of the proposed methods. A test dataset consisting of 14 real-world music excerpts was compiled. Tab. 3.6 lists the details of the dataset. The excerpts were selected so as to cover different music styles, such as pop, rock, jazz and heavy metal, as well as different instrumentation and dynamics.

Examples of sound sources include male and female singing voice, electronic piano and guitar, percussive instruments such as drums and cymbals, environmental sound like the rain, and more.

All songs are single channel PCM data with 44.1 kHz sample rate and 32-bit quantization.

The average length of each recording is about 6-second. A FFT size of 1024 points and hop-size of 256 points are used in this experiment. 113

Two experiments are designed to evaluate Algorithm I and II separately. In both cases, the proposed method is compared to the audio imputation method using PLCA. We modeled the data with an N-HMM with 10 dictionary and 10 component per dictionary.

As a comparator, a PLCA with 100 spectral components was also used to reconstruct the spectrogram.

As shown in [Mysore 10], there is no universal standard on model selection for N-HMM and PLCA. Model selection should be determined with respect to an application. The number of dictionaries and spectral components for both N-HMM and PLCA are deter- mined by the author empirically.

For N-HMM, I limit number of spectral vectors per dictionary to 10 so that each dictionary in N-HMM only models the predominant components of the audio. In this way, it is less likely for extra noise to be introduced to the reconstructed audio.

We limit the number of dictionaries for N-HMM to 10 because we expect the input spectrogram has roughly 10 different temporal parts. A more careful choice for the number of dictionaries should depend on the specific properties of the audio of interest. However, it would mean each of the excepts in our dataset will be modeled using different number of dictionaries, which is quite time consuming. Further more, this will raise the issue of overfitting for N-HMM. Thus the number of dictionaries are fixed for all excepts in this section. In Chap. 4 , I will further show how the number of dictionaries can be determined appropriately according to the characteristics of the audio signal itself.

Based on the above reasons, we choose 10 dictionaries and 10 spectral components per dictionary for N-HMM for all audio excepts in the data set. This choice is also a good comprise between performance and speed of N-HMM for the musical date used in this 114 experiment. As a comparator, 100 spectral components are used to give the same amount of modeling power for PLCA as the N-HMM has with 10 dictionaries and 10 spectral components per dictionary.

Signal-to-Noise-Ratio (SNR) is used to measure the outputs of both imputation methods:

P s(t)2 (3.23) SNR = 10 × log t , 10 P 2 t(¯s(t) − s(t)) where s(t) ands ¯(t) are the original and the corrupted/reconstructed signals respectively.

Notice that SNR here measures the “signal to difference (between original and recon- structed)” ratio in this context. The higher the number is, the close the reconstructed signal is to the original one. We call it “SNR” because it is the name that has been widely used in signal processing.

To evaluate the performance of Algorithm I, we randomly removed 60% of the audio spec- trogram elements with a smoothed random binary mask. The masked audio spectrogram is the corrupted input to both the proposed algorithm and PLCA. Both algorithms are evaluated in an unsupervised way, so the only information available to both algorithms is the corrupted input data itself.

Table 3.7 presents the performance of PLCA and the proposed Algorithm I on 14 clips of real-world music recordings using the SNR measurement. Both methods have significantly improved the SNR of the input audio. PLCA does slightly better than Algorithm I by about 0.39 dB, which is not statistically significant (this will be further discussed later in this section).

We perform Bandwidth expansion with Algorithm II on the same dataset of 14 record- ings. A bandwidth expansion method using PLCA [Smaragdis 07a] is implemented as a 115

SNR (dB) Song ID Input PLCA Algorithm I 1 1.58 6.19 5.14 2 2.76 8.88 8.00 3 2.55 4.81 4.53 4 2.17 6.29 5.99 5 2.16 2.35 3.85 6 2.80 3.43 4.82 7 2.27 8.02 6.85 8 2.78 3.97 4.49 9 2.24 3.75 3.85 10 2.26 4.72 4.13 11 2.91 8.92 8.18 12 1.00 8.29 5.90 13 2.21 9.01 8.44 14 3.26 6.88 5.84 Average measurement 2.35 6.11 5.72 Table 3.7. Performances of the Audio Imputation results by the proposed Algorithm I and PLCA. There is no statistical difference at a significant level 0.05 between the two methods with a p-value 0.76.

comparator. The corrupted audio is obtained from the original audio by removing all the frequencies between 860 Hz and 6000 Hz in the spectrogram. For a particular corrupted audio recording in our dataset, another clip (that does not contains the testing audio) of about 12-second long is taken from the same song as the training data.

We learn the N-HMM parameters for each song from the training data. Specifically, we learned 10 dictionaries and 10 spectral components per dictionary, as well as the transition matrix from the training data. We then learn new mixture weights from the narrowband audio as described in Tab. 3.5 . When using PLCA, we learn one dictionary of 100 spectral components. 116

SNR (dB) Song ID Input PLCA Algorithm II 1 4.57 8.11 8.72 2 -0.43 2.23 4.20 3 13.2 14.57 15.28 4 3.20 7.17 7.81 5 6.95 8.54 9.69 6 10.99 19.87 21.15 7 7.32 10.57 11.28 8 2.43 4.81 9.69 9 7.44 12.91 9.69 10 12.05 16.97 17.41 11 1.02 2.83 5.27 12 7.65 13.53 14.67 13 1.53 5.20 8.39 14 5.87 8.66 8.87 Average measurement 5.99 9.71 10.86 Table 3.8. Performances of the Audio Bandwidth Expansion results by the proposed Algorithm II and PLCA. There is statistical difference at a sig- nificant level 0.05 between the two methods with a p-value 0.01

Table 3.8 presents the performance of PLCA and the proposed Algorithm II. The aver- age performance of the proposed method is 10.86 dB SNR, improving 4.87 dB from the narrowband audio and 1.15 dB from the output of the PLCA. The proposed method has better SNR measurement than PLCA on 13 out of 14 song recordings. This improvement is statistically significant.

In order to infer statistically significant conclusions on the results of different methods, we perform paired, two-sided statistical tests on the results obtained by the proposed algorithm and PLCA. The NULL hypothesis is that the difference between the matched samples in SNR scores obtained by PLCA and Algorithm I (or II) comes from a distribu- tion whose median is zero. We use a significant level α = 0.05, which is commonly used 117 in statistical significance testing. When the null hypothesis is rejected, the result is said to be statistically significant.

For the results in Tab. 3.7 , the null hypothesis that the difference between PLCA and

Algorithm I is accepted with a p-value equal to 0.76, which means that PLCA and the proposed Algorithm I produce statistically indistinguishable results when both algorithms are conducted in an unsupervised way. For the results in Tab. 3.8 , the null hypothesis is rejected with a p-value equal to 0.01. This indicates that the proposed algorithm II has produced significantly better results than PLCA does.

3.6. Contribution and Conclusion

In this chapter I present an approach that allows us to estimate the missing values in the time-frequency representation of audio signals. The contributions of this chapter are as follows. The proposed audio imputation algorithm, based on the Non-negative Hidden

Markov Model (N-HMM), enables us to learn the spectral information as well as the temporal information of the audio signal. Compared to existing work using non-negative spectrogram factorization, the proposed approach has taken into consideration the non- stationarity and temporal dynamics of audio signal. The proposed system can benefit many important applications, such as speech recognition in noisy environment, sound restoration and enhancement, and improvement of the audio source separation results, etc.

I propose two algorithms, one for general imputation of arbitrary missing values in the spectrogram, the other for audio bandwidth expansion. Experimental results showed that the proposed algorithms are quite effective in reconstructing missing values from highly 118 corrupted audio spectrogram. The proposed methods have shown some advantages over performing imputation using PLCA [Smaragdis 11, Smaragdis 07a]. As illustrated with the real-world music examples, the proposed audio imputation algorithm is able to learn a better temporal dynamics for the reconstruction as well as introduces less extra artifacts in the missing frequency region.

Quantitative experiments also showed that the proposed algorithm is comparable to a state-of-art method using PLCA when performing audio imputation of random missing regions. Furthermore, the proposed algorithm outperformed PLCA with statistical sig- nificance when performing bandwidth expansion with data available for training.

Future work to improve the methods described in this dissertation includes developing techniques for automatically identifying the missing or corrupted regions in the audio spectrogram and better missing phase recovery techniques. 119

CHAPTER 4

Language Informed Audio Bandwidth Expansion

Audio Bandwidth Expansion (BWE) [Larsen 04] refers to methods that increase the frequency bandwidth of narrowband audio signals. Such frequency expansion is desirable if at some point the bandwidth of the signal has been reduced, as can happen during signal recording, transmission, storage, or reproduction. By expanding the bandwidth of audio signals from narrowband to wideband, a more natural sounding audio reproduction and an increased intelligibility can be achieved.

A typical application of BWE is telephone speech enhancement [Jax 02]. The degra- dation of speech quality is caused by the bandlimiting filters with a passband from ap- proximately 300 Hz to 3400 Hz, due to the use of analogue frequency-division multiplex transmission. Other applications include bass enhancement on small loudspeakers and high-quality reproduction of historical recordings.

In this chapter, I will limit the discussion to speech signals, since this is the case that usually incurs the bandwidth limitation problem.

Effective bandwidth expansion methods for speech signals are very necessary. In certain situations we clearly become aware of the impacts of bandwidth limitation. For example, the limited intelligibility of syllables becomes apparent when we try to understand un- known words or names on the phone, especially in the case to distinguish between certain unvoiced or plosive utterances, such as /s/ and /f/ or /p/ and /t/, because the differences we need to distinguish between these plosives are only obvious in energy above 3400 Hz. 120

Another drawback caused by band-limited speech is that many speaker-specific charac- teristics are not retained transparently in the narrowband speech signal. Therefore, it is sometimes difficult to distinguish on the phone a mother from her daughter.

Most BWE methods are based on the source-filter model of speech production [Jax 02].

Such methods generate an excitation signal and modify it with an estimated spectral envelope that simulates the characteristics of the vocal tract. The main focus has been on the spectral envelope estimation. Classical techniques for spectral envelope estimation include codebook mapping [Enbom 99], Gaussian mixture models (GMM) [Park 00], hidden Markov models (HMM) [Bauer 09], and neural networks [Pulakka 11a]. How- ever, these methods need to be trained on parallel wideband and narrowband corpora to learn a specific mapping between narrowband features and wideband spectral envelopes.

Thus, a system trained on telephony and wideband speech cannot be readily applied to expand the bandwidth of a low-quality loudspeaker.

Another way to estimate the missing frequency bands is based on directly modeling the au- dio signal by learning a dictionary of spectral vectors that explains the audio spectrogram.

By directly modeling the audio spectrogram, BWE can be framed as a missing data impu- tation problem. Such methods only need to be trained once on wideband corpora. Once the system is trained, it can be used to expand any missing frequencies of narrowband signals, despite never having been trained on the mapping between the narrowband and wideband corpus. To the best of our knowledge, the only existing work based on directly modeling the audio is [Smaragdis 07a] using non-negative spectrogram factorization.

The use of high-level knowledge about language is very common in human audition.

The human auditory scene induction ability relies heavily on the high-level knowledge 121 of speech [Warren 70, Repp 92] . In the well-known “phonemic restoration illusion” experiment [Warren 70] (described in detail in Chap. 1 ), “auditory induction” was illustrated by the fact that listeners believe they hear the deleted phonemes masked by an extraneous sound. Further study by [Repp 92] showed that our auditory system does not segregate the restored speech from the extraneous sound, but instead uses an abstract phonological phonetic representation (top-down knowledge about speech) that is activated in the process of word recognition to make up the missing phonemes “in the mind’s ear”. Since humans use the high-level information to complete corrupted speech, it is reasonable for machine systems to employ the same kind of information to perform audio bandwidth expansion (and more generally imputation).

In this chapter, I will show that the performance of BWE can be improved by introducing speech recognition machinery that relies on the high-level knowledge of language. Specif- ically, if it is known that the given speech conforms to certain syntactic constraints, this high level information could be useful to constrain the model. In automatic speech recog- nition (ASR), such constraints are typically enforced in the form of a language model

(constrained sequences of words) [Rabiner 93]. It has more recently been applied to source separation [Mysore 12]. However, we are not aware of any existing BWE meth- ods that explicitly explore syntactic knowledge about speech.

Note that there has recently been an approach [Bauer 09] that used language information to improve the performance of source-filter models for BWE. However, this approach requires an a-priori transcription of the given speech. In contrast, our technique does not require any information about the content of the specific instance of speech but rather uses syntactical constraints in the form of a language model. 122

In the remainder of this chapter, I first give an overview of existing work on audio band- width extension in Sec 4.2 . The overview of the proposed system is described in Sec.

4.2 . I then discuss how to train a speaker-level N-HMM that incorporates the high-level acoustic knowledge in the form of word models in Sec. 4.3 and syntactic knowledge in the form of language models in Sec. 4.4 . The detailed description of the proposed system is presented in Sec. 4.5 . Illustrative examples and quantitative experimental results are presented in Sec. 4.6 . I summarize this chapter and point out future direction in in Sec.

4.7 .

4.1. Related work

There exists a sizeable literature on the topic of audio bandwidth expansion. In this chapter I limit the discussion to only speech bandwidth extension. A more comprehensive coverage of BWE techniques can be found in [Larsen 04].

Bandwidth extension can be realized by employing a statistical estimation scheme.

Thereby, certain “features” are extracted from the narrowband speech signal. These features allow, in conjunction with the statistical model, identification of the parameters of a wideband speech production model. These model parameters are typically spectral

(or also temporal) envelopes of the speech signal. The respective speech fine structure can either be reproduced from the narrowband signal or completely synthesized.

Most of the bandwidth extension methods are based on the source-filter model of speech production. The procedure of the source-filter model based BWE is illustrated in Fig. 4.1

. Such methods usually decompose the narrowband signal into a spectral envelope and a residual signal. An excitation signal is then generated from the residual signal, which 123

Figure 4.1. Procedure of source-filter based BWE method illustrated in the frequency domain (adapted from [Kornagel 02]). 124 corresponds to a spectrally flat excitation produced by a source model. The excitation is modified with a estimated wideband spectral envelope to produce the source-filtered signal. The spectral envelope here corresponds to a filter that simulates the spectral shaping characteristics of the vocal tract. The artificially generated highband and/or lowband signal is then combined with the original narrowband signal to form a speech signal with an extended bandwidth.

Commonly used techniques for generating a wideband excitation signal (source model) include spectral folding, spectral translation, and nonlinear processing of an excitation derived from the narrowband signal [Jax 02]. Alternatively, sinusoidal synthesis [Iser 08] or modulated noise [Unno 05] can be used.

In [Jax 03], it showed that the quality of the estimated wideband spectral envelope is far more important for the subjective quality of the extended speech signal than the extension of the excitation signal. Therefore, most of the BWE methods concentrate on the estimation of the wideband spectral envelope.

In [Kornagel 02], the spectral wideband envelope is selected from a pre-trained code- book based on a two-step classification scheme. In [Jax 03], the wideband spectral envelope is estimated using the pre-trained hidden Markov model (HMM) which takes into account several features of the band-limited speech. Other commonly used wide- band spectral envelope estimation methods include Gaussian mixture model (GMM)

[Park 00, Pulakka 11b] and neural networks [Kontio 07].

In particular, [Pulakka 11b] describes a BWE method that combines a filter bank tech- nique with GMM-based estimation of the highband mel-spectrum. The GMM predictor used in this work is based on the reconstruction method proposed in [R. 04] for missing 125 data imputation in automatic speech recognition (ASR). However, the use of GMM is limited in the expressive power of this method.

The source-filter based methods usually need to be trained on parallel wideband and nar- rowband corpora to learn a specific mapping between narrowband features and wideband spectral envelopes. This limits these methods from broader usage. A system trained on telephony and wideband speech cannot be readily applied to expand the bandwidth of a low-quality loudspeaker.

Another way to estimate the missing frequency bands is based on directly modeling the au- dio signal by learning a dictionary of spectral vectors that explains the audio spectrogram.

By directly modeling the audio spectrogram, BWE can be framed as a missing data impu- tation problem. Such methods only need to be trained once on wideband corpora. Once the system is trained, it can be used to expand any missing frequencies of narrowband signals, despite never having been trained on the mapping between the narrowband and wideband corpus. To the best of our knowledge, the only existing work based on directly modeling the audio is [Smaragdis 07a] using non-negative spectrogram factorization.

The proposed approach [Han 12b] belongs to the second category of BWE methods.

Thus our approach does not need to see the narrowband speech before the process of

BWE once it is trained one time on the wideband corpus, which makes it differ from most of the methods using source-filter models.

Furthermore, the proposed algorithm differs from [Smaragdis 07a] (and most of the source-filter based methods) by explicitly modeling the high-level knowledge of language for bandwidth extension. The syntactic knowledge of speech is incorporated in the form 126 of Language models which enables high quality reconstruction of the missing frequency bands.

4.2. System Overivew

Figure 4.2. Block diagram of the proposed system. Our current imple- mentation includes modules with solid lines. Modules with dashed lines indicate possible extensions in order to make the system more feasible for large vocabulary BWE.

A block diagram of the proposed system is shown in Fig. 4.2 . The goal is to learn an N-

HMM (the N-HMM is described in detail in Sec. 3.2 ) for each speaker from training data of that speaker and syntactic knowledge common to all speakers (in the form of a language model). We construct each speaker-level N-HMM in two steps. We first learn a N-HMM for each word in the vocabulary, detailed in Sec. 4.3 . We then build a language model by 127 concatenating all the word models together according to the word transitions specified by the language model, as elaborated in Sec. 4.4 . Given the narrowband speech, the learned speaker-level N-HMM can be utilized to perform bandwidth expansion by estimating the missing frequencies as an audio spectrogram imputation problem. This is described in

Sec. 4.5 .

Learning a word model for each word in a vocabulary is suitable for small vocabularies.

However, it is not likely to be feasible for larger vocabularies. In this dissertation we are simply establishing that the use of a language model does improve BWE, rather than selecting the most scalable modeling strategy for large vocabulary situations. This work can be extended to use subword models such as phonelike units (PLUs) [Rabiner 93], which have been quite successful in ASR. In Fig. 4.2 , we illustrate these extensions using dashed lines.

4.3. Word Models

For each word in our vocabulary, we learn the parameters of an N-HMM from multiple instances (recordings) of that word as routinely done with HMMs in small vocabulary speech recognition [Rabiner 93]. The N-HMM parameters are learned using the EM algorithm described in Tab. 3.3 .

Let V (k), k = 1 ··· N, be the kth spectrogram instance of a given word. We compute the E step of EM algorithm separately for each instance using Eq. 3.6 – Eq. 3.14 . This gives us (k) ¯ (k) ¯ the marginalized posterior distributions Pt (zt, qt|ft, f, v¯) and Pt (qt, qt+1|f, v¯) for each ¯ word instance k. Here, f = {ft,v | t = 1, ··· , T, v = 1, ··· , vt} denotes the sequence of all of the draws of frequencies over all time frames, which is the whole spectrogram. 128

¯v = {vt | t = 1, ··· ,T } denotes the total number of draws over all time frames (the sum of the magnitude of the spectrum at each time frame).

We use these marginalized posterior distributions in the M step of the EM algorithm.

Specifically, we learn a single NHMM to account for all instances of a given word except the mixture weights. We compute a separate weights distribution for each word instance k. We use the kth spectrogram instance V (k) to replace V in Eq. 3.16 as follows:

P V (k)P (k)(z , q |f , ¯f, v¯) (k) ft ft t t t t (4.1) Pt (zt|qt) = , P P V (k)P (k)(z , q |f , ¯f, v¯) zt ft ft t t t t

(k) (k) where Vft is the magnitude (at time t and frequency f) of spectrogram V of word instance k.

For all instances of a given word, we estimate a single set of dictionaries of spectral components using the marginalized posterior distributions of all instances of a given word as follows:

P P V (k)P (k)(z, q|f, ¯f, v¯) (4.2) P (f|z, q) = k t ft t P P P (k) (k) ¯ f k t Vft Pt (z, q|f, f, v¯)

We estimate a single transition matrix using the marginalized posterior distributions of all instances of a given word as follows:

P PT −1 (k) ¯ k t=1 Pt (qt, qt+1|f, v¯) (4.3) P (qt+1|qt) = P P PT −1 P (k)(q , q |¯f, v¯) qt+1 k t=1 t t t+1

The remaining parameters are estimated the same as described in Tab. 3.3 .

Once we learn the set of dictionaries and transition matrix for each word of a given speaker, we need to combine them into a single speaker dependent N-HMM. 129

4.4. Speaker Level Model

A large vocabulary speech recognition system is critically dependent on linguistic knowl- edge embedded in the speech. Incorporation of the knowledge of the language, in the form of a language model, has shown to be essential for the performance of speech recognition system [Rabiner 93].

The goal of the language model is to provide an estimate of the probability of a word sequence W for a given task. If we assume that W is a specified sequence of Q words, i.e.,

(4.4) W = w1w2...wQ,

P (W ) can be computed as:

P (W ) =P (w1w2...wQ)(4.5)

=P (w1)P (w2|w1)P (w3|w1w2)...

P (wQ|w1w2...wQ−1).

N-gram word models are used to approximate the term P (wj|w1...wj−1) as:

P (wj|w1...wj−1) ≈ P (wj|wj−N+1...wj−1)(4.6) i.e., based only on the preceding N − 1 words. 130

The conditional probabilities P (wj|wj−N+1...wj−1) can be estimated by the relative fre- quency approach:

ˆ R(wj, wj−1, ..., wj−N+1) (4.7) P (wj|wj−N+1...wj−1) = , R(wj−1, ..., wj−N+1) where R(·) is the number of occurrences of the string in its argument in the given training corpus.

In practice, N = 2 (bigram) or 3 (trigram) are usually used for reliable estimation of Eq.

4.7 .

In an N-HMM, we learn a Markov chain that explains the temporal dynamics between the dictionaries. Each dictionary corresponds to a state in the N-HMM. Since we use an

HMM structure, we can readily use the idea of language model to constrain the Markov chain to explain a valid grammar.

Once we learn an N-HMM for each word of a given speaker, we combine them into a single speaker dependent N-HMM according to the language model. We do this by constructing a large transition matrix that consists of each individual word transition matrix. The transition matrix within each individual word stays the same as specified in Eq. 4.3 .

However, the language model dictates the transitions between words.

In this dissertation, the syntax to which every sentence in the corpus conforms to is provided in [Cooke 10]. However, when this is not the case, one can learn the language model as described above. 131

4.5. Estimation of incomplete data

So far, we have shown how to learn a speaker-level N-HMM that combines the acoustic knowledge of each word, and syntactic knowledge in the form of language model, from the wideband speech.

With respect to wideband speech, we can consider narrowband speech as incomplete data since certain frequency bands are missing. We generally know which frequency bands are missing and consequently which entries of the spectrogram of narrowband speech are missing. Our objective is to estimate these entries. Intuitively, once we have a speaker-level N-HMM, we estimate the mixture weights for the spectral components of each dictionary, as well as the expected values for the missing entries of the spectrogram.

The proposed language model informed BWE method is an N-HMM based imputation technique that works for the estimation of missing frequencies in the spectrogram, which works in a similar way to the BWE algorithm II described in Chap. 3 . The main difference between the language informed BWE algorithm and the algorithm II proposed in last chapter is how we train the system. To maintain the intactness of the algorithm proposed in this chapter, I describe some of the same procedures that has been described in last chapter.

In this chapter, we take the same notation as in Chap. 3 . We denote the observed regions of a spectrogram V as V o and the missing regions as V m = V \ V o. Within a magnitude

o spectrum Vt at time t, we represent the set of observed entries as Vt and the missing

m o entries as Vt . Ft will refer to the set of frequencies for which the values of Vt are known,

o m i.e. the set of frequencies in Vt . Ft will similarly refer to the set of frequencies for which

m o m the values of Vt are missing, i.e. the set of frequencies in Vt . Vt (f) and Vt (f) will 132

o m refer to specific frequency entries of Vt and Vt respectively. For narrowband telephone

o m speech, we set Ft = {f|300 ≤ f ≤ 3400} and Ft = {f|f < 300 or f > 3400} for all t. In this method, we perform N-HMM parameter estimation on the narrowband spectro- gram. However, the only parameters that we estimate are the mixture weights. We keep the dictionaries and transition matrix from the speaker level N-HMM fixed. One issue is that the dictionaries are learned on wideband speech (Sec. 4.3 ) but we are trying to fit them to narrowband speech. We therefore only consider the frequencies of the dictionaries

o that are present in the narrowband spectrogram: Ft , for the purposes of mixture weights estimation. However, once we estimate the mixture weights, we reconstruct the wideband spectrogram using all of the frequencies of the dictionaries.

The resulting value Pt(f) in Eq. 3.19. (the counterpart for PLCA is Eq. 2.2 ) can be viewed as an estimate of the relative magnitude of of the frequencies at time t. However, we need estimates of the absolute magnitudes of the missing frequencies so that they are consistent with the observed frequencies. We therefore need to estimate a scaling factor for Pt(f). In order to do this, we sum the values of the uncorrupted frequencies in the

o P o o original audio to get n = o V (f). We then sum the values of Pt(f) for f ∈ F to t f∈Ft t t o P o get p = o Pt(f). The expected magnitude at time t is obtained by dividing n by t f∈Ft t o pt , which gives us a scaling factor.

m The expected value of any missing term Vt (f) can then be estimated by:

o m nt E[Vt (f)] = o Pt(f)(4.8) pt 133

Then we can update the corrupted spectrogram by:

 o  Vt(f) if f ∈ F ¯  t (4.9) Vt(f) =  m m  E[Vt (f)] if f ∈ Ft

The audio BWE process can be summarized as follows in Tab. 4.1 . 134

Let F o and F denote the sets of narrowband and wideband frequencies. (1) Learn an N-HMM word model for each word in the training data set using the EM algorithm, as described in Sec. 4.3 from the wideband speech corpus.

We now have a set of dictionaries of spectral components P (f|z, q), f ∈ F . Each of the dictionaries corresponds roughly to a phoneme in the training data. We call these wideband dictionaries .

(2) Combine the word models into one single speaker dependent N-HMM model, dictated by a language model as described in Sec. 4.4 . (3) Given the narrowband spectrogram, construct the narrowband dictionaries: P˜(f|z, q) by keeping only the frequencies of the wideband dictionaries that are present in the narrowband spectrogram, as described in Eq. 3.22 . (4) Perform N-HMM parameter estimation on the narrowband spectrogram. During the learning process, • Perform N-HMM learning using P˜(f|z, q). • During the E step, calculate Eq. 3.6 – Eq. 3.14.

• During the M step, only update the mixture weights Pt(zt|qt) using Eq. 3.16 .

(5) Calculate Pt(f) using Eq. 3.19 with the wideband dictionaries learned in step (1) from the wideband spectrogram, and the mixture weights estimated in step (4) from the narrowband spectrogram. (6) Reconstruct the corrupted audio spectrogram using Eq. 4.9 . (7) Convert the estimated spectrogram to the time domain. Table 4.1. Algorithm III for Language Informed Speech Bandwidth Expansion 135

4.6. Experimental results

We performed BWE experiments on a subset of the speech separation challenge training data set [Cooke 10] 1. We selected 10 speakers (5 male and 5 female), with 500 sentences per speaker. Each sentence contains 6 words. Specifically, each speaker has 51 different words and there are 20-150 instances of each word. We learned N-HMMs for each speaker using 450 of the 500 sentences, and used the the remaining 50 sentences as the test set.

All data are single channel PCM data with 25000 Hz sampling rate and 16-bit quantiza- tion. A window size of 1600 points and hop-size of 400 points are used in all experiments conducted in this section.

We segmented the training sentences into words in order to learn individual word models as described in Sec. 4.3 . We used one state per phoneme. This is less than what is typically used in speech recognition. However, we did not want to excessively constrain the model in order to obtain high quality reconstructions. We then combined the word models of a given speaker into a single N-HMM according to the language model, as described in Sec. 4.4 .

We performed speech BWE using the language-model constrained N-HMM on 50 sen- tences per speaker in the test set, totaling 500 sentences. As a comparison, we per- formed BWE using PLCA [Smaragdis 07a], with a scaling factor that is described in

[Smaragdis 11]. We found that the use of the scaling factor for PLCA produces better results than the original algorithm proposed in [Smaragdis 07a]. When using PLCA, we used the same training and test sets that we used with the proposed model. However, we simply concatenated all of the training data of a given speaker and learned a single

1The speech separation challenge training data set is available for download at http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparationChallenge.htm 136 dictionary for that speaker, which is customary when using non-negative spectrogram factorization approaches, such as PLCA.

We considered two different conditions. The first one is to expand the bandwidth of tele- phony speech signals, referred to as Con-A. The input narrowband signal has a bandwidth of 300 Hz to 3400 Hz, which simulates the bandlimited telephone speech signals.

In the second condition, referred to as Con-B, we removed all the frequencies below 1000

Hz. In Con-B, the speech is considerably more corrupted than the telephony speech since speech usually has strong energy distributed in the low frequencies. For both categories, we reconstructed wideband signals to its full bandwidth of 0 to 12500 Hz.

Signal-to-Noise-Ratio (SNR) is used to measure the narrowband speech and the outputs of both methods.

P s(t)2 (4.10) SNR =10 × log t 10 P 2 t(¯s(t) − s(t)) where s(t) ands ¯(t) are the original and the reconstructed signals respectively.

In this context, SNR measures the “signal to difference” ratio between original and re- constructed signals. The higher the number, the closer the reconstructed signal is to the original one.

To measure the perceptual quality of the reconstructed speech, we use the overall rating for speech enhancement (COVL)[Hu 06] as a second measurement. COVL is a composite measure of the predicted overall quality of speech, formed by linearly combining three widely objective speech quality measures: PESQ (perceptual evaluation of speech quality),

LLR (log likelihood ratio), and WSS measures (weighted-slope spectral). 137

In [Hu 06], it shows that COVL is well correlated with the objective measure OVRL using the scale of the Mean Opinion Score, depicted in the following table: Scale Description of the scale 1 Bad 2 Poor 3 Fair 4 Good 5 Excellent Table 4.2. Scale of Mean Opinion Score used by the objective measure OVRL.

Note that obtaining subjective ratings (OVRL) for our experiment (500 sentences in total) is time-consuming. Because of the high correlation between the COVL and OVRL, we can use the value of COVL as a reliable prediction for the speech perceptual quality in the scale of mean opinion score (OVRL). For example, a score of COVR = 3 for a speech clip means that it is very likely the measured speech will be rated a quality of “Fair” by a human listener.

We first illustrate the proposed method with an example in Fig. 4.3 . The x-axis of all plots is time (linear-scale) and y-axis is frequency (log-scale). The original audio is a 1.27- second long speech clip of a male speaker saying, “set red at h five soon”. We removed all the frequencies below 300 Hz and all the frequencies above 3400 Hz of the spectrogram to construct the narrowband speech. Both methods are applied to expand the bandwidth of the narrowband speech from 300-3400 Hz to up to 0-12500 Hz. To better illustrate the missing regions, we only plot the frequencies above 800 Hz of the spectrogram in Fig. 4.3

.

Compared to PLCA, the proposed method provides a higher-quality reconstruction as can be clearly seen in the high frequencies. PLCA tends to be problematic with the 138 reconstruction of high-end energy, especially the phonemes (such as /s/ and /h/) with noticeable energy in high frequencies. Such regions are marked with white-edge boxes in

Fig. 4.3 (c). The proposed method, on the other hand, has recovered most of the high-end energy quite accurately.

The second example is to illustrate the performances of both methods on the reconstruc- tion of low frequencies. The reconstructed signals are illustrated in Fig. 4.4 . The original audio is a 2.20-second long speech clip of a male speaker saying, “bin blue with s seven soon”. We removed all the frequencies below 1000 Hz of the spectrogram. Only he lower

4000 Hz in the spectrogram are plotted in log-scale to better illustrate the missing regions.

Compared to PLCA, the proposed method provides a high-quality reconstruction as can be clearly seen in the low frequencies. PLCA tends to be problematic with the reconstruction of low-end energy, especially the harmonic structures. I have marked with white-edge boxes some of the regions in Fig. 4.4 (c) where PLCA performed poorly. The proposed method, on the other hand, has recovered most of the harmonics in the low frequency regions in the spectrogram.

Sound Examples of both methods are available at music.cs.northwestern.edu/ research.php?project=imputation to show the perceptual quality of the reconstructed signals. 139

The boxplots of both metrics over all speakers in both conditions are plotted in Fig. 4.5 and Fig. 4.6 . Each boxplot is generated by 500 results from 10 speakers. The sample minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and sample maximum are depicted. The boxplots of BWE results of each speaker are detailed in Fig. 4.7 – 4.10 in the end of this chapter .

Both PLCA and the proposed method has significantly improved the SNR and over- all speech quality from the bandlimited audio. Furthermore, the proposed method has achieved better performances than PLCA using both metrics.

The detailed results of both methods on each of the 10 speakers are listed in Tab. 4.3 for Con-A and Tab. 4.4 for Con-B. The score for each speaker is averaged over all 50 sentences for that speaker.

Both methods produce results that have significantly better audio quality than the given narrowband speech. The proposed method, however, outperforms PLCA in both con- ditions using both metrics. The improvements of both metrics in both conditions are statistically significant between the proposed method and PLCA by student t-test with p-values smaller than 0.01.

In Con-A, PLCA has improved the speech quality (in terms of Covl metric) of the input narrowband signals from “bad” to “between fair and good”. The proposed method has further improved the rating to “above good”. In Con-B, the OVRL metric of the corrupted speech signal is improved from “bad” to “above poor” by PLCA, and further to “between fair and good” by the proposed method.

The improvement is clearly more apparent in Con-B than in Con-A. The reason is that speech input in Con-B is more heavily corrupted, so the spectral information alone is not 140

SNR (dB) - Con-A OVRL (1 – 5) - Con-A Speaker ID Input PLCA Proposed Input PLCA Proposed 1 1.39 5.32 9.98 1.07 3.61 4.24 2 1.94 3.51 7.60 1.02 3.23 4.05 3 5.29 8.30 11.12 1.26 3.56 4.25 4 4.43 8.10 10.74 1.04 3.72 4.26 5 7.29 10.07 13.26 1.14 3.58 4.41 6 3.98 7.05 11.08 1.06 3.63 4.34 7 4.58 8.45 10.69 1.16 3.58 4.17 8 2.64 5.71 8.92 1.01 3.37 4.14 9 5.60 10.75 13.64 1.42 3.73 4.38 10 4.86 8.53 11.65 1.34 3.75 4.36 Average 4.2 7.58 10.87 1.15 3.58 4.26 Table 4.3. Performances of audio BWE results by the proposed method and PLCA in Con-A . Numbers in bold font indicate the difference between the proposed and PLCA is statistically significant by a student t-test at 5% significance level.

SNR (dB) - Con-B OVRL (1 – 5) - Con-B Speaker ID Input PLCA Proposed Input PLCA Proposed 1 -0.38 0.44 5.02 1.00 2.30 3.29 2 1.15 1.30 5.84 1.00 2.32 3.45 3 -0.10 1.12 5.08 1.00 2.12 3.57 4 -0.11 1.61 6.30 1.00 2.20 3.39 5 0.87 2.15 6.76 1.00 2.28 3.47 6 -0.12 1.04 5.15 1.00 2.11 3.41 7 0.21 1.82 6.07 1.00 2.18 3.45 8 0.14 1.22 4.60 1.00 1.95 2.98 9 -0.30 2.17 7.61 1.00 2.75 3.68 10 0.39 1.47 6.56 1.00 2.36 3.48 Average 0.17 1.43 5.90 1.00 2.26 3.41 Table 4.4. Performances of audio BWE results by the proposed method and PLCA in Con-B . Numbers in bold font indicate the difference between the proposed and PLCA is statistically significant by a student t-test at 5% significance level. 141 enough to get reasonable results. The syntactic knowledge in the form of language models incorporated in the proposed algorithm is able to boost the quality of the reconstructed speech signals. 142

4.7. Contribution and Conclusion

I presented a method to perform bandwidth expansion for speech signals. The contri- butions of the proposed algorithm are as follows. The high-level syntactic knowledge about speech in the form of language models is incorporated in the N-HMM framework for BWE. Experimental results have shown that the use of language models to constrain non-negative models has led to improved speech BWE performance when compared to a non-negative spectrogram factorization method.

The main contribution of this chapter is to show that the use of high-level structural mod- eling for the BWE problem is promising. In the proposed system, the acoustic knowledge is incorporated into the proposed system in the form of word models and the syntactic knowledge in the form of a language model. The methodology was shown with respects to speech and language models, but it can be used in other contexts in which high-level structure information is available. One such example is incorporating music theory rules into the N-HMM framework for BWE of music.

The current system can be extended in several ways as shown in speech recognition. As discussed in Sec. 4.3 , our system can be extended to use sub-word models, in order for it to be feasible for large-vocabulary speech BWE. The current algorithm is an offline method since we used the forward-backward algorithm. In order for it to work online, we can simply use the forward algorithm [Rabiner 93]. 143

(a)

(b)

(c)

(d)

Figure 4.3. Example of speech BWE. a) Original speech. b) Narrowband speech. Frequencies below 300 Hz or above 3400 Hz are removed. c) Result using the PLCA. d) Result using the proposed method. 144

(a)

(b)

(c)

(d)

Figure 4.4. Example of speech BWE. The x-axis is time and y-axis fre- quency. a) Original speech. b) Narrowband speech. Lower 1000 Hz of the spectrogram are removed. c) Result using the PLCA. Regions marked with white-edge boxes are regions in which PLCA performed poorly. d) Result using the proposed method. The lower 4000 Hz are plotted in log-scale. 145

Figure 4.5. Boxplot of audio BWE results SNR in Con-A (top plot) and Con-B (bottom plot). Each boxplot is generated by 500 SNR results from 10 speakers. 146

Figure 4.6. Boxplot of audio BWE results Covl in Con-A (top plot) and Con-B (bottom plot). Each boxplot is generated by 500 Covl results from 10 speakers. 147

Figure 4.7. SNR Boxplot of audio BWE results for each speaker by the proposed method and PLCA in Con-A . Each boxplot is generated by 50 SNR results from one speaker. 148

Figure 4.8. Covl Boxplot of audio BWE results for each speaker by the proposed method and PLCA in Con-A . Each boxplot is generated by 50 Covl results from one speaker. 149

Figure 4.9. SNR Boxplot of audio BWE results for each speaker by the proposed method and PLCA in Con-B . Each boxplot is generated by 50 SNR results from one speaker. 150 151

Figure 4.10. Covl Boxplot of audio BWE results for each speaker by the proposed method and PLCA in Con-B . Each boxplot is generated by 50 Covl results from one speaker. 152 153

CHAPTER 5

Conclusion and Future Research

In this thesis, I have introduced and addressed the problem of computational auditory scene analysis (CASI). I have formulated this problem as a model-based spectrogram analysis and factorization problem with missing data. I have shown how to solve this model-based factorization problem using the expectation–maximization (EM) algorithm.

More specifically, three CASI systems are presented in this thesis. The contributions of each system are as follows:

In Chap. 2 , I have presented a semi-supervised singing melody extraction system. The proposed system, based on probabilistic latent component analysis (PLCA), assumes no prior information on the type or the number of instruments, and can adjust the learned accompaniment model adaptively. I have shown the effectiveness of the proposed al- gorithm on suppressing the accompaniment with illustrative examples and quantitative experiments compared to other methods.

Current QBH systems depend on humans to listen to polyphonic audio files (song record- ings) and build machine-searchable melodies from them. The melody extraction system proposed in this thesis would make such QBH systems much more broadly useful with deployment of large databases of audio.

In Chap. 3 , I have presented an audio imputation and bandwidth expansion methods using the non-negative hidden Markov model (N-HMM). The proposed system has ac- counted for the the non-stationarity and temporal dynamics of audio, which have never 154 been considered in previous audio imputation systems. Illustrative examples and quantita- tive experiments have shown the advantages of the proposed algorithms over a state-of-art audio imputation algorithm.

An effective approach for audio imputation could benefit many important applications, such as speech recognition in noisy environment, sound restoration and enhancement, and improvement of the audio source separation results, etc.

In Chap. 4 , I have further extended the proposed audio bandwidth expansion system by incorporating the high-level knowledge about speech. The syntactic knowledge in the form of language models and acoustic knowledge in the form of word models are incorporated into the proposed CASI system. The proposed algorithm has improved the bandwidth expansion results, over an BWE method without the use of high-level speech knowledge, by both the reconstruction accuracy and perceived speech quality.

The proposed BWE system is especially useful for high-quality reproduction of historical recordings. With future extensions (discussed later in this chapter), the proposed system will have broader applications for telephone speech and loudspeaker enhancement.

5.1. Future Directions

The work in this thesis can be extended in many ways. I now discuss the possible exten- sions in detail.

• Singing Melody Extraction

There are several aspects of the current algorithms that can be improved.

First, the singing voice detection algorithm can be improved in various ways. For

example, the current singing voice detector is based on two procedures: dividing 155

the mixture into segments based on spectral peaks, and then classify each segment

into vocal or non-vocal using a majority votes by the pre-trained GMM classifier.

As shown in [Duxbury 03], a better way to segment the mixture is to use the

onset information. This will improve the singing detection accuracy and further

the melody extraction accuracy.

Secondly, the current pitch estimation algorithm is a simple single pitch

estimation algorithm based on auto-correlation [Boersma 93]. The extracted

singing voice still contains interferences from the accompaniment. Therefore the

results of the single pitch estimation algorithm can be distracted by the residual

interference. More advanced algorithm that takes the interferences into consid-

eration should be used in the future.

On the algorithm side, I plan to investigate if the proposed algorithm can

be extended to be completely unsupervised, by taking the repetitive structures

[Rafii 11] of the accompaniment into consideration. To account for the repetitive

structures in the mixture, convolutive NMF [OGrady 06] can be used in the

proposed algorithm to model the accompaniment.

• Audio Imputation

On the application side, the current algorithm is offline and computationally

expensive. It can be extended to an online algorithm by using the forward algo-

rithm in the EM process. In order to speech up the proposed algorithm, we can

apply variational Inference to replace the EM algorithm during the parameter

estimation process of N-HMM. 156

On the theory side, it has been shown that the success of N-HMM greatly

depends on proper training on large dataset. Therefore it is crucial to develop a

principled way for training N-HMMs from large-size dataset in an appropriate and

efficient way. For example, advanced clustering algorithm can be used to divide

large dataset into clusters of appropriate size with similar spectral structures

first, and then N-HMM (or PLCA) learning can be performed independently on

each cluster.

Another way to make the training more efficient is to impose appropriate

prior distributions on the parameters of N-HMM.

• Language Informed Audio Bandwidth Expansion

The possible extensions of the proposed speech bandwidth expansion algo-

rithm are illustrated in Fig. 4.2 . In order for the proposed system to be

feasible for large-vocabulary speech BWE, the system need to be extended to

use sub-word models. Furthermore, it is worth to investigate how to train the

sub-word models without explicitly segmenting the training data into sub-word

units. This idea has been studied in automatic speech recognition [Rabiner 93]

and can greatly reduce the computational costs for the training of the proposed

algorithm.

When the syntactic knowledge about the language is not readily available, it

can be learned using the language models as described in Sec. 4.4 . The language

models can be learned from either written text (such as New York Times, etc),

or from the actual transcript of the speech. It is worth the efforts to investigate 157 how much the choice of the language models and training formats will affect the system performances.

The idea of language models can be easily generalized to deal with the ex- pansion of the bandwidth of different types of audio signals. For example, music theory rules can be encoded in our system in the form of language models. If the music theory rules are unknown for the type of music of interest, language model can be used to learn the relationship among different notes for this particular type of music signals. 158

References

[Abel 91] J. S. Abel & J. O. Smith III. Restoring a clipped signal. In IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1745–1748, 1991.

[Adler 12] A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribonval & M. D. Plumbley. Audio Inpainting. IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 3, pages 922–932, 2012.

[Bauer 09] P. Bauer & T. Fingscheidt. A Statistical Framework for Artificial Band- width Extension Exploiting Speech Waveform and Phonetic Transcrip- tion. In European Signal Processing Conference (EUSIPCO), pages 1839–1843, 2009.

[Bertalmio 00] M. Bertalmio, G. Sapiro, V. Caselles & C. Ballester. Image inpainting. In International conference on Computer graphics and interactive tech- niques (SIGGRAPH), pages 417–424, 2000.

[Boersma 93] P. Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise of a sampled sound. In Proc. the Institute of Phonetic Sciences, volume 17, pages 97–110, 1993.

[Bouvrie 06] J. Bouvrie & T. Ezzat. An Incremental Algorithm for Signal Reconstruc- tion from Short-Time Fourier Transform Magnitude. In International Conference on Spoken Language Processing (Interspeech), 2006.

[Brand 02] Matthew Brand. Incremental singular value decomposition of uncertain data with missing values. European Conference on Computer Vision (ECCV), pages 707–720, 2002.

[Bregman 90] A. S. Bregman. Auditory scene analysis: The perceptual organization of sound. The MIT Press, 1990. 159

[Cartwright 11] M. Cartwright, Z. Rafii, J. Han & B. Pardo. Making Searchable Melodies: Human versus Machine. In Workshops at the Twenty-Fifth AAAI Con- ference on Artificial Intelligence, 2011.

[Clark 08] P. Clark & L. Atlas. Modulation decompositions for the interpolation of long gaps in acoustic signals. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3741–3744, 2008.

[Cocchi 02] G. Cocchi & A. Uncini. Subband neural networks prediction for on-line audio signal recovery. IEEE Trans. Neural Networks, vol. 13, no. 4, pages 867–876, 2002.

[Cooke 01] M. Cooke, P. Green, L. Josifovski & A. Vizinho. Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data. Speech Communication, vol. 34, pages 267–285, 2001.

[Cooke 10] M. Cooke, J. R. Hershey & S. J. Rennie. Monaural speech separation and recognition challenge. Computer Speech and Language, vol. 24, no. 1, pages 1–15, 2010.

[Criminisi 04] A. Criminisi, P. P´erez& K. Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE trans. image processing, vol. 13, no. 9, pages 1200–1212, 2004.

[Dahimene 08] A. Dahimene, M. Noureddine & A. Azrar. A Simple Algorithm for the Restoration of Clipped Speech Signal. Informatica (Slovenia), pages 183– 188, 2008.

[Dhillon 05] I. Dhillon & S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In Neural Information Processing Systems Confer- ence (NIPS), pages 283–290, 2005.

[Duan 09] Z. Duan, J. Han & B. Pardo. Harmonically informed multi-pitch track- ing. In International Society on Music Information Retrieval conference (ISMIR), pages 333–338, 2009.

[Duan 10a] Z. Duan, J. Han & B. Pardo. Song-level multi-pitch tracking by heavily constrained clustering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60, 2010. 160

[Duan 10b] Z. Duan, B. Pardo & C. Zhang. Multiple fundamental frequency estima- tion by modeling spectral peaks and non-peak regions. IEEE Trans. Audio, Speech and Language Processing, vol. 18, no. 8, pages 2121–2133, 2010.

[Durrieu 10] J. Durrieu, G. Richard, B. David & C. Fevotte. Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals. IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 3, pages 564 –575, 2010.

[Duxbury 03] C. Duxbury, J. P. Bello, M. Davies & M. Sandler. Complex Domain On- set Detection for Musical Signals. In International Conference on Digital Audio Effects (DAFx), 2003.

[Ellis 96] D. Ellis. Prediction-driven computational auditory scene analysis. Ph.d. dissertation, MIT, 1996.

[Enbom 99] N. Enbom & W. B. Kleijn. Bandwidth expansion of speech based on vector quantization of the mel frequency cepstral coefficients. In IEEE Workshop on Speech Coding, pages 171–173, 1999.

[Esquef 06] P. Esquef & L. Biscainho. An efficient model-based multirate method for reconstruction of audio signals across long gaps. IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pages 1391–1400, 2006.

[Etter 96] W. Etter. Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters. IEEE Trans. Signal Processing, vol. 44, no. 5, pages 1124–1135, 1996.

[F´evotte 09] C. F´evotte, N. Bertin & J. Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural Computation, vol. 21, no. 3, pages 793–830, 2009.

[Gemmeke 11] J. F. Gemmeke. Noise robust ASR: Missning data techniques and beyond. PhD thesis, Radboud Universiteit Nijmegen, The Netherlands, 2011.

[Godsill 01] S. Godsill & P. Wolfe amd W. Fong. Statistical model-based approaches to audio restoration and analysis. Journal of New Music Research, vol. 30, no. 4, pages 323–338, 2001.

[Goto 04] M. Goto. A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals. Speech Communication, vol. 43, no. 4, pages 311–329, 2004. 161

[Grindlay 10] G. Grindlay & D. Ellis. A Probabilistic Model for Multi-instrument Poly- phonic Transcription. In International Society on Music Information Re- trieval conference (ISMIR), 2010.

[Han 09] J. Han & B. Pardo. Improving Separation of Harmonic Sources with Iterative Estimation of Spatial Cues. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 77–80, 2009.

[Han 10] J. Han & B. Pardo. Reconstructing individual monophonic instruments from musical mixtures using scene completion. Journal of the Acoustical Society of America, vol. 128, page 2309, 2010.

[Han 11a] J. Han & C. Chen. Improving melody extraction using Probabilistic La- tent Component Analysis. In IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 33–36, 2011.

[Han 11b] J. Han & B. Pardo. Reconstructing completely overlapped notes from mu- sical mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 249–252, 2011.

[Han 12a] J. Han, G. J. Mysore. & B. Bryan. Audio Imputation using the Non- negative Hidden Markov Model. In Lecture Notes in Computer Science: Latent Variable Analysis and Signal Separation (LVA/ICA), volume 7191, pages 347–355, 2012.

[Han 12b] J. Han, G. J. Mysore. & B. Bryan. Language Informed Bandwidth Expan- sion. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2012.

[Hays 08] J. Hays & A. Efros. Scene Completion Using Millions of Photographs. Communications of the ACM, vol. 51, pages 87–940, 2008.

[Hu 06] Y. Hu & P. C. Loizou. Evaluation of objective measures for speech en- hancement. In International Conference on Spoken Language Processing (Interspeech), 2006.

[Hus 09] C. Hus, L. Chen, R. Jang & H. Li. Singing pitch extraction from monau- ral polyphonic songs by contextual audio modeling and singing harmonic enhancement. In International Society on Music Information Retrieval conference (ISMIR), 2009. 162

[Iser 08] B. Iser & G. Schmidt. Bandwidth Extension of Telephony Speech. In Speech and Audio Processing in Adverse Environments, Signals and Communication Technology, pages 135–184. Springer Berlin Heidelberg, 2008.

[Janssen 86] A. Janssen, R. Veldhuis & L. Vries. Adaptive interpolation of discrete- time signals that can be modeled as autoregressive processes. IEEE Trans. Acoustics, Speech and Signal Processing, vol. 34, no. 2, pages 317–330, 1986.

[Jax 02] P. Jax. Enhancement of Bandlimited Speech Signals: Angorithms and Theoretical Bounds. Ph.d. dissertation, Rheinisch-Westfalische Technis- che Hochschule Aachen, 2002.

[Jax 03] P. Jax & P. Vary. On artificial bandwidth extension of telephone speech. Signal Processing, vol. 83, no. 8, pages 1707–1719, 2003.

[Jo 11] S. Jo, C. D. Yoo & A. Doucet. Melody Tracking Based on Sequential Bayesian Model. IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pages 1216–1227, 2011.

[Joo 11] S. Joo, S. Park, S. Jo & C. D. Yoo. Melody Extraction based on Harmonic Coded Structure. In International Society on Music Information Retrieval conference (ISMIR), pages 227–232, 2011.

[Kim 09] Wooil Kim & J.H.L. Hansen. Time-Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band- Restricted Conditions. IEEE Trans. Audio, Speech, and Language Pro- cessing, vol. 17, no. 7, pages 1292–1304, 2009.

[Klapuri 03] A.P. Klapuri. Multiple fundamental frequency estimation based on har- monicity and spectral smoothness. IEEE Trans. Speech and Audio Pro- cessing, vol. 11, no. 6, pages 804–816, 2003.

[Kontio 07] J. Kontio, L. Laaksonen & P. Alku. Neural Network-Based Artificial Bandwidth Expansion of Speech. IEEE Trans. Audio, Speech, and Lan- guage Processing, vol. 15, no. 3, pages 873–881, 2007.

[Kornagel 02] U. Kornagel. Spectral widening of telephone speech using an extended classification approach. In European Signal Processing Conference (EU- SIPCO), 2002. 163

[Lagrange 05] M. Lagrange & S. Marchand. Long interpolation of audio signals using linear prediction in sinusoidal modeling. J. Audio Eng. Soc, vol. 53, page 891905, 2005.

[Larsen 04] E. Larsen & R. M. Aarts. Audio bandwidth extension: Application of psychoacoustics, signal processing and loudspeaker design. Wiley, 2004.

[Le Roux 10] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign´e& S. Sagayama. Computational auditory induction as a missing-data model-fitting prob- lem with Bregman divergences. Speech Communication, vol. 53, no. 5, pages 658–676, 2010.

[Lee 99] D. Lee & S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, vol. 401, no. 6755, pages 788–791, 1999.

[Li 05] Y. Li & D. Wang. Detecting pitch of singing voice in polyphonic au- dio. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005.

[Li 07] Y. Li & D. Wang. Separation of Singing Voice From Music Accompani- ment for Monaural Recordings. IEEE Trans. Audio, Speech, and Lan- guage Processing, vol. 15, pages 1475–1487, 2007.

[Li 09] Y. Li, J. Woodruff & D. Wang. Monaural Musical Sound Separation Based on Pitch and Common Amplitude Modulation. IEEE Trans. Au- dio, Speech & Language Processing, vol. 17, no. 7, pages 1361–1371, 2009.

[Maher 94] R. Maher. A Method for Extrapolation of Missing Digital Audio Data. J. Audio Eng. Soc, vol. 42, pages 350–357, 1994.

[McDermott 11] J. H. McDermott, D. Wrobleski & A. J. Oxenham. Recovering sound sources from embedded repetition. Proceedings of the National Academy of Sciences, vol. 108, pages 1188–1193, 2011.

[Moussallam 10] M. Moussallam, P. Leveau & S. M. Aziz Sbai. Sound enhancement using sparse approximation with speclets. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 221–224, 2010.

[Mysore 10] G. J. Mysore. A Non-negative Framework for Joint Modeling of Spectral Structure and Temporal Dynamics in Sound Mixtures. Ph.d. dissertation, Stanford University, 2010. 164

[Mysore 12] G. J. Mysore & P. Smaragdis. A Non-negative Approach to Language Informed Speech Separation. In Lecture Notes in Computer Science: La- tent Variable Analysis and Signal Separation (LVA/ICA), volume 7191, pages 356–363, 2012.

[Nawab 83] S. Nawab, T. Quatieri & J. Lim. Signal reconstruction from short-time Fourier transform magnitude. IEEE Trans. Acoustics, Speech and Signal Processing, vol. 31, pages 986–998, 1983.

[Ofir 07] H. Ofir, D. Malah & I. Cohen. Audio Packet Loss Concealment in a Com- bined MDCT-MDST Domain. IEEE Signal Processing Letters, vol. 14, no. 12, pages 1032–1035, 2007.

[Oppenheim 75] A. V. Oppenheim & R. W. Schafer. Digital signal processing. Prentice- Hall, 1975.

[OGrady 06] P. D. OGrady & B. A. Pearlmutter. Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint. In IEEE International Work- shop on Machine Learning for Signal Processing (MLSP), pages 427–432, 2006.

[Paiva 05] R. P. Paiva, T. Mendes & A. Cardoso. On the Detection of Melody Notes in Polyphonic Audio. In International Society on Music Information Re- trieval conference (ISMIR), pages 175–182, 2005.

[Papoulis 91] A. Papoulis. Probability, random variables, and stochastic processes. McGraw-Hill, 1991.

[Pardo 08] B. Pardo, D. Little, R. Jiang, H. Livni & J. Han. The vocalsearch music search engine. In ACM/IEEE-CS joint conference on Digital libraries (JCDL), 2008.

[Park 00] K. Park & H. Kim. Narrowband to wideband conversion of speech us- ing GMM based transformation. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 1843–1846, 2000.

[Poliner 05] G. Poliner & D. Ellis. A Classification Approach to Melody Transcrip- tion. In International Society on Music Information Retrieval conference (ISMIR), pages 161–166, 2005.

[Pulakka 11a] H. Pulakka & P. Alku. Bandwidth Extension of Telephone Speech Using a Neural Network and a Filter Bank Implementation for Highband Mel 165

Spectrum. IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 7, pages 2170–2183, 2011.

[Pulakka 11b] H. Pulakka, U. Rentes, K. Palomaki, M. Kurimo & P. Alku. Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum. In IEEE International Conference on Acous- tics Speech and Signal Processing (ICASSP), pages 5100–5103, 2011.

[R. 04] Bhiksha R., Michael L. S. & Richard M. S. Reconstruction of missing features for robust speech recognition. Speech Communication, vol. 43, no. 4, pages 275–296, 2004.

[Rabiner 93] L. Rabiner & B-H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993.

[Rafii 11] Z. Rafii & B. Pardo. A Simple Music/Voice Separation Method based on the Extraction of the Repeating Musical Structure. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 221–224, 2011.

[Raj 00] B. Raj. Reconstruction of Incomplete Spectrograms for Robust Speech Recognition. Ph.d. dissertation, Carnegie Mellon University, 2000.

[Ramona 08] M. Ramona, G. Richard & B. David. Vocal detection in music with Sup- port Vector Machines. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 1885–1888, 2008.

[Rao 10] V. Rao & P. Rao. Vocal Melody Extraction in the Presence of Pitched Accompaniment in Polyphonic Music. IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 8, pages 2145–2154, 2010.

[Regnier 10] L. Regnier & G. Peeters. Partial clustering using a time-varying fre- quency model for singing voice detection. In IEEE International Confer- ence on Acoustics Speech and Signal Processing (ICASSP), pages 441– 444, 2010.

[Repp 92] B. H. Repp. Perceptual restoration of a “missing” speech sound: auditory induction or illusion? Attention, Perception, & Psychophysics, vol. 51, no. 1, pages 14–32, 1992.

[Ryynanen 06] M. Ryynanen & A. Klapuri. Transcription of the Singing Melody in Poly- phonic Music. In International Society on Music Information Retrieval conference (ISMIR), 2006. 166

[Shashanka 08] M. Shashanka, B. Raj & P. Smaragdis. Probabilistic Latent Variable Models as Nonnegative Factorizationss. Computational Intelligence and Neuroscience, 2008.

[Skalak 08] M. Skalak, J. Han & B. Pardo. Speeding Melody Search With Vantage Point Trees. In International Society on Music Information Retrieval conference (ISMIR), pages 95–100, 2008.

[Smaragdis 06] P. Smaragdis, M. Shashanka & B. Raj. Probabilistic latent variable model for acoustic modeling. In Advances in models for acoustic processing workshop, NIPS, 2006.

[Smaragdis 07a] P. Smaragdis, B. Raj & M. Shashanka. Example-Driven Bandwidth Ex- pansion. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2007.

[Smaragdis 07b] P. Smaragdis, B. Raj & M. Shashanka. Supervised and semi-supervised separation of sounds from single-channel mixtures. In Lecture Notes in Computer Science: Independent Component Analysis and Signal Sepa- ration, pages 414–421, 2007.

[Smaragdis 11] P. Smaragdis, B. Raj & M. Shashanka. Missing Data Imputation for Time-Frequency Representations of Audio Signals. Journal of Signal Pro- cessing Systems, vol. 65, no. 3, pages 361–370, 2011.

[Tachibana 10] H. Tachibana, T. Ono, N. Ono & S. Sagayama. Melody line estima- tion in homophonic music audio signals based on temporal-variability of melodic source. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 425 –428, 2010.

[Unno 05] T. Unno & A. McCree. A Robust Narrowband to Wideband Extension System Featuring Enhanced Codebook Mapping. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 805–808, 18-23 2005.

[Wang 06] D. Wang & G. J. Brown. Computational auditory scene analysis: Prin- ciples, algorithms and applications. Wiley, 2006.

[Warren 70] R. M. Warren. Perceptual restoration of missing speech sounds. Science, vol. 167, no. 917, pages 392–393, 1970. 167

[Warren 72] R. M. Warren, C. J. Obusek & J. M. Ackroff. Auditory Induction: Per- ceptual Synthesis of Absent Sounds. Science, vol. 176, no. 4039, pages 1149–1151, 1972.

[Yilmaz 04] O. Yilmaz & S. Rickard. Blind Separation of Speech Mixtures via Time- Frequency Masking. IEEE Trans. Singal Processing, vol. 52, pages 1830– 1847, 2004.