<<

PITCH TRACKING AND SPEECH ENHANCEMENT IN NOISY AND REVERBERANT ENVIRONMENTS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Mingyang Wu, M.S., B.Eng.

* * * * *

The Ohio State University

2003

Dissertation Committee: Approved by

Professor DeLiang Wang, Advisor

Professor Donna Byron ______Advisor Professor Lawrence L. Feth Department of Computer and Information Science Professor Ashok Krishnamurthy

© Copyright by

Mingyang Wu

2003

ABSTRACT

Two causes of speech degradation exist in practically all listening situations: noise interference and room reverberation. This dissertation investigates three particular aspects of speech processing in noisy and reverberant environments: multipitch tracking for noisy speech, measurement of reverberation time based on pitch strength, and reverberant speech enhancement using one microphone (or monaurally).

An effective multipitch tracking algorithm for noisy speech is critical for speech analysis and processing. However, the performance of existing algorithms is not satisfactory. We present a robust algorithm for multipitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model

(HMM) for forming continuous pitch tracks. The resulting algorithm can reliably track single and double pitch tracks in a noisy environment. We suggest a pitch error measure for the multipitch situation. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. Quantitative comparisons show that our algorithm significantly outperforms existing ones.

Reverberation corrupts structure in voiced speech. We observe that the pitch strength of voiced speech segments is indicative of the degree of reverberation.

ii Consequently, we present a pitch-based measure for reverberation time (T60) utilizing our new pitch determination algorithm. The pitch strength is measured by deriving the statistics of relative time lags, defined as the distances from the detected pitch periods to the closest peaks in correlograms. The monotonic relationship between the measured pitch strength and reverberation time is learned from a corpus of reverberant speech with known reverberation times.

Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. We propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects so that SRR is increased. The second stage utilizes spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. Our algorithm is quantitatively compared with a recent one-microphone reverberant speech enhancement algorithm on a corpus of speech utterances in a number of reverberant conditions. The results show that our algorithm performs substantially better.

iii

Dedicated to my parents, Dingyi Shen and Qinjin Wu.

iv

ACKNOWLEDGMENTS

I wish to thank my advisor, Dr. DeLiang Wang, for his scientific insights and guidance in this research. Not only did he teach me principles of scientific thinking and research, but also took every opportunity to educate me to be a well-rounded researcher. I have always marveled at his extensive knowledge in many scientific and engineering fields, which makes his advice invaluable. His passion for scientific exploration will always inspire me in the future.

I am grateful to Dr. Larry Feth, who taught me and is always willing to offer me his advice and insights. Thanks are due to Dr. Ashok Krishnamurthy, from whom I learned much of my knowledge in speech processing. Dr. Osamu Fujimura not only taught me Phonetics, but also provided me a broader view of speech and hearing science.

Along with Dr. Wang, Dr. Delwin Lindsey, Dr. James Todd, and Tjeerd Dijkstra founded the OSU Club, which introduced me to a broad range of issues in perception and is a constant source of inspiration. Special thanks are due to Dr. Lindsey and Dr. Angela Brown. Their knowledge and devotion to science always impress me, and their intellectual as well as emotional supports are invaluable. Dr. Todd showed me many aspects of visual perception in his wonderful talks and is generous to offer me valuable

v advice. Dr. Dijkstra and Dr. Stijn Oomes offered me help and support in many aspects and are greatly appreciated.

I am grateful to have Dr. Donna Byron, Dr. Feth, Dr. Krishnamurthy, and Dr. Kim

Boyer on my dissertation committee and Dr. Eric Fosler-Lussier to review my dissertation. Their support and critique have been very helpful.

I also wish to thank my officemates, for help and productive discussions. After sharing an office with Nicoleta Roman for four years, I will miss her as I move out.

Guoning Hu always impresses me with his opinions not only on scientific subjects but also on life. Without Soundararajan Srinivasan, lunch would have been a routine affair.

Late night working with Yang Shao will be remembered for a long time.

I would like to thank my former officemates: Dr. Xiuwen Liu, Dr. Shannon

Campbell, and Dr. Erdogan Cesmeli. Even after their graduation, they continue to offer me their help and encouragement.

Last but not the least, I wish to acknowledge the financial support provided to me by an ONR Young Investigator Award to Dr. Wang, an AFOSR grant (F49620-01-1-0027), and an NSF grant (IIS-0081058).

vi

VITA

March 24, 1972 ...... Born in Jiangsu Province, China

July, 1995 ...... B.Eng. Electrical Engineering, Tsinghua University, Beijing, China

June, 1999 ...... M.S. Computer and Information Science, The Ohio State University

PUBLICATIONS

Journal Article

Mingyang Wu, DeLiang Wang and Guy J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 229-241, May 2003.

Conference Papers

Mingyang Wu, DeLiang Wang and Guy J. Brown, “Pitch tracking based on statistical anticipation,” Proc. International Joint Conference on Neural Networks (IJCNN), Vol. 2, pp. 866-871, 2001.

Mingyang Wu, DeLiang Wang and Guy J. Brown, “A multipitch tracking algorithm for noisy speech,” Proc. IEEE IEEE International Conference on , Speech and Signal Processing (ICASSP), 2002, vol. 1, pp. 369-372.

Mingyang Wu and DeLiang Wang, “A one-microphone algorithm for reverberant speech enhancement”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2003, vol.1, pp. 844-847.

vii Technical Report

Mingyang Wu, DeLiang Wang and Guy J. Brown, ª A multipitch tracking algorithm for noisy speech,º Technical Report OSU-CISRC-12/01-TR25, Department of Computer and Information Science, The Ohio State University, Dec. 2001.

FIELDS OF STUDY

Major Field: Computer and Information Science

viii

TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... xi

List of Figures ...... iv

Chapters:

1. Introduction ...... 1

1.1 Motivations ...... 1 1.2 Objectives ...... 5 1.3 Organization of Dissertation ...... 7

2. A Multipitch Tracking Algorithm for Noisy Speech ...... 9

2.1 Introduction ...... 10 2.2 Proposed Algorithm ...... 19 2.2.1 Multichannel Front-End ...... 23 2.2.2 Pitch Tracking ...... 28 2.3 Results and Comparisons ...... 43 2.4 Discussion and Conclusion ...... 55

3. Room Reverberation and a Pitch-based Measure for Reverberation Time ...... 59

3.1 Room Reverberation ...... 59

ix 3.2 Effects of Room Reverberation on Speech Perception ...... 67 3.3 A Pitch-based Measure for Reverberation Time ...... 70 3.3.1 Proposed Measure ...... 73 3.4 Conclusion ...... 77

4. A One-microphone Algorithm for Reverberant Speech Enhancement ...... 79

4.1 Introduction ...... 79 4.2 Proposed Algorithm ...... 91 4.2.1 Inverse Filtering ...... 92 4.2.2 Spectral Subtraction ...... 100 4.3 Results and Comparisons ...... 107 4.3.1 Objective Speech Quality Measures ...... 107 4.3.2 Evaluation Results ...... 111 4.4 Discussion and Conclusion ...... 125

5. Contributions and Future Work ...... 129

5.1 Contributions ...... 129 5.2 Insights Gained ...... 131 5.3 Future Work ...... 133 5.4 Concluding Remarks ...... 134

Bibliography ...... 136

x

LIST OF TABLES

Table Page

2.1 Four sets of estimated model parameters ...... 32

2.2 Transition probabilities between state spaces of pitch ...... 40

2.3 Categorization of interference signals ...... 44

2.4 Error rates (in percentage) for Category 1 interference ...... 49

2.5 Error rates (in percentage) for Category 2 interference ...... 50

2.6 Error rates (in percentage) for Category 3 interference ...... 51

4.1 The first stage of the proposed algorithm for finding an inverse filter to the reverberant speech ...... 97

4.2 The normalized correlation coefficients between the early- and late impulse components computed from four female and four male speakers randomly selected from TIMIT database [36] ...... 106

4.3 Articulation index filter bands (reproduced from [41])...... 110

4.4 The systematic results of reverberant speech enhancement for speech utterances of four female and four male speakers randomly selected from TIMIT database [36] ...... 114

4.5 The systematic results of reverberant speech enhancement for speech utterances of four female and four male speakers randomly selected from TIMIT database [36]. All signals are sampled at 8 kHz ...... 118

xi 4.6 The systematic enhancement results of speech degraded by reverberation (T60 = 0.3 s) and noise (SNR = 20 dB) for speech utterances from four female and four male speakers randomly selected from the TIMIT database [36] ...... 125

xii

LIST OF FIGURES

Figure Page

2.1 Schematic diagram of the proposed model. A mixture of speech and interference is processed in four main stages. In the first stage, the normalized correlogram is obtained within each channel after the mixture is decomposed into a multi-channel representation by cochlear filtering. Channel/peak selection is performed in the second stage. In the third stage, the periodicity information is integrated across different channels using a statistical method. Finally, an HMM is utilized to form continuous pitch tracks...... 21

2.2 Examples of normalized correlograms: (a) normalized correlogram of a clean low- channel, (b) that of a noisy low-frequency channel, (c) that of a clean high-frequency channel, and (d) that of a noisy high- frequency channel. Solid lines represent the correlogram using the original time window of 16 ms and dashlines represent the correlogram using a longer time window of 30 ms. Dot lines indicate the maximum height of non-zero peaks. All correlograms are computed from the mixture of two simultaneous utterances of a male and a female speaker. The utterances are “ Why were you all weary” and “ Don’t ask me to carry an oily rag like that.” 26

2.3 (a) Summary normalized correlogram of all channels in a time frame from a speech utterance mixed with . The utterance is “ Why were you all weary.” (b) Summary normalized correlogram of only selected channels in the same time frame as shown in (a). (c) Summary normalized correlogram of selected channels in a time frame from the speech utterance “ Don’t ask me to carry an oily rag like that.” (d) Summary normalized correlogram of selected channels where the removed peaks are excluded in the same time frame as shown in (c). To exclude a removed peak means that the segment of correlogram between the two adjacent minima surrounding the peak is not considered. Dashlines represent the delay corresponding to the true pitch periods. Dot lines indicate the peak heights at pitch periods. ... 29

xiii 2.4 Histogram and estimated distribution of relative time lags for a single pitch in channel 22. The bar graph represents the histogram and the solid line represents the estimated distribution...... 31

2.5 Schematic diagram of an HMM for forming continuous pitch tracks. The hidden nodes represent possible pitch states in each time frame. The observation nodes represent the set of selected peaks in each frame. The temporal links in the Markov model represent the probabilistic pitch dynamics. The link between a hidden and an observation node is called observation probability...... 37

2.6 Histogram and estimated distribution of pitch period changes in consecutive time frames. The bar graph represents the histogram and the solid line represents the estimated distribution...... 39

2.7 (a) Time-frequency energy plot for a mixture of two simultaneous utterances of a male and a female speaker. The utterances are “ Why were you all weary” and “ Don’ t ask me to carry an oily rag like that.” The brightness in a time-frequency cell indicates the energy of the corresponding gammatone filter output in the corresponding time frame. For better display, energy is plotted as the square of the logarithm. (b) Result of tracking the mixture. The solid lines indicate the true pitch tracks. The ‘×’ and ‘ο’ tracks represent the pitch tracks estimated by our algorithm...... 47

2.8 (a) Time-frequency energy plot for a mixture of a male utterance and white noise. The utterance is “ Why were you all weary.” The brightness in a time- frequency cell indicates the energy of the corresponding gammatone filter output in the corresponding time frame. For better display, energy is plotted as the square of logarithm. (b) Result of tracking the mixture. The solid lines indicate the true pitch tracks. The ‘×’ tracks represent the pitch tracks estimated by our algorithm...... 48

2.9 Results of tracking the same signal as in Figure 2.7 using (a) the TK PDA, (b) the GB PDA, and (c) the R-GB PDA. The solid lines indicate the true pitch tracks. The `×' and `ο' tracks represent the estimated pitch tracks...... 52

2.10 Result of tracking the same signal as in Figure 2.8 using (a) the TK PDA, (b) the GB PDA, (c) the R-GB PDA, and (d) the PDA proposed by Rouat et al. [31]. The solid lines indicate the true pitch tracks. The `×' and `ο' tracks represent the estimated pitch tracks. In subplot (d), time frames with negative pitch period estimates indicate the decision of voiced with unknown period...... 54

xiv 3.1 Plot of a room impulse response function measured in a typical office employing a cross-correlation method. The data is obtained from [1]...... 62

3.2 (a) The geometry of an office-size room of the dimensions 6 by 4 by 3 meters (length by width by height), and (b) the corresponding room impulse response function generated by the image model [3]. Wall reflection coefficients are 0.75 for all walls, ceiling and floor. The loudspeaker and the microphone are at (2, 3, 1.5) and (4, 1, 2) meters (length, width, height), respectively...... 64

3.3 Energy decay curve for the room impulse response function in Figure 3.2(b) using Schroeder integration method [144]. The horizontal dot line represents –60 dB energy decay level. The left dashline indicates the starting time of the impulse responses and the right dashline the time at which decay curve crosses –60 dB...... 66

3.4 Frequency response of the room impulse response in Figure 3.2(b)...... 70

3.5 Histograms and estimated distributions of relative time lags in channel 22 (center frequency = 264 Hz) of (a) clean speech, and (b) reverberant speech with reverberation time of 0.3 s. The bar graphs represent histograms and the solid lines represent the estimated distributions...... 74

3.6 Average distribution spread λ of relative time lag with respect to reverberation time...... 77

4.1 The equalized impulse response derived by convolving the room impulse response in Figure 3.2(b) with its matched filter...... 85

4.2 Schematic diagrams of (a) an ideal one-microphone dereverberation algorithm maximizing the kurtosis of LP residual of inverse-filtered signal, and of (b) the algorithm employed in the section...... 93

4.3 (a) A room impulse response function generated by an image method [3]. (b) The equalized impulse response derived from the reverberant speech generated by the room impulse response in (a) as the result of the first stage of our algorithm...... 98

4.4 Energy decay curves computed from (a) the room impulse response function in Figure 4.3(a) and from (b) the equalized impulse response in Figure 4.3(b). The horizontal dot line represents –60 dB energy decay level. The left dashlines indicate the starting times of the impulse responses and the right dashlines the times at which decay curves cross –60 dB...... 99

xv 4.5 Smoothing function for estimating the late-impulse components...... 103

4.6 The average autocorrelation function of speech utterances of 4 male and 4 female speakers random selected from the TIMIT database [36]...... 104

4.7 Results of reverberant speech enhancement: (a) clean speech, (b) of clean speech, (c) reverberant speech, (d) spectrogram of reverberation speech, (e) inverse-filtered speech, (f) spectrogram of inverse- filtered speech, (g) speech processed using our algorithm, and (h) spectrogram of the processed speech of a female utterance ª She had your dark suit in greasy wash water all year,º all sampled at 16 kHz...... 112

4.8 Results of reverberant speech enhancement of the same speech utterance in Figure 4.7(a) (downsampled to 8 kHz): (a) clean speech, (b) spectrogram of clean speech, (c) reverberant speech, (d) spectrogram of reverberant speech, (e) speech processed using the YM algorithm, (f) spectrogram of (e), (g) speech processed using our algorithm, and (h) spectrogram of (g)...... 116

4.9 The results of the proposed algorithm in reverberant environments with different reverberation times. The dot, dash, and solid lines represent the frequency-weighted segmental SNR values of reverberant speech, inverse- filtered speech, and the processed speech...... 119

4.10 Performance of the optimal versus the fixed scaling factors: the optimal scaling factors are represented by the dashline, and the frequency-weighted segmental SNR gains from the performance using the fixed scaling factor of 0.32 are represented by the solid line...... 121

4.11 Results of enhancement of speech degraded by reverberation and noise: (a) speech degraded by reverberation (T60 = 0.3 s) and noise (SNR = 20 dB), (b) spectrogram of speech degraded by reverberation and noise, (c) inverse- filtered speech as the result of the first stage of our algorithm, (d) spectrogram of inverse-filtered speech, (e) speech processed using the proposed algorithm, and (f) spectrogram of the processed speech...... 123

4.12 (a) The equalized impulse response derived from the room impulse response in Figure 4.3(b) using linear least-square inverse filtering of length 1024 (64 ms), and (b) its energy decay curve. The horizontal dot line represents -60 dB energy decay level. The left dashline indicates the starting time of the impulse responses and the right dashline the time at which decay curves cross -60 dB...... 128

xvi

CHAPTER 1

INTRODUCTION

1.1 Motivation

Speech is the most important mode of human communication. The essential speech communication process includes a talker, who utters a speech , and a receiver, who listens to the sound and then decodes the meaning. This process, however, is subject to interference in realistic acoustic environments; the acoustic waveform reaching the listener' s ears is usually composed of sound energy from multiple environmental sources.

The interfering sound can be a stationary noise, such as an ambient noise from an air conditioner, or a nonstationary interference, such as door slams, music, and other speech utterances. A rather interesting example occurs at a crowded party, where many people talk simultaneously with a variety of interfering noises in the background. Nonetheless, we are able to attend to and understand a particular voice in these situations. Cherry [26] first referred to this as the ª cocktail party phenomenon.º Later, Bregman in his seminal book explains this aspect of auditory perception as auditory scene analysis (ASA) [22].

He further argues that auditory scene analysis consists of two stages. The first stage

1 decomposes the acoustic signal into elementary sensory components. The second stage subsequently groups them into streams: a stream consists of sensory components that are likely to originate from the same acoustic event.

In sharp contrast to human ability to deal with acoustic interference, current speech processing systems perform poorly in noisy environments. For example, existing computational auditory scene analysis (CASA) systems only achieve moderate success in extracting a speech utterance from background noise and other speech utterances. The performance of current automatic (ASR) systems degrades a great deal in noisy environments [70] .

Many natural , including speech and music, have characteristic pitch, and the variations of pitch give the sensation of melody. The prosodic information of a speech utterance, for instance, is predominately determined by its pitch contours, and the major feature of a musical tone is its pitch. Indeed, detection of pitch is a fundamental task in auditory processing and has been the subject of extensive computational research.

Many pitch determination algorithms (PDAs) have been specifically designed for detecting a single pitch track in noisy speech. The algorithmic restriction of a single pitch track, however, puts limitations on the background noise in which PDAs are able to perform, and an ideal PDA should perform robustly in a variety of acoustic environments. For example, if the background contains harmonic structures such as background music or voiced speech, more than one pitch is present, and a pitch tracker that can yield multiple pitches at a given frame is required. On the other hand, the

2 performance of current multipitch detecting systems is very limited on tracking speech mixed with broadband interference (see Chapter 2).

A robust algorithm for multipitch tracking in noisy speech is valuable for many speech-processing systems operating in noisy environments. For example, in a number of

CASA systems, such as the multistage neural oscillator model of Wang and Brown [161], sound segregation is achieved by two processing stages: segment formation and simultaneous organization. In the first stage, the mixtures of speech and noise are processed by a multi-band auditory frond-end and auditory segments are constructed by forming connected regions of acoustic energy in the time-frequency map of the frond-end output. These segments represent the basic elements of an auditory scene and are computed by analyzing the autocorrelation and cross-channel correlation of filter activity.

At the second stage ± simultaneous organization ± the simultaneous segments across different frequency channels are grouped together to form auditory streams by analyzing the periodicity cue. A robust multipitch tracking algorithm would be very useful for the second stage; correct grouping depends on correct pitch tracks in periodicity-based grouping methods. Also, reliable multipitch tracking can be used for co-channel speaker identification. Shao and Wang [148] identify the speakers in co-channel speech, a combination of utterances from two speakers, by retaining the speech segments that have only one detected pitch contour and discarding the others. As a result, only speech segments in which one speaker dominates are used for the identification task.

Another important cause of degradation for the speech communication process is reverberation, which occurs almost everywhere except in a wide-open space and an

3 .1 When a wave front encounters a boundary, part of the energy is reflected, while the remaining part is absorbed. In an enclosure, such as an office, a classroom, or a church, a wave front typically reflects many times before it attenuates to an undetectable energy level. Consequently, what a listener hears in a reverberant environment is the sum of many time-delayed repetitions of an original sound.

Reverberation causes various effects on speech signal, most significantly to smear the temporal structure in speech, and thus is deleterious to speech perception in a severely reverberant environment [89]. A moderate level of reverberation, however, usually has little effect on speech intelligibility. On the other hand, it is often not true for speech processing [49]. For example, moderate reverberation significantly decreases the performance of current ASR systems (see Chapter 4).

The reverberation characteristics of a room are usually unknown a priori and it is desirable to estimate the characteristics from reverberant speech directly. Reverberation has a number of effects on pitch periods of speech signal. For example, reverberation elongates the harmonic structure of speech and, therefore, produces elongated pitch tracks. Also, we find that reverberation corrupts the harmonic structure and, thus, reduces pitch strength. The latter phenomenon can be utilized to estimate a most important characteristic of reverberation ± the reverberation time.

As noted earlier in this section, two causes of speech degradation exist in practically all listening situations: background noise and room reverberation. Many techniques such as spectral subtraction, adaptive noise cancellation, and comb filtering have been

1 An anechoic chamber is a room whose walls, ceiling, and floor are covered with special sound-absorbing materials so that no reflection of acoustic energy occurs.

4 developed to improve the perceived quality of speech degraded by background noise, and are effective in low to moderate noise level [70]. Alternatively, CASA systems treat background noise as composed of distinct sound sources and segregate acoustic waveform into different streams representing different sources, therefore capable of segregating speech from noise interference and different speech utterances (for example, see [23, 69, 161]). These CASA systems, however, are based on single-pitch trackers and, therefore, incapable of extracting multiple simultaneous speech utterances from a mixture. On the other hand, existing reverberant speech enhancement systems achieve only partial success in recovering original speech signal. An effective reverberant speech enhancement system should be invaluable to many applications such as hand-free telecommunication, ASR, speaker recognition, and hearing aids design (see Chapter 4).

1.2 Objectives

In the preceding section, we have described two major causes of speech degradation in realistic acoustic environments: noise interference and room reverberation. Our investigation in this dissertation will center on these two challenges. The ideal goal is to recover the original clean speech in a noisy and reverberant environment. Perhaps a more realistic goal is to achieve a performance level comparable to that of humans. There is a long way to go even to realize this goal as the human ability to deal with acoustic interference and reverberation is truly remarkable. Thus, it would be unrealistic to embark on the entire problem in this dissertation. Instead, we limit our investigation to

5 three particular aspects of speech processing in noisy and reverberant environments, described as follows.

First, we study multipitch tracking for noisy speech. Three objectives are established for this investigation: to understand how noise interference degrades the periodicity of speech signal, to identify an appropriate computational paradigm for improving the robustness of multipitch tracking, and finally to propose an algorithm that works reliably.

Next, we study the relationship between reverberation and pitch. Here, two goals are set: to understand the influence of reverberation on speech pitch periods, and to uncover a pitch-based measure for the degree of reverberation.

Finally, we study reverberant speech enhancement using one microphone. We choose to study one-microphone scenarios for the following reasons. First, many real-world applications, such as telecommunication and audio retrieval, demand one-microphone solutions. Second, although binaural listening somehow improves the intelligibility of reverberant speech for normal listeners, moderately reverberant speech is highly intelligible in monaural listening conditions (see Chapter 3). Hence how to achieve this monaural capability arouses scientific curiosity. We establish two objectives for this investigation: to understand how reverberation degrades speech quality, and based on this understanding, to come up with a one-microphone reverberant speech enhancement algorithm.

6 1.3 Organization of Dissertation

The rest of this dissertation is organized as follows. Chapter 2 presents our study on multipitch tracking for noisy speech. After reviewing pitch perception and existing pitch determination algorithms, we propose an algorithm that consists of four main stages. In the first stage, a normalized correlogram is computed within each channel after a mixture of speech and interference is decomposed into a multiband representation by cochlear filtering. Channel/peak selection comprises the second stage, where only weakly corrupted channels and valid peaks are retained. In the third stage, the periodicity information is integrated across different channels using a statistical method. The last stage of the algorithm is to form continuous pitch tracks using a hidden Markov model

(HMM). Model parameters are determined and issues of efficient implementation are discussed. Finally, we suggest a pitch error measure for the multipitch situations and evaluate the algorithm on a database of speech utterances mixed with various types of interference.

Chapter 3 first provides an introduction to the acoustic phenomenon of room reverberation. After discussing its effects on speech perception, we present a measure of reverberation time based on pitch strength. Reverberation corrupts harmonic structure in voiced speech. By estimating the pitch strength of reverberant speech and, therefore, the degree of reverberation, reverberation time can be estimated.

In Chapter 4, we study reverberant speech enhancement using one microphone. After reviewing applications and existing algorithms for reverberant speech enhancement, we propose a two-stage algorithm. In the first stage, an inverse filter is estimated in order to

7 reduce coloration effects so that signal-to-reverberant energy ratio is increased. The second stage utilizes spectral subtraction to minimize the influence of long-term reverberation. Finally, the algorithm is quantitatively evaluated on a corpus of reverberant speech.

Chapter 5 summarizes the contributions presented in this dissertation, discusses the insights gained from my doctoral research, and outlines future research directions.

8

CHAPTER 2

A MULTIPITCH TRACKING ALGORITHM FOR NOISY SPEECH

The design of a reliable algorithm for multipitch tracking in a variety of speech and noise conditions is a challenging and yet unresolved issue. In this chapter, we propose a robust algorithm for multipitch tracking of noisy speech. By using a statistical approach, the algorithm can maintain multiple hypotheses with different probabilities, making the model more robust in the presence of acoustic noise. Moreover, the modeling process incorporates the statistics extracted from a corpus of natural sound sources. Finally, a hidden Markov model (HMM) is incorporated for detecting continuous pitch tracks. A database consisting of mixtures of speech and a variety of interfering sounds (white noise, ª cocktail partyº noise, rock music, etc.) is used to evaluate the proposed algorithm, and very good performance is obtained. In addition, we have carried out quantitative comparison with related algorithms and the results show that our model performs significantly better. A version of this algorithm can be found in [167].

The chapter is organized as follows. Section 2.1 provides an introduction to pitch and

PDAs. Detailed explanations of our model are given in Section 2.2. Section 2.3 describes

9 evaluation experiments and shows the results. Finally, we discuss related issues and conclude the chapter in Section 2.4.

2.1 Introduction

Pitch is defined as ª that attribute of auditory sensation in terms of which sounds may be ordered on a musical scaleº [9]. The perception of pitch for a pure tone is related to its frequency. The sensitivity of our ears to the changes of pitch is remarkable. For example, the smallest detectable change in pitch frequency of a 1-kHz tone at a moderate sound level is only 2-3 Hz. The is mainly responsible for the perceived pitch of a stationary complex harmonic sound. However, it is possible to filter out the fundamental frequency component in the harmonic structure and find the pitch to be unaltered. This phenomenon is termed ª residue pitchº [112]. Flanagan and Saslow [38] measured the smallest detectable change in fundamental frequency for stable synthetic vowels and find it to be in the range of 0.3% to 0.5%.

A normal listener can often hear a pure tone with frequency as high as 15 kHz [112], while the lowest frequency which evokes a pitch sensation in a normal sense is approximately 16 Hz; sounds below this frequency are heard by virtue of the distortion they produce after passing through the middle ear [78]. Human voice has a narrower range of fundamental : from 33 to 3100 Hz for arbitrary human voice utterances (singing voice included) [113]. The fundamental frequency range of conversational speech is even narrower. Different authors have reported different ranges, but they are mostly within the range of 80 to 500 Hz (for example, see [66, 110, 147]).

10 Although some sounds evoking the sensation of pitch are neither periodic nor quasi- periodic, for speech signals, it is assumed that pitch and fundamental frequency closely correspond to each other [64]. Consequently, pitch detection in speech is equivalent to fundamental frequency detection.

Pitch detection techniques have been heavily utilized in speech technology. It has been demonstrated that, in speech vocoders (voice coders), reliable detection of pitch contours is essential for maintaining the quality and preserving speaker identity [48].

Determination of pitch also can be used for speech enhancement [25, 52]. Moreover, pitch contours and therefore the prosodic information can be extracted and employed in many applications such as speech understanding [95, 123], speaker verification [100,

129], and language identification [153].

Harmonicity is employed as a primary cue in various CASA systems (for example, see, [23, 29, 35, 161, 163]). Reliable pitch contours are critical for segregating harmonic structures, such as voiced speech, from other noise intrusions and for segregating simultaneous speakers. In musical signal processing, monophonic and polyphonic pitch tracking is important in converting the performance of single or multiple instruments to symbolic representation [132, 155].

The task of detecting the periodicity in a speech signal, seemingly simple, is in fact considered a very difficult problem in speech processing (see [65, 105]). The complexity of the problem stems mostly from the variability of a speech signal and the unpredictability of environmental noise. Speech production is a nonstationary process; the spectrum and intensity of a clean speech signal change constantly due to the

11 articulation of different sounds, and pitch fluctuates about 2-10% between two successive periods [64]. Unvoiced sounds complicate the situation further. Also, speech communication systems distort or band-limit the signal. For example, the telephone network typically limits the bandwidth of speech signal to between 300 and 3400 Hz. In this case, the spectral components corresponding to fundamental frequencies may not appear in telephone speech. Finally, the task of pitch determination in a noisy environment is much more challenging due to the interference from noise intrusions and mutual interference among multiple harmonic structures.

A. Classification of pitch determination algorithms

Numerous PDAs have been proposed. Good reviews of the topic can be seen in Hess

[65] and Hermes [62]. Generally, PDAs can be classified into three categories: time- domain, frequency-domain and time-frequency domain algorithms.

Time-domain PDAs detect the pitch period by analyzing the temporal structure of a quasi-periodic signal. For example, after extensive low-pass filtering, the residue waveform is dominated by the fundamental harmonic and has only two zero-crossings with a defined polarity in every period. By detecting these zero-crossings, pitch periods can be identified [104]. Some PDAs (for example, see [68]) utilize the fact that the vocal track approximates a lossy linear system and the response of one glottal impulse consists of a sum of exponentially damped oscillations. Therefore, the magnitude of peaks in the beginning of a period is greater than that in the end. The peak patterns and therefore the periodicity can be extracted. The system proposed by Hess [63] obtains the laryngeal excitation function by inverse filtering the speech signal using vocal tract model. The

12 excitation function approximates a pulse train and the pitch period is obtained from it.

Comb filters are used by de Cheveigné [27] to cancel the periodic sound of a given period in a signal. The average squared difference function or the average magnitude difference function [142] is calculated to characterize the residue signal energy or average magnitude and its minimum indicates the pitch period. The super-resolution PDA proposed by Medan et al. [105] utilizes the normalized cross-correlation of consecutive waveform segments to characterize the similarity between them. The segments with greatest similarity indicate the pitch period.

Frequency-domain PDAs identify the fundamental frequency by utilizing the harmonic structure in the short-term spectrum. For example, the PDA described by

Martin [99] uses the frequency domain comb filter to extract and summarize the on speech spectrum. Subharmonic summation method proposed by Hermes

[61] employs a logarithmic frequency abscissa. The subharmonics are summarized by repetitively shifting and summing up the speech spectrum. Cepstrum, i.e., the of the logarithm of the power spectrum, are also utilized by a number of PDAs

(for example, [124]) to flatten the spectrum due to the fact that strong tend to over emphasize some harmonics.

Time-frequency domain PDAs perform time-domain analysis on the band-filtered signals obtained via a multi-channel front-end. Many algorithms of this type are biologically inspired and can trace their roots to Licklider’ s “ duplex” model [93] for pitch perception. He proposed a network of delay lines and coincidence detectors after multi- channel cochlea filtering arranged along two dimensions: frequency and delay. The

13 coincidence detectors were realized by running an autocorrelation function in each channel. The periodicity information from each channel would then be integrated to give a single sensation of pitch. In principle, the pitch period is identified by looking across all channels and finding a time delay common for peaks in most channels. However,

Licklider did not provide explicit details of this integration.

Later, Moore [111] proposed a similar schematic model for the perception of the pitch of complex tones. His model consists of five stages: multi-channel cochlea filtering, neural transduction, analysis of spike intervals, combining intervals across frequency channels, and picking most prominent intervals. However, no computational implementation is provided in his model.

Many instantiations of this schematic model have been proposed since then. Meddis and Hewitt' s model [106] is among the best known. In their model, autocorrelations

(termed ª correlogramº ) are computed from simulated auditory nerve activity in each channel and the cross-band integration of periodicity, i.e., the summary autocorrelation function, is explicitly calculated by simple summation of the autocorrelation functions across all channels. Then the pitch period is derived from the position of the largest peak in the summary autocorrelation function. Although autocorrelation had been used to extract periodicity information in each frequency channel before the publishing of their paper (for example, [94, 97]), the use of a summary autocorrelation function as a basis of the prediction of pitch perception is a novel contribution. Their model can qualitatively explain a number of psychophysical phenomena such as and pitch shifts of inharmonic complexes. In many other time-frequency domain algorithms, such

14 as recent algorithms proposed by Van Immerseel and Marten [157] and Rouat et al.

[143], the approach of integrating periodicity across channels by simple summation is also used.

B. Pitch Tracking in Noisy Speech

Among the numerous PDAs proposed, some have been specifically designed for detecting a single pitch track with voiced/unvoiced decisions in noisy speech. The majority of these algorithms were tested on clean speech and speech mixed with different levels of white noise. For example, Krubsack and Niederjohn [87] proposed a robust system providing pitch and voice/unvoiced decisions with confidence measures based on analyzing the characteristics of the autocorrelation function of a signal. A method employing autocorrelation on the log spectrum was proposed by Kunieda et al. [88] and was shown to be more robust comparing with standard autocorrelation and cepstrum methods.

Some systems also have been tested in other speech and noise conditions. For example, Wang and Seneff [160] proposed a PDA employing the discrete logarithmic

Fourier transform and their algorithm is particularly robust for telephone speech. The system by Rouat et al. [143] was tested on telephone speech, vehicle speech, and speech mixed with white noise. By computing autocorrelation functions using multiple lengths of the analysis windows, Takagi et al. [150] developed a system detecting pitch from multiple time scales. They tested their single pitch track PDA on speech mixed with , music, and a male voice. In their study, multiple pitches in the mixtures are ignored and a single pitch decision is given.

15 C. Multipitch PDAs

The tracking of multiple pitches also has been investigated and many multi-pitch

PDAs are based on the same principles of designing PDAs discussed in Section 2.1-A.

A time-domain algorithm for tracking more than one pitch, for instance, is the cancellation model proposed by de Cheveigné and Kawahara [28]. After identifying the first pitch period, their model obtains the residue signal by canceling the waveform responsible for the first pitch period. The second pitch then is estimated from the residue signal. When iteratively applying this method, more then two pitch periods can be identified. The cancellation model was tested on concurrent vowels and mixtures of artificially generated periodic signals. The system proposed by Chazan et al. [25] finds the optimal combination of two quasi-periodic signals in maximum likelihood sense using Expectation-Maximization method and it is used for co-channel speech separation.

Parsons [130] constructed a table listing all peaks in signal spectrum. Then the

Schroeder histogram [145] is formed by all peaks in the table and their subharmonics.

The first pitch period is estimated from the largest peak of the histogram and then harmonics of the period are removed from the histogram. The second histogram is constructed and used to estimate the second pitch period and so on. An alternative approach is provided by Kwon et al. [90]. A two-dimensional map of the frequency of a spectrum peak and possible number of harmonic multiple corresponding to the peak is formed first. For peaks belonging to a quasi-periodic signal, there is a linear relationship between these two parameters. The first period is identified by finding the line passing through most peaks. The second period, then likewise, is identified by finding the passing

16 through most remaining peaks. This system was tested on tracking mixed speech signals.

Pern¤ndez-Cid and Casajøs-Quiràs [132] used multiple FFT with different window length to obtain partials with different time scales. Then partials are grouped and form pitch tracks. They tested their system on polyphonic musical signals.

Assmann and Summerfield [8] extended the summary autocorrelation method to identify the second pitch period by finding the second largest peak in the summary autocorrelation function. In Meddis and Hewitt' s model [107] for concurrent vowel identification, the first pitch period is derived from the largest peak in the summary autocorrelation function. Channels with peaks having the same delay of the first period are then removed. The residue summary autocorrelation function is formed and the second pitch period can be estimated from the residue function. Tolonen and Karjalainen

[155] proposed a computationally simplified version of Meddis and Hewitt' s model and tested their system on mixtures of two single pitch signals.

D. Hidden Markov Models for Pitch Tracking

Markov models have been widely used to capture many temporal stochastic processes. Notably, hidden Markov models were introduced to the speech recognition field by Baker [11, 12] and Jelinek and colleagues [74, 76] and resulted in a number of successful algorithms.

In several studies, hidden Markov models (HMMs) have been employed to model pitch track continuity. Weintraub [163] utilized a Markov model to determine whether zero, one or two pitches were present. Gu and van Bokhoven [52] used an HMM to group pitch candidates proposed by a bottom-up PDA and form continuous pitch tracks.

17 Tokuda et al. [154] modeled pitch patterns using an HMM based on a multi-space probability distribution. In both of the studies, pitch is treated as the observation and the

HMM has to be trained.

E. Several issues in pitch tracking

Several critical issues for designing a PDA have been repeatedly studied in the literature and influence the performance significantly. In this section, we review some of them.

Pitch track continuity

Temporal continuity of pitch track has been employed in many PDAs and was shown to improve the performance substantially. For example, a low-pass filter can be used to reduce small detection inaccuracies and a median smoothing filter can be used to reduce gross detection errors, i.e., drastic error of pitch detection [135]. Another common approach is to form a continuous pitch track from a number of candidates proposed by a bottom-up PDA. For instance, dynamic programming has been used to find an optimal path with smooth contours though a list of pitch estimate candidates (see [146] and

[160]). Postprocessing rules were used by Rouat et al. [143] for forming continuous pitch tracks. The algorithm proposed by Medan et al. [105] searches only the neighborhood of the estimated pitch value of previous time frame after the onset transients have settled down. Van Immerseel and Martens [157] gave the score of a pitch candidate based on pitch candidates in neighborhood time frames. Also, HMM can be used to model the continuity of pitch tracks as discussed earlier.

18 Determination of the number of pitch tracks

A PDA should be able to identify the number of pitch tracks at a given time frame.

For single-track PDAs, this is reduced to the voiced/unvoiced determination problem.

Generally, there are three categories of time frames in speech signals: voiced, unvoiced, and silence. Practically, most single-track PDAs identify a time frame as either voiced or unvoiced with one or no pitch period respectively. A number of parameters, such as energy, amplitude, zero-crossings count, and autocorrelation coefficients, can be used for the decision. Some algorithms separate the voiced/unvoiced decisions from the determination of pitch periods, for example, Krubsack and Niederjohn [87] not only made the voiced/unvoiced decision but also gave the confidence measure of the decision by comprehensively analyzing the characteristics of summary autocorrelation. Many

PDAs integrate the pitch tracker with the decision mechanism. Medan et al. [105] used an adaptive threshold for voiced/unvoiced decision. The algorithm by Van Immerseel and

Martens [157] thresholds the pitch evidence obtained not only from bottom up analysis but also from analysis of nearby time frames employing continuity constraint.

The determination of the number of pitch tracks in multi-pitch tracking systems is more challenging. Mistakes are easily made, for example, by identifying two or no pitch periods in single-pitch time frames of a noisy speech utterance.

2.2 Proposed Algorithm

In this section, we first give an overview of the algorithm and stages of processing.

As shown in Figure 2.1, the proposed algorithm consists of four stages. In the first stage,

19 the front-end, the signals are filtered into channels by an auditory peripheral model and the envelopes in high-frequency channels are extracted. Then, normalized correlograms

[23, 161] are computed.

Channel and peak selection comprises the second stage. In noisy speech, some channels are significantly corrupted by noise. By selecting the less corrupted channels, the robustness of the system is improved. Rouat et al. [143] suggested this idea, and implemented on mid- and high-frequency channels with center frequencies greater than

1270 Hz (see also [71] in the context of speech recognition). We extend the channel selection idea to low-frequency channels and propose an improved method that applies to all channels. Furthermore, we employ the idea for peak selection as well. Generally speaking, peaks in normalized correlograms indicate periodicity of the signals. However, some peaks give misleading information and should be removed. The detail of this stage is given in Section 2.2.1.

20

e n d n i a e h a t

s

, o n m y

I p l

l r . a m u e n o o g i f c a

F t e n

s d . i

d d s d i o n

e h o s e t r c s e e e u t s c m

x

o e i l r h a p t m

c

i s n t e i i s

h i t e t d

c a e r t n e s t m e

f r r a a e o

f f l r r g e e e n t n i p

n s n i s a u

i

h d s n c n l

o a e i h

t n c h c n a c e a e l e

h e e n c s p i

s t h k

t n a f i e e o r

w p

e / e l f r d f e e u i t n n d i x n

i a a s t s h m b

o C o r

A . c

s . a g i l

n e i d r m d e e a t o t r a l

r g i . m f

s g o

l d e r k t e e a c r n s a e r i l r o

o t h

p s c i c

h o

r o c d t n c p e i

o z p i y e i

t l b h s a

a t u

n f m o m o r r o i u

o t o n f a i n t m n t

i a n n e

r e o h y g s t t c a

e

i , i r c e d i m p

r g e d c r a o i o

t f t i l

s r a e

o e t t n

m s p n r e

d i a e e h f h

z c h i t c e l S -

i h

i , t . t t e

l u 1

. g n u s I a 2 i

t

m . s

e s

r a M e

d u g o r

g t i M a i t n h F s i t H

21 The third stage integrates periodicity information across all channels. Most time- frequency domain PDAs stem from Licklider' s duplex model for pitch perception [93], which extracts periodicity in two steps. First, the contribution of each frequency channel to a pitch hypothesis is calculated. Then, the contributions from all channels are combined into a single score. In the multi-band autocorrelation method, the conventional approach for integrating the periodicity information in a time frame is to summate the

(normalized) autocorrelations across all channels. Though simple, the periodicity information contained in each channel is under-utilized in the summary. By studying the statistical relationship between the true pitch periods and the time lags of selected peaks obtained in the previous stage, we first formulate the probability of a channel supporting a pitch hypothesis and then employ a statistical integration method for producing the conditional probability of observing the signal in a time frame given the hypothesized pitch. The relationship between true pitch periods and time lags of selected peaks is obtained in Section 2.2.2-A and the integration method is described in Section 2.2.2-B.

The last stage of the algorithm is to form continuous pitch tracks using an HMM. In several previous studies, HMMs have been employed to model pitch track continuity.

Weintraub [163] utilized a Markov model to determine whether zero, one or two pitches were present. Gu and van Bokhoven [52] used an HMM to group pitch candidates proposed by a bottom-up PDA and form continuous pitch tracks. Tokuda et al. [154] modeled pitch patterns using an HMM based on a multi-space probability distribution. In these studies, pitch is treated as an observation and both transition and observation probabilities of the HMM must be trained. In our formulation, pitch is explicitly modeled

22 as hidden states and hence only transition probabilities need to be specified by extracting pitch statistics from natural speech. Finally, optimal pitch tracks are obtained by using the

Viterbi algorithm. This stage is described in Section 2.2.2-C.

2.2.1 Multichannel Front-End

The input signal is sampled at a rate of 16 kHz and then passed through a bank of fourth-order gammatone filters [131], which is a standard model for cochlear filtering.

The bandwidth of each filter is set according to its equivalent rectangular bandwidth

(ERB) and we use a bank of 128 gammatone filters with center frequencies equally distributed on the ERB scale between 80 Hz and 5 kHz [29, 161]. After the filtering, the signals are re-aligned according to the delay of each filter.

The rest of the front-end is similar to that described by Rouat et al. [143]. The channels are classified into two categories. Channels with center frequencies lower than

800 Hz (channels 1-55) are called low-frequency channels. Others are called high- frequency channels (channels 56-128). The Teager energy operator [79] and a low-pass filter are used to extract the envelopes in high-frequency channels. The Teager energy

2 operator is defined as En = sn − sn+1sn−1 for a digital signal sn . Then, the signals are low- pass filtered at 800 Hz using the 3rd order Butterworth filter.

In order to remove the distortion due to very low frequencies, the outputs of all channels are further high-pass filtered to 64 Hz (FIR, window length of 16 ms). Then, at a given time step j, which indicates the center step of a 16 ms long time frame, the

23 normalized correlogram A(c, j, ) for channel c with a time lag is computed by running the following normalized autocorrelation in every 10-ms interval:

N / 2 ∑ r(c, j + n)r(c, j + n +τ ) A(c, j,τ ) = n=− N / 2 , (2.1) N / 2 N / 2 ∑ r 2 (c, j + n) ∑ r 2 (c, j + n + τ ) n=− N / 2 n=−N / 2 where r is the filter output. Here, N = 256 corresponds to the 16 ms window size (one frame) and the normalized correlograms are computed for = 0,,200.

In low-frequency channels, the normalized correlograms are computed directly from filter outputs, while in high-frequency channels, they are computed from envelopes. Due to their distinct properties, separate methods are employed for channel and peak selection in the two categories of frequency channels.

A. Low-frequency channels

Figure 2.2(a) and (b) show the normalized correlograms in the low-frequency range for a clean and noisy channel respectively. As can be seen, normalized correlograms are range limited ( −1 ≤ A(c, j, ) ≤ 1) and set to 1 at the zero time lag. A value of 1 at a non- zero time lag implies a perfect repetition of the signal with a certain scale factor. For a quasi-periodic signal with period T, the greater the normalized correlogram is at time lag

T, the stronger the periodicity of the signal. Therefore, the maximum value of all peaks at non-zero lags indicates the noise level of this channel. If the maximum value is greater

than a threshold 1 = 0.945, the channel is considered clean and thus selected. Only the

24 time lags of peaks in selected channels are included in the set of selected peaks, which is denoted as Φ.

B. High-frequency channels

As suggested by Rouat et al. [143], if a channel is not severely corrupted by noise, the original normalized correlogram computed using a window size of 16 ms and the normalized correlogram A′(c, j, ) using a longer window size of 30 ms should have similar shapes. This is illustrated in Figure 2.2(c) and (d) which show the normalized correlograms of a clean and a noisy channel in the high-frequency range respectively. For every local peak of A(c, j, ) , we search for the closest local peak in A′(c, j, ) . If the

difference between the two corresponding time lags is greater than 2 = 2 lag steps, the channel is removed.

Two methods are employed to select peaks in a selected channel. The first method is motivated by the observation that, for a peak suggesting true periodicity in the signal, a peak that is around the double of the time lag of the first one should be found. This

second peak is thus checked and if it is outside 3 = ±5 lag steps around the predicted double time lag of the first peak, the first peak is removed.

25

Figure 2.2. Examples of normalized correlograms: (a) normalized correlogram of a clean low-frequency channel, (b) that of a noisy low-frequency channel, (c) that of a clean high-frequency channel, and (d) that of a noisy high-frequency channel. Solid lines represent the correlogram using the original time window of 16 ms and dashlines represent the correlogram using a longer time window of 30 ms. Dot lines indicate the maximum height of non-zero peaks. All correlograms are computed from the mixture of two simultaneous utterances of a male and a female speaker. The utterances are ª Why were you all wearyº and ª Don' t ask me to carry an oily rag like that.º

26 It is well known that a high-frequency channel responds to multiple harmonics, and the nature of beats and combinational tones dictates that the response envelope fluctuates at the fundamental frequency [59]. Therefore, the occurrence of strong peaks at time lag

T and its multiples in a high-frequency channel suggests a fundamental period of T. In the second method of peak selection, if the value of the peak at the smallest non-zero time

lag is greater than 4 = 0.6 , all of its multiple peaks are removed. The second method is critical for reducing the errors caused by multiple and sub-multiple pitch peaks in autocorrelation functions.

The selected peaks in all high-frequency channels are added to Φ.

To demonstrate the effects of channel selection, Figure 2.3(a) shows the summary normalized correlograms of a speech utterance mixed with white noise from all channels, and Figure 2.3(b) from only selected channels. As can be seen, selected channels are much less noisy and their summary correlogram reveals the most prominent peak near the true pitch period whereas the summary correlogram of all channels fails to indicate the true pitch period. To further demonstrate the effects of peak selection, Figure 2.3(c) shows the summary normalized correlogram of a speech utterance from selected channels, and Figure 2.3(d) that from selected channels where removed peaks are excluded. To exclude a removed peak means that the segment of the correlogram between the two adjacent minima surrounding the peak is not considered. As can be seen, without peak selection, the height of the peak that is around double the time lag of the true pitch period is comparable or even slightly greater than the height of the peak that is

27 around the true pitch period. With peak selection, the height of the peak at the double of the true pitch period has been significantly reduced.

2.2.2. Pitch Tracking

A. Pitch Period and Time Lags of Selected Peaks

The alignment of peaks in the normalized correlograms across different channels signals a pitch period. By studying the difference between the true pitch period and the time lag from the closest selected peaks, we can derive the evidence of the normalized correlogram in a particular channel supporting a pitch period hypothesis.

More specifically, consider channel c. We denote the true pitch period as d, and the relative time lag is defined as

δ = l − d , (2.2) where l denotes the time lag of the closest peak.

The statistics of the relative time lag are extracted from a corpus of 5 clean utterances of male and female speech, which is part of the sound mixture database collected by Cooke [29]. A true pitch track is estimated by running a correlogram-based

PDA on clean speech before mixing, followed by manual correction. The speech signals are passed through the front-end and the channel/peak selection method described in

Section 2.2.1. The statistics are collected for every channel separately from the selected channels across all voiced frames.

28

Figure 2.3. (a) Summary normalized correlogram of all channels in a time frame from a speech utterance mixed with white noise. The utterance is ª Why were you all weary.º (b) Summary normalized correlogram of only selected channels in the same time frame as shown in (a). (c) Summary normalized correlogram of selected channels in a time frame from the speech utterance ª Don' t ask me to carry an oily rag like that.º (d) Summary normalized correlogram of selected channels where the removed peaks are excluded in the same time frame as shown in (c). To exclude a removed peak means that the segment of correlogram between the two adjacent minima surrounding the peak is not considered. Dashlines represent the delay corresponding to the true pitch periods. Dot lines indicate the peak heights at pitch periods.

29 As an example, the histogram of relative time lags for channel 22 (center frequency:

264 Hz) is shown in Figure 2.4. As can be seen, the distribution is sharply centered at zero, and can be modeled by a mixture of a Laplacian and a uniform distribution. The

Laplacian represents the majority of channels ª supportingº the pitch period and the uniform distribution models the ª background noiseº channels, whose peaks distribute uniformly in the background. The distribution in channel c is defined as

pc (δ ) = (1− qc )L(δ ;λc ) + qcU (δ ;ηc ) , (2.3)

where 0 < qc < 1 is a partition coefficient of the mixture model. The Laplacian

distribution with parameter λc has the formula

1 δ L(δ ;λc ) = exp(− ) . 2λc λc

The uniform distribution U( ;ηc ) with range ηc is fixed in a channel according to the possible range of the peak. In a low-frequency channel, multiple peaks may be selected and the average distance between the neighboring peaks is approximately the reciprocal of the center frequency. As a result, we set the length of the range in the uniform

distribution to this wavelength, that is, ηc = (− Fs /(2Fc ), Fs /(2Fc )), where Fs is the

sampling frequency and Fc is the center frequency of channel c. In a high-frequency

channel, however, ideally only one peak is selected. Therefore, U( ;ηc ) is the uniform distribution over all possible pitch periods. In other words, it is between 2 ms to 12.5 ms, or 32 to 200 lag steps, in our system.

30

Figure 2.4. Histogram and estimated distribution of relative time lags for a single pitch in channel 22. The bar graph represents the histogram and the solid line represents the estimated distribution.

The Laplacian distribution parameter λc and the partition parameter qc can be estimated independently for each channel. However, some channels have too few data

points to have accurate estimations. We observe that λc estimated this way decreases slowly as the channel center frequency increases. In order to have more robust and

smooth estimation across all channels, we assume qc to be constant across channels and a linear relationship between the frequency channel index and the Laplacian distribution

parameter λc ,

31 Model parameters

a0 a1 qc One pitch (LF) 1.21 −0.011 0.016 One pitch (HF) 2.60 −0.008 0.063 Two pitches (LF) 1.56 −0.018 0.016 Two pitches (HF) 3.58 −0.016 0.108

Table 2.1. Four sets of estimated model parameters.

λc = a0 + a1c . (2.4)

A maximum likelihood method is utilized to estimate the three parameters a0 , a1 ,

and qc . Due to the different properties for low- and high-frequency channels, the parameters were estimated on each set of channels separately and the resulting parameters are shown in the upper half of Table 2.1, where LF and HF indicate low- and high-frequency channels respectively. The estimated distribution of channel 22 is shown in Figure 2.4. As can be seen, the distribution fits the histogram very well.

Similar statistics are extracted for time frames with two pitch periods. For a selected channel with signals coming from two different harmonic sources, we assume that the energy from one of the sources is dominant. This assumption holds because otherwise, the channel is likely to be noisy and rejected by the selection method in Section 2.2.1. In this case, we define the relative time lags as relative to the pitch period of the dominant source. The statistics are extracted from the mixtures of the 5 speech utterances used earlier. For a particular time frame and channel, the dominant source is decided by

32 comparing the energy of the two speech utterances before mixing. The probability

distribution of relative time lags with two pitch periods is denoted as pc′ (δ ) and has the same formulation as in Equations 2.3-2.4. Likewise, the parameters are estimated for low- and high-frequency channels separately and shown in the lower half of Table 2.1.

Likewise, LF and HF indicate low- and high-frequency channels respectively.

B. Integration of Periodicity Information

As noted in Tokuda et al. [154], the state space of pitch is not a discrete or continuous state space in a conventional sense. Rather, it is a union space Ω consisting of three spaces:

Ω = Ω 0 ∪ Ω1 ∪ Ω 2 , (2.5)

where Ω 0 , Ω1 , Ω 2 are zero, one, and two dimensional spaces representing zero, one, and two pitches, respectively. A state in the union space is represented as a pair = (o,O) , where o ∈ R O and O ∈{0,1,2} is the space index. This section derives the conditional probability p(Φ | ) of observing the set of selected peaks given a pitch state

.

The hypothesis of a single pitch period d is considered first. For a selected channel, the closest selected peak relative to the period d is identified and the relative time lag is

denoted as δ (Φ c , d) , where Φ c is the set of selected peaks in channel c.

The channel conditional probability is derived as

33  pc (δ (Φ c ,d)), if channel c selected p(Φ c | ω1 ) =  , (2.6) q1 (c)U (0;ηc ), otherwise

where 1 = (d, 1) ∈ Ω1 and q1 (c) is the parameter qc estimated from one-pitch frames as shown in Table 2.1. Note that, if a channel has not been selected, the probability of background noise is assigned.

The channel conditional probability can be easily combined into the frame conditional probability if the mutual independence of the responses of all channels is assumed. However, the responses are usually correlated due to the wideband nature of speech signals and the independence assumption produces very ª spikyº distributions.

This is known as the probability overshoot phenomenon and can be partially remedied by smoothing the combined probability estimates by taking a root greater than 1 [55].

Hence, we propose the following formula with a smoothing operation to combine the information across the channels:

C b p(Φ | ω1 ) = κ ∏ p(Φ c | ω1 ) , (2.7) c=1 where C = 128 is the number of all channels, the parameter b = 6 is the smoothing factor

(see Section 2.2.2-D for more discussion), and κ is a normalization constant for probability definition.

Then we consider the hypothesis of two pitch periods, d1 and d 2 , corresponding to

two different harmonic sources. Let d1 correspond to the stronger source. The channels

are labeled as the d1 source if the relative time lags are small. More specifically, channel

34 c belongs to the d1 source if δ (Φ c ,d1 ) < βλc , where = 5.0 and λc denotes the

Laplacian parameter for channel c calculated from Equation 2.4. The combined probability is defined as

C b ′ p2 (Φ, d1 , d 2 ) = ∏ p2 (Φ c , d1 , d 2 ) , (2.8) c=1 where

 q2 (c)U (0;ηc ) if channel c not selected  p2′ (Φ c , d1 , d 2 ) =  pc′ (δ (Φ c ,d1 )), if channel c belongs to d1 , (2.9)   max(pc′ (δ (Φ c ,d1 )), pc′ (δ (Φ c ,d 2 ))), otherwise

with q2 (c) denotes the parameter qc estimated from two-pitch frames as shown in Table

2.1.

The conditional probability for the time frame is the larger of assuming either d1 or

d 2 to be the stronger source:

p(Φ | 2 ) = κα 2 max[p2 (Φ,d1 ,d 2 ), p2 (Φ,d 2 ,d1 )] , (2.10)

−5 where 2 = ((d1 ,d 2 ),2)∈ Ω 2 and α 2 =1.7 ×10 .

Finally, we fix the probability of zero pitches occurring,

p(Φ | 0 ) = κα 0 , (2.11)

−33 where 0 ∈ Ω0 and α 0 = 2.3 ×10 .

In many time-frequency domain PDAs (e.g., [107]), the score of a pitch hypothesis is computed by weighting the contributions of frequency channels according to, say, energy

35 levels. Our formulation treats every frequency channel equally. Several considerations are in order. First, in principle, the periodicity information extracted from different channels should be integrated so that greater weights are assigned to channels providing more reliable information. For speech mixed with a moderate level of interference, the channels with higher energy tend to indicate more reliable periodicity information.

However, for speech mixed with comparable or higher levels of interference, high-energy channels can be significantly corrupted and give unreliable periodicity information. The channel selection method described in Section 2.2.1 serves to choose channels that are not strongly corrupted by noise. As a result, selected channels should provide relatively reliable information on periodicity, and hence we allow selected channels to contribute equally to pitch estimation. Second, the source with dominant energy tends to mask other weaker sources. Our integration scheme maintains the sensitivity of pitch detection to weaker sources.

C. Pitch Tracking using an HMM

We propose to use a hidden Markov model for approximating the generation process of harmonic structure in natural environments. The model is illustrated in Figure 2.5. In each time frame, the hidden node indicates the pitch state space, and the observation node the observed signal. The temporal links between neighboring hidden nodes represent the probabilistic pitch dynamics. The link between a hidden node and an observation node describes observation probabilities, which have been formulated in the previous section

(bottom-up pitch estimation).

36

Figure 2.5. Schematic diagram of an HMM for forming continuous pitch tracks. The hidden nodes represent possible pitch states in each time frame. The observation nodes represent the set of selected peaks in each frame. The temporal links in the Markov model represent the probabilistic pitch dynamics. The link between a hidden node and an observation node is called observation probability.

Pitch dynamics have two aspects. The first is the dynamics of a continuous pitch track. The statistics of the changes of the pitch periods in consecutive time frames can be extracted from the true pitch contours of 5 speech utterances extracted earlier and their histogram is shown in Figure 2.6. This is once again indicative of a Laplacian distribution. Thus, we model it by the following Laplacian distribution

1 ∆ − m p p(∆) = exp(− ) , (2.12) 2λ p λ p

37 where ∆ represents pitch period changes, and m p and λ p are distribution parameters.

Using a maximum likelihood method, we have estimated that λ p = 2.4 lag steps and

m p = 0.4 lag steps. A positive m p indicates that, in natural speech, speech utterances have a tendency for pitch periods to increase; conversely, pitch frequencies tend to decrease. This is consistent with the declination phenomenon [125] that in natural speech pitch frequencies slowly drift down where no abrupt change in pitch occurs, which has been observed in many languages including English. The distribution is also shown in

Figure 2.6 and it fits the histogram very well.

The second aspect concerns jump probabilities between the state spaces of zero pitch, one pitch, and two pitches. We assume that a single speech utterance is present in the mixtures approximately half of the time and two speech utterances are present in the remaining time. The jump probabilities are estimated from the pitch tracks of the same 5 speech utterances analyzed above and the values are given in Table 2.2.

Finally, the state spaces of one and two pitch are discretized and the standard Viterbi algorithm [75] is employed for finding the optimal sequence of states. Note that the sequence can be a mixture of zero, one, or two pitch states.

38

Figure 2.6. Histogram and estimated distribution of pitch period changes in consecutive time frames. The bar graph represents the histogram and the solid line represents the estimated distribution.

D. Parameter Determination

The frequency separating the low- and high-frequency channels is chosen according to several criteria. First, the separation frequency should be greater than possible pitch frequencies of speech, and the bandwidth of any high-frequency channels should be large enough to contain at least two harmonics of a certain harmonic structure so that amplitude modulation due to beating at the fundamental frequency is possible. Second, as long as such envelopes can be extracted, the normalized correlograms calculated from the

39 → Ω 0 → Ω1 → Ω 2

Ω 0 0.9250 0.0750 0.0000

Ω1 0.0079 0.9737 0.0184

Ω 2 0.0000 0.0323 0.9677

Table 2.2. Transition probabilities between state spaces of pitch.

envelopes give better indication of pitch periods than those calculated from the filtered signals directly. That is because envelope correlograms reveal pitch periods around the first peaks, whereas direct correlograms have many peaks in the range of possible pitch periods. Therefore, the separation frequency should be as low as possible so long as reliable envelopes can be extracted. By considering these criteria, we choose the separation frequency of 800 Hz.

In our model, there are a total of eight free parameters: four for channel/peak selection and four for bottom-up estimation of observation probability (their values are

given). The parameters 1 , 2 , 3 , and 4 are introduced in channel/peak selection method and they are chosen by examining the statistics from sample utterances mixed with interferences. The true pitch tracks are known for these mixtures. In every channel, the closest correlogram peak relative to the true pitch period is identified. If this peak is off from the true pitch period by more than 7 lag steps, we label this channel ª noisyº .

Otherwise, the channel is labeled ª cleanº . Parameter 1 is selected so that more than half

of the noisy channels in low-frequency channels are rejected. Parameters 2 and 3 are chosen so that majority of the noisy channels are rejected while minimizing the chance

40 that a clean channel is rejected. Finally, parameter 4 is chosen so that, for almost all selected channels in high-frequency channels, the multiple peaks are removed.

Parameters , α 0 , α 2 , and b are employed for bottom-up estimation of observation probability. Parameter is used to specify the criterion for identifying the channels that belong to the dominant pitch period. It is chosen so that, in clean speech samples, almost

all selected channels belong to the true pitch periods. Parameters α 0 and α 2 are employed to tune the relative strengths of the hypotheses of zero, one or two pitch periods. The smoothing factor b can be understood as tuning the relative influence of

bottom-up and top-down processes. α 0 , α 2 , and b are optimized with respect to the combined total detection error for the training mixtures. We find that b can be chosen in a considerable range without influencing the outcome.

We note that in the preliminary version of this model [166], a different set of parameters has been employed and good results were obtained. In fact, there is a considerable range of appropriate values for these parameters, and overall system performance is not very sensitive to the specific parameter values used.

E. Efficient Implementation

The computational expense of the proposed algorithm can be improved significantly by employing several efficient implementations. First, a logarithm can be taken on both sides of Equation 2.6-2.11 and in the Viterbi algorithm [75]. Instead of computing multiplications and roots, which are time-consuming, only summations and divisions need to be calculated. Moreover, the number of pitch states is quite large and checking all

41 of them using the Viterbi algorithm requires an extensive use of computational resources.

Several techniques have been proposed in the literature to alleviate the computational load while achieving almost identical results [75].

1) Pruning has been used to reduce the number of pitch states to be searched for

finding the current candidates of a pitch state sequence. Since pitch tracks are

continuous, the differences of pitch periods in consecutive time frames in a

sequence can be restricted to a reasonable range. Therefore, only pitch periods

within the range need to be searched.

2) Beam search has been employed to reduce the total number of pitch state

sequences considered in evaluation. In every time frame, only a limited number of

the most probable pitch state sequences are maintained and considered in the next

frame.

3) The highest computational load comes from searching the pitch states

corresponding to two pitch periods. In order to reduce the search effort, we only

check the pitch periods in the neighborhood of the local peaks of bottom-up

observation probabilities.

By using the above efficient implementation techniques, we find that the computational load of our algorithm is drastically reduced. Meanwhile, our experiments show that the results from the original formulation and that derived for efficient implementation have negligible differences.

42 2.3. Results and Comparisons

A corpus of 100 mixtures of speech and interference [29], commonly used for CASA research [23, 34, 161], has been used for system evaluation and model parameter estimation. The mixtures are obtained by mixing 10 voiced utterances with 10 interference signals representing a variety of acoustic sounds. As shown in Table 2.3, the interferences are further classified into three categories: 1) those with no pitch, 2) those with some pitch qualities, 3) other speech. Five speech utterances and their mixtures, which represent approximately half of the corpus, have been employed for model parameter estimation. The other half of the corpus is used for performance evaluation.

To evaluate our algorithm (or any algorithm for that matter) requires a reference pitch contour corresponding to true pitch. However, such a reference is probably impossible to obtain [64], even with instrument support [86]. Therefore, our method of obtaining reference pitch contours starts from pitch tracks computed from clean speech and is followed by a manual correction as mentioned before. Reference pitch contours obtained this way are far more accurate than those without manual correction, or those obtained from noisy speech.

43 Interference signals Category 1 White noise and noise bursts Category 2 1 kHz tone, ª cocktail partyº noise, rock music, siren, and trill telephone Category 3 Female utterance 1, male utterance and female utterance 2

Table 2.3. Categorization of interference signals.

To measure progress, it is important to provide a quantitative assessment of PDA performance. The guidelines for the performance evaluation of PDAs with single pitch track were established by Rabiner et al. [134]. However, there are no generally accepted guidelines for multiple pitch periods that are simultaneously present. Extending the classical guidelines, we measure pitch determination errors separately for the three interference categories documented in Table 2.3 because of their distinct pitch properties.

We denote E x→ y as the error rate of time frames where x pitch points are misclassified as y pitch points. The pitch frequency deviation ∆f is calculated by

PDAoutput − f 0 ∆f = ×100% , (2.13) f 0

where PDAoutput is the closest pitch frequency estimated by the PDA to be evaluated and

f 0 is the reference pitch frequency. Note that PDAoutput may yield more than one pitch

point for a particular time frame. The gross detection error rate EGross is defined as the

percentage of time frames where ∆f > 20% and the fine detection error E Fine is defined

44 as the average frequency deviation from the reference pitch contour for those time frames without gross detection errors.

For speech signals mixed with Category 1 interferences, a total gross error is indicated by

ETotal = E0→1 + E0→2 + E1→0 + EGross . (2.14)

Since the main interest in many contexts is to detect the pitch contours of speech

utterances, for Category 2 mixtures only E1→0 is measured and the total gross error ETotal

is indicated by the sum of E1→0 and EGross . Category 3 interferences are also speech utterances and therefore all possible decision errors should be considered. For time frames with a single reference pitch, gross and fine determination errors are defined as earlier. For time frames with two reference pitches, a gross error occurs if either one exceeds the 20% limit, and a fine error is the sum of the two for two reference pitch periods. For many applications, the accuracy with which the dominating pitch is

Dom determined is of primary interest. Therefore, the total gross error EGross and the fine error

Dom E Fine for dominating pitch periods are also measured.

Our results show that the proposed algorithm reliably tracks pitch points in various situations, such as one speaker, speech mixed with other acoustic sources, and two speakers. For instance, Figure 2.7(a) shows the time-frequency energy plot for a mixture of two simultaneous utterances (a male speaker and a female speaker with signal-to- signal energy ratio = 9 dB) and Figure 2.7(b) shows the result of tracking the mixture. As another example, Figure 2.8(a) shows the time-frequency energy plot for a mixture of a

45 male utterance and white noise (signal-to-noise ratio = ±2 dB). Note here that the white noise is very strong. Figure 2.8(b) shows the result of tracking the signal. In both cases, our algorithm robustly tracks either one or two pitches. Systematic performance of our algorithm for the three interference categories is given in Tables 2.4-2.6 respectively. As can be seen, our algorithm achieves total gross errors of 7.17% and 3.50% for Category 1 and 2 mixtures respectively. For Category 3 interferences, a total gross error rate of

0.93% for the dominating pitch is obtained.

46

Figure 2.7. (a) Time-frequency energy plot for a mixture of two simultaneous utterances of a male and a female speaker. The utterances are ª Why were you all wearyº and ª Don' t ask me to carry an oily rag like that.º The brightness in a time-frequency cell indicates the energy of the corresponding gammatone filter output in the corresponding time frame. For better display, energy is plotted as the square of the logarithm. (b) Result of tracking the mixture. The solid lines indicate the true pitch tracks. The `×' and `ο' tracks represent the pitch tracks estimated by our algorithm.

47

Figure 2.8. (a) Time-frequency energy plot for a mixture of a male utterance and white noise. The utterance is ª Why were you all weary.º The brightness in a time-frequency cell indicates the energy of the corresponding gammatone filter output in the corresponding time frame. For better display, energy is plotted as the square of logarithm. (b) Result of tracking the mixture. The solid lines indicate the true pitch tracks. The `×' tracks represent the pitch tracks estimated by our algorithm.

48 E0→1 E0→2 E1→0 E1→2 EGross ETotal EFine Proposed PDA 0.36 Nil 6.81 Nil Nil 7.17 0.43 TK PDA 1.96 0.05 23.3 9.10 2.38 27.66 1.76 GB PDA 0.26 Nil 49.5 Nil 0.36 50.10 1.06 R-GB PDA 1.56 Nil 10.81 Nil 2.13 14.50 0.78

Table 2.4. Error rates (in percentage) for Category 1 interference.

To put the above performance into perspective, we compare with two recent multipitch detection algorithms proposed by Tolonen and Karjalainen [155] and Gu and van Bokhoven [52]. In the Tolonen and Karjalainen model, the signal is first passed through a pre-whitening filter and then divided into two channels, below and above 1000

Hz. Generalized autocorrelations are computed in the low-frequency channel directly and those of the envelope are computed in the high-frequency channel. Then, enhanced summary autocorrelation functions are generated and the decisions on the number of pitch points as well as their pitch periods are based on the most prominent and the second most prominent peaks of such functions. We choose this study for comparison because it is a recent time-frequency domain algorithm based on a similar correlogram representation. We refer to this PDA as the TK PDA.

Gu and van Bokhoven' s multipitch PDA is chosen for comparison because it is an

HMM-based algorithm, and an HMM is also used in our system. The algorithm can be separated into two parts. The first part is a pseudo perceptual estimator [51] that provides coarse pitch candidates by analyzing the envelopes and carrier frequencies from the

49 E1→0 EGross ETotal EFine Proposed PDA 3.18 0.32 3.50 0.44 TK PDA 7.70 4.53 12.23 1.41 GB PDA 22.10 2.10 24.21 2.20 R-GB PDA 5.94 4.48 10.04 0.70

Table 2.5. Error rates (in percentage) for Category 2 interference.

responses of a multi-channel front-end. Such pitch candidates are then fed into an HMM- based pitch contour estimator [51] for forming continuous pitch tracks. Two HMMs are trained for female and male speech utterances separately and are capable of tracking a single pitch track without voiced/unvoiced decisions at a time. In order to have voicing decisions, we add one more state representing unvoiced time frames to their original 3- state HMM. Knowing the number and types of the speech utterances presented in a mixture in advance (e.g. a mixture of a male and a female utterance) we can find the two pitch tracks by applying the male and female HMM separately. For a mixture of two male utterances, after the first male pitch track is obtained, the pitch track is subtracted from the pitch candidates and the second track is identified by applying the male HMM again.

We refer to this PDA as the GB PDA.

50 Dom Dom E0→1 E0→2 E1→0 E1→2 E2→0 E2→1 EGross EFine EGross EFine Proposed 0.68 Nil 0.88 0.16 Nil 27.08 0.21 0.33 0.93 0.21 PDA TK PDA 0.47 0.10 2.64 4.55 1.19 26.84 2.33 0.99 4.28 0.69 GB PDA 0.41 Nil 2.65 4.20 4.20 34.54 3.89 2.04 7.70 1.34 R-GB PDA 0.57 Nil 2.28 2.78 0.57 11.80 9.09 2.11 3.63 0.53

Table 2.6. Error rates (in percentage) for Category 3 interference.

Our experiments show that sometimes the GB PDA provides poor results, especially for speech mixed with a significant amount of white noise. Part of the problem is caused by its bottom-up pitch estimator, which is not as good as ours. To directly compare our

HMM-based pitch track estimator with their HMM method, we substitute our bottom-up pitch estimator for theirs but still use their HMM model for forming continuous pitch tracks. The revised algorithm is referred as the R-GB PDA.

Figure 2.9 shows the multipitch tracking results using the TK, the GB, and the R-GB

PDAs, respectively, from the same mixture of Figure 2.7. As can been seen, our algorithm performs significantly better than all those algorithms. Figure 2.10(a)-(c) give the results of extracting pitch tracks from the same mixture of Figure 2.8 using the TK, the GB, and the R-GB PDAs, respectively. As can be seen, our algorithm has much less detection error.

51

Figure 2.9. Results of tracking the same signal as in Figure 2.7 using (a) the TK PDA, (b) the GB PDA, and (c) the R-GB PDA. The solid lines indicate the true pitch tracks. The `×' and `ο' tracks represent the estimated pitch tracks.

52 Quantitative comparisons are shown in Table 2.4-2.6. For Category 1 interferences, our algorithm has a total gross error of 7.17% while others have errors varying from

14.50% to 50.10%. The total gross error for Category 2 mixtures is 3.50% for ours, and for others it ranges from 10.04% to 24.21%. Our algorithm yields the total gross error rate of 0.93% for the dominating pitch. The corresponding error rates for the others range from 3.63% to 7.70%.

Note in Table 2.6 that the error rate E2→1 of the R-GB PDA is considerably lower than ours. This, however, does not imply the R-GB PDA outperforms our algorithm. As shown in Figure 2.9(c), the R-GB PDA tends to mistake harmonics of the first pitch period as the second pitch period. As a result, the overall performance is much worse.

Finally, we compare our algorithm with a single-pitch determination algorithm for noisy speech proposed by Rouat et al. [143]2. Figure 2.10(d) shows the result of tracking the same mixture as in Figure 2.8. As can be seen, our algorithm yields less error. We do not compare with this PDA quantitatively because it is designed as a single-pitch tracker and cannot be applied to Category 2 and 3 interferences.

In summary, these results show that our algorithm outperforms the other algorithms significantly in almost all the error measures.

2 Results provided by J. Rouat.

53

Figure 2.10. Result of tracking the same signal as in Figure 2.8 using (a) the TK PDA, (b) the GB PDA, (c) the R-GB PDA, and (d) the PDA proposed by Rouat et al. [31]. The solid lines indicate the true pitch tracks. The `×' and `ο' tracks represent the estimated pitch tracks. In subplot (d), time frames with negative pitch period estimates indicate the decision of voiced with unknown period.

54 2.4 Discussion and Conclusion

A common problem in PDAs is harmonic and subharmonic errors, in which the harmonics or subharmonics of a pitch are detected instead of the real pitch itself. Several techniques have been proposed to alleviate this problem. For example, a number of algorithms check sub-multiples of the time lag for the highest peak of the summary autocorrelations to ensure the detection of the real pitch period (for example, see [87]).

Shimamura and Kobayashi [149] proposed a weighted autocorrelation method discounting the periodicity score of the multiples of a potential pitch period. The system by Rouat et al. [143] checks the sub-multiples of the two largest peaks in normalized summary autocorrelations and further utilizes the continuity constraint of pitch tracks to reduce these errors. Liu and Lin [96] compensate two pitch measures to reduce the scores of harmonic and subharmonic pitch periods. Medan et al. [105] disqualify such candidates by checking the normalized autocorrelation using a larger time window and pick the pitch candidate that exceeds a certain threshold and has the smallest pitch period.

In our time-frequency domain PDA, several measures contribute to alleviate these errors. First, the probabilities of subharmonic pitch periods are significantly reduced by selecting only the first correlogram peaks calculated from envelopes in high-frequency channels. Second, noisy channels tend to have random peak positions, which can reinforce harmonics or subharmonics of the real pitch. By eliminating these channels using channel selection, harmonic and subharmonic errors are greatly reduced. Third, the

HMM for forming continuous pitch tracks contributes to decrease these errors.

55 The HMM in our model plays a similar role (utilizing pitch track continuity) as post- processing in many PDAs. Some algorithms, such as [143], employ a number of post- processing rules. These ad hoc rules introduce new free parameters. Although there are parameters in our HMM, they are learned from training samples. Also, in many algorithms (for example, see [157]), pitch tracking only considers several candidates proposed by the bottom-up algorithm and composed of peaks in bottom-up pitch scores.

Our tracking mechanism considers all possible pitch hypotheses and therefore performs in a wider range of conditions.

There are several major differences in forming continuous pitch tracks between our

HMM model and that of Gu and van Bokhoven [52]. Their approach is essentially for single pitch tracking while ours is for multipitch tracking. Theirs uses two different

HMMs for modeling male and female speech while ours uses the same model. Their model needs to know the number and types of speech utterances in advance, and has difficulty tracking a mixture of two utterances of the same type (e.g. two male utterances). Our model does not have these difficulties.

Many models estimate multiple pitch periods by directly extending single-pitch detection methods, and they are called the one-dimensional paradigm. A common one- dimensional representation is a summary autocorrelation. Multiple pitch periods can be extracted by identifying the largest peak, the second largest peak, and so on. However, this approach is not very effective in a noisy environment, because harmonic structures often interact with each other. Cheveigné and Kawahara [28] have pointed out that a multi-step “ estimate-cancel-estimate” approach is more effective. Their pitch perception

56 model cancels the first harmonic structure using an initial estimate of the pitch, and the second pitch is estimated from the comb-filtered residue. Also, Meddis and Hewitt' s

[107] model of concurrent vowel separation uses a similar paradigm. A multi- dimensional paradigm is used in our model, where the scores of single and combined pitch periods are explicitly given. Interactions among the harmonic structures are formulated explicitly, and our results show that this multi-dimensional paradigm is effective for dealing with noise intrusions and mutual interference among multiple harmonic structures.

As stated previously, approximately half of the mixture database is employed for estimating (learning) relative time lag distributions in a channel (see Figure 2.4) and pitch dynamics (see Figure 2.6), while the other half is utilized for evaluation. It is worth emphasizing that such statistical estimations reflect general speech characteristics, not specific to either speaker or utterance. Hence, estimated distributions and parameters are expected to generalize broadly, and this is confirmed by our results. We have also tested our system on different kinds of utterance and different speakers, including digit strings from TIDigit [92], after the system is trained, and we observe equally good performance.

The proposed model can be extended to track more than two pitch periods. To do so, the union space described in Section 2.2-B would be augmented to include more than three pitch spaces. The conditional probability for the hypotheses of more than two pitch periods may be formulated using the same principles as for formulating up to two pitch periods.

57 There are two aspects of our proposed algorithm: multipitch tracking and robustness.

Rather than considering these two aspects separately, we treat them as a single problem.

As mentioned in the Introduction, the ability to track multiple pitch periods increases the robustness of an algorithm by allowing it to deal with other voiced interferences.

Conversely, the ability to operate robustly improves the reliability of detecting the pitch periods of weaker sources. More specifically, the channel/peak selection method mainly contributes to the robustness of the system. The cross-channel integration method and the

HMM for pitch tracking are formulated for detecting multiple pitch periods, although considerations are also given to the robustness of our system.

In summary, we have shown that our algorithm performs reliably for tracking single and double pitch tracks in a noisy acoustic environment. A combination of several novel ideas enables the algorithm to perform well. First, an improved channel and peak selection method effectively removes corrupted channels and invalid peaks. Second, a statistical integration method utilizes the periodicity information across different channels. Finally, an HMM realizes the pitch continuity constraint.

58

CHAPTER 3

ROOM REVERBERATION AND A PITCH-BASED MEASURE FOR REVERBERATION TIME

Reverberation is a complex acoustical phenomenon, and it degrades speech intelligibility and quality. In this chapter, we first review room reverberation and its effects on speech perception. Then, with the understanding of the deleterious effects of reverberation to harmonic structure in voiced speech, we propose a measure for the reverberation time.

The chapter is organized as follows. The following section provides the general background of room reverberation. Section 3.2 reviews the effects of room reverberation on speech perception. In Section 3.3, we introduce a pitch-based measure for reverberation time. Finally, Section 3.4 concludes this chapter.

3.1 Room Reverberation

A sound field in a natural environment can be considered as a superposition of numerous simple sound waves, such as plane waves. A simple plane wave propagates in

59 air along a single direction with constant speed, and is governed by the following partial differential equation [89]:

∂ 2 p ∂ 2 p v 2 = , (3.1) ∂x 2 ∂t 2 where we assume the plane wave propagates along x-direction of the Cartesian coordinate system, and p, t, and v denote pressure, time, and the velocity of sound, respectively. Note here, the velocity of sound in air depends on temperature as described in the following equation:

v = (333.1+ 0.6Θ) m/s, (3.2) where Θ is the temperature in centigrade.

In contrast to the phenomenon of plane wave propagation that occurs only in a unbounded homogeneous medium, the sound fields in rooms are restricted on all side by walls, ceiling and floor. These room boundaries usually reflect a certain fraction of acoustic energy impinging on them, while remaining fraction is absorbed. When a plane wave strikes a wall, the changes in amplitude and phase during the reflection depends on the properties of the wall and the direction of the incident wave.

Another phenomenon called scattering occurs when sound wave hits any obstacle of limited extent, such as a pillar or a piece of furniture. Additional waves are brought about by diffraction and spread more or less in all directions.

The combination of numerous reflected and scattered components gives rise to the complexity of the sound field in a room. However, due to the linearity of wave

60 propagation, reflection, and scattering processes, the reverberation process can be characterized as a linear transmission system. Specifically, the relationship between the sound pressure measured at one point in space and the sound source pressure can be expressed as:

∞ y(t)= ∫ s(τ )h(t −τ )dτ , (3.3) 0 where y(t), s(t), and h(t) are the reverberant signal, the source signal, and the room impulse response function, respectively. The room impulse response function yields a complete description of room reverberation from the talker to the receiver. This linearization approximation is valid for most situations except for a very loud sound [89], which causes nonlinearity in the reflection process. Also, in a large room, the impulse response sometimes varies with time due to air circulation and temperature fluctuation.

This effect, however, is usually small and can be safely ignored.

61

Figure 3.1. Plot of a room impulse response function measured in a typical office employing a cross-correlation method. The data is obtained from [1].

Several techniques have been developed to measure room impulse response functions. For example, a cross-correlation method [89] derives the impulse response from the cross-correlation function of a known ª pseudo-randomº test signal emitted by a loudspeaker and the signal received by a microphone. However, it is difficult to obtain high quality room impulse response functions because not only does this procedure require high quality loudspeakers and microphones, but also in field measurement, background noise is hard to eliminate. As an example, Figure 3.1 shows a room impulse response measured in a typical office employing a cross-correlation method. As can be

62 seen, the energy of the response first attenuates quickly with increasing time delays. It follows by a long segment with relatively low amplitudes, which resembles the white noise and is caused by the background noise in measurement.

Computer simulation provides an alternative for field measurement. Allen and

Berkley [3] proposed an image method for simulating room reverberation by artificially generating room impulse response functions. This method assumes that sound waves propagate in a room of rectangular enclosure and a virtual acoustic image is generated by every reflection by the walls, the floor and the ceiling. The speaker-to-microphone impulse response function is computed by summing the impulses generated by the virtual acoustic images. This image model provides an exact solution to the full wave equation formulation of acoustics when the walls of the room are ª hardº surfaces. Even when the walls are not perfect reflectors, this model still offers a good approximation of the physical results [16]. As a result, the Allen-Berkeley model is widely used. A variety of room impulse responses can be generated using this method by changing the room dimension and reflection coefficients of the walls. For example, Figure 3.2(a) illustrates the geometry of an office-size room and the locations of the loudspeaker and the microphone. The corresponding room impulse response function generated by this image model is shown in Figure 3.2(b). Comparing with the room impulse response in Figure

3.1, the long tail segments resembling white noise are eliminated.

63 Microphone

Speaker

3 m

4 m

6 m

(a)

(b)

Figure 3.2. (a) The geometry of an office-size room of the dimensions 6 by 4 by 3 meters (length by width by height), and (b) the corresponding room impulse response function generated by the image model [3]. Wall reflection coefficients are 0.75 for all walls, ceiling and floor. The loudspeaker and the microphone are at (2, 3, 1.5) and (4, 1, 2) meters (length, width, height), respectively.

64 A room impulse response consists of the direct sound, which arrives by a straight-line path, and numerous reflections, which yield a multipath sound transmission from source to receiver. The first few reflections are from nearby walls and relatively well defined, while later ones, representing multiple reflections from the wall, are numerous and blur together. Instead of being treated individually, the reflections are usually summarized to characterize the impulse response.

A sound wave loses a fraction of its energy in every reflection according to the reflection coefficient (a property of the wall material) and the incident angle. The more times a sound wave reflects from the walls before reaching the receiver, the weaker the received signal. Assuming the impulse response h(t), the energy decay D(t) is computed according to Schroeder [144]:

∞ ∞ t D(t)= ∫[h(τ )]2 dτ = ∫[h(τ )]2 dτ − ∫[h(τ )]2 dτ . (3.4) t 0 0

As an example, Figure 3.3 shows the energy decay curve for the impulse response function in Figure 3.2(b). As can be seen, the decay curve in dB approximates a straight line, and therefore, shows an exponential energy decay pattern (a straight decay line in dB indicates an exponential decay in energy values). It is typical of most room impulse response functions. In order to quantify this pattern, reverberation time T60 is defined as the duration at the end of which the decay level reaches -60 dB. Typical values of reverberation times run from 0.3 s (living room) to 10 s (large churches or reverberation chambers) [89]; the longer the reverberation time, the more severe the room reverberation. For instance, the reverberation time T60 of the impulse response in Figure

65

Figure 3.3. Energy decay curve for the room impulse response function in Figure 3.2(b) using Schroeder integration method [144]. The horizontal dot line represents ±60 dB energy decay level. The left dashline indicates the starting time of the impulse responses and the right dashline the time at which decay curve crosses ±60 dB.

3.2(b) is 0.3 s; see Figure 3.3. In practice, reverberation time is used to characterize not only the room impulse response but also the overall reverberation in a room, for it experiences rather little changes in a room when talker and microphone locations vary.

66 3.2 Effects of Room Reverberation on Speech Perception

Reverberation affects speech perception everyday. For instance, when having a conversation with a friend in a stairwell, not only do we feel the speech sound ª echoicº , but also have more difficulty in understanding the speech.

Under certain situations, such as listening to the sound reflected from mountainsides, the reflected portions of sound are heard as distinct ª echoes.º However, in a typical room, even though the reverberant sound we hear is composed of numerous repetitions of original sound signal, we only hear one fused sound rather than distinctive repetitions.

This is called the Haas effect [53].

The other common experience is our ability to localize a sound source in a closed room, even though there are numerous reflections coming from directions other than the source direction [159].

Although the reflections caused by room reverberation usually do not stand out as

ª echoes,º reverberation does impair speech quality and intelligibility. Two causes of degradation of speech sounds in reverberation are identified (for example, see [117]):

1) The energy overlap of a preceding phoneme on the following phoneme (overlap-

masking) and

2) the internal temporal smearing of energy within each phoneme (self-masking).

The severity of overlap-masking depends on the relative intensity of adjacent phonemes and their spectra. Consonants mask following vowels only a little [116], while vowels or high-intensity consonants mask following consonants substantially [83, 117]. On the other hand, self-masking is responsible for vowel perception errors in reverberant speech.

67 Some early studies of the relationship between a room impulse response function and the intelligibility of reverberant speech are based on the observation that, due to the spectral continuity of speech signals, the early reflections in the reverberation mainly increase the intensity of the reverberant speech, whereas the late ones are deleterious to speech quality and intelligibility. A few measures were proposed to describe the relative strength of early and late reflections. For example, Thiele [152] proposed a measure called “ definition” as the following:

50ms ∫[h(t)]2 dt 0 I = ∞ 100% . (3.5) ∫[h(t)]2 dt 0

The higher the “ definition” value, the more distinctive is the reverberant speech. Borè

[17] later shows that there is a monotonic increasing relationship between “ definition” and syllable intelligibility.

Monaural listening tests on normal-hearing subjects using the Modified Rhyme Test

[85] show that the recognition accuracy degrades from 99.7% for anechoic conditions to

97.0%, 92.5%, and 88.7% for reverberation times of 0.4 s, 0.8 s, and 1.2 s, respectively

[119]. Comparing with monaural listening in the same conditions, binaural listening increases the recognition accuracy to some extent for young subjects, while reduces that for children and elderly listeners [119].

For a person with normal hearing in a typical room, reverberation does not significantly reduce the intelligibility of perceived speech but causes a noticeable change in speech quality [16].

68 Berkley and Allen [16] identified that two physical variables, reverberation time T60 and the talker-listener distance, are important for reverberant speech quality. Consider the impulse response as a combination of three parts, the direct, early, and late reflections.

While late reflections smear the speech spectra and reduce the intelligibility and quality of speech signals, early reflections cause another distortion of speech signal called coloration; the non-flat frequency response of the early reflections distorts the speech spectrum. The coloration can be characterized by a spectral deviation defined as the standard deviation of room frequency response. For example, Figure 3.4 shows the frequency response of the room impulse response in Figure 3.2(b), and the corresponding spectral deviation is 6.2 dB.

Allen [2] reported a formula derived from a nonlinear regression to predict the quality of reverberant speech as measured by subjective preference:

P = 1− 0.3σT60 , (3.6) PMAX

where PMAX is the maximum preference, σ is the spectral deviation in dB, and T60 is the reverberation time, in seconds. According to this formula, increasing either spectral deviation or reverberation time results in decreased reverberant speech quality. Jetzt [77] shows that spectral deviation is determined by signal-to-reverberant energy ratio (SRR).

The relative reverberant energy in a room is approximately constant. Therefore, in the same room spectral deviation is determined by talker-to-microphone distance. Shorter talker-to-microphone distance results in higher SRR value and less spectral deviation, hence, less distortion or coloration.

69

Figure 3.4. Frequency response of the room impulse response in Figure 3.2(b).

3.3 A Pitch-based Measure for Room Reverberation Time

For reasons noted in Section 3.1, reverberation time is an important quantity characterizing . Many techniques that measure this parameter have been proposed. Reverberation time, of course, can be inferred directly from the measured room impulse response function. Simpler methods, however, are usually employed [89]. For example, traditional methods typically utilize a loudspeaker as the excitation generator, and a microphone for receiving reverberant sounds. The excitation signal from the loudspeaker is either a modulated sinusoidal signal or a band-passed random noise.

70 Sometimes, a pistol shot can be used as an alternative of the excitation. At a given moment the excitation is interrupted and a recorder begins to record the sound intensity level picked up by the microphone. By employing Schroeder' s integration method in

Equation 3.4, the intensity decay curve is smoothed and reverberation time then obtained.

Traditional room reverberation time measurements require the generation of unnatural sound sources. On the other hand, experienced acousticians are able to make rather precise judgments concerning reverberation times through listening to speech or music in a room. Human subjects are quite sensitive to the change of reverberation time when listening to natural sound sources. For example, Niaounakis and Davies [122] report that the just noticeable difference of reverberation time when listening to music in a room is around 0.042 s.

Cox et al. [31] proposed a blind reverberation time estimation algorithm from reverberant speech utterances using artificial neural networks. Reverberation smears the temporal structure in speech and, therefore, flattens its energy envelopes; the flatness indicates the degree of reverberation. Taking advantage of this fact, a neural network model is trained on speech samples with known reverberation times and later used to determine the reverberation time in a room. However, speech utterances are restricted to individually pronounced digits and uncontrolled situations are not considered.

A recent blind reverberation time estimation algorithm outlined in an abstract by

Ratnam et al. [138] does not have the restriction on speech source. A reverberation tail is modeled as an exponentially damped Gaussian white noise process, and the reverberation time is estimated employing a maximum-likelihood procedure. The authors report that

71 the estimated reverberation times are in good agreement with the real ones. The detailed results, however, have not been provided.

Many speech processing tasks require a robust measure on degraded speech that indicates the degree of reverberation. For example, Yegnanarayana and Murthy [168] employ the kurtosis of LP residual signal as a measure to estimate signal-to-reverberation component ratio within a time frame. Extending this idea, Gillespie et al. [45] utilize the kurtosis as an optimization criterion to derive an inverse filter and therefore to dereverberate the degraded speech signal.

Brandstein [19] employs a criterion of signal periodicity for time-delay estimation using microphone arrays. The criterion, indicating the degree of speech signal influenced by the detrimental effect of noise and reverberation, is used to weight generalized cross- correlations across all time frames. As a result, the weights of time frames with less degradation are increased relatively and the robustness of the system is improved.

Besides other manifestations, reverberation corrupts harmonic structure in voiced speech, and we find that the degree of corruption can be used as an indication of reverberation. Our goal in this section is to develop a pitch-based measure on the degree of reverberation. It is robust to noise and can be used to estimate key parameters of room impulse response such as the reverberation time (T60). A version of this measure can be found in [165].

72 3.3.1 Proposed Measure

A speech signal contains three types of section: voiced, unvoiced, and silence.

Obviously, a pitch-based measure of reverberation could be based only on voiced time frames. Moreover, in a noisy background, some frequency channels in a voiced frame may be severely corrupted by noise. This measure should be thus based on the signals from ª cleanº frequency channels.

In order to satisfy these criteria, our measure, detailed below, is extended from the multipitch tracking algorithm described in Chapter 2. That algorithm can track pitch periods reliably and can also be used to provide voiced/unvoiced labeling. In addition, it has a channel selection method for identifying weakly corrupted frequency channels on which the pitch-based measure is based. A simplified version of the algorithm restricted to only one pitch track is used in this section, which deals with single speech sources.

Our novel observation is that the differences between the pitch periods determined by the pitch tracker and the time lag from the closest peaks of normalized correlograms in selected channels indicate the level of degradation in the harmonic structure. More specifically, relative time lag δ is defined as the distance from the detected pitch period to the closest peak in correlograms. We then collect the δ statistics from the selected channels across all voiced frames from 16 clean speech utterances chosen from the

TIMIT database for every channel separately. As a typical example, the δ histogram for channel 22 is shown in Figure 3.5(a). As can be seen, the distribution is sharply centered at zero.

73

(a)

(b)

Figure 3.5. Histograms and estimated distributions of relative time lags in channel 22 (center frequency = 264 Hz) of (a) clean speech, and (b) reverberant speech with reverberation time of 0.3 s. The bar graphs represent histograms and the solid lines represent the estimated distributions.

74 We propose to use the spread of the distribution as an indication of reverberation because it measures the ª cleannessº of harmonic structure in speech signals. A signal composed of an ideal stationary harmonic structure is extremely clean. In this case, the relative time lags collected from the signal have the same value of zero, and the distribution has zero spread. Due to the nonstationary nature of speech, the distribution spread of clean speech shown in Figure 3.5(a) is greater than zero.

Room reverberation corrupts harmonic structure, and echoes from natural speech tend to spread the distribution of relative time lags. To illustrate this, we collect the statistics of relative time lags from reverberant speech generated by convolving clean speech with a room impulse response function with T60 = 0.3 s. The histogram is shown in Figure

3.5(b). The spread is wider than that of clean speech.

In order to measure the distribution spread, we employ a mixture of a Laplacian and a uniform distribution for modeling the distribution in channel c (see Chapter 2):

1  δ  p ( ) (1 q ) exp  qU ( ; ) (3.7) c δ = − c −  + δ ηc 2λc  λc 

where 0 < q < 1 is a partition coefficient of the mixture and λc is the Laplacian

distribution parameter. U (δ ;ηc ) is a uniform distribution with range ηc . In a low- frequency channel (channel 1-55), we set the length of the range as the reciprocal of the center frequency.

We also assume a linear relationship between the frequency channel index c and the

Laplacian distribution parameter λc ,

75 λc = a0 + a1c . (3.8)

The maximum likelihood method is utilized to estimate the three parameters a0 , a1 , and q in low-frequency channels. The estimated distributions of relative time lags in clean and reverberant speech are also shown in Figure 3.5(a) and (b). As can be seen, the model distributions fit the histograms very well.

Finally, the measure of distribution spread λ is defined as the average of parameters

λc in low-frequency channels, for harmonic structure of clean speech in low-frequency channels is more stable than that in high-frequency ones. Figure 3.6 shows the relationship of λ and reverberation time. Here, the reverberant signals are generated by convolving the same 16 clean speech signals with room impulse response functions of various reverberation times obtained from the image model [3]. As can be seen, the plot is monotonic and therefore the relative time lag spread λ can be used to estimate the reverberation time.

As shown in Figure 3.6, the distribution spread λ rises almost linearly with increasing reverberation times until 0.6 s, and it saturates beyond this value. As a result, our method is not able to measure T60 accurately when it is longer than 0.6 s.

Our method is perhaps the first blind reverberation time estimation algorithms. To our best knowledge, the algorithm later outlined in an abstract by Ratnam et al. [138] is the only other published reverberation time measure utilizing arbitrary speech sources. A comparison with their algorithm, however, is not possible since the details of their algorithm have not been provided.

76

Figure 3.6: Average distribution spread λ of relative time lag with respect to reverberation time.

3.4 Conclusion

We have shown that reverberation is a complex acoustic phenomenon. Speech processing in reverberant environments is very challenging because speech signal is non- stationary and has a large dynamic range, and reverberant environments impose additional variations.

Although reverberation degrades speech quality, normal-hearing human listeners perform remarkably well in reverberant environments. This performance, however, is not matched by current speech processing systems. In the next chapter, we will examine approaches for speech processing in reverberant environments.

77 Finally, we have presented a pitch-based reverberation measure that gives a blind estimation of the reverberation time in a room utilizing only reverberant speech signal.

This measure, which plays a role in reverberant speech enhancement algorithms presented in the next chapter, may be useful for many speech processing tasks performing in reverberant environments due to the lack of prior knowledge of reverberation characteristics in many practical situations.

78

CHAPTER 4

A ONE-MICROPHONE ALGORITHM FOR REVERBERANT SPEECH ENHANCEMENT

In this chapter, we present an algorithm for reverberant speech enhancement using one microphone. The proposed algorithm utilizes inverse filtering to increase SRR and then spectral subtraction to lessen the influence of long-term reverberation. A preliminary version of the proposed algorithm, which enhances the reverberant speech by estimating and subtracting effects of late reflections, can be found in [165].

The chapter is organized as follows. The following section provides the background of reverberant speech enhancement. The details of the proposed algorithm are explained in Section 4.2. Section 4.3 describes the evaluation experiments and shows the results.

Finally, discussion and conclusion are given in Section 4.4.

4.1 Introduction

Speech technology applications are ubiquitous in modern societies. For instance, telephony has long been an inseparable part of people' s lives; ASR and text-to-speech synthesis technologies are employed for various tasks such as computer command and

79 control, dictation systems, and voice portals [70]; hearing-impaired listeners wear hearing aids to make speech signals more audible.

As discussed in Chapter 3, reverberation often degrades the quality of speech; it affects various speech technology applications in different ways. For example, many contemporary telecommunication technologies, such as videoconferencing and multimedia conferencing, demand hand-free audio communication, which is subject to degradation such as reverberation. In order to achieve high audio quality, it is essential to develop systems that recover the original speech signal from the reverberant signal received by a microphone.

Reverberation substantially degrades the performance of current ASR systems, whereas normal-hearing human listeners have little difficulty in recognizing speech in moderately reverberant conditions. Almost all existing ASR systems work on input data that is compressed into feature vectors typically based on linear prediction (LP) or cepstral analysis. These features, however, are sensitive to reverberation degradation. A recent study [44] reports that, the recognition accuracy of an ASR system (Microsoft

Speech SDK version 5.0 with no additional training [108]) drops from 58.8% with no reverberation to 28.7% in a reverberant environment with the SRR of ±8.23 dB and the reverberation time of 0.22 s. Giuliani et al. [47] also show that, using continuous-density

HMMs and a normalized mel-cepstral front-end, the recognition accuracy of their ASR system degrades from 80% for T60 = 0.1 s to around 50% for T60 = 0.3 s, and around 10% for T60 = 0.5 s. As a comparison, normal-hearing human listeners maintain an accuracy over 90% for reverberation times up to 0.5 s on even more difficult test materials [119].

80 Several strategies have been proposed to improve ASR performance in reverberant environments (for an overview, see [126]): training speech models in the presence of reverberation, the use of more robust feature vectors, reverberant speech recognition using missing data techniques, and reverberant speech enhancement. High performance

ASR systems can be developed where there is a good match between testing and training conditions. Training speech models using reverberant speech increases the robustness in the presence of the same level of reverberation. Yet these systems still suffer from the variability of acoustic environments. The second strategy seeks feature vectors that are robust to reverberation. For example, inspired by the human auditory system, reverberation-resistant features, such as modulation-filtered spectrogram (MSG) [81] and

RASTA-PLP (RelAtive SpecTrAl Perceptual Linear Prediction) [60], have been proposed. Employing combined MSG and RASTA-PLP features, an ASR system proposed by Kingsbury [80] achieves the error rates around 8%, 13%, and 37%, respectively, for the reverberation times of 0.3 s, 0.5 s, and 0.9 s,, compared with the error rate of 4.7% with no reverberation. In a recent study, Palomäki et al. [128] employed missing data speech recognition to improve ASR performance in reverberant environments by estimating and utilizing only the least reverberation-contaminated time- frequency regions of reverberant speech. Missing data speech recognition [30] utilizes the time-frequency redundancy in speech to make optimal decisions. In degraded speech, some time-frequency components are severely corrupted by noise or reverberation and, therefore, unreliable. Consequently, the posterior probability is estimated by utilizing only the reliable components. The system by Palomäki et al. [128] achieves the

81 recognition accuracy of 90% with the reverberation time of 0.7 s. A comparable ASR system, using mean normalized mel-cepstral coefficients as feature vectors, achieves the recognition accuracy of 99.7% with no reverberation and that of 60% with the reverberation time of 0.7 s. The system, however, is tested on a small vocabulary; the performance of a missing data speech recognition system degrades substantially with larger vocabularies [24]. Although these reverberation-resistant ASR systems improve the recognition accuracy, the performance is still unsatisfactory for many real-word applications and is substantially inferior to human performance.

Reverberation also has detrimental effects on speaker recognition systems, such as automatic speaker verification systems; the performance is strongly dependent on speaker location, room size, reverberation time, and training conditions [43]. Several methods for improving the performance in reverberant environments have been proposed. For example, González-Rodríguez et al. [50] utilize microphone arrays to enhance the robustness of Gaussian mixture model speaker recognition systems; some performance improvement is observed. A successful reverberant speech enhancement algorithm can be used as a front-end for ASR and speaker recognition systems and thus improve the recognition accuracy.

While moderately reverberant speech is highly intelligible for normal listeners, hearing-impaired listeners are much more affected by reverberation (for a comprehensive review, see [115] 1iE OHNDQG0DVRQ[118] studied the perception of consonants in two conditions: one with the T60 = 0.1 s and the other with T60 = 0.5 s. Although the normal- hearing subjects maintain high Modified Rhyme Test [85] scores (over 90%) in both

82 reverberant conditions for binaural and monaural listening, the hearing-impaired listeners never achieve high scores ± only around 80% ± in the less reverberant environment; their scores in the more reverberant environment are reduced to about 60%. The perception of vowels in reverberation is also investigated. Although vowels are more identifiable than

FRQVRQDQWVLQUHYHUEHUDWLRQGXHWRWKHLUKLJKHULQWHQVLW\1iE OHN[114] shows that, under reverberant conditions, two groups of hearing-impaired listeners score the error rates of around 35% and 55% comparing with the error rates of around 8% and 25% with no reverberation, respectively. In conclusion, hearing-impaired listeners suffer from room reverberation much more than normal listeners. An effective reverberant speech enhancement algorithm is critical for designing next generation digital hearing aids.

Two general approaches, pre- and post-processing algorithms, exist for reverberant speech enhancement: while post-processing algorithms deal with the reverberant signals received by microphones, pre-processing algorithms modify the source signals emitted by the speakers. Several pre-processing methods for boosting the intelligibility in reverberant environments are proposed. For example, Arai et al. [5] attenuate the steady- state portions of source speech, which consist of vowel segments and have high intensity, to reduce the masking of the following weaker segments, and therefore improve the intelligibility of reverberant speech.

Some post-processing algorithms assume that room impulse response functions are known. For instance, a delay-sum beamformer [37] reduces reverberation effects by summating the received signals from multiple microphones after compensating for appropriate delays. Consequently, it amplifies the sound energy coming from the direct

83 source while attenuates that from other directions. However, Gillespie and Atlas [44] point out that delay-sum beamformers improve the SRR values while leave the reverberation times hardly changed, and performance gains of ASR systems using delay- sum beamformers as the front-end are only modest. Flanagan et al. [39] show that matched filters, the time-reversed room impulse responses, are theoretically superior to delay-sum beamformers. Figure 4.1 shows the equalized impulse response derived by convolving the room impulse response in Figure 3.2 with its matched filter. The existence of pre-, caused by the long tail before the most prominent impulse as shown in

Figure 4.1, however, degrades the speech quality [139]. Consequently, truncated matched filters [136] are proposed and also shown to improve ASR performance [137].

84 Time (ms)

Figure 4.1. The equalized impulse response derived by convolving the room impulse response in Figure 3.2(b) with its matched filter.

A reverberation process can be viewed as a convolution of the original signal emitted by the speaker with a room impulse response function as described in Section 3.1. As a result, one method for removing reverberation effects is to pass the reverberant signal through a second filter that inverts the reverberation process and recover the original signal. An inverse filter may be inferred from the known room impulse response. The perfect reconstruction of the original signal exists, however, only if the room impulse response function is a minimum-phase filter, whose poles and zeros are all inside the unit circle [127]. In this case, its inverse is guaranteed to be also minimum-phase, stable and

85 casual. However, as pointed out by Neely and Allen [121], room impulse responses are often not minimum-phase. One solution is to use multiple microphones. By assuming no common zeros among the room impulse responses, an exact inverse filtering can be realized using FIR filters [109]. For one-microphone systems, some methods, such as linear least-square equalizers, which partially reconstruct the original signal, are also proposed [44].

It is often impractical to assume the prior knowledge of room impulse responses; most post-processing algorithms are designed to perform in unknown acoustic environments and many of them utilize more than one microphone. As pointed out by

Koenig et al. [84], the reverberation tails of the impulse responses, characterizing the reverberation process in a room with multiple microphones and one speaker, are uncorrelated. Several multi-microphone algorithms for reverberant speech enhancement take advantage of this fact. For example, an algorithm proposed by Allen et al. [4] first filters the individual microphone signals into frequency bands. Next, the filtered outputs are compensated for delay differences and added. Finally, this algorithm reduces the long-term reverberation effects by attenuating each frequency band according to cross- correlation between corresponding microphone signals in that band.

Microphone-array based methods [21], such as beamforming techniques [32, 158], attempt to suppress the sound energy coming from directions other than that of the direct source and therefore enhance target speech. Extending from the idea proposed by Allen et al. [4] and described in the last paragraph, a family of techniques, called microphone array with postfiltering, utilize time-varying postfilters to further reduce reverberation

86 and noise effects by removing the incoherent parts of received signals (for example, see

[98]). The postfiltering techniques have also been used in hearing-aids applications (for example, see [54]).

With multiple sound sources in a room, the signals received by microphones can be viewed as convolutive mixtures of original signals emitted by the sources. Several methods (for example, see [15, 57]) have been proposed to achieve blind source separation (BSS) of convolutive mixtures, estimating the original signals using only the information of the convolutive mixtures received by the microphones. Unlike microphone-array based methods, which assume the knowledge of the geometry and the directional characteristics of microphones, BSS algorithms make no such assumptions.

Some methods consider unmixing systems as FIR filters, while others convert the problem into the frequency domain and solve an instantaneous BSS for every frequency channel. Araki et al. [6] point out a fundamental performance limitation of the frequency domain BSS algorithms. When the room impulse responses are long, the frame size of

FFT used for a frequency domain BSS algorithm is preferred to be long because it needs to cover the long reverberation. However, when the length of mixture signal is short, the lack of data in each frequency channel caused by the longer frame size triggers the collapse of the assumption of independence of the source signals. This explains the poor performance of frequency domain BBS algorithms in a realistic acoustic environment with moderate reverberation time.

Many BSS algorithms require that the number of microphones is greater than or equal to that of sources. A recent system [13], however, is capable of extracting the dominant

87 speech from a mixture of reverberant sounds regardless of the number of sources utilizing two microphones. This system first computes the dominant fundamental frequencies of the mixture and then applies adaptive wavelet band-pass filters centered at the fundamental frequencies and its harmonics. After that, an independent component analysis algorithm extracts the signal mostly correlated with the fundamental frequency.

The results show that the dominant speech signal is enhanced.

As noted earlier in this section, inverse filtering proposed by Miyoshi and Kaneda

[109] is able to achieve nearly perfect dereverberation using multiple microphones with perfect knowledge of room impulse response functions. However, room impulse responses are seldom known in practical situations. A number of algorithms, called blind deconvolution, are proposed to obtain the inverse filters without the prior knowledge of room impulse responses. For example, Furuya and Kaneda [42] proposed a two- microphone blind deconvolution system. Assuming no common zeros between the room impulse responses, their method estimates the room impulse responses by computing the eigenvector corresponding to the smallest eigenvalue of the input correlation matrix and using a cost function to determine the order of impulse responses. Their results show that this method increases the SRR by 5 dB on a room impulse response with T60 = 0.5 s.

Other methods, such as the system developed by Gillespie et al. [45], employ prior knowledge of speech signal distribution. Their algorithm estimates an inverse filter of the room impulse response by maximizing the kurtosis of the LP residual of speech. Later,

Gillespie and Atlas [46] derive an inverse filter by minimizing the long-term correlation in the LP residual of reverberant speech, and reduce the long-term reverberation energy at

88 the expense of decreasing the SRR. They show that their algorithm provides better ASR recognition accuracy than those maximizing SRR.

Brandstein [18] argues for the use of explicit speech models to improve the performance of microphone arrays for distant-talker speech acquisition. He employs the

Dual Excitation Speech Model [56] to represent a windowed segment of speech as the sum of two components: a voiced and an unvoiced signal, and then eliminates the unvoiced components in strongly voiced segments. The method is shown to be relatively insensitive to varying sound source locations. Brandstein and Griebel [20] assume that reverberation primarily affects the LP residual, not the LP coefficients. The LP residual is modeled employing a class of wavelets: quadratic spline wavelets. By locating the extrema of wavelet coefficients well clustered across all microphone channels, the original non-reverberant LP residual can be reconstructed.

Reverberant speech enhancement using one microphone is significantly more challenging than that using multiple microphones. A number of one-microphone algorithms have been proposed. For example, a cepstrum-based method is employed by

Bees et al. [14] to estimate the cepstrum of reverberation impulse response, and then its inverse is used to dereverberate the signal. Informal listening tests show that the processed speech is less reverberant. However, it has audible tone-like distortions.

Several one-microphone dereverberation algorithms are motivated by the study of the effects of reverberation on Modulation Transfer Function (MTF) [67], which indicates the changes of the modulation depth of a modulated sinusoid in different frequency channels. Under reverberant conditions, the modulation depths of energy envelops in

89 high frequency channels are more attenuated than those in low frequency channels.

Consequently, Langhans and Strude [91] proposed an enhancement algorithm that performs a nonlinear filtering on energy envelops in critical bands, from which the enhanced speech is then resynthesized. As a result, the reverberant speech is somewhat enhanced. Avendano and Hermansky [10] attempt to recover the energy envelop of the original speech by applying theoretically derived inverse MTF and an optimal filter trained from clean and reverberant speech. The results show an audible reduction of reverberation but artifacts on the processed speech appear to be rather severe.

Yegnanarayana and Murthy [168] point out that LP residual signals of voiced clean speech have damped sinusoidal patterns within each glottal cycle, while those of reverberant speech are smeared and resemble Gaussian noise. With this observation, gross weights are applied to LP residual signals so that more severely reverberant speech segments are attenuated, therefore reducing reverberant artifacts such as reverberation tails immediately after a speech segment. Also, fine weights are applied to the residual signals so that they resemble more closely the damped sinusoidal patterns of the LP residual of clean speech. Moreover, the authors observe that the envelop spectrum of clean speech is flatter than that of reverberant speech. Thus the LP coefficients are manipulated to flatten the spectrum. Finally, the enhanced speech is resynthesized from the processed LP residual signal and coefficients.

Nakatani and Miyoshi [120] proposed a system capable of blind dereverberation of a one-microphone speech by employing the harmonic structure of speech. In their system, a sinusoidal representation is used to approximate the direct sound in a reverberant

90 environment and adaptive harmonic filters are first employed to estimate the voiced clean speech from the reverberant speech signals. This estimation, although crude, is then used to derive a dereverberation filter. As the number of reverberant speech data sets increases, the estimation of the dereverberation filter becomes more precise. Good results are obtained especially for female speech trained on 5240 Japanese word utterances but this algorithm requires a large amount of reverberant speech produced using the same room impulse response function.

Reverberation presents itself as an important challenge for many speech processing systems. Existing reverberant speech enhancement algorithms, however, do not reach the performance level demanded by many real-world applications. In the following section, we present a novel one-microphone algorithm for reverberant speech enhancement.

4.2 Proposed Algorithm

As identified in Chapter 3, two types of degradation ± coloration and long-term reverberation ± exist in a reverberant environment. Consequently, our algorithm consists of two stages to deal with these two types of degradations. In the first stage, an inverse filter is estimated to reduce coloration effects so that SRR is increased. The second stage utilizes spectral subtraction to minimize the influence of long-term reverberation.

Detailed explanations of the two stages of our algorithm are given in Section 4.2.1 and

4.2.2.

91 4.2.1 Inverse Filtering

As described in the introduction of this chapter, inverse filtering can be utilized to reconstruct the original signal. In the first stage of our algorithm, we derive an inverse filter to reduce reverberation effects and this stage is adapted from a multi-microphone inverse filtering algorithm proposed by Gillespie at el. [45].

Assuming that gˆ = [g(1), g(2),..., g(L)] is an inverse filter of length L, the inverse- filtered speech is

z(t)= gˆyˆ(t), (4.1) where yÃ(t)= [y(t − L +1),..., y(t −1), y(t)]T and y(t) is the reverberant speech, sampled at

16 kHz,

The LP residual of clean speech has higher kurtosis than that of reverberant speech

[168]. Consequently, an inverse filter can be sought by maximizing the kurtosis of LP residual signal of the inverse-filtered signal [45]. A schematic diagram of a direct implementation of such a system is shown in Figure 4.2(a). However, due to the LP analysis in the feedback loop, the optimization problem is not trivial. As a result, an alternative system is employed for inverse filtering [45] and shown in Figure 4.2(b).

Here, the LP residual of the processed speech is approximated by the inverse-filtered LP residual of the reverberant speech ~z (t). Consequently, we have:

92

Reverberant speech y(t) Inverse filter gà LP analysis

Gradient of the kurtosis

(a)

Inverse-filtered Reverberant speech z(t) speech y(t)

Inverse f ilter gÃ

Copy filter ~ y t z (t) r ( ) LP analysis Inverse f ilter gÃ

Gradient of

the kurtosis

(b)

Figure 4.2. Schematic diagrams of (a) an ideal one-microphone dereverberation algorithm maximizing the kurtosis of LP residual of inverse-filtered signal, and of (b) the algorithm employed in the section.

93 ~ z (t)= gÃyÃr (t), (4.2)

T where yÃr (t)= [yr (t − L +1),..., yr (t −1), yr (t)] and yr (t) is the LP residual of the reverberant speech. The optimal inverse filter gà is derived so that the kurtosis of ~z (t) is maximized. The optimization process can be carried out using adaptive-filter-like algorithms as following.

The kurtosis of the inverse-filtered LP residual of the reverberant speech ~z (t) is defined as:

E ~z 4 (t) J = [ ] − 3. (4.3) E 2 [~z 2 (t)]

The gradient of the kurtosis with respect to the inverse filter gà can be derived as:

~ 2 ~ 3 ~ 4 ~ ∂J 4E[z (t)]E[z (t)yà (t)]− 4E[z (t)]E[z (t)yà (t)] = r r . (4.4) ∂gà E 3 [~z 2 (t)]

~ 3 Ã To develop an estimate of this gradient, we substitute the expectations E[z (t)y r (t)] and

~ Ã ~ 3 Ã ~ Ã E[z (t)y r (t)] with their instantaneous estimates z (t)y r (t) and z (t)y r (t), respectively, and obtain a stochastic approximation [151]:

∂J 4(E[~z 2 (t)]~z 3 (t)− E[~z 4 (t)]~z (t)) ≈  yÃr (t). (4.5) ∂gà  E 3 [~z 2 (t)] 

With the definition of

4(E[~z 2 (t)]~z 3 (t)− E[~z 4 (t)]~z (t)) f (t)= , (4.6) E 3 [~z 2 (t)]

94 the optimization problem can be formulated as a time-domain adaptive filter and the update equation of the inverse filter becomes:

gÃ(t +1)= gÃ(t) +µf (t)yÃr (t), (4.7) where µ denotes the learning rate, for every time step.

According to Haykin [58], however, the time-domain adaptive filter formulation is not recommended, because the large variations in the eigenvectors of the autocorrelation matrices of the input signals may lead to very slow convergence, or no convergence at all. Consequently, a block frequency-domain structure is used for optimization. In this formulation, the signal is processed block by block using FFT and the filter length L is also used as the block length. The new update equations for the inverse filter are:

µ M ′ * G (n +1)= G(n)+ ∑F(m)Yr (m) , and (4.8) M m=1

′ n +1 G(n +1)= G ( ) , (4.9) G′(n +1)

where F(m) and Yr (m) denote, respectively, the FFT of f (t) and yÃr (t) for the mth block. The superscript * denotes complex conjugation. G(n) is the FFT of gà at nth iteration and M is the number of blocks. Equation 4.9 ensures that the inverse filter is normalized. Finally, the inverse-filtered speech z(t) is obtained by convolving the reverberant speech with the inverse filter.

95 The detailed algorithm is given in Table 4.1. Specifically, we use 20 sec reverberant speech to derive the inverse filter, and we run for 500 iterations which are needed for good results.

An informal listening test shows that our inverse filtering algorithm improves the speech quality. A room impulse response function with T60 = 0.3 s is shown in Figure

4.3(a), and the equalized impulse response ± the result of the room impulse response convolved with the obtained inverse filter ± is shown in Figure 4.3(b). As can be seen, the equalized impulse response is far more impulse-like than the room impulse response. In fact, the SRR value of the room impulse response is ±9.8 dB in comparison with 2.4 dB for that of the equalized impulse response.

However, the above inverse filtering method does not improve on the tail part of reverberation. Figure 4.4(a) and (b) show the energy decay curves of the room impulse response and the equalized impulse response, respectively. As can be seen, except for the first 50 ms, the energy decay patterns are almost identical, and thus the estimated reverberation times are almost the same, around 0.3 s. While the coloration distortion is reduced due to the increase of SRR, the degradation due to reverberation tails is not alleviated. In other words, the effect of inverse filtering is similar to that of moving the sound source closer to the receiver. In the next subsection, we introduce a method to reduce the effects of long-term reverberation, therefore further enhancing reverberant speech.

96 Notations: 0 = L-by-1 null vector FFT = fast Fourier transformation IFFT = inverse fast Fourier transformation

Step 1: initialization

Compute the LP residual yr (t) of reverberant speech y(t) employing 10th order LP analysis, and G(0) = 2L-by-1 random vector, subject to G(0) = 1, where L = 1024 (64 ms).

Step 2: iteration For each iteration n, compute:

Step 3: filtering For each time frame m, compute: T Yr (m)= diag{FFT[yr (mL − L),..., yr (mL −1), yr (mL),..., yr (mL + L −1)] }

zÃ(m) = last L elements of IFFT[Yr (m)G(n)]

Step 4: feedback function computation Compute E[~z 2 (t)] and E[~z 4 (t)] every 512-sample block (32ms). Compute fÃ(m)= [f (mL − L), f (mL − L +1),..., f (mL −1)]T according to Equation 4.6.  0  F(m)= FFT Ã  f (m)

Step 5: updating the inverse filter

M 1 * = ∑ first L elements of IFFT[Yr (m)F(m)] M m=1   G′(n +1) G′(n +1)= G(n)+µFFT  and G(n +1)= , 0  G′(n +1) where µ = 3×10−9 . Go to Step 2.

Table 4.1. The first stage of the proposed algorithm for deriving an inverse filter.

97

(a)

(b)

Figure 4.3. (a) A room impulse response function generated by an image method [3]. (b) The equalized impulse response derived from the reverberant speech generated by the room impulse response in (a) as the result of the first stage of our algorithm.

98

(a)

(b)

Figure 4.4. Energy decay curves computed from (a) the room impulse response function in Figure 4.3(a) and from (b) the equalized impulse response in Figure 4.3(b). The horizontal dot line represents ±60 dB energy decay level. The left dashlines indicate the starting times of the impulse responses and the right dashlines the times at which decay curves cross ±60 dB.

99 4.2.2 Spectral Subtraction

As described in Chapter 3, the late reflections in a room impulse response function smear speech spectrum and degrade speech intelligibility and quality. Likewise, the equalized impulse response can be decomposed into two parts: early and late impulses.

Resembling the effects of the late reflections in a room impulse response, the late impulses have deleterious effects to the quality of the inverse-filtered speech; by estimating the effects of the late impulses and subtracting them, we can expect to enhance the speech quality.

Several methods have been proposed to reduce the effects of late reflections in a room impulse response. For example, Wu and Wang [165] enhance the reverberant speech by estimating and subtracting effects of late reflections. As discussed in Section 4.1,

Palomäki et al. [128] employ a missing data speech recognition technique for ASR in reverberant environments by estimating and utilizing only the least reverberation- contaminated time-frequency regions. First, the simulated auditory nerve firing rates in each frequency band are filtered by a reverberation masking filter, and the outputs of the filter indicate the severity of the corruption by reverberation. A threshold is then applied and the least reverberation-contaminated regions are estimated.

Reverberation causes the elongation of harmonic structure in voiced speech (also see

Figure 4.7) and, therefore, produces elongated pitch tracks. In order to obtain more accurate pitch estimation in reverberant environments, Nakatani and Miyoshi [120]

employ a filter f p = (1,−e,−e,...,−e) to pre-filter the amplitude spectrum in the time

100 domain before applying a PDA. This technique reduces some elongated pitch tracks in reverberant speech.

The short-term spectrum of the effects of late impulses has two components: magnitude and phase. The details of the late impulses, which consist of numerous impulses, are very complex, while an overall characteristic, the general energy decay pattern, is known. As a result, it is impractical to estimate the phase spectrum, whereas the magnitude spectrum can be estimated.

The smearing effects of late impulses lead to the smoothing of the signal spectrum in time domain, and we assume that the power spectrum of the late-impulse components is a smoothed and shifted version of the power spectrum of the inverse-filtered speech z(t):

2 2 Sl (k;i) = γw(i − ρ)∗ S z (k;i) , (4.10)

2 2 where S z (k;i) and Sl (k;i) are, respectively, the short-term power spectra of the inverse-filtered speech and the late-impulse components. The symbol * denotes convolution. Indexes k and i refer to frequency bin and time frame, respectively. The convolution is in the time domain and w(i) is the smoothing function. The short-term speech spectrum is obtained by using hamming windows of length 16 ms with 8 ms overlap for short-term Fourier analysis. The shift delay ρ indicates the relative delay of the late-impulse components. The distinction of early and late reflections for speech is commonly set at a delay of 50 ms in a room impulse response function [89]. This translates to approximately 7 frames for a shift interval of 8 ms, and we choose ρ = 7 as

101 a result. Finally, the scaling factor specifies the relative strength of the late-impulse components and is set to 0.35.

Considering the shape of the equalized impulse response, we choose an asymmetrical smoothing function as the Rayleigh distribution3:

 i + a  − (i + a)2  w(i)= exp  if i > −a 2  2  (4.11)  a  2a    w(i)= 0 otherwise where we choose a = 5 and the smoothing function is shown in Figure 4.5. This smoothing function goes down to zero on the left side quickly but tails off slowly on the right side; the right side of the smoothing function is similar to the shape of reverberation tails in equalized impulse responses.

The inverse-filtered speech z(t) can be expressed as the convolution of the clean

speech s(t) and the equalized impulse response he (t):

∞ z t = s t −τ h τ dτ . (4.12) ( ) ∫ ( ) e ( ) 0

By separating the contributions from early and late impulses in the equalized impulse response, we rewrite Equation 4.12 as:

Tl ∞ z t = s t −τ h τ dτ + s t −τ h τ dτ , (4.13) ( ) ∫ ( ) e ( ) ∫ ( ) e ( ) 0 Tl

x  − x 2  3 Rayleigh distribution is defined as: f x = exp  , for x ≥ 0 ; f x = 0 , ( ) 2  2  ( ) otherwise. a  2a 

102

Figure 4.5. Smoothing function for estimating the late-impulse components.

where Tl indicates the separation between early and late impulses. The first and the second terms in Equation 4.13 represent the early- and late-impulse components, respectively, and are computed from different segments of original clean speech: The

early-impulse component is calculated from s( 1 ), where t − Tl ≤ 1 ≤ t , and the late-

impulse component from s( 2 ), where 2 ≤ t − Tl .

103

Figure 4.6: The average autocorrelation function of speech utterances of four male and four female speakers random selected from the TIMIT database [36].

Figure 4.6 shows the average autocorrelation function of speech utterances from four female and four male speakers randomly selected from the TIMIT database [36]. As can be seen, the autocorrelations are large around zero lag but fall off rapidly; they are almost zero with lags larger than 30 ms. As noted earlier, the early- and late-impulse components

are separately derived from two adjacent segments of clean speech: s( 1 ) and s( 2 ). The

correlation between these two speech signals is small when the time difference 1 − 2 is relatively large (not close to the border between the two segments). Consequently, we assume the early- and late-impulse components mutually uncorrelated. To verify this, in

104 Speaker/Gender Normalized correlation coefficients Female#1 0.012 Female#2 0.005 Female#3 0.006 Female#4 0.010 Male#1 0.013 Male#2 0.009 Male#3 0.017 Male#4 0.003

Table 4.2. The normalized correlation coefficients between the early- and late impulse components computed from four female and four male speakers randomly selected from TIMIT database [36].

Table 4.2, we show the normalized correlation coefficients between the early- and late-

impulse components derived from speech utterances from the eight speakers mentioned

before, assuming the separation between the early and late impulses Tl = 50 ms. As can

be seen, the normalized correlation coefficients are less than 0.02. Consequently, the

power spectrum of the early-impulse components can be estimated by subtracting the

power spectrum of the late-impulse components from that of the inverse-filtered speech.

The results are further used as an estimate of the power spectrum of original speech.

Specifically, spectral subtraction [33] is employed to estimate the power spectrum of

2 original speech S ~x (k;i) :

 2 2  2 2 S z (k;i) − γw(i − ρ)∗ S z (k;i) ~ S x (k;i) = S z (k;i) max 2 ,ε  , (4.14)  S z (k;i) 

105 where ε = 0.001 is the floor and corresponds to the maximum attenuation of 30 dB.

Natural speech utterances contain silent gaps between words and sentences, and reverberation fills some of the gaps right after high-intensity speech sections. Identifying and then attenuating these silent gaps are desirable.

Two characteristics of a silent time frame are recognized. First, even with reverberation filling, the energy of a silent frame in the inverse-filtered speech is

relatively low. Consequently, a threshold ϑ1 is established to identify the possibility of a silent frame. Secondly, for a silent frame, the energy is substantially reduced after the spectral subtraction process described earlier in this section. As a result, a second

threshold ϑ2 is established for the energy reduction ratio. Specifically, the signal is first normalized so that the maximum frame energy is 1. A time frame i is identified as a

E z (i) silent frame only if E z (i)< ϑ1 and > ϑ2 , where E z (i) and E~x (i) are the E~x (i) energy values in frame i for the inverse-filtered speech z(t) and the spectral-subtracted

~ speech x(t). We choose ϑ1 = 0.0125 and ϑ2 = 5 . For identified silent frames, all frequency bins are attenuated by 30 dB. Finally, the short-term phase spectrum of enhanced speech is set to that of inverse-filtered speech and the processed speech is reconstructed from the short-term magnitude and phase spectrum.

106 4.3 Results and Comparisons

4.3.1 Objective Speech Quality Measures

The selection of an objective speech quality measure for systematic evaluation is not trivial. An appropriate measure should be well correlated with subjective measures for a specific speech processing task.

Subjective speech intelligibility tests were introduced by Fletcher and Steinberg [40] to quantitatively measure the percentage of transmitted speech units that were correctly perceived by the listener. However, speech intelligibility tests are only appropriate for moderate to severely degraded speech signals, since these tests are unable to distinguish between speech signals that are highly intelligible. As a result, many subjective speech quality tests have been developed. For example, the category judgment method requires the listeners rate the speech under test on a five-point scale: unsatisfactory, poor, fair, good, and excellent, and results in a mean opinion score (MOS) [72].

Due to the high costs and inflexibility of administrating subjective tests, many objective quality measures are also developed (see [133]). Ideally, an objective measure should replicate human performance. In reality, however, different objective measures are appropriate for different conditions.

Although the changes in the phases of sinusoidal signals or those in the relative phases of sinusoidal components of a signal are generally considered less important perceptually compared with the changes in the amplitudes, Weiss et al. [164] show that rapid fluctuations in the relative phases of the sinusoidal components of a speech signal lead to significant degradation in speech quality. When speech is mixed with a moderate

107 level of white noise, the phases of strong spectral components of speech are not distorted significantly due to the large dynamic range of speech signal; Wang and Lim [162] studied the importance of phase information in the context of enhancing speech mixed with white noise and concluded that phase distortion is not important for speech enhancement applications. As a result, many objective speech quality measures, such as

Itakura distance [73] and weighted-spectral slope measure [82] are proposed focusing only on the magnitude of short-term speech spectrum.

Although ignoring the phase information is appropriate for enhancement of noisy speech, it is not appropriate for enhancement of reverberant speech. We have conducted an informal experiment by substituting the phase of clean speech with that of reverberant speech while retaining the magnitude of clean speech. Clear reduction of speech quality is heard in comparison with original speech.

A signal-to-noise ratio (SNR) is a speech quality measure, which takes into account of phase information, and is defined as:

∑ s 2 (n) n SNR = 10log10 , (4.15) ∑[s(n)− sÃ(n)]2 n where s(n) is the original noise- and reverberation-free signal, and sÃ(n) is the processed signal. However, this simple measure implicitly weights every time frame by its energy and speech energy varies substantially in the time domain. Therefore, time frames with high energy are significantly over-weighted and it is a poor estimator of subjective

108 speech quality for a variety of speech distortions [101-103, 156]. As a result, a frame-

based measure called segmental SNR ( SNRseg ) is proposed as following [133]:

1 M  m j s 2 (n)  SNR 10 log (4.16) seg = ∑ 10  ∑ 2  M j=1 n=m −N +1 s n − sÃn  j [ ( ) ( )] 

where m j is the end-time of the jth frame and the summation is over M frames, each of length N. SNR values are computed for each frame and finally averaged. Some frames are silent and SNR values are large in the negative direction. Consequently, a lower- bound threshold, ±10 dB, is selected to replace negative values less than the threshold.

Frames with SNR values greater than 35 dB are not perceived by the listener as significantly different. Therefore, an upper threshold of 35 dB is selected to represent

SNR values higher than 35 dB. The two thresholds prevent the final SNR measure from being dominated in either a positive or negative direction by a few time frames.

Although segmental SNR is much more correlated with speech quality, it weights all frequency components equally in a time frame. The variance of energy in different frequency bands is large and in order to weight them more appropriately, frequency-

weighted segmental SNR ( SNR fw ) has been proposed [156]. One particular implementation of frequency-weighted segmental SNR is defined as:

1 M  1 K m j s 2 (n)  SNR = k , (4.17) fw ∑  ∑ ∑ 2  M j=1 K k=1 n=m −N +1 s n sà n  j [ k ( )− k ( )]  where the signals are first filtered into K frequency bands corresponding to 20 classical articulation bands illustrated in Table 4.3, reproduced from [41]. These bands are

109 Band Frequency Limits (Hz) Band Frequency Limits (Hz) 1 250 ± 375 11 1930 ± 2140 2 375 ± 505 12 2140 ± 2355 3 505 ± 645 13 2355 ± 2600 4 645 ± 795 14 2600 ± 2900 5 795 ± 955 15 2900 ± 3255 6 955 ± 1130 16 3255 ± 3680 7 1130 ± 1315 17 3680 ± 4200 8 1315 ± 1515 18 4200 ± 4860 9 1515 ± 1720 19 4860 ± 5720 10 1720 ± 1930 20 5720 ± 7000

TABLE 4.3. Articulation index filter bands (reproduced from [41]).

unequally spaced and have varying bandwidths, however they contribute equally to the intelligibility of a processed speech. Normally these articulation bands are associated with the critical bands for the human auditory system. The articulation bands range from

250 to 7000 Hz as shown in Table 4.3, for it is found that frequencies below 250 Hz or above 7000 Hz do not significantly contribute to speech intelligibility [41].

Experiments show that frequency-weighted segmental SNR is highly correlated with subjective speech quality and is superior to conventional SNR or segmental SNR [133,

156].

We use frequency-weighted segmental SNR for evaluation of our reverberant speech enhancement system. Specifically, we use a window length of 30 ms, and the filters for the AI bands are implemented by using FIR filters of 10 ms length.

110 4.3.2 Evaluation results

A corpus of speech utterances from eight speakers (four females and four males) randomly selected from the TIMIT database [36] has been used for system evaluation.

Informal listening tests show that the proposed algorithm achieves substantial reduction of reverberation and has little audible artifacts. To illustrate typical performance, we show the enhancement result of a speech signal corresponding to the sentence ª She had your dark suit in greasy wash water all yearº from the TIMIT database in Figure 4.7.

Figure 4.7(a) and (c) show the clean and the reverberant signal and Figure 4.7(b) and (d), the corresponding , respectively. The reverberant signal is produced by convolving the clean signal and a room impulse response function with a 0.3 s reverberation time. As can be seen, while the clean signal has fine harmonic structure and silence gaps between the words, the reverberant speech is smeared and its harmonic structure is elongated. The inverse-filtered speech, resulting from the first stage of our algorithm, and its spectrogram are shown in Figure 4.7(e) and (f), respectively. Compared with the reverberant speech, inverse filtering restores some detailed harmonic structure of the original speech, although the smearing and silence gaps are not much improved. This is consistent with our understanding that coloration mostly degrades the detailed spectrum and phase information. Finally, the processed speech using the entire algorithm and its spectrogram are shown in Figure 4.7(g) and (h), respectively. As can be seen, the effects of reverberation have been significantly reduced in the processed speech. The smearing is lessened and many silence gaps are clearer.

111

(a)

(b)

(c)

Time (sec) (d)

Continued

Figure 4.7. Results of reverberant speech enhancement: (a) clean speech, (b) spectrogram of clean speech, (c) reverberant speech, (d) spectrogram of reverberation speech, (e) inverse-filtered speech, (f) spectrogram of inverse-filtered speech, (g) speech processed using our algorithm, and (h) spectrogram of the processed speech of a female utterance ª She had your dark suit in greasy wash water all year,º all sampled at 16 kHz.

112 Figure 4.7 continued

(e)

(f)

(g)

Time (sec) (h)

113 Speaker/Gender rev inv processed inv−rev processed−rev SNR fw SNR fw SNR fw SNR fw SNR fw (dB) (dB) (dB) (dB) (dB) Female#1 -2.62 0.01 1.84 2.63 4.46 Female#2 -2.07 0.01 1.56 2.17 3.63 Female#3 -4.28 -1.69 0.74 2.60 5.02 Female#4 -3.02 -0.90 1.07 2.12 4.09 Male#1 -4.47 -0.30 1.74 4.17 6.21 Male#2 -4.42 -0.50 1.07 3.92 5.49 Male#3 -3.23 0.66 2.01 3.90 5.24 Male#4 -3.04 -0.06 1.41 2.99 4.45 Average -3.39 -0.33 1.43 3.06 4.82

Table 4.4. The systematic results of reverberant speech enhancement for speech utterances of four female and four male speakers randomly selected from TIMIT database [36].

Table 4.4 shows the systematic results for the utterances from the eight speakers.

rev inv processed SNR fw , SNR fw , and SNR fw denote the frequency-weighted segmental SNRs for

reverberant speech, inverse-filtered speech, and processed speech, respectively. The SNR

gains on inverse-filtered speech and the processed speech are represented by

inv−rev inv rev processed−rev processed rev SNR fw = SNR fw − SNR fw and SNR fw = SNR fw − SNR fw , respectively. As

can be seen, the quality of the processed speech is substantially improved, with an

average SNR gain of 4.82 dB on reverberant speech.

To put the performance of our algorithm in perspective, we compare with a recent

one-microphone reverberant speech enhancement algorithm proposed by Yegnanarayana

and Murthy [168], described in the introduction section of this chapter. We refer to this

114 algorithm as the YM algorithm. Since the YM algorithm is implemented for speech signals sampled at 8 kHz, we downsample the speech signals from 16 kHz and adapt our algorithm to perform at 8 kHz. The results of processing the downsampled signal from

Figure 4.7 are shown in Figure 4.8. Figure 4.8(a) and (c) show the clean and the reverberant signal sampled at 8 kHz and Figure 4.8(b) and (d), the corresponding spectrograms, respectively. Figure 4.8(e) and (f) show the processed speech using the

YM algorithm and its spectrogram, respectively. As can be seen, some silence gaps are attenuated. The processed speech using our algorithm and its spectrogram are shown in

Figure 4.8(g) and (h). It is clear that our algorithm performs significantly better than the

YM algorithm, and this observation was confirmed by informal listening tests.

Quantitative comparisons are also obtained from the speech utterances of the eight

rev YM processed speakers separately and presented in Table 4.5. SNR fw−8k , SNR fw−8k , and SNR fw−8k represent the frequency-weighted segmental SNR values of reverberant speech, the processed speech using the YM algorithm, and the processed speech using our algorithm, respectively. The SNR gains by employing the YM algorithm and our algorithm are

YM −rev YM rev denoted by SNR fw−8k = SNR fw−8k − SNR fw−8k and

processed−rev processed rev SNR fw−8k = SNR fw−8k − SNR fw−8k , respectively. As can be seen, the YM algorithm obtains an average SNR gain of 0.74 dB compared with that of 4.15 dB by our algorithm.

Our algorithm outperforms the YM algorithm substantially.

115

(a)

(b)

(c)

Time (sec) (d)

Continued

Figure 4.8. Results of reverberant speech enhancement of the same speech utterance in Figure 4.7(a) (downsampled to 8 kHz): (a) clean speech, (b) spectrogram of clean speech, (c) reverberant speech, (d) spectrogram of reverberant speech, (e) speech processed using the YM algorithm, (f) spectrogram of (e), (g) speech processed using our algorithm, and (h) spectrogram of (g).

116 Figure 4.8 continued

(e)

(f)

(g)

Time (sec) (h)

117 Speaker/Gender rev YM processed YM −rev processed−rev SNR fw−8k SNR fw−8k SNR fw−8k SNR fw−8k SNR fw−8k (dB) (dB) (dB) (dB) (dB) Female#1 -3.64 -3.06 0.92 0.58 4.56 Female#2 -3.51 -3.05 0.74 0.46 4.25 Female#3 -3.86 -3.19 -0.20 0.68 3.66 Female#4 -4.12 -3.29 0.73 0.83 4.84 Male#1 -3.86 -2.65 -0.92 1.21 2.94 Male#2 -3.33 -2.68 1.77 0.65 5.10 Male#3 -3.30 -2.53 1.20 0.76 4.49 Male#4 -3.50 -2.76 -0.13 0.75 3.38 Average -3.64 -2.90 0.51 0.74 4.15

Table 4.5. The systematic results of reverberant speech enhancement for speech utterances of four female and four male speakers randomly selected from TIMIT database [36]. All signals are sampled at 8 kHz.

Our algorithm has also been tested in reverberant environments with different reverberation times. The first stage of our algorithm ± inverse filtering ± is able to perform reliably with reverberation times ranging from 0.2 s to 0.4 s, which cover the range of the reverberation times of typical living rooms. When reverberation times are greater than 0.4 s, the length of the inverse filter (64 ms) is too short to cover the long room impulse responses. On the other hand, when reverberation times are less than 0.2 s, the quality of reverberant speech is rather high without processing. Unless the inverse filter is precisely estimated, inverse filtering may even degrade the reverberant speech rather than improve it. Figure 4.9 shows the performance of our algorithm under different reverberation times. The dot, dash, and solid lines represent the frequency-weighted

118

Figure 4.9. The results of the proposed algorithm in reverberant environments with different reverberation times. The dot, dash, and solid lines represent the frequency- weighted segmental SNR values of reverberant speech, inverse-filtered speech, and the processed speech.

segmental SNR values of reverberant speech, inverse-filtered speech, and the enhanced speech, respectively. As can be seen, our algorithm consistently improves the quality of reverberant speech within this range of reverberation times.

While some impulse responses have heavy reverberation tails, others possess light tails, as specified by reverberation times. The scaling factor in Equation 4.10 indicates the relative strength of the late-impulse components, and ideally should change according to reverberation times. The optimal scaling factors are identified by finding the maxima

119 of frequency-weighted segmental SNR values, and are shown in Figure 4.10. The optimal frequency-weighted segmental SNR gains in comparison to those derived by using the fixed scaling factor of 0.32 are also shown in Figure 4.10. As can be seen, even with the optimal scaling factors ranging from 0.1 to 0.6, the performance gain by using these optimal factors is no greater than 0.3 dB. This strongly suggests that the selection of scaling factors is not sensitive and does not affect final performance much.

Reverberation time can be estimated by using algorithms such as the one proposed in

Chapter 3. If the reverberation time is outside the range of 0.2 s to 0.4 s, the reverberant speech should be handled differently. For reverberation time from 0.1 s to 0.2 s, the second stage of our algorithm ± estimating and subtracting the late-impulse components ± can be applied directly without passing through the first stage. Speech utterances from eight speakers described before are employed for evaluation. Our experiments show that, under reverberation times of 0.12 s and 0.17 s, the second stage of our algorithm with a scaling factor of 0.05 improves the average frequency-weighted segmental SNR values from 3.89 dB and 1.36 dB of reverberant speech to 4.38 dB and 2.55 dB of the processed speech, respectively. For reverberation times lower than 0.1 s, the reverberant speech has very high quality and no enhancement is necessary.

120

Figure 4.10. Performance of the optimal versus fixed scaling factors: the optimal scaling factors are given in the dashline, and the frequency-weighted segmental SNR gains on that using the fixed scaling factor of 0.32 are given in the solid line.

For reverberation times greater than 0.4 s, we could also directly use the second stage of our algorithm. To see its effects, we perform further experiments using a scaling factor of 2.0 and employing the speech utterances used before. Utilizing the utterances from the same eight speakers, our experiments show that, with T60 = 0.58 s, average frequency- weighted segmental SNR improves from ±5.7 dB of reverberant speech to ±1.4 of the processed speech.

121 In the preceding examples, we assume that the experiments are conducted in noise- free reverberant environments. In practical situations, however, ambient noise commonly exists. To illustrate the performance in this situation, a reverberant utterance is further corrupted by white noise so that the SNR of reverberant speech, i.e., the reverberant speech is treated as signal, is 20 dB. Figure 4.11 shows the performance. The reverberant speech in Figure 4.7(c) is mixed with white noise and the result is shown in Figure

4.11(a). The corresponding spectrogram is shown in Figure 4.11(b). Figure 4.11(c) and

(d) show the inverse-filtered speech as a result of the first stage of our algorithm and its spectrogram, respectively. Figure 4.11(e) and (f) show the processed speech using the entire algorithm and its spectrogram, respectively. The improvement can be seen clearly from Figure 4.11(e) and (f). Some silence gaps have been significantly attenuated.

Systematic evaluation results for the eight speakers are provided in Table 4.6.

degraded inv processed SNR fw , SNR fw , and SNR fw represent the frequency-weighted segmental SNR values of speech degraded by reverberation and noise, the inverse-filtered speech as the result of the first stage of our algorithm, and the speech processed using the entire algorithm, respectively. The SNR gains on the inverse-filtered speech and the entire

inv−degraded inv degraded algorithm are denoted as SNR fw = SNR fw − SNR fw and

processed−degraded degraded degraded SNR fw = SNR fw − SNR fw , respectively. Comparing reverberant speech with the enhanced speech using our algorithm, we obtain an average SNR gain of

4.79 dB.

122

(a)

(b)

(c)

Time (sec) (d)

Continued

Figure 4.11. Results of enhancement of speech degraded by reverberation and noise: (a) speech degraded by reverberation (T60 = 0.3 s) and noise (SNR = 20 dB), (b) spectrogram of speech degraded by reverberation and noise, (c) inverse-filtered speech as the result of the first stage of our algorithm, (d) spectrogram of inverse-filtered speech, (e) speech processed using the proposed algorithm, and (f) spectrogram of the processed speech.

123

Figure 4.11 continued

(e)

Time (sec) (f)

124

Speaker/Gender degraded inv processed inv−degraded processed−degraded SNR fw SNR fw SNR fw SNR fw SNR fw (dB) (dB) (dB) (dB) (dB) Female#1 -4.93 -2.60 -0.18 2.33 4.75 Female#2 -6.15 -3.61 1.13 2.54 5.02 Female#3 -4.94 -3.23 -0.29 1.71 4.65 Female#4 -5.08 -2.16 0.11 2.91 5.19 Male#1 -5.50 -2.44 0.13 3.06 5.63 Male#2 -4.38 -3.25 -1.14 1.13 3.24 Male#3 -4.75 -1.01 0.27 3.74 5.01 Male#4 -5.11 -2.63 -0.28 2.48 4.83 Average -5.10 -2.62 -0.31 2.49 4.79

Table 4.6. The systematic enhancement results of speech degraded by reverberation (T60 = 0.3 s) and noise (SNR = 20 dB) for speech utterances from four female and four male speakers randomly selected from the TIMIT database [36].

4.4 Discussion and Conclusion

Many algorithms for reverberant speech enhancement utilize FIR filters for inverse filtering. The length of an FIR inverse filter, however, puts limitation on the system performance. For example, Figure 4.12(a) shows the equalized impulse response derived from the room impulse response in Figure 4.3(a) (T60 = 0.3 s) using linear least-square inverse filtering [44]. This technique derives an optimal FIR inverse filter in the least- square sense for length 1024 (64 ms) with the perfect knowledge of the room impulse response. The corresponding energy decay curve is shown in Figure 4.12(b). As can be seen, the impulses after 70 ms from the starting time of the equalized impulse response are not much attenuated. Some remedies have been investigated. For example, Gillespie and Atlas proposed a binary-weighted linear-least-square equalizer [44], which attenuates

125 more long-term reverberation at the expense of lower SRR values. However, because the length of the inverse filter is shorter than the length of reverberation, the reverberation longer than the filter cannot be effectively reduced in principle.

In theory, longer FIR inverse filters may achieve better performance. But long inverse filters introduce many more free parameters that are often difficult to estimate in practice.

Sometimes, it leads to instability of convergence and it often requires large amounts of data to perform. A few algorithms have been proposed to derive long FIR inverse filters.

For example, Nakatani and Miyoshi [120] proposed a system capable of blind dereverberation of one-microphone speech using long FIR filters (2 s, personal communication, 2003). Good results, however, are obtained using large amounts of speech data (trained on 5240 Japanese words). In many practical situations, only relatively short FIR inverse filters can be derived. In this case, the second stage of our algorithm can be used as an add-on to many inverse-filtering based algorithms.

Although our algorithm is designed for enhancing reverberant speech using one microphone, it is straightforward to extend it into multi-microphone scenarios. Many inverse filtering algorithms, such as the algorithm by Gillespie et al. [45], have been originally proposed using multiple microphones as described in the introduction of this chapter. After inverse filtering using multiple microphones, the second stage of our algorithm ± the spectral subtraction method ± can be utilized effectively for reducing long-term reverberation effects.

As described in Section 4.1, Araki et al. [6] point out a fundamental limitation of frequency-domain BSS algorithms. An optimal frame length of FFT exists for a

126 frequency domain BSS system, and this optimal length is comparatively short when room impulse responses are long. In one of their experiments, the optimal frame length is 1024

(64 ms) for a convolutive BSS system in a room with the reverberation time of 0.3 s.

Similar to the argument we offered earlier, the BSS systems employing the optimal frame sizes are unable to attenuate long-term reverberation effects from either target or interfering sound sources. Again, the second stage of our algorithm can be extended to deal with multiple sound sources by applying a convolutive BBS system and then reducing long-term reverberation effects.

In this chapter, we have presented a one-microphone reverberant speech enhancement algorithm based on inverse filtering and spectral subtraction. The qualitative and quantitative evaluations show that our algorithm enhances the quality of reverberant speech effectively and performs significantly better than a recent reverberant speech enhancement algorithm.

127

(a)

(b)

Figure 4.12. (a) The equalized impulse response derived from the room impulse response in Figure 4.3(b) using linear-least-square inverse filtering of length 1024 (64 ms), and (b) its energy decay curve. The horizontal dot line represents -60 dB energy decay level. The left dashline indicates the starting time of the impulse responses and the right dashline the time at which decay curves cross -60 dB.

128

CHAPTER 5

CONTRIBUTIONS AND FUTURE WORK

5.1 Contributions

We have studied three aspects of speech processing in noisy and reverberant environments. First, we have proposed an algorithm for multipitch tracking in noisy speech. Next, we have suggested a pitch-based measure for reverberation time. Finally, we have presented a one-microphone reverberant speech enhancement algorithm.

In Chapter 2, our multipitch tracking algorithm for noisy speech consists of four main stages. In the first stage, a mixture of speech and interference is decomposed into a multichannel representation by cochlear filtering, and then a normalized correlogram is computed. The second stage ± channel/peak selection ± retains only weakly corrupted channels and valid peaks in the correlogram. A statistical method is employed to integrate the periodicity information across different frequency channels in the third stage. Finally, continuous pitch tracks are formed using a hidden Markov model (HMM). In order to facilitate a better comparison, we have suggested a pitch error measure for the multipitch situation and evaluated the algorithm on a database of speech utterances mixed with

129 various types of interference (white noise, ª cocktail partyº noise, rock music, other speech utterances, etc.). Quantitative comparisons show that our algorithm outperforms existing ones by a substantial margin.

Three main factors give rise to the performance of our multipitch PDA. First, an improved channel and peak selection method effectively removes corrupted channels and invalid peaks. Second, a common issue in designing time-frequency domain PDAs is how to integrate periodicity information across all channels. A novel statistical integration method is proposed and our approach is shown to be more effective than summary correlogram methods. Also, by utilizing a statistical approach, our algorithm is able to maintain multiple hypotheses with different probabilities, causing the model to be more robust to noise interference. Finally, an HMM realizes the pitch continuity constraint and the parameters in the HMM are learned from natural speech utterances.

In Chapter 3, our pitch-based measure for reverberation time quantifies the deleterious effects of reverberation to harmonic structure in voiced speech. By estimating the pitch strength from the statistics of relative time lags from reverberant speech, our measure gives a measure of the reverberation time.

Our study explores the connection between room reverberation and a well-established notion ± pitch ± in psychoacoustics, and demonstrates the monotonic relationship between the degree of reverberation and pitch strength. Also, we have shown that the statistics of relative time lags can be used as a measure for pitch strength.

In Chapter 4, we have presented a two-stage algorithm for reverberant speech enhancement using one microphone. The first stage employs inverse filtering to reduce

130 the harmful effects of coloration. Then, in the second stage, the influence of long-term reverberation is lessened by utilizing spectral subtraction.

Two main factors contribute to the performance of our reverberant speech enhancement algorithm. First, our two-stage approach recognizes the distinct properties of early and late reflections in a room impulse response and treats them separately. While the early reflections give rise to coloration distortion, the late reflections cause long-term reverberation effects. Second, a novel spectral subtraction technique is employed for estimating and subtracting the effects of long-term reverberation. Quantitative measures show that our method is effective in reducing the distortion caused by reverberation.

5.2 Insights Gained

At the beginning of this dissertation, we set out to investigate three particular aspects of speech processing in noisy and reverberant environments: multipitch tacking for noisy speech, the bearing of reverberation on pitch strength, and reverberant speech enhancement using one microphone. A number of insights have proven to be very valuable for our research these are summarized below.

In the study of multipitch tracking of noisy speech, we started our investigation with an attempt to understand how noise interference degrades the periodicity of speech signal. Although the mechanism is not completely understood, we have obtained some insights from our study. As noted in Chapter 2, there are three categories of PDAs: time- domain, frequency-domain, and time-frequency domain algorithms. Due to the wide dynamic range of speech signal in the frequency domain, we observed that, while some

131 channels are heavily corrupted by interference, others are hardly affected. As a result, by employing the time-frequency domain approach, not only can we compensate for the large energy differences among the frequency channels, but also identify the corrupted channels that impact the robustness. Second, we choose a statistical approach as the computational paradigm for our algorithm. The choice is not accidental. Speech and many other natural sounds are nonstationary and complex, and probably they cannot be accurately modeled deterministically. Consequently, we apply a statistical model and extract the parameters in the model from natural sounds such as speech. This novel approach serves our well, and indeed, our statistical approach contributes substantially to the performance of our algorithm.

In the study of the relationship between reverberation and pitch, we found that reverberation corrupts harmonic structure in voiced speech and, therefore, degrades a measure of pitch strength. This understanding directly led to our pitch-based measure for reverberation time. On the other hand, the full impacts of reverberation to pitch have not been completely understood.

Finally, in the study of reverberant speech enhancement using one microphone, we set out to understand how reverberation degrades speech quality. Although this objective has not been fully achieved, we realized that the early and late reflections in a room impulse response have almost independent effects on reverberant speech. Based on this understanding, we proposed a novel algorithm for reverberant speech enhancement.

132 5.3 Future Work

A number of issues remain to be addressed in future research. In our studies, except for one experiment on speech degraded by reverberation and a low level of ambient white noise (see Section 4.3.2), we assume either reverberation- or noise-free conditions.

Realistic acoustic environments, however, tend to be both noisy and reverberant.

Several concerns exist when we extend our algorithm for multipitch tracking to reverberant conditions. First, reverberation further degrades the speech signal and the degradation is not uniform in low- and high-frequency channels. Due to the fast-changing nature of speech signal in high-frequency channels, harmonic structure in these channels tends to be severely degraded in moderate to severe reverberant conditions. An effective algorithm for multipitch tracking in noisy and reverberant environments needs to overcome this challenge. Here, it is interesting to note that the lower harmonics of a complex sound dominate the perceptual judgment of pitch [140, 141]. Second, as briefly described in the introduction chapter and in Chapter 4, room reverberation elongates the harmonic structure and, therefore, the pitch tracks in reverberant speech. A study by

Nakatani and Miyoshi [120] suggests that this problem may be alleviated by employing the second stage of our one-microphone reverberant speech enhancement algorithm ± spectral subtraction ± described in Section 4.2.2. By suppressing the effects of reverberation tails, the elongated pitch tracks may be corrected.

Next, our pitch-based reverberation measure can potentially be extended to multiple simultaneous sound sources. A multipitch tracking algorithm, such as the algorithm described in Chapter 2, may be used for obtaining multiple pitch tracks in a mixture of

133 reverberant signals, and pitch strength may be extracted by collecting the statistics of relative time lags based on the detected multiple pitch tracks. Chapter 3 represents a first step and further performance improvements are expected.

Finally, our one-microphone reverberant speech enhancement algorithm, presented in

Chapter 4, can potentially be extended to deal with multiple acoustic sources. As discussed in Section 4.4, a convolutive BSS system may be used as the first stage of the extended algorithm to increase the target-to-interference energy ratio and reduce coloration effects. While the second stage, extending from the second stage of our reverberant speech enhancement algorithm, may be utilized to reduce the long-term reverberation effects from both target and interfering sound sources.

5.4 Concluding Remarks

As laid out in Section 1.2, our investigation strives to achieve a performance level comparable to that of humans. This goal, however, has only been partially accomplished.

It is hard to quantify human performance of tracking pitch in noisy speech; complex pitch patterns in speech utterances and simultaneous pitch contours make it difficult to conduct perceptual experiments. Consequently, most psychoacoustic experiments on pitch perception utilize acoustic signals with steady pitch or with simple pitch patterns, such as pitch glides. Some indirect evidence on multipitch perception is given by ª double-vowelº experiments: Subjects identify concurrent steady-state synthetic vowels better when they differ in pitch than when they are the same (for example, see [7, 8, 169]). Tracking multiple speakers in a noisy environment, however, is considerably more difficult for

134 human listeners. We are probably only able to track a single speaker in a noisy environment, such as in a cocktail party. Our multipitch tracking algorithm can reliably track the dominant pitch contours. If we equate pitch tracking and speaker tracking, our algorithm may have reached a comparable performance level.

Human ability to understand speech in a reverberant environment is remarkable; moderately reverberant speech is highly intelligible for normal listeners. Although reverberation effects are substantially reduced by our reverberant speech enhancement algorithm, we think that more progress is needed to achieve human-level performance.

Looking back now, I can say that this dissertation represents a very rewarding journey of scientific exploration for me. Although there are clearly unsolved issues in the topics addressed in this dissertation, I am satisfied by the progress made and the new understanding achieved. I can only hope that some of the ideas and algorithms proposed in the dissertation will also prove to be worthwhile for the speech processing community at large.

135

BIBLIOGRAPHY

[1] http://www- sigproc.eng.cam.ac.uk/~jrh1008/Resources/ImpulseResponses/NF_Welcome.html .

[2] J. B. Allen, "Effects of small room reverberation on subjective preference," J. Acoust. Soc. Amer., vol. 71, pp. S5, 1982.

[3] J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small- room acoustics," J. Acoust. Soc. Amer., vol. 65, pp. 943-950, 1979.

[4] J. B. Allen, D. A. Berkley, and J. Blauert, "Multimicrophone signal-processing technique to remove room reverberation from speech signals," J. Acoust. Soc. Amer., vol. 62, pp. 912-915, 1977.

[5] T. Arai, K. Kinoshita, N. Hodoshima, A. Kusumoto, and T. Kitamura, "Effects of suppressing steady-state portions of speech on intelligibility in reverberant environments," Acoustical Science and Technology, vol. 23, pp. 229-232, 2002.

[6] S. Araki, R. Mukai, S. Makino, T. Nishikara, and H. Saruwatari, "The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech," IEEE Trans. Speech Audio Processing, vol. 11, pp. 109-116, 2003.

[7] P. F. Assmann and Q. Summerfield, "Modelling the perception of concurrent vowels: vowels with the same fundamental freqency.," J. Acoust. Soc. Amer., vol. 85, pp. 327-338, 1989.

[8] P. F. Assmann and Q. Summerfield, "Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies," J. Acoust. Soc. Amer., vol. 88, pp. 680-697, 1990.

[9] American Standards Association, Acoustical Terminology SI, New York, NY: American Standards Association, 1960.

136

[10] C. Avendano and H. Hermansky, "Study on the dereverberation of speech based on temporal envelope filtering," in Proc. ICSLP, 1996, pp. 889-892.

[11] J. K. Baker, "The DRAGON system - An overview," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 23, pp. 23-29, 1975.

[12] J. K. Baker, "Stochastic modeling for automatic speech understanding," in Speech Recognition, D. E. Reddy, Ed., New York, NY: Academic Press, 1975, pp. 521- 542.

[13] A. K. Barros, T. Rutkowski, F. Itakura, and N. Ohnishi, "Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and Wavelets," IEEE Trans. Neural Networks, vol. 13, pp. 888-893, 2002.

[14] D. Bees, M. Blostein, and P. Kabal, "Reverberant speech enhancement using cepstral processing," in Proc. IEEE ICASSP, 1991, pp. 977-980.

[15] A. J. Bell and T. J. Sejnowski, "An information-maximization approach to blind source separation and blind deconvolution," Neural Computation, vol. 7, pp. 1129-1159, 1995.

[16] D. A. Berkley and J. B. Allen, "Normal listening in typical rooms: The physical and psychophysical correlates of reverberation," in Acoustical factors affecting hearing aid performance, G. A. Studebaker and I. Hochberg, Eds., 2nd ed., Needham Heights, MA: Allyn and Bacon, 1993, pp. 3-14.

[17] G. Borè, "Kurzton-meßverfahren zur punktweisen ermittlung der sprachverständlichkeit in lautsprecherbeschallten räumen," Technische Hochshule, Aachen, 1956.

[18] M. S. Brandstein, "On the use of explicit speech modeling in microphone array applications," in Proc. IEEE ICASSP, 1998, pp. 3613-3616.

[19] M. S. Brandstein, "Time-delay estimation of reverberated speech exploiting harmonic structure," J. Acoust. Soc. Amer., vol. 105, pp. 2914-2919, 1999.

[20] M.S. Brandstein and S. Griebel, "Explicit speech modeling for microphone array applications," in Microphone arrays: Signal processing techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., New York, NY: Springer Verlag, 2001, pp. 133-153.

137 [21] M.S. Brandstein and D.B. Ward, "Microphone Arrays: Signal Processing Techniques and Applications." New York, NY: Springer Verlag, 2001.

[22] A. S. Bregman, Auditory scene analysis: The perceptual organization of sound, Cambridge, MA: The MIT Press, 1990.

[23] G. J. Brown and M. P. Cooke, "Computational auditory scene analysis," Computer Speech and Language, vol. 8, pp. 297-336, 1994.

[24] C. Cerisara, "Towards missing data recognition with cepstral features," in Proc. EUROSPEECH, 2003, pp. 3057-3060.

[25] D. Chazan, Y. Stettiner, and D. Malah, "Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation," in Proc. IEEE ICASSP, 1993, pp. II-728-II-731.

[26] E. C. Cherry, "Some experiments on the recognition of speech, with one and with two ears," J. Acoust. Soc. Amer., vol. 25, pp. 975-979, 1953.

[27] A. de Cheveigné, "Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing," J. Acoust. Soc. Amer., vol. 93, pp. 3271-3290, 1993.

[28] A. de Cheveigné and H. Kawahara, "Multiple period estimation and pitch perception model," Speech Communication, vol. 27, pp. 175-185, 1999.

[29] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993.

[30] M.P. Cooke, P. Green, L. Losifovski, and A. Vizinho, "Robust automatic speech recognition with missing and unreliable acoustic data," Speech Communication, vol. 34, pp. 267-285, 2001.

[31] T. J. Cox, F. Li, and P. Darlington, "Extracting room reverberation time from speech using artificial neural networks," J. Audio Eng. Soc., vol. 49, pp. 219-230, 2001.

[32] B. W. Darren, R. Kennedy, and R. C. Williamson, "Constant directivity beamforming," in Microphone arrays: Signal processing techniques and application, M. S. Brandstein and D. B. Ward, Eds., New York, NY: Springer Verlag, 2001, pp. 3-17.

[33] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals, Upper Saddle River, NJ: Prentice-Hall, 1987.

138

[34] L. A. Drake, "Sound source separation via computational auditory scene analysis (CASA)-enhanced beamforming," Ph.D. Dissertation, Department of Electrical Engineering, Northwestern University, 2001.

[35] D. P. W. Ellis, "Prediction-driven computational auditory scene analysis," Ph.D. dissertation, Department of Electrical and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 1996.

[36] W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall, "The DARPA speech recognition research database: Specifications and status," in Proc. the DARPA Speech Recognition Workshop, 1986, pp. 93-99.

[37] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko, "Computer-steered microphone arrays for sound transduction in large rooms," J. Acoust. Soc. Amer., vol. 78, pp. 1508-1518, 1985.

[38] J. L. Flanagan and M. G. Saslow, "Pitch discrimination for synthetic vowels," J. Acoust. Soc. Amer., vol. 30, pp. 435-442, 1958.

[39] J. L. Flanagan, A. Surendran, and E. Jan, "Spatially selective sound capture for speech and audio processing," Speech Communication, vol. 13, pp. 207-222, 1993.

[40] H Fletcher and J. C. Steinberg, "Articulation testing methods," Bell Sys. Tech. Journal, vol. 8, pp. 806-854, 1929.

[41] N. R. French and J. C. Steinberg, "Factors governing the intelligibility of speech sounds," J. Acoust. Soc. Amer., vol. 19, pp. 90-119, 1947.

[42] K. Furuya and Y Kaneda, "Two-channel blind deconvolution for non-minimum phase impulse responses," in Proc. IEEE ICASSP, 1997, pp. 1315-1318.

[43] P. J. Gastellano, S. Sridharan, and D. Cole, "Speaker recognition in reverberant enclosures," in Proc. ICASSP, 1996, pp. 117-120.

[44] B. W. Gillespie and L. E. Atlas, "Acoustic diversity for improved speech recognition in reverberant environments," in Proc. IEEE ICASSP, 2002, pp. 557- 560.

[45] B. W. Gillespie, H. S. Malvar, and D. A. F. Florêncio, "Speech dereverberation via maximum-kurtosis subband adaptive filtering," in Proc. IEEE ICASSP, 2001, pp. 3701-3704.

139 [46] B.W. Gillespie and L. E. Atlas, "Strategies for improving audible quality and speech recognition accuracy of reverberant speech," in Proc. IEEE ICASSP, 2003, pp. 676-679.

[47] D. Giuliani, M. Omologo, and P. Svaizer, "Experiments of speech recognition in a noisy and reverberant environment using a microphone array and HMM adaptation," in Proc. ICSLP, 1996, pp. 1329-1332.

[48] B. Gold, "Digital speech networks," Proc. IEEE, vol. 65, pp. 1636-1658, 1977.

[49] B. Gold and N. Morgan, Speech and , New York, NY: John Wiley & Sons, 2000.

[50] J. González-Rodríguez, J. Ortega-García, C. Martín, and L. Hernández, "Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays," in Proc. ICASSP, 1996, pp. 1333-1336.

[51] Y. H. Gu, "Linear and nonlinear adaptive filtering and their application to speech intelligibility enhancement," Ph.D. Dissertation, Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands, 1992.

[52] Y. H. Gu and W. M. G. van Bokhoven, "Co-channel speech separation using frequency bin non-linear adaptive filter," in Proc. IEEE ICASSP, 1991, pp. 949- 952.

[53] H. Haas, "Uber den einfluss des einfachechos auf die horsamkeit von sprache," in Acustica, vol. 1, 1951, pp. 49-58.

[54] V. Hamacher, "Comparison of advanced monaural and binaural noise reduction algorithms for hearing aids," in Proc. IEEE ICASSP, 2002, pp. 4008-4011.

[55] D. J. Hand and K. Yu, "Idiot’ s Bayes – not so stupid after all?," International Statistical Review, vol. 69, pp. 385-398, 2001.

[56] J. Hardwick, "The dual excitation speech model," Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, 1992.

[57] S. Haykin, Unsupervised adaptive filtering, New York: Wiley, 2000.

[58] S. Haykin, Adaptive filter theory, 4th ed., Upper Saddle River, New Jersey: Prentice Hall, 2002.

140 [59] H. Helmholtz, On the as a Physiological Basis for the Theory of Music, A. J. Ellis, Ed, New York, NY: Dover, 1954.

[60] C. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Processing, vol. 2, pp. 578-589, 1994.

[61] D. J. Hermes, "Measurement of pitch subharmonic summation," J. Acoust. Soc. Amer., vol. 83, pp. 257-264, 1988.

[62] D. J. Hermes, "Pitch analysis," in Visual Representations of Speech Signals, M. P. Cooke and M. Crawford, Eds., London, UK: Wiley, 1992.

[63] W. J. Hess, "A pitch-synchronous digital feature extraction system for phonemic recognition of speech," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 24, pp. 14-25, 1976.

[64] W.J. Hess, Pitch Determination of Speech Signals, New York, NY: Springer, 1983.

[65] W.J. Hess, "Pitch and voicing determination," in Advances in speech signal processing, S. Furui and M. M. Sondhi, Eds., New York, NY: Marcel Dekker, 1992, pp. 3-48.

[66] H. Hollien, "Three major vocal registers: a proposal," in Proc. the 7th Internation Congress on Phonetic Sciences., 1972, pp. 320-331.

[67] T. Houtgast and H. J. M. Steeneken, "A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria," J. Acoust. Soc. Amer., vol. 77, pp. 1069-1077, 1985.

[68] D. M. Howard, "Peak-picking fundamental period estimation for hearing prostheses," J. Acoust. Soc. Amer., vol. 86, pp. 902-910, 1989.

[69] G. Hu and D. L. Wang, "Monaural speech separation," in Proc. NIPS, 2002, pp. 1221-1228.

[70] X. Huang, A. Acero, and H.W. Hon, Spoken language processing: a guide to theory, algorithm, and system development, Upper Saddle River, NJ: Prentice Hall, 2001.

[71] M. J. Hunt and C. Lefèbvre, "Speaker dependent and independent speech recognition experiments with an auditory model," in Proc. IEEE ICASSP, 1988, pp. 215-218.

141 [72] IEEE, "IEEE recommended practice for speech quality measurements," IEEE Trans. on Audio and Electroacoust., pp. 227-246, 1969.

[73] F. Itakura, "Minimum prediction residual principle applied to speech recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, pp. 67- 72, 1975.

[74] F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE, vol. 64, pp. 532-556, 1976.

[75] F. Jelinek, Statistical methods for speech recognition, Cambridge, Massachusetts U.S.A.: The MIT Press, 1997.

[76] F. Jelinek, L. R. Bahl, and R. L. Mercer, "Design of a linguistic statistical decoder for the recognition of continuous speech," IEEE Trans. Info. Theory, vol. 21, pp. 250-256, 1975.

[77] J. J. Jetzt, "Critical distance measurement of rooms from the sound energy spectral response," J. Acoust. Soc. Amer., vol. 65, pp. 1204-1211, 1979.

[78] D. L. Johnson and H. V. Gierke, "Audibility of ," J. Acoust. Soc. Amer., vol. 56, pp. S37, 1974.

[79] J. F. Kaiser, "On a simple algorithm to calculate the `energy' of a signal," in Proc. IEEE ICASSP, 1990, pp. 381-384.

[80] B. E. D. Kingsbury, "Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments," Ph.D. dissertation, the Department of Computer Science, University of California, Berkeley, Berkeley, CA, 1998.

[81] B. E. D. Kingsbury, N. Morgan, and S. Greenberg, "Robust speech recognition using modulation spectrogram," Speech Communication, vol. 25, pp. 117-132, 1998.

[82] D. Klatt, "Prediction of perceived phonetic distance from critical-band spectra: A first step," in Proc. IEEE ICASSP, Paris, 1982, pp. 1278-1281.

[83] V.O. Knudsen, "The hearing of speech in auditoriums," J. Acoust. Soc. Amer., vol. 1, pp. 56-82, 1929.

[84] A. H. Koenig, J. B. Allen, D. A. Berkley, and T. H. Curtis, "Determination of masking level differences in an reverberant environment," J. Acoust. Soc. Amer., vol. 61, pp. 1374-1376, 1977.

142

[85] E. J. Kreul, J. C. Nixon, K. D. Kryter, D. W. Bell, J. L. Land, and E. D. Schubert, "A proposed clinical test of speech discrimination," Journal of Speech and Hearing Research, vol. 11, pp. 536-552, 1968.

[86] A. K. Krishnamurthy and D. G. Childers, "Two-channel speech analysis," IEEE Trans. Acoust., Speech, Signal Processing, vol. 34, pp. 730-743, 1986.

[87] D. A. Krubsack and R. J. Niederjohn, "An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech," IEEE Trans. Signal Processing, vol. 39, pp. 319-329, 1991.

[88] N. Kunieda, T. Shimamura, and J. Suzuki, "Pitch extraction by using autocorrelation function on the log spectrum," Electronics and Communications in Japan, Part 3, vol. 83, pp. 90-98, 2000.

[89] H. Kuttruff, Room Acoustics, 4th ed., New York, NY: Spon Press, 2000.

[90] Y.-H. Kwon, D. J. Park, and B. C. Ihm, "Simplified pitch detection algorithm of mixed speech signals," in Proc. Proc. IEEE ISCAS, 2000, pp. III-722-III-725.

[91] T. Langhans and H. W. Stube, "Speech enhancement by nonlinear multiband envelope filtering," in Proc. IEEE ICASSP, 1982, pp. 156-159.

[92] R. G. Leonard, "A database for speaker-independent digit recognition," in Proc. IEEE ICASSP, 1984, pp. 111-114.

[93] J. D. R. Licklider, "A duplex theory of pitch perception," Experientia, vol. 7, pp. 128-134, 1951.

[94] J. D. R. Licklider, "Three auditory theories," in Psychology: A Study of a Science, S. Koch, Ed., New York, NY: McGraw-Hill, 1959.

[95] C. Lieske, B. Bos, B. Gambäck, M. Emele, and C. J. Rupp, "Giving prosody a meaning," in Proc. European Conference on Speech Communication and Technology, 1997, pp. 1431-1434.

[96] D. J. Liu and C. T. Lin, "Fundamental frequency estimation based on the joint time-frequency analysis of harmonic spectral structure," IEEE Trans. Speech Audio Processing, vol. 9, pp. 609-621, 2001.

[97] R. F. Lyon, "Computational models of neural auditory processing," IEEE Proc., vol. 36, pp. 1-4, 1984.

143 [98] C. Marro, Y. Mahieux, and K. U. Simmer, "Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering," IEEE Trans. Speech Audio Processing, vol. 6, pp. 240-259, 1998.

[99] Ph. Martin, "Détection de F0 par intercorrélation avec une function peigne," Journées d'Etude sur la parle, vol. 12, pp. 221-232, 1981.

[100] T. Matsui and S. Furui, "Text-independent speaker recognition using vocal tract and pitch information," in Proc. ICSLP, 1990, pp. 137-140.

[101] B. J. McDermott, "Multidimensional analysis of circuit quality judgments," J. Acoust. Soc. Amer., vol. 45, pp. 774-781, 1969.

[102] B. J. McDermott, C. Scagliola, and D. J. Goodman, "Perceptual and objective evaluation of speech processed by adaptive differential PCM," in Proc. IEEE ICASSP, Tulsa, OK, 1978, pp. 581-585.

[103] B. J. McDermott, C. Scagliola, and D. J. Goodman, "Perceptual and objective evaluation of speech processed by adaptive differential PCM," Bell Sys. Tech. Journal, vol. 57, pp. 1597-1619, 1978.

[104] N. P. McKinney, "Laryngeal frequency analysis for linguistic research," University of Michigan, Ann Arbor, Michigan 1965.

[105] Y. Medan, E. Yair, and D. Chazan, "Super resolution pitch determination of speech signals," IEEE Trans. Signal Processing, vol. 39, pp. 40-48, 1991.

[106] R. Meddis and M. J. Hewitt, "Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification," J. Acoust. Soc. Am., vol. 89, pp. 2866-2882, 1991.

[107] R. Meddis and M. J. Hewitt, "Modeling the identification of concurrent vowels with different fundamental frequencies," J. Acoust. Soc. Amer., vol. 91, pp. 233- 244, 1992.

[108] http://www.microsoft.com/speech.

[109] M. Miyoshi and Y. Kaneda, "Inverse filtering of room impulse response," IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 145-152, 1988.

[110] R. B. Monsen and A. M. Engebretson, "Study of variations in the male and female glottal wave," J. Acoust. Soc. Amer., vol. 62, pp. 981-933, 1977.

144 [111] B. C. J. Moore, An Introduction to the Psychology of Hearing, 1st ed., New York, NY: Academic, 1977.

[112] B. C. J. Moore, An introduction to the psychology of hearing, 4th ed., San Diego, CA: Academic Press, 1997.

[113] M. Morner, F. Fransson, and C. Fant, Voice register terminology and standard pitch, vol. 4, Stockholm, Sweden: Royal Institute of Technology, 1964.

[114] A. K. Nábelek, "Identification of vowels in quiet, noise, and reverberation: Relationships with age and hearing loss," J. Acoust. Soc. Amer., vol. 84, pp. 476- 484, 1988.

[115] A. K. Nábelek, "Communication in noisy and reverberant environments," in Acoustical factors affecting hearing aid performance, G. A. Stubebaker and I. Hochberg, Eds., 2nd ed., Needham Height, MA: Allyn and Bacon, 1993.

[116] A. K. Nábelek and P. A. Dagenais, "Vowel errors in noise and in reverberation by hearing-impaired listeners," J. Acoust. Soc. Amer., vol. 80, pp. 741-748, 1986.

[117] A. K. Nábelek, T. R. Letowski, and F. M. Tucker, "Reverberant overlap and self- masking in consonant identification," J. Acoust. Soc. Amer., vol. 86, pp. 1259- 1265, 1989.

[118] A. K. Nábelek and D. Mason, "Effect of noise and reverberation on binaural and monaural word identification by subjects with various andiograms," Journal of Speech and Hearing Research, vol. 24, pp. 375-383, 1981.

[119] A. K. Nábelek and P. K. Robinson, "Monaural and binaural speech perception in reverberation for listeners of various ages," J. Acoust. Soc. Amer., vol. 71, pp. 1242-1248, 1982.

[120] T. Nakatani and M. Miyoshi, "Blind dereverberation of single channel speech signal based on harmonic structure," in Proc. IEEE ICASSP, 2003, pp. 92-95.

[121] S. T. Neely and J. B. Allen, "Invertibility of a room impulse response," J. Acoust. Soc. Amer., vol. 66, pp. 165-169, 1979.

[122] T. I. Niaounakis and W. J. Davies, "Perception of reverberation time in small listening rooms," J. Audio Eng. Soc., vol. 50, pp. 343-350, 2002.

[123] H. Niemann, E. Nöth, A. Keißling, and R. Batliner, "Prosodic processing and its use in Verbmobil," in Proc. IEEE ICASSP, 1997, pp. 75-78.

145 [124] A. M. Noll, "Cepstrum pitch determination," J. Acoust. Soc. Amer., vol. 41, pp. 293-309, 1967.

[125] S. Nooteboom, "The prosody of speech: melody and rhythm," in The Handbook of Phonetic Science, W. J. Hardcastle and J. Laver, Eds., Cambridge, MA: Blackwell Publishers, 1997, pp. 640-673.

[126] M. Omologo, P. Svaizer, and M. Matassoni, "Environmental conditions and acoustic transduction in hands-free speech recognition," Speech Communication, vol. 25, pp. 75-95, 1998.

[127] A. V. Oppenheim and R. W. Schafer, Discrete-time signal processing, Englewood Cliffs, NJ: Prentice-Hall, 1989.

[128] K. J. Palomäki, G. J. Brown, and J. Barker, "Missing data speech recognition in reverberant conditions," in Proc. IEEE ICASSP, Orlando, FL, 2002, pp. 65-68.

[129] E. Parris and M. Carey, "Language independent gender identification," in Proc. IEEE ICASSP, 1996, pp. 685-688.

[130] T. W. Parson, "Separation of speech from interfering speech by means of harmonic selection," J. Acoust. Soc. Amer., vol. 60, pp. 911-918, 1976.

[131] R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Price, "APU Report 2341: An Efficient Auditory Filterbank Based on the Gammatone Function," Applied Psychology Unit, Cambridge 1988.

[132] P. Pernández-Cid and F. J. Casajús-Quirós, "Multi-pitch estimation for polyphonic musical signals," in Proc. IEEE ICASSP, 1998, pp. 3565-3568.

[133] S. R. Quackenbush, T. P. Barnwell, III, and M. A. Clements, Objective measures of speech quality, Englewood Cliffs, NJ: Prentice Hall, 1988.

[134] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and A. McGonegal, "A comparative study of several pitch detection algorithms," IEEE Trans. Acoust., Speech, Signal Processing, vol. 23, pp. 552-557, 1976.

[135] L. R. Rabiner, M. R. Sambur, and C. E. Schmidt, "Applications of nonlinear smoothing algorithm to speech processing," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 23, pp. 552-557, 1975.

[136] D. V. Rabinkin, R. J. Renomeron, J. L. Flanagan, and D. F. Macomber, "Optimal truncation time for matched filter array processing," in Proc. IEEE ICASSP, 1998, pp. 3629-3632.

146

[137] P. Raghavan, R. J. Renomeron, C. Che, D-S. Yuk, and J. L. Flanagan, "Speech recognition in a reverberant environment using matched filter array (MFA) processing and linguistic-tree maximum likelihood linear regression (LT-MLLR) adaptation," in Proc. IEEE ICASSP, 1999, pp. 777-780.

[138] R. Ratnam, D. L. Jones, B. C. Wheeler, and A. S. Feng, "Online estimation of room reverberation time," J. Acoust. Soc. Amer., vol. 113, pp. 2269, 2003.

[139] R. J. Renomeron, D. V. Rabinkin, J. C. French, and J. L. Flanagan, "Small-scale matched filter array processing for spatially selective sound capture," J. Acoust. Soc. Amer., vol. 102, pp. 3208-3217, 1997.

[140] R. J. Ritsma, "Frequencies dominant in the perception of the pitch of complex sounds," J. Acoust. Soc. Amer., vol. 42, pp. 191-198, 1967.

[141] R. J. Ritsma, "Periodicity detection," in Frequency Analysis and Periodicity Detection in Hearing, R. Plomp and G. F. Smoorenburg, Eds., Sijthoff, Leiden, 1970.

[142] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, "Average magnitude difference function pitch extractor," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 22, 1974.

[143] J. Rouat, Y. C. Liu, and D. Morissette, "A pitch determination and voiced/unvoiced decision algorithm for noisy speech," Speech Communication, vol. 21, pp. 191-207, 1997.

[144] M. R. Schroeder, "New method of measuring reverberation time," J. Acoust. Soc. Amer., vol. 37, pp. 409-412, 1965.

[145] M. R. Schroeder, "Period histogram and product spectrum: new methods for fundamental-frequency measurement," J. Acoust. Soc. Amer., vol. 43, pp. 829- 834, 1968.

[146] B. G. Secrest and G. R. Doddington, "Postprocessing techniques for voice pitch trackers," in Proc. IEEE ICASSP, 1982, pp. 172-175.

[147] H. L. Shaffer, "Infomation rate necessary to transmit pitch-period durations for connected speech," J. Acoust. Soc. Amer., vol. 36, pp. 1895-1900, 1964.

[148] Y. Shao and D. L. Wang, "Co-channel speaker identification using usable speech extraction based on multi-pitch tracking," in Proc. ICASSP, 2003, pp. 205-208.

147 [149] T. Shimamura and J. Kobayashi, "Weighted autocorrelation for pitch extraction of noisy speech," IEEE Trans. Speech Audio Processing, vol. 9, pp. 727-730, 2001.

[150] T. Takagi, N. Seiyama, and E. Miyasaka, "A method for pitch extraction of speech signals using autocorrelation functions through multiple window lengths," Electronics and Communications in Japan, Part 3, vol. 83, pp. 67-79, 2000.

[151] O. Tanrikulu and A. G. Constantinides, "Least-mean kurtosis: a novel higher- order statistics based adaptive filtering algorithm," Electronics Letters, vol. 30, pp. 189-190, 1994.

[152] R. Thiele, "Richtungsverteilung und zeitfolge der schallrückwürfe in räumen," Acustica, vol. 3, pp. 291-302, 1953.

[153] A. Thymé-Gobbel and S. Hutchins, "On using prosodic cues in automatic language identification," in Proc. ICSLP, 1996, pp. 1768-1771.

[154] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, "Hidden Markov models based on multi-space probability distribution for pitch pattern modeling," in Proc. IEEE ICASSP, 1999, pp. 229-232.

[155] T. Tolonen and M. Karjalainen, "A computationally efficient multipitch analysis model," IEEE Trans. Speech Audio Processing, vol. 8, pp. 708-716, 2000.

[156] J. M. Tribolet, P. Noll, and B. J. McDermott, "A study of complexity and quality of speech waveform coders," in Proc. IEEE ICASSP, Tulsa, OK, 1978, pp. 586- 590.

[157] L. M. Van Immerseel and J. P. Martens, "Pitch and voiced/unvoiced determination with an auditory model," J. Acoust. Soc. Amer., vol. 91, pp. 3511- 3526, 1992.

[158] B. D. Van Veen and K. M. Buckley, "Beamforming: A versatile approach to spatial filtering," in IEEE ASSP Mag., vol. 5, 1988, pp. 4-24.

[159] H. Wallach, E. B. Newman, and M. R. Rosenzweig, "The precedence effect in sound localization," Amer. J. Psych., vol. 62, pp. 315-337, 1949.

[160] C. Wang and S. Seneff, "Robust pitch tracking for prosodic modeling in telephone speech," in Proc. IEEE ICASSP, 2000, pp. 1343-1346.

[161] D. L. Wang and G. J. Brown, "Separation of speech from interfering sounds based on oscillatory correlation," IEEE Trans. Neural Networks, vol. 10, pp. 684-697, 1999.

148

[162] D. L. Wang and Lim. J. S., "The unimportance of phase in speech enhancement," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 30, pp. 679-681, 1982.

[163] M. Weintraub, "A computational model for separating two simultaneous talkers," in Proc. IEEE ICASSP, 1986, pp. 81-84.

[164] M. R. Weiss, A. E. Aschkenasy, and T. W. Parsons, "Study and development of INTEL technique for improving speech intelligibility," Nicolet Scientific Corp. Final Rep. NSC-FR/4023, 1974.

[165] M. Wu and D. L. Wang, "A one-microphone algorithm for reverberant speech enhancement," in Proc. IEEE ICASSP, 2003, pp. 844-847.

[166] M. Wu, D. L. Wang, and G. J. Brown, "Pitch tracking based on statistical anticipation," in Proc. Proc. IJCNN, 2001, pp. 866-871.

[167] M. Wu, D. L. Wang, and G. J. Brown, "A multi-pitch tracking algorithm for noisy speech," IEEE Trans. Speech Audio Processing, vol. 11, pp. 229-241, 2003.

[168] B. Yegnanarayana and P. S. Murthy, "Enhancement of reverberant speech using LP residual signal," IEEE Trans. Speech Audio Processing, vol. 8, pp. 267-281, 2000.

[169] U. T. Zwicker, "Auditory recognition of diotic and dichotic vowel pairs," Speech Communication, vol. 3, pp. 265-277, 1984.

149