<<

Factors Influencing the Prediction of Intelligibility

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Sarah Yoho Leopold, B. A.

Graduate Program in Speech and Hearing Science

The Ohio State University

2016

Dissertation Committee:

Eric W. Healy, Advisor

Rachael Frush Holt

DeLiang Wang © Copyright by

Sarah Yoho Leopold

2016

Abstract

The three manuscripts presented here examine the relative importance of various

‘critical bands’ of speech, as well as their susceptibility to the corrupting influence of background noise. In the first manuscript, band-importance functions derived using a novel technique are compared to the standard functions given by the Speech Intelligibility

Index (ANSI, 1997). The functions derived with the novel technique show a complex

‘microstructure’ not present in previous functions, possibly indicating an increased accuracy of the new method. In the second manuscript, this same technique is used to examine the effects of individual talkers and types of speech material on the shape of the band-importance functions. Results indicate a strong effect of speech material, but a smaller effect of talker. In addition, the use of ten talkers of different genders appears to greatly diminish any effect of individual talker. In the third manuscript, the susceptibility to noise of individual critical bands of speech was determined by systematically varying the signal-to-noise ratio in each band. The signal-to-noise ratio that resulted in a criterion decrement in intelligibility for each band was determined. Results from this study indicate that noise susceptibility is not equal across bands, as has been assumed. Further, noise susceptibility appears to be independent of the relative importance of each band.

Implications for future applications of these data are discussed.

ii Dedication

Dedicated to my mother, who instilled in me strength, resilience, and a lifetime love of learning.

iii Acknowledgements

Foremost, I would like to thank my advisor, Eric Healy, for his immeasurable and invaluable support and guidance throughout my graduate studies. Without his encouragement and direction,

I would not be where I am today. I can never adequately express my gratitude.

I would also like to extend a special thank you to my dissertation committee members- Rachael

Frush Holt and DeLiang Wang. They were both integral in my growth and success during my graduate career.

I have a very special place in my heart for my laboratory colleagues, who have been there for me both professionally and personally. Carla Youndahl and Frederic Apoux have been great colleagues and even better friends. A sincere thank you to Brittney Carter, Jordan Vasko, and

Shuang Liu.

There have been several other individuals who have made my time at Ohio State both successful and truly enjoyable. There are many, but my gratitude especially to Janet Weisenberger, Robert

Fox, Jason Johnson, Gail Whitelaw, Christy Goodman, Pete Eichel, Christina Roup, Lawrence

Feth, Yuxuan Wang, and Jitong Chen.

Lastly, and no less importantly, I want to thank my family. Their support through the years has allowed me to follow my heart and spend the time and effort necessary to succeed. To my husband, Jordan, who has prioritized my career and happiness above all else- thank you. My love and endless gratitude to my father, Brad, my brother, Matt, my uncle, John, and my grandparents

iv John and Barbara. And lastly, I want to acknowledge the decades of love and unwavering encouragement shown to me by my late mother, Lynn. She always made me believe that what I did was special and worthwhile, and I would be nowhere without the confidence and strength she instilled in me.

v Vita

January 6, 1988……………………………………...Born

2009…………………………………………………B.A. Speech and Hearing Science, The Ohio State University Columbus, Ohio

2009 - 2015…………………………………………Graduate Research Associate Speech Psychoacoustics Laboratory The Ohio State University Columbus, OH

2009 - 2015…………………………………………Graduate Teaching Associate Dept. of Speech and Hearing Science The Ohio State University Columbus, OH

2015…………………………………………………Presidential Fellow The Ohio State University Columbus, OH

Publications

Healy, E.W., Yoho, S.E., Chen, J., Wang, Y. & Wang, D.L. (2015). An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type. Journal of the Acoustical Society of America, 138, 1660-1669.

Apoux, F., Youngdahl, C.L., Yoho, S.E., & Healy, E.W. (2015). Dual-carrier processing to convey temporal fine structure cues: Implications for cochlear implants. Journal of the Acoustical Society of America, 138, 1469-1480.

Healy, E. W., Yoho, S. E., Wang, Y., Apoux, F., & Wang, D.L. (2014). Speech-cue transmission by an algorithm to increase consonant recognition in noise for hearing- impaired listeners. Journal of the Acoustical Society of America, 136, 3325-3336.

vi Mandel, M. I., Yoho, S. E., Healy, E. W. (2014). Generalizing time-frequency importance functions across noises, talkers, and . Proceedings of INTERSPEECH 2014, 2016-2020.

Healy, E. W., Yoho, S. E., Wang, Y., & Wang, D. L. (2013). An algorithm to improve speech recognition in noise for hearing-impaired listeners. Journal of the Acoustical Society of America, 134, 3029-3038.

Apoux, F., Yoho, S. E., Youngdahl, C. L., & Healy, E. W. (2013). Role and relative contribution of temporal envelope and fine structure cues in sentence recognition by normal-hearing listeners. Journal of the Acoustical Society of America, 134, 2205-2212.

Healy, E. W., Yoho, S. E., Youngdahl, C. L., & Apoux, F. (2013). Talker effects in speech band importance functions. Proceedings of Meetings on Acoustics, 19, 050066, pp. 1-6.

Apoux, F., Yoho, S. E., Youngdahl, C. L., & Healy, E. W. (2013). Can envelope recovery account for speech recognition based on temporal fine structure? Proceedings of Meetings on Acoustics, 19, 050072, pp.1-6.

Healy, E. W., Yoho, S. E., & Apoux, F. (2013). Band importance for sentences and words reexamined. Journal of the Acoustical Society of America, 133, 463-473.

Field of Study

Major Field: Speech and Hearing Science

vii Table of Contents

Abstract…………………………………………………………………………..………..ii Dedication………………………………………………………………….…….……….iii Acknowledgements………………………………………………………….…….….…..iv Vita…………………………………………………………………………………..……vi List of Tables…………………………………………………………………….…….....xi List of Figures……………………………………………………………….………...…xii Chapter 1: Introduction………………………………………………...………………….1 Chapter 2: Review of Literature………………………………………………….…….…4 I. Normal Speech Perception………………………………………...……….…...4 II. Auditory Filters……………………………………………………………...... 5 III. Spectral and Temporal Speech Information …………………….…………....6 IV. Glimpsing Theory of Speech Perception…………………………………….10 V. Effects of Sensorineural Hearing Impairment…………………….……….…11 A. Impact of Spectral Processing Deficits ……..………………………..12 B. Other Deficits and Their Impacts …………………………………….15 C. Specific Factors for Speech Perception in Hearing Impaired Listeners ……………………………………………………………………………18 VI. Articulation Index/Band Importance Functions…………………..…………22 A. Assumptions and Considerations……………..…………..…………..25 B. Band Importance Functions…………………….....………………….28 C. Techniques for Deriving Band Importance Functions…………..……30 Chapter 3: Overview of Current Studies……………………………………...…...……..49 Chapter 4: Manuscript 1: Band Importance Functions Reexamined…………………....51 I. Introduction………………………………………...…………...... 53

viii II. Experiment 1. High- and Low- Predictability SPIN Sentences……….……...58 A. Method……………………………………………………………….58 1. Subjects………………………………………………………58 2. Stimuli ……………………………………………………….59 3. Procedure…………………………………………………….61 B. Results………………………………………………………………...63 III. Experiment 2. Phonetically-Balanced Words ……………….………………68 A. Method……………………………………………………………….68 1. Subjects………………………………………………………68 2. Stimuli and Procedure………………………………………..69 B. Results………………………………………………………………...70 IV. Discussion……………………………………………………………………74 V. Summary and Conclusions…………………………………….………...……82 Chapter 5: Manuscript 2: Talker and Speech Material Effects in Band Importance Functions…………………………………………………………………...…………….94 I. Introduction………………………………………...………………...... 96 II. Experiment 1. Single vs. Ten-Talker Sentences……………..……………...101 A. Method………………………………………………………..…….101 1. Subjects……………………………………………….…….101 2. Stimuli and Procedure…………………………………..….102 B. Results……………………………………………………………….105 III. Experiment 2. Different Talkers…………… ……………….……………..107 A. Method………………………………………………………..…….107 1. Subjects……………………………………………….…….107 2. Stimuli and Procedure………………………………………108 B. Results………………………………………………………………110

ix IV. Experiment 3. Different Materials…………… ……………….…………...111 A. Method……………………………………………………………...111 1. Subjects……………………………………………………..111 2. Stimuli and Procedure………………………………………112 B. Results……………………………………………………………….113 IV. Discussion…………………………………………………………………..114 Chapter 6: Manuscript 3: Noise Susceptibility of Speech Critical Bands…...……...….128

I. Introduction……………..………………………………...……………….....130 II. Method………………………………………….……………..……..……...134 A. Subjects………………………………………………………….….134 B. Stimuli………………………………………………………………134 C. Procedure……………………………………………………...……135 1. Band Importance Method……………………….………….135 2. Noise Susceptibility Method……………………………….136 III. Results …………………………………………..…………….…..………..138 IV. Discussion………………………………………………………………….143 V. Summary and Conclusions…….……………….…………….……………..147 Chapter 7: General Summary and Discussion………………...…………...... 156 Cumulative References……………………………………………………….………...170

x List of Tables

Table 4.1. Band divisions employed in the SII and here………………………………...60

Table 4.2. Band importance values obtained in the current study for SPIN sentences (Exp. 1) and CID W-22 phonetically balanced words (Exp. 2)………………………………...75

Table 5.1 Band divisions employed for all functions, as given by the SII….…………103

Table 6.1 Band divisions for ten critical bands examined here……………………..…135

Table 6.2 Noise susceptibility (in dB SNR) by band. Values for bands 2 and 20 are from the control conditions. ………………………………………………………….………142

xi List of Figures

Figure 2.1. From Glasberg and Moore, 1987. Critical bandwidth in Hertz as a function of center frequency. The dotted line indicates the traditional critical-band function. The symbols are critical band values measured in various studies. The solid line is the curve fitted to the equation given in the figure. ………………………………………………..12

Figure 2.2. From Moore, 2007. Width of the auditory filter in ERBs (Equivalent Rectangular Bandwidths) for hearing -impaired subjects relative to those of normal- hearing subjects as a function of audiometric threshold…………………………………14

Figure 2.3. From Glasberg and Moore, 1986. Shape of the auditory filter for a subject with normal hearing and a subject with hearing loss, both at 1000 Hz………………….15

Figure 2.4. From Baer and Moore, 1993. Percent correct intelligibility as a function of amount of spectral smearing at three different signal-to-noise ratios. Black bar = upper side broader. Hashed bar = lower side broader. Open bar = lower side much broader….16

Figure 2.5. From Studebaker et al. 1999. Intelligibility (in rationalized arcsine units) as a function of speech level for various signal-to-noise ratios………………………………19

Figure 2.6. From Summers and Molis, 2004. Speech reception threshold for speech in noise, forward talker, and reverse talker maskers. Normal hearing = filled symbols; Hearing-impaired = open symbols……………………………………………………….20

Figure 2.7. From Egan and Wiener, 1946. Equal-articulation contours for various bandwidths, all with center frequencies of 1500 Hz……………………………….…….23

Figure 2.8. From French and Steinberg, 1947. Functions relating Articulation Index values to intelligibility for various types of speech materials……………………………27

xii Figure 2.9. From French and Steinberg, 1947. Syllable recognition as a function of cut- off frequency for both high- and low-pass stimuli at two signal-to-noise ratios. The cross- over frequencies relating to A = .5 and A =. 25 are labeled……………………………..31

Figure 2.10. From Doherty and Turner 1996. Individual frequency-weighting functions for six listeners derived using the Correlational Method………………………………...34

Figure 4.1. Responses of the high-order FIR filters used to create the 21 speech bands. Shown are long-term average spectra of a 60-sec white noise filtered using parameters for bands 2, 3, 4; 10, 11, 12; and 18, 19, 20…………………………………………………61

Figure 4.2. Band importance values for SPIN sentences, for each of 21 speech bands. High- and Low-Predictability sentences were pooled. Shown are functions for (i) the first randomly-selected subgroup of 10 subjects in each frequency region, (ii) the first 15 subjects, (iii) all 20 subjects, and (iv) the second randomly-selected subgroup of 10 subjects in each frequency region. …………………………….………………….…….65

Figure 4.3. Band importance values for SPIN sentences obtained in the current experiment (High- and Low-Predictability sentences pooled) versus that described in the SII for identical speech materials. The first three formant frequencies are indicated by inverted triangles at the top of the panel…………………………………………………65

Figure 4.4. The top panel shows absolute deviations from SII importance values for each band in band-importance units. The bottom panel shows these deviations in percent (│current importance value - SII importance value│ / SII importance value). Because the SII importance for band 21 is zero, a percent difference could not be calculated in the bottom panel. ……………………………………………………………………………66

xiii Figure 4.5. Band-importance functions for High-Predictability and Low-Predictability SPIN sentences. Also shown is the SII function for identical speech materials. ………67

Figure 4.6. The top panel shows absolute deviations from SII importance values for High- and Low-Predictability SPIN sentences in band-importance units. The bottom panel shows these deviations in percent, as in Fig. 4.4. Again, percent difference could not be calculated in the bottom panel for band 21.……………………………..………..68

Figure 4.7. Band-importance functions for CID W-22 phonetically-balanced words. As in Fig. 4.2, functions are shown based on the first 10, 15, and 20 subjects run in each of three frequency regions, as well as for the second subgroup of 10 subjects run.………..71

Figure 4.8. The band importance function obtained here for CID W-22 words versus that described in the SII for the identical speech materials. The first three formant frequencies are indicated by inverted triangles at the top of the panel. …………………………..…72

Figure 4.9. As Fig. 4.4, but for W-22 words. Unlike Fig. 4.4, percent deviations could be calculated for all bands because SII importance is non-zero for all bands. ………...…..72

Figure 4.10. Shown are band-importance functions obtained in the current study following smoothing across bands using a triangular weighting window. Corresponding functions from the SII are also displayed. The top panel shows the functions for the SPIN sentences, and the bottom panel shows functions for the CID W-22 words.……..74

Figure 5.1. Band-importance functions created using IEEE sentences. The closed symbols show functions for a single male talker and the open symbols show functions for 10 talkers (half male)...…………………………………………………………………106

xiv Figure 5.2. Band-importance functions created using sentence materials and three different individual male talkers...... 111

Figure 5.3. Band-importance functions created using CID W-22 words (open symbols) and SPIN sentences (closed symbols), both spoken by the same talker………………..114

Figure 6.1. Sentence intelligibility in percent correct as a function of target band signal- to-noise ratio for the ten critical bands tested. The top dashed line in each panel is average ‘band-present’ sentence intelligibility for that target band from Manuscript 2. The bottom dashed line in each panel is average ‘band-absent’ sentence intelligibility for that target band from Manuscript 2. The dotted line in each panel is the midway point between band present and band absent for that target band…………………..………..139

Figure 6.2. Same as for Fig. 6.1, but data from a new group of five subjects. The dashed ‘band-present’ and ‘band-absent’ scores were obtained from these same five subjects. ………………………………………..…………………………………………………141

Figure 6.3. Noise susceptibility as equivalent signal-to-noise ratios for the target bands indicated. Values for bands having center frequencies of 250 and 7000 Hz are from the group of five subjects shown in Figure 6.2...... 142

Figure 6.4. Relationship between noise susceptibility (in dB SNR) and band importance (from Manuscript 2) for the ten bands tested here (r = .008, p = .982)………………...142

.

xv

CHAPTER 1: INTRODUCTION

Accurate estimation of speech intelligibility, when that speech is transmitted under various conditions, is crucial to many aspects of communication science. The study of these intelligibility predictions is one of the oldest in the field, largely prompted by the development of the telephone in the early 20th century. One of the first applications of these methods was the estimation of speech intelligibility transmitted through the telephone, which at the time was fraught with distortions and severely limited in bandwidth. Since that time, many more applications have been established, including the design of speech communication devices such as public address systems and hearing aids.

These speech intelligibility predictions are currently provided by the American

National Standards Institute (ANSI) standard referred to as the Speech Intelligibility

Index (SII; ANSI, 1997), formerly known as the Articulation Index (AI; ANSI, 1969).

The SII consists of various ‘band-importance functions,’ which provide values for the relative contribution of individual frequency bands to overall speech intelligibility. These band-importance functions exist for many different types of speech including monosyllables, word lists, sentence corpora, and continuous discourse. In addition,

1 corrections are given for various types of distortion such as a listener’s hearing impairment.

The aim of the current research is to identify ways to improve current techniques to predict speech intelligibility. Specifically, this work employs a novel technique to assess band importance to examine particular aspects of the speech signal and the corrupting influence of noise on that speech signal. Subsequently, this information may be used in many aspects of hearing science, including the development of more efficient and effective speech amplification systems, such as loudspeakers and hearing aids, and the improvement of speech-noise segregation algorithms for use in hearing devices.

Although the studies described here examine speech band importance and speech band susceptibility to noise in normal-hearing listeners, there are substantial implications for hearing-impaired listeners as well. Those implications are discussed.

The first manuscript describes a study in which a new technique to develop band importance functions is described and tested. This new technique circumvents several of the limitations of previous methods, and accounts for the complex interactions and synergistic relationships that are now known to occur across speech bands.

In the second manuscript, the effect of individual talker characteristics and speech material characteristics are assessed, to determine the relative importance of each when deriving band-importance functions. Additionally, the use of a single talker versus a multi- talker speech corpus is assessed.

2 The final manuscript describes an experiment to examine the corrupting influence of noise on particular bands of speech. In this study, the detrimental effect of noise is systematically evaluated by measuring speech intelligibility at various signal-to-noise ratios of particular ‘critical’ bands of speech.

3

CHAPTER 2: REVIEW OF LITERATURE

I. NORMAL SPEECH PERCEPTION

Understanding and comprehending speech is a complex and multifaceted process.

Many aspects of speech must be captured, encoded and translated by the auditory system in order for the message to be intelligible. The specific ways in which the auditory system accomplishes this are still not fully understood, because there are many complex acoustic, phonemic, and linguistic cues within the speech signal.

The acoustic features of speech can be classified primarily as either temporal, spectral, or spectro-temporal. The acoustic cues of speech contribute the ‘bottom-up’ aspects of speech perception; the cues that must be effectively transmitted from the periphery to the higher cortical processing centers. Alternatively, the linguistic features of speech, such as semantic structure and prosody, contribute to the ‘top-down’ aspects; the high-level information that is gained from experience with language, and which allows the auditory system to ‘fill in gaps’ when the bottom-up acoustic signal is deficient.

The identification of an isolated has effectively no context to provide top-down information, and thus relies near-exclusively on bottom-up cues. These bottom- up cues can take the form of spectral or frequency information such as the frequency

4 components present in the signal and their relative levels, temporal information such as the onset duration of a phoneme, or spectro-temporal information such as the relative timing of different frequency components of a sound. This information is processed by the normal auditory system by a series of bandpass filters in the auditory periphery, as described in the following section. In contrast, the linguistic context of a sentence may allow a listener to fill in what is lacking acoustically. Through the use of contextual information, it is possible for an individual to process a spoken sentence even when much of the acoustic signal is distorted or obscured by extraneous noise. As an illustration, consider this incomplete sentence: “The boy hit the baseball with the ______”, in which the final word is entirely disrupted. Although the final word is missing, its identity is strongly constrained by the surrounding context and so its identity is very likely ‘bat’.

II. AUDITORY FILTERS

The normal auditory system is capable of decomposing the frequency content of a signal with considerable precision. The peripheral auditory system can be characterized as a series of band pass filters that decode the incoming sound. The size of these analysis filters is commonly referred to as the critical band (Fletcher, 1940). The width of the auditory filter is dependent on center frequency, with lower-frequency filters being narrower in Hz than higher-frequency filters (Moore and Glasberg, 1983), though bandwidths are more constant on a log-frequency scale. Although the auditory filter shapes do not have infinitely steep slopes, and in fact are generally asymmetric

(Patterson, 1974; Moore and Glasberg, 1983; Glasberg and Moore, 1990), they are

5 commonly modeled using an ‘equivalent rectangular bandwidth’ (ERBN), in which the filters are represented as rectangular in shape and contiguous.

One common technique to measure the size of the auditory filter is the ‘notched- noise’ method (Patterson, 1976; Glasberg and Moore, 1990). In this technique, detection of a tone at a particular frequency is measured in the presence of a noise having a frequency ‘notch’ around the tone. As the width of the notch increases, the detectability of the tone decreases, until a point at which a further increase in notch width no longer elevates the threshold of the tone. The frequencies at which additional noise no longer affects tone detection are considered the edges of the auditory filter for that center frequency.

The asymmetric shape of the auditory filter results in a phenomenon known as

‘upward spread of masking,’ which was first noted by Wegel and Lane (1924). That is, tones of a particular frequency will more easily mask tones that are above them in frequency, rather than below them. This is due to the shallower slope on the high- frequency side of the auditory filter, resulting in a greater spread of excitation above a given frequency than below. One functional consequence is that the lower frequency components of a speech signal can obscure higher frequency components. However, it is important to recognize that the relationships between the various spectral components of speech and their effects on speech perception are often quite complex.

III. SPECTRAL AND TEMPORAL SPEECH INFORMATION

6 It is widely accepted that spectral speech cues can be highly redundant. For example, Warren et al. (1995) showed that speech could be highly intelligible even with minimal spectral content. In this study, listeners were presented with sentences that had been filtered into very narrow (1/3 octave) bands. When mid-frequency bands were presented in isolation, intelligibility was very high (over 95% correct). However, when very high- or very low- frequency bands were presented in isolation, scores dropped considerably, findings in accord with the relative importance of mid-frequency bands of speech to intelligibility. Of particular relevance to the current study is the finding that when two spectrally remote, narrow bands of speech that contributed little to intelligibility in isolation were presented in combination, intelligibility increased substantially, indicating a synergistic relationship between the bands.

Although somewhat less widely studied than spectral speech information, temporal speech information also plays an important role in intelligibility. Temporal information can generally be classified according to the rate of acoustic signal fluctuation. For example, Rosen (1992) classified temporal information into envelope cues (rates of roughly 2-50 Hz), periodicity cues (rates of roughly 50-500 Hz) and fine structure cues (high rates corresponding to the carrier). Envelope cues provide information related to the intensity of a signal, the duration, and the particular rise and fall time of sounds. Periodicity primarily encodes voicing of phonemes. Temporal fine structure, although still a subject of much debate, is currently believed to assist in sound source segregation and detecting speech in the presence of background noise (see Apoux and Healy, 2011).

7 In accord with the redundancy of spectral cues and the importance of temporal cues, speech can be intelligible even when it is processed to remove most spectral detail.

Work by Van Tasell and colleagues (Van Tasell et al., 1987; Van Tasell et al., 1992) examined the intelligibility of the gross temporal envelope alone (no spectral detail). The authors found that listeners were able to identify isolated, closed-set consonants at a rate above chance with solely the temporal envelope present. Additionally, when the maximum temporal rate of the envelope was increased from 20 to 200 Hz, performance increased, indicating that important temporal cues are provided by rates in this range.

However, no further increase in performance was observed as the rate was increased from

200 to 2000 Hz.

Much of our knowledge of spectro-temporal speech cues has come from investigations involving cochlear implant processing. Cochlear implants are devices that are implanted into the inner ear to provide a sense of hearing to individuals with substantial sensorineural hearing loss. To do this, they stimulate directly the auditory nerve through electrodes that present electrical pulses representing the temporal envelope of the incoming sound. This type of processing has been studied extensively in ‘vocoder’ simulations with normal-hearing individuals. In vocoder simulations, the signal is filtered into a series of frequency bands, then the temporal envelope of each band is extracted via half-wave rectification and low-pass filtering, and lastly the temporal envelopes of each band are imposed on either tone or noise carriers of the same filtering configurations as the original speech bands. Therefore, the spectral detail is largely removed from the signal and the listener is required to rely primarily on the temporal envelope information

8 at various loci to understand speech.

Using this type of processing, Shannon et al. (1995) found that normal-hearing adult listeners could understand speech when presented with as few as three to four vocoded bands in quiet. Sentence intelligibility reached nearly 100% correct with only four bands of temporal envelopes, whereas isolated vowel and consonant identification was near-perfect with as few as three bands. Furthermore, only two bands were required for effective transmission of voicing and manner of articulation cues.

In accord with the results of Van Tassell, the low-pass smoothing filter cutoff employed during envelope extraction plays a role in vocoded-speech intelligibility.

Shannon et al. (1995) found significant increases in intelligibility between 16 and 50 Hz low-pass filter cutoffs, but no increase from 50 to 500 Hz. To further examine this effect of low-pass filtering, or temporal smoothing, Healy and Steinbach (2007) systematically varied the slope of the low-pass filter used to created three-band vocoded sentences. In this experiment, two cutoff rates were used: 16 and 100 Hz. For each cutoff rate, the slope was varied from 6 dB/octave to 192 dB/octave. As the slope increased, the amount of temporal detail was reduced. For a given slope, intelligibility was lower for the 16 Hz cutoff than for the 100 Hz cutoff, in accord with results previously observed by Shannon et al. (1995). In addition, it was observed that as the slope of the cutoff filter increased, intelligibility decreased. In fact, performance in the 100 Hz, 96 dB/octave condition was nearly equivalent to the 16 Hz, 6 dB/octave condition. Thus, despite the higher cutoff value, the steep slope of the filter effectively reduced the temporal information present such that performance dropped substantially. The next section describes a theory that

9 extends spectro-temporal analysis of speech to involve its perception when background noise is present.

IV. GLIMPSING THEORY OF SPEECH PERCEPTION

The mechanisms responsible for a listener’s ability to identify and understand a target in a complex background have been a topic of examination for many decades. One of the earliest studies was performed by E. Colin Cherry (1952) in which he presented two competing voices in both diotic and dichotic conditions. For the diotic conditions, when listeners were instructed to repeat the speech presented in the target ear, performance was high. However, they were unable to repeat much if any of the

‘distractor’ speech presented to the opposite ear. These results indicate that there is some underlying function of the auditory system that allows the listener to follow and understand speech, even in the presence of competing message with similar acoustic and linguistic characteristics.

One possible factor that could contribute to this ability to attend to a speech signal of interest is the integration of temporal patterns across frequency channels. The fluctuations of speech tend to be highly correlated across channels, and the auditory system may take advantage of this similarity. One theory concerning this spectro- temporal integration of speech is the ‘glimpsing’ theory (Cooke, 2006). The idea is quite simple- listeners combine spectro-temporal samples that are correlated in some way to produce an image of the speech signal. When there is noise present, listeners take advantage of the spectro-temporal units containing the least noise to reconstruct the

10 signal. Cooke tested this hypothesis both behaviorally and using a computer model, and found strong relationships between the size of ‘glimpses’ and performance in a consonant recognition task. In the behavioral data, it was observed that performance dropped as the number of talkers in the babble-modulated noise masker was increased, reaching asymptote at 32 talkers. Thus, performance was best in conditions in which fewer talkers were competing (the background was more modulated) and therefore more opportunities existed to glimpse a clean signal across the spectro-temporal units.

This remarkable ability of the normal-hearing auditory system to ‘glimpse’ or find units of relatively undistorted speech, and reconstruct them to form an intelligible signal, is greatly reduced with the introduction of sensorineural hearing impairment. The mechanisms responsible for the loss of this ability, as well as many other auditory functions, are described below.

V. EFFECTS OF SENSORINEURAL HEARING IMPAIRMENT

Hearing impairment of sensorineural origin is detrimental to many aspects of auditory processing. Consequently, listeners with hearing loss often have multiple difficulties, including tasks that involve understanding speech in the presence of background noise. While a reduction in audibility may account for some of the associated difficulties, deficits related to basic spectral and temporal resolution likely contribute as well. These ‘suprathreshold’ deficits - challenges that remain even when hearing thresholds are accounted for - can interact in complex ways, and isolating the individual functional consequences of each can be cumbersome.

11 A. Spectral Processing Deficits

There is evidence to suggest that much of the primary functional deficit associated with hearing impairment (i.e., speech understanding in noise) is due to spectral processing difficulties (Van Schijndel, Houtgast and Festen, 2001). As such, understanding the basic underlying mechanisms associated with this deficit is imperative.

Much work has been dedicated to examining the spectral processing abilities of both normal-hearing and hearing-impaired individuals.

Spectral resolution is largely determined by the size of the auditory filter, commonly referred to as the critical band (Fletcher, 1940). For normal-hearing individuals, this critical band is relatively constant in size below 500 Hz, and then increases (in Hz) with increasing center frequency (Moore and Glasberg, 1983; Fig. 2.1).

Figure 2.1. From Glasberg and Moore, 1987. Critical bandwidth in Hertz as a function of center frequency. The dotted line indicates the traditional critical-band function. The symbols are critical band values measured in various studies. The solid line is the curve fitted to the equation given in the figure.

12

A common effect of sensorineural hearing impairment is a broadening of the auditory filter. Estimates of auditory filter width are typically measured via procedures such as the ‘notched-noise’ method to determine the shape of tuning curves (Glasberg and Moore, 1990; Stone, Glasberg and Moore, 1992; Glasberg and Moore, 2000). For listeners with sensorineural hearing impairment, auditory filter width generally increases with increasing pure-tone audiometric threshold (Glasberg and Moore, 1986; Dubno and

Dirks, 1989). However, there is often large variability between subjects in terms of the relationship between pure-tone threshold and auditory-filter width (Moore, 2007; Fig.

2.2). This may be due in part to variability in the underlying physiologic mechanism of impairment across subjects.

13

Figure 2.2. From Moore, 2007. Width of the auditory filter in ERBs (Equivalent Rectangular Bandwidths) for hearing-impaired subjects relative to those of normal- hearing subjects as a function of audiometric threshold.

Another factor underlying spectral processing in hearing-impaired listeners is a pronounced asymmetry in auditory filter shape. Although auditory filter shape for normal ears tends to be somewhat asymmetric, particularly at mid frequencies (i.e. 1000 and

2000 Hz), with the lower frequency side less steep than the higher frequency side, this aspect can be much more pronounced in impaired ears. Glasberg and Moore (1986) examined filter shapes in listeners with either unilateral or bilateral cochlear impairment.

Although large variability was observed between subjects, in general the lower frequency side of the tuning curves for impaired ears was much less steep relative to those of the normal ears. The higher frequency side had a tendency to be a bit less steep, but for some listeners it matched those of normal ears (Fig. 2.3).

14

Figure 2.3. From Glasberg and Moore, 1986. Shape of the auditory filter for a subject with normal hearing (left) and a subject with hearing loss (right), both at 1000 Hz.

A further physiologic consequence of hearing loss that results in impaired spectral processing is the presence of ‘dead regions’ in the cochlea (Baer, Moore and Kluk, 2002).

Dead regions are locations along the cochlea in which there are either very few or no surviving inner hair cells and/or neurons, and thus information cannot be transmitted from the periphery to more central structures (Moore, 2004). In instances of dead regions, the spectral information typically encoded by that region will either be completely missing from the signal transduced via the auditory nerve, or will be encoded by fibers tuned to different frequencies, causing a place-frequency mismatch.

B. Impact of Spectral Processing Deficits

One technique for studying the effects of broadened tuning is to simulate widened auditory filters in normal-hearing subjects (Baer and Moore, 1993; ter Keurs, Festen and

Plomp, 1992; 1993). In the study by Baer and Moore, the simulated broadening was shown to negatively affect speech understanding in background noise but not in quiet. In

15 addition, conditions with simulated broadening of the low-frequency side of the auditory filter had a greater effect than those with a broadened high-frequency side, indicating a possible influence of increased upward spread of masking (Fig. 2.4). When the lower frequency side is widened, more low-frequency noise is able to pass through that filter and mask the higher frequency components. As previously discussed, listeners with hearing loss tend to present with highly asymmetric auditory filters, in which the low frequency side is shallower. Therefore, it is reasonable to conclude that listeners with hearing loss are more susceptible to the effects of upward spread of masking than normal- hearing listeners.

Figure 2.4. From Baer and Moore, 1993. Percent correct intelligibility as a function of amount of spectral smearing at three different signal-to-noise ratios. Black bar = upper side broader. Hashed bar = lower side somewhat broader. Open bar = lower side much broader.

16 Other difficulties may result from decreased frequency selectivity or place- frequency mismatch (Moore, 2003). Widened auditory filters may obscure many spectral features of a signal. For example, spectral peaks may be less prominent when passed through broader auditory filters and the ‘dips’ or valleys in the spectrum may be further reduced by background noise. The wider the auditory filter, the more background noise is allowed through that filter. Additionally, increased noise due to broadened tuning may disrupt the precise timing of the various individual filter outputs. Speech processing may be very sensitive to this timing, and thus disruptions to this mechanism may be particularly detrimental.

As discussed, dead regions in the cochlea potentially introduce a place-frequency mismatch in which the frequency information of a signal is received by the ‘wrong’ region of the cochlea. If there are no surviving inner hair cells or neurons specifically tuned to a particular frequency, this information may be processed through neural pathways that are tuned to other frequencies. In addition to the detrimental influence of the mismatch, this ‘informational overload’ could impair the analysis of the frequency components for which that neural pathway is tuned (Moore, 2003).

As stated previously, sensorineural hearing loss often results in both reduced sensitivity to sound (increased thresholds) and reduced auditory processing abilities

(suprathreshold deficits). Although the lack of audibility resulting from increased thresholds can often be overcome through amplification technology, many suprathreshold

17 deficits remain. Oftentimes, these suprathreshold deficits are particularly detrimental to the functional communication abilities of the affected individual.

Perhaps the most undesirable consequence of this impairment is a severe reduction of speech intelligibility in the presence of background noise. This difficulty appears to be most strongly influenced by suprathreshold, rather than sensitivity, factors.

Noordhoek, Houtgast and Festen (2001) examined the relationship between various factors related to hearing impairment and speech perception. Whereas audibility could account for difficulties understanding speech in quiet, it could not entirely account for difficulties in noise. Both temporal and spectral resolution, such as the influence of upward spread of masking, seem to play a role.

C. Other Deficits and Their Impacts

The numerous impairments and deficits involved in sensorineural hearing loss result in many challenges to speech intelligibility. In fact, the very process of making speech audible for some listeners can introduce distortions. Studebaker et al. (1999) examined the relationship between increasing speech level and intelligibility in normal- hearing listeners. They showed that at a fixed signal-to-noise ratio (SNR), the intelligibility of speech decreases as a function of increasing level above 69 dB SPL (Fig.

2.5). They also found that when the level of speech is high, the additional noise is more detrimental to intelligibility than when the level of speech is lower. Therefore, the high presentation levels required by hearing-impaired listeners can themselves be detrimental.

18

Figure 2.5. From Studebaker et al. 1999. Intelligibility (in rationalized arcsine units) as a function of speech level for various signal-to-noise ratios.

As previously mentioned, background noise is particularly detrimental to listeners with hearing loss. Not only is an increased susceptibility to upward spread of masking common, but listeners often suffer from an inability to ‘listen in the dips’ of temporally modulated noise. This second issue results in increased difficulty in situations involving competing talkers. The term used to describe the better performance obtained when a background is modulated relative to when it is steady is ‘masking release.’

Masking release has been studied extensively in both normal-hearing and impaired subjects. As one example, Summers and Molis (2004) examined the relationship between hearing impairment and the reduced release from masking in temporally modulated backgrounds. They tested intelligibility in time-reversed speech

19 distractors (providing little or no linguistic information) as well in forward (normal) speech distractors. They found that not only was the hearing-impaired group less able to achieve masking release, they were slightly more affected by the forward talker than the reverse talker (Fig. 2.6). This indicates that some ‘informational masking’ also occurs for these listeners, in which it is difficult to ignore the linguistically-relevant content of a competing talker’s message. However, due to the fact that many hearing-impaired individuals are elderly, and there are data to indicate that susceptibility to informational masking increases with age, there is some possibility that this observed effect is due to factors related more to aging than hearing loss per se (Tun, O’Kane and Wingfield,

2002).

Figure 2.6. From Summers and Molis, 2004. Speech reception threshold for speech in noise, forward talker, and reverse talker maskers. Normal hearing = filled symbols; Hearing-impaired = open symbols.

20

Another factor that may contribute to speech difficulties in the impaired population is a lack of phonemic restoration (Warren, 1970). Phonemic restoration is a phenomenon in which portions of a signal that are inaudible can be perceptually restored when noise is present during the temporal interruptions. Bazkent, Eiler and Edwards

(2010) presented data to indicate that hearing-impaired listeners receive less phonemic restoration as a function of degree of hearing loss. This finding could not be attributed to lack of audibility, and thus indicating other processing deficits.

In addition to increasing the susceptibility to noise, hearing impairment also increases the detrimental effects of reverberation and temporal asynchronies. Although individuals with hearing loss seem to utilize the temporal envelope of speech as well as individuals with normal hearing (Turner, Souza and Forget, 1995), they appear to have difficulty integrating temporal speech envelopes across different spectral frequencies

(Turner, Chi and Flock, 1999). Healy and Bacon (2002) examined the influence of across-frequency asynchrony on the integration of two speech envelope patterns. They found that normal-hearing listeners could tolerate relatively large asynchronies across the two patterns, but hearing-impaired listeners were much more sensitive to even small asynchronies. Further, intelligibility fell more rapidly as a function of temporal offset for this group. Further analysis indicated this effect cannot be attributed to signal level or age of the listener, but rather appears to be a function of degree of hearing loss.

21 It is very possible that this lack of robustness to across-frequency asynchrony may be of pragmatic importance. Many speech-processing strategies in hearing aid technology introduce delays across frequency channels, and therefore could introduce distortions that may be particularly challenging for listeners with large degrees of hearing loss (e.g.,

Stone and Moore, 2003).

The combined effects of decreases in both spectral and temporal processing, along with other suprathreshold deficits, make communication particularly challenging for many listeners with hearing loss. The exact mechanisms underlying these difficulties are still under investigation, but the detrimental influence of background noise is clear.

Therefore, treatments for hearing loss should focus on the remediation of difficulties associated with noisy situations. Techniques to overcome these limitations will likely need to take into account two critical aspects: 1) effectively transmitting to the listener those acoustic cues that contribute the most to speech intelligibility, and 2) a reduction of the most detrimental extraneous noise, without the introduction of large distortions into the signal. One way to determine the aspects of the speech signal most crucial to intelligibility is through an examination of frequency-importance functions, which will be discussed in the following section.

ARTICULATION INDEX/BAND IMPORTANCE FUNCTIONS

The Speech Intelligibility Index (SII; ANSI, 1997), formerly referred to as the

Articulation Index (AI; ANSI, 1969), was developed to allow predictions of speech intelligibility over communication systems such as telephones. Systems such as these

22 introduce various distortions into the speech signal, which may reduce the clarity and intelligibility of the message. Thus, when designing a communication device, it is beneficial to be able to predict the effects of various factors inherent to the system.

Researchers at Bell Labs were some of the first to conceive of the idea of the AI

In its original state, the telephone was fraught with numerous transduction issues including highly peaked frequency responses and distortion, and internal noise. As a result, articulation tests were used to measure the effects of various types of distortion on the intelligibility of speech (Fletcher and Galt, 1950). An early example of articulation testing was performed by Egan and Wiener (1946). They tested several speech communication systems of varying spectral bandwidths and established equal-articulation contours showing the gain required to produce equal intelligibility for different systems and conditions (Fig. 2.7).

Figure 2.7. From Egan and Wiener, 1946. Equal-articulation contours for various bandwidths, all with center frequencies of 1500 Hz.

23

French and Steinberg (1947) were among the first to use the term ‘articulation index’ to describe how the characteristics of the target speech and background relate to intelligibility. Specifically, they explained that the signal received by the listener is the aggregate result of multiple factors such as the characteristics of the speech (acoustic cues), hearing (sensitivity), electric and acoustic properties of the communication device

(telephone or hearing aid), and the conditions of the environment in which the communication is occurring. With regard to hearing, the authors commented that one must account for any hearing difficulties on the part of the listener. However, they do suggest that over a rather large range of absolute thresholds for individual listeners there will be no appreciable effect of hearing level, due to the relative level of the speech and noise being independent of listener sensitivity.

The general concept of the AI is that the information transmitted by a communication system is a consequence of the SNR in each frequency band. In the calculation, each frequency band is assigned an importance value, reflecting its relative contribution to total information. The SII value is simply the sum of these band- importance values, each weighted according to its SNR. The SNR weight is a scaler ranging from 0.0 at an SNR of -18 dB and 1.0 at an SNR of +12 dB. An SII value of 1.0 indicates that all speech information is present and a value of 0.0 indicates that no information is present.

24 More specifically, the AI can be calculated in one of two main ways. The first is termed the ‘20-Band Method’ and is based on the relative spectrum levels of speech and noise in each of 20 contiguous equally-contributing bands (French and Steinberg, 1947,

Kryter, 1962a). In this method, band-importance is fixed at 0.05 (1/20) for every band.

The other methods of calculation, termed the 1/3- or 1-octave methods, are similar to the

20-band method. In this case, speech is divided into a contiguous series of 1/3-octave or

1-octave bands. However, as band divisions are no longer equated for their information contribution, a different importance value is used for each band.

A. Assumptions and Considerations

Although the AI is considered generally reliable, many factors must be corrected for in order to achieve an accurate estimation of intelligibility. Aspects including the shape of the speech spectrum (particularly for shapes that differ widely from average, such as whispered speech), the response characteristics of the system, and other distortions such as peak clipping and reverberation must be included in articulation estimations (Fletcher and Galt, 1950).

Pavlovic and Studebaker (1984) examined some of the assumptions of the AI.

They found that the assumptions of a 30 dB dynamic range for speech, as well as a linear weighting for intensity, and the exclusion of natural pauses in speech when calculating speech level (using peak levels) all gave good predictions. They also examined the ability of the AI to predict performance of hearing-impaired (HI) listeners. They did this by simulating increased thresholds in a group of normal-hearing listeners. Perhaps not

25 surprisingly, this indicated a strong ability of the AI to predict HI listener performance, as it simply simulated increased auditory thresholds and did not include any suprathreshold deficits inherent to many individuals in this population.

Kryter (1962b) tested the accuracy of the AI under conditions of noise masking, high- and low-pass filtering and narrow bandpass filtering and found that the AI predicted intelligibility well for all. One notable finding was that the AI under-predicted intelligibility for a system with three pass bands, perhaps reflecting flaws in the assumption of the independence of individual speech bands.

Kryter did find that the AI must account for issues such as nonlinear growth of masking; otherwise predictions can be very poor. In a review paper appearing the in the same journal issue as the previously discussed paper, Kryter (1962a) found that the AI is capable of addressing issues such as different types of distortion, noises, and speech spectrums. Specifically, it can provide corrections for peak clipping, reverberation, modulated noise, narrow bands of noise, and frequency distortion. It is also possible to account for the addition of visual cues that augment auditory cues. The AI can accurately predict intelligibilities for speech in the range of 55-85 dB, but outside of that range the different speech spectrum shapes must be addressed. Kryter also noted that is unclear if the AI functions can be applied to female talkers, as they have been established using male voices. Lastly, he discussed the relationship between the AI and speech intelligibility. This relationship seems to be highly dependent on the information content of the message, such as the linguistic characteristics. For a given AI value, intelligibility

26 will tend to be higher if the linguistic aspects of the speech content are contained to a limited set (e.g. a closed-set of words; Fig. 2.8).

Figure 2.8. From French and Steinberg, 1947. Functions relating Articulation Index values to intelligibility for various types of speech materials.

As indicated previously, the AI does contain a correction to be applicable for combined auditory/visual speech perception. Grant and Braida (1991) tested the predictions for auditory-only, auditory plus visual, and visual-only speech. They found generally good fits for a variety of noise and filtering conditions, however when AI scores were below .25, intelligibility was under predicted, and when AI scores were above .25, intelligibility was over predicted. Interestingly enough, they also noted a non-

27 additive relationship between presenting multiple disparate bands simultaneously, in accord with early reports from Pollack (1948), and in violation of the AI’s assumption of independence.

B. Band-Importance Functions

There are many underlying assumptions in the derivation of the band-importance functions. One such assumption that has persisted for many decades is the idea that total information content is a simple sum of the content of the individual frequency bands.

French and Steinberg (1947) state:

“The articulation index is based on the concept that any narrow band of speech frequencies… carries a contribution to the total index which is independent of the other bands with which it is associated and that the total contribution of all bands is the sum of the contributions of the separate bands,” (Pg. 101, emphasis added).

A footnote is included suggesting that this may not be entirely true, as at very intense levels speech can produce masking effects by energy from one band spilling into another band. Regardless, no mention of complex interactions between bands is made.

Here, it is argued that the contribution of various bands of speech to total information is far from independent. Instead, there exist large redundant and synergistic interactions among various speech frequency bands. This important topic is addressed during the first manuscript.

28 The characteristics of speech with regard to frequency content and spectral cues have long been a subject of interest. In particular, the differences between vowels and consonants in terms of spectral features have been examined thoroughly. It has been shown that generally, vowels are more driven by fundamental frequency and related harmonics, whereas consonants have frequency cues that are more disperse across the spectrum. When these isolated sounds combine to form words and sentences, the frequency content of the speech continually varies with time. Thus, the intelligibility of speech is influenced by the frequencies present, with some frequencies contributing more or less overall. French and Steinberg (1947) described the possibility to characterize the general shape of this frequency-importance function, via the articulation-index procedure.

Since then, numerous band-importance functions have been derived using the methods outlined in the AI/SII or through other techniques. Functions for numerous types of speech materials have been developed and many of them are incorporated into the revision of the AI- the SII. Large differences in the shape of functions have been observed depending on the speech material under consideration. For example, the original importance function using nonsense syllables from French and Steinberg (1947) had a peak at approximately 2500 Hz. Comparatively, when Studebaker, Pavlovic, and

Sherbecoe (1987) derived a function for continuous speech, they observed a peak at 450

Hz. This discrepancy could be explained by differences in linguistic content between the two speech materials, as more redundant speech (continuous discourse) allows listeners to utilize context to detect low-frequency perceptual cues (Miller and Nicely, 1955). This

29 indicates a tendency for shifts in phonemic content to alter the shape of importance functions (Pavlovic, 1987).

C. Techniques for Deriving Band-Importance Functions

The original technique to derive a band-importance function, described by French and Steinberg (1947), utilized a series of high- and low-pass filtered speech presented to the listener to obtain intelligibility scores. For example, Studebaker, Pavlovic and

Sherbecoe (1987) obtained band-importance functions for continuous discourse spoken by two male talkers and one female talker utilizing this method. They presented a total of

135 filtered speech conditions, half to one group of listeners and half to a second group.

The filtering conditions consisted of broadband speech, seven low-pass and seven high- pass speech bands with different frequency cutoffs. They then presented each of these bands at various SNRs (from 4.5-12.5 dB in 1-dB steps), with all speech presented at conversational level.

After obtaining intelligibility scores for each of the conditions, the data were used to derive a transfer function relating intelligibility scores to the AI. Prior to analysis, the data were smoothed across SNRs for each filtering condition. The relationship between intelligibility and AI was obtained using the method of French and Steinberg (1947; Fig.

2.9).

30

Figure 2.9. From French and Steinberg, 1947. Syllable recognition as a function of cut- off frequency for both high- and low-pass stimuli at two signal-to-noise ratios. The cross- over frequencies relating to A = .5 and A =. 25 are labeled.

In this method, the crossover frequency, which is the frequency that divides performance into two equally-intelligible halves at a high SNR, must be established. This is the frequency at which the high-pass curve and low-pass curve intersect. The intelligibility score at that intersection point is assigned an AI value of A = .5. From this, the (less favorable) SNR that degrades the speech enough to produce a broadband score equal to the A = .5 value is determined. The crossover point of the high-pass and low- pass curves for this function is assigned a value of A = .25. This ‘halving’ procedure is continued until many AI values have been assigned (French and Steinberg, 1947,

DePaolis, 1992).

31 Subsequently, these values are used to estimate the importance of the frequency bands having cutoffs corresponding to the various high- and low-pass filtering conditions.

The difference between the AI values of the upper and lower cutoffs is taken as the importance of that band1. After these values have been established, interpolation is used to create 20 equally-contributing bands.

Recently, several new techniques have been developed for the calculation of band-importance functions. This has become particularly relevant, as the literature concerning the complex redundancies and interactions between frequency bands has expanded greatly since Pollack’s original observation in 1948 (Warren, et al., 1995;

Greenberg, Arai and Silipo, 1998; Warren, Bashford, and Lenz, 2005). In addition to theoretical concerns with the standard technique, there are important practical limitations as well, as the process is quite laborious and time-consuming.

One such method to overcome these limitations is the ‘Correlational Method’

(Doherty and Turner, 1996; Turner et al, 1998; Calandruccio and Doherty, 2007). The general technique is to divide speech in a number of bands (typically 3 - 5) and systematically vary the SNR in each of the bands relative to the overall speech level.

From this, correlations between performance and the SNR can be computed.

For example, Calandruccio and Doherty (2007) developed a frequency- importance function for the IEEE sentences (IEEE, 1969) using the Correlational

Method. To do so, they filtered the sentences into 5 ‘equally-contributing’ bands, each with a calculated SII of .2. For each of the 5 bands, SNR was allowed to vary within a 13-

32 dB range with 3-dB step sizes. On a trial-by-trial basis without repeat, an SNR was assigned to a particular band. After correlations between performance and the relative

SNR of each band were computed, it was found that the greatest weight was placed on bands 2 (562-1113 Hz) and 5 (2807-11000 Hz). Further analysis revealed that these weighting functions were stable across subjects and across time, as half of the subjects heard the stimuli across two different test sessions, one week apart.

There are certain pitfalls and drawbacks to this method, however. Firstly, although there are obvious benefits to presenting broadband speech, which allows for all of the natural interactions to take place across the spectrum, this necessitates the use of noise to degrade the signal below ceiling performance levels. The presence of noise may have a differential influence across frequency regions, with areas of lower amplitude such as the high-frequency energy of frication, being disproportionately disrupted. This limitation is not exclusive to the Correlational Method, as the original SII method also requires the use of noise. Secondly, the spectrum is divided into 3, 4 or 5 gross divisions and thus does not provide much frequency resolution for importance. It does, however, allow for analysis on a subject-by-subject basis, something that would be difficult using the technique employed in the SII (see Fig. 2.10).

33

Figure 2.10. From Doherty and Turner, 1996. Individual frequency-weighting functions for six listeners derived using the Correlational Method.

More recently, a technique has been developed that is capable of eliminating the need for extraneous noise to degrade the signal while still taking into account the multitude of interactions across the spectrum. This technique is termed the ‘compound method’ (Apoux and Healy, 2012). Furthermore, the technique is capable of providing

34 highly detailed frequency-resolution, examining as many as 30 auditory-filter wide bands.

In this method, the band of interest is presented along with a set number of other bands, randomly distributed across the spectrum from trial-to-trial. The performance with the band of interest present is then compared to the performance with the same number of

‘other’ bands present but with the band of interest absent. The importance of that frequency band is the difference between the ‘band present’ and the ‘band absent’ scores, relative to the same metric for all of the other bands. As each band is presented along with multiple combinations of all other spectral bands, a more ‘global’ view of the importance of that band is given regardless of which other speech bands may be available at any given moment in time.

There is evidence to suggest that assessing band importance in such a way can reveal influences of spectral regions previously believed to provide little information.

Lippmann (1996) showed that when consonants were filtered to remove a large portion of mid-frequency information (band reject filter, 800-4000 Hz), not only were listeners able to integrate speech bands over very disparate regions to achieve high levels of intelligibility, they also saw an effect of the presence of very high-frequency cues. When a speech band was added to the filtered stimuli above 8000 Hz, further improvements in performance were noted. This suggests that frequency content which may be rendered unimportant under favorable conditions such as broadband speech may gain additional importance when critical mid-frequency cues are lacking.

35 The benefits of the Compound Method are many. First, it allows the many different interactions to take place across the entire spectrum. Also, the method more accurately models the natural ‘glimpsing’ which occurs in everyday speech reception, in which various portions of the speech spectrum may be more versus less available at any given moment in time. In contrast, the SII method assesses information about a particular frequency region while all of the information above it or below it in frequency is intact.

In addition, the SII method requires the use of masking noise, whereas the compound method can present speech in quiet. The ability to present speech in quiet can be quite beneficial, as interactions could exist between the effects of masking and the importance of speech in a particular frequency region, thus confounding the two factors.

One of the most striking results of using this technique is the vastly different shape of the resulting band-importance functions relative to those in the SII. Noticeable in the new functions are prominent peaks and valleys across the spectrum, a so-called

‘microstructure’.

There appears to be strong evidence to indicate that the band-importance functions in the SII may not accurately represent the true distribution of speech cues, given the various interactions and redundancies known to exist. There are a seemingly endless number of factors that may play a role in intelligibility, including the linguistic and semantic content of the message, the particular distortions and disruptions to the signal, and the aspects of the talker used to derive the functions. Add to that the effects of complex masking patterns and effects of noise on band of speech, as well as

36 suprathreshold deficits related to sensorineural hearing impairment, and the picture becomes even less clear. It is evident that much is left to be examined with regard to speech intelligibility and the development of future Speech Intelligibility Indices.

37 NOTE:

1. Studebaker et al. (1987) discarded the values assigned to the lowest signal-to-noise ratio functions due to large variability and the presence of negative importance values.

38 REFERENCES

American National Standard Inst. (1969). ANSI S3.5. American National Standard

Methods for the Calculation of the Articulation Index (American National Standards

Inst., New York).

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

Apoux, F., and Healy, E. W. (2011). “Relative contribution of target and masker temporal fine structure to the unmasking of consonants in noise,” J. Acoust. Soc. Am. 130, 4044-

4052.

Apoux, F., and Healy, E. W. (2012). “Use of a compound approach to derive auditory- filter-wide frequency-importance functions for vowels and consonants,” J. Acoust. Soc.

Am. 132, 1078-1087.

Baer, T., and Moore, B. C. (1993). “Effects of spectral smearing on the intelligibility of sentences in noise,” J. Acoust. Soc. Am. 94, 1229-1241.

39 Baer, T., Moore, B. C., and Kluk, K. (2002). “Effects of low pass filtering on the intelligibility of speech in noise for people with and without dead regions at high frequencies,” J. Acoust. Soc. Am., 112, 1133-1144.

Başkent, D., Eiler, C. L., and Edwards, B. (2010). “Phonemic restoration by hearing- impaired listeners with mild to moderate sensorineural hearing loss,” Hear. Res., 260, 54-

62.

Calandruccio, L. , and Doherty, K. A. (2007). “Spectral weighting strategies for sentences measured by a correlational method,” J. Acoust. Soc. Am. 121, 3827–3836.

Cherry, E. C. (1953). “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., 25, 975-979.

Cooke, M. (2006). “A glimpsing model of speech perception in noise,” J. Acousti. Soc.

Am., 119, 1562-1573.

DePaolis, R. A. (1992). The intelligibility of words, sentences, and continuous discourse using the articulation index (No. TR-92-04). Pennsylvania State University Park Applied

Research Lab.

40 Doherty, K. A., and Turner, C. W. (1996). “Use of a correlational method to estimate a listener’s weighting function for speech,” J. Acoust. Soc. Am. 100, 3769-3773.

Dubno, J. R., and Dirks, D. D. (1989). “Auditory filter characteristics and consonant recognition for hearing‐impaired listeners,” J. Acoust. Soc. Am. 85, 1666-1675.

Egan, J. P., and Wiener, F. M. (1946). “On the intelligibility of bands of speech in noise,”

J. Acoust. Soc. Am. 18, 435-441.

Fletcher, H. (1940). “Auditory patterns,” Reviews of Modern Phys. 12, 47-65.

Fletcher, H., and Galt, R. H. (1950). “The perception of speech and its relation to telephony,” J. Acoust. Soc. Am. 22, 89-151.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119.

Glasberg, B. R., and Moore, B. C. (1986). “Auditory filter shapes in subjects with unilateral and bilateral cochlear impairments,” J. Acoust. Soc. Am. 79, 1020-1033.

Glasberg, B. R., and Moore, B. C. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103-138.

41

Glasberg, B. R., and Moore, B. C. (2000). “Frequency selectivity as a function of level and frequency measured with uniformly exciting notched noise,” J. Acoust. Soc. Am.

108, 2318-2328.

Grant, K. W., and Braida, L. D. (1991). “Evaluating the articulation index for auditory– visual input,” J. Acoust. Soc. Am. 89, 2952-2960.

Greenberg, S., Arai, T., and Silipo, R.(1998). “Speech intelligibility derived from exceedingly sparse spectral information,” in Proc. of the Intl. Conf. on Spoken Lang.

Proc., Sydney, pp. 2803–2806

Healy, E. W., and Bacon, S. P. (2002). “Across-frequency comparison of temporal speech information by listeners with normal and impaired hearing,” J. Speech Lang.

Hear. Res. 45, 1262-1275.

Healy, E. W., and Steinbach, H. M. (2007). “The effect of smoothing filter slope and spectral frequency on temporal speech information,” J. Acoust. Soc. Am., 121, 1177-

1181.

IEEE (1969). “IEEE recommended practice for speech quality measurements,” IEEE

Trans. Audio Electroac.., 17, 225–246.

42

Kryter, K. D. (1962a). “Methods for the calculation and use of the articulation index,” J.

Acoust. Soc. Am. 34, 1689-1697.

Kryter, K. D. (1962b). “Validation of the articulation index,” J. Acoust. Soc. Am. 34,

1698-1702.

Lippmann, R. P. (1996). “Accurate consonant perception without mid-frequency speech energy,” IEEE Trans. Speech Audio. Process. 4, 66-69.

Miller, G. A., and Nicely, P. E. (1955). “An analysis of perceptual confusions among some English consonants,” J. Acoust. Soc. Am. 27, 338-352.

Moore, B. C. (2003). “Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms,” Speech Communication, 41, 81-91.

Moore, B. C. (2004). “Dead regions in the cochlea: conceptual foundations, diagnosis, and clinical applications,” Ear Hear., 25, 98-116.

Moore, B. C. (2007). Cochlear hearing loss: physiological, psychological and technical issues. (2nd ed.). Chichester, UK: Wiley.

43 Moore, B. C., and Glasberg, B. R. (1983). “Suggested formulae for calculating auditory‐ filter bandwidths and excitation patterns,” J. Acoust. Soc. Am. 74, 750-753.

Noordhoek, I. M., Houtgast, T., and Festen, J. M. (2001). “Relations between intelligibility of narrow-band speech and auditory functions, both in the 1-kHz frequency region,” J. Acoust. Soc. Am. 109, 1197-1212.

Patterson, R. D. (1974). “Auditory filter shape,” J. Acoust. Soc. Am. 55, 802-809.

Patterson, R. D. (1976). “Auditory filter shapes derived with noise stimuli,” J. Acoust.

Soc. Am. 59, 640-654.

Pavlovic, C. V. (1987). “Derivation of primary parameters and procedures for use in speech intelligibility predictions,” J. Acoust. Soc. Am. 82, 413-422.

Pavlovic, C. V., and Studebaker, G. A. (1984). “An evaluation of some assumptions underlying the articulation index,” J. Acoust. Soc. Am. 75, 1606-1612.

Pollack, I. (1948). “Effects of high-pass and low-pass filtering on the intelligibility of speech in noise,” J. Acoust. Soc. Am. 20, 259–266.

44 Rosen, S. (1992). “Temporal information in speech: acoustic, auditory and linguistic aspects,” Philosophical Transactions of the Royal Society of London. Series B:

Biological Sciences, 336, 367-373.

Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science, 270, 303-304.

Stone, M. A., and Moore, B. C. (2003). “Tolerable hearing aid delays. III. Effects on speech production and perception of across-frequency variation in delay,” Ear Hear. 24,

175-183.

Stone, M. A., Glasberg, B. R., and Moore, B. C. (1992). “Simplified measurement of auditory filter shapes using the notched-noise method,” British J. of Audiology, 26, 329-

334.

Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L. (1987). “A frequency importance function for continuous discourse,” J. Acoust. Soc. Am. 81, 1130-1138.

Studebaker, G. A., Sherbecoe, R. L., McDaniel, D. M., and Gwaltney, C. A. (1999).

“Monosyllabic word recognition at higher-than-normal speech and noise levels,” J.

Acoust. Soc. Am. 105, 2431-2444.

45 Summers, V., and Molis, M. R. (2004). “Speech recognition in fluctuating and continuous maskers: Effects of hearing loss and presentation level,” J. Speech, Lang.,

Hear. Res. 47, 245-256.

Ter Keurs, M., Festen, J. M., and Plomp, R. (1992). “Effect of spectral envelope smearing on speech reception I,”. J. Acoust. Soc. Am. 91, 2872-2880.

Ter Keurs, M., Festen, J. M., and Plomp, R. (1993). “Effect of spectral envelope smearing on speech reception II,”. J. Acoust. Soc. Am. 93, 1547-1552.

Tun, P. A., O'Kane, G., and Wingfield, A. (2002). “Distraction by competing speech in young and older adult listeners,” Psych. and Aging, 17, 453-467.

Turner, C. W., Chi, S. L., and Flock, S. (1999). “Limiting spectral resolution in speech for listeners with sensorineural hearing loss,” J. Speech, Lang., Hear. Res. 42, 773-784.

Turner, C. W., Kwon, B. J., Tanaka, C., Knapp, J., Hubbartt, J. L., and Doherty, K. A.

(1998). “Frequency-weighting functions for broadband speech as estimated by a correlational method,” J. Acoust. Soc. Am. 104, 1580-1585.

46 Turner, C. W., Souza, P. E., and Forget, L. N. (1995). “Use of temporal envelope cues in speech recognition by normal and hearing‐impaired listeners,” J. Acoust. Soc. Am. 97,

2568-2576.

van Schijndel, N. H., Houtgast, T., & Festen, J. M. (2001). “Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am. 110, 529-542.

Van Tasell, D. J., Greenfield, D. G., Logemann, J. J., and Nelson, D. A. (1992).

“Temporal cues for consonant recognition: Training, talker generalization, and use in evaluation of cochlear implants,” J. Acoust. Soc. Am. 92, 1247-1257.

Van Tasell, D. J., Soli, S. D., Kirby, V. M., and Widin, G. P. (1987). “Speech waveform envelope cues for consonant recognition,” J. Acoust. Soc. Am. 82, 1152-1161.

Warren, R. M. (1970). “Perceptual restoration of missing speech sounds,” Science, 167,

392-393.

Warren, R. M., Bashford Jr, J. A., and Lenz, P. W. (2005). “Intelligibilities of 1-octave rectangular bands spanning the speech spectrum when heard separately and paired,” J.

Acoust. Soc. Am. 118, 3261-3266.

47 Warren, R. M., Riener, K. R., Bashford, J. A., Jr., and Brubaker, B. S. (1995). “Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits,” Percept.

Psychophys. 57, 175-182.

Wegel, R., and Lane, C. E. (1924). “The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear,” Physical Review, 23, 266-285.

48

CHAPTER 3: OVERVIEW OF CURRENT STUDIES

Here we present three manuscripts concerning the contributions of various frequency bands to speech intelligibility, as well as the detrimental impact of noise on those contributions.

The first manuscript describes a newly developed method to measure band- importance functions (the compound method, Apoux and Healy, 2012), and examines new functions derived from two types of speech materials- words and sentences. The objective is to compare these observed functions to those found in the Speech

Intelligibility Index for the identical recordings.

The second manuscript examines the effect of using multiple talkers of different genders versus one single male talker to generate band-importance functions. It also examines the relative influence of talker effects (whether the function reflects acoustic aspects of a particular voice) versus speech-material effects (whether the function reflects aspects of the type of speech under examination, such as sentences versus words). The objective is to determine whether a function may be able to generalize across different talkers or speech materials, or whether a function derived from one single talker may reflect particular idiosyncrasies of that speaker’s voice.

49 The third manuscript examines directly the corrupting influence of noise on various ‘critical’ bands of speech. Currently, effects of noise are generally believed to be equal across frequency. However, it may be that different bands of speech have different tolerances or susceptibilities to masking noise, potentially due to the type of acoustic cues residing within each band. Therefore, a systematic examination of this influence will be conducted by varying the signal-to-noise ratio within each target band.

50

CHAPTER 4: MANUSCRIPT 1

Band-importance for sentences and words reexamined

Published in The Journal of the Acoustical Society of America

Eric W. Healy, Sarah E. Yoho, Frédéric Apoux

Department of Speech and Hearing Science

The Ohio State University

Columbus, 43210

51 Abstract

Band-importance functions were created using the ‘compound’ technique [F. Apoux and

E.W. Healy (2012) 132, J. Acoust. Soc. Am.] that accounts for the multitude of synergistic and redundant interactions that take place among speech bands. Functions were created for standard recordings of the SPIN sentences and the CID W-22 words using 21 critical-band divisions and steep filtering to eliminate the influence of filter slopes. On a given trial, a band of interest was presented along with four other bands having spectral locations determined randomly on each trial. In corresponding trials, the band of interest was absent and only the four other bands were present. The importance of the band of interest was determined by the difference between paired band-present and band-absent trials. Because the locations of the other bands changed randomly from trial to trial, various interactions occurred between the band of interest and other speech bands, which provided a general estimate of band importance. Obtained band- importance functions differed substantially from those currently available for identical speech recordings. In addition to differences in the overall shape of the functions, especially for the W-22 words, a complex microstructure was observed in which the importance of adjacent frequency bands often varied considerably. This microstructure may result in better predictive power of the current function.

52 I. INTRODUCTION

Much of what is known about the spectral distribution of speech information is reflected in the Speech Intelligibility Index (SII; ANSI, 1997) and its predecessor, the

Articulation Index (AI; ANSI, 1969). These indexes provide a method for estimating intelligibility of various communication systems based on acoustic measures and can alleviate the need for extensive testing of human listeners. One of the key components of the Index (henceforth, the SII) are band-importance functions, which describe the relative contribution to total speech information provided by each spectral band. Data are provided for bands ranging in width from critical bands to octaves. The sum of these importance values, once each is scaled to reflect audibility, provides an SII value from

0.0 to 1.0, reflecting the proportion of total speech information available to the listener.

The band-importance functions of the SII provide not only the practical means required to calculate SII values, they are of substantial theoretical importance. These values reflect our understanding of the speech-information content of each spectral band

– an understanding that is seemingly critical to our overall understanding of speech processing. Further, these values have been used in numerous empirical studies.

Examples include the design of spectral bands having different frequency locations but equal apriori intelligibility (e.g., Grant and Walden, 1996) or the estimation of factors beyond acoustic speech information that impact intelligibility, such as cognitive factors involved in aging (e.g., Dubno et al., 2008).

53 Existing band-importance functions are based on recognition in background noise as the speech signal is subjected to successive low-pass or high-pass filtering. The importance of a band is then determined by comparing the recognition scores across two successive cutoff frequencies. A consequence of this procedure is that the importance of any spectral band is assessed when information either above it or below it in frequency is entirely intact, while the complimentary information (below it or above it) is entirely missing (e.g., French and Steinberg, 1947; Fletcher and Galt, 1950; Studebaker and

Sherbecoe, 1991).

Recent years have brought an increased understanding of the multitude of redundancies and synergistic interactions that exist among speech bands (e.g., Breeuwer and Plomp, 1984, 1985; Warren et al., 1995; Lippmann, 1996; Müsch and Buus, 2001;

Healy and Warren, 2003; Healy and Bacon, 2007). A simple example of this potentially profound synergy was provided by Healy and Warren (2003), who showed that speech- modulated bands that provide essentially no intelligibility when presented individually (0 or 1%), can combine to provide substantial intelligibility (81%). That same study showed that intelligibility of band pairs was a function of their spacing, reflecting the extent to which the information provided by the two bands was complimentary or redundant. Consider the following – when a particular “target” band is presented along with another band that is juxtaposed in frequency, the information provided by the target may be redundant and its importance low. Alternatively, when that same target band is presented along with a band that is more spectrally distant, its importance may increase due to the complimentary nature of the information it provides. Finally, (as found by

54 Healy and Warren, 2003), if that same target band is presented along with a band that is too disparate in frequency, the complimentary nature of the information (or perhaps its integration) may be limited, and the importance of the target may again be diminished.

This is suggestive of a complex interaction between redundancy and synergy that takes place among various speech frequencies.

It should be clear from the above that the stand-alone intelligibility of isolated bands cannot be used to predict the contribution to total intelligibility that a given band provides when other bands are present. But it is also suggested that the contribution of a speech band cannot be accurately assessed based on its contribution to contiguous frequencies above or below it, as in the SII procedure. Instead, it is argued that the contribution of a particular speech band is a complex function of the extent to which it provides information that is redundant or complimentary with that of other speech bands.

It should also be clear that it is difficult to predict which speech frequencies will be spared and which will be masked when speech is presented in a spectro-temporally complex background, as in many everyday environments. Indeed, the concept of

“glimpsing” speech in background noise involves the integration of glimpses of clean speech that change in frequency position from moment to moment, analogous to a checkerboard pattern on a spectrogram (e.g., Brungart et al., 2006; Cooke, 2006; Li and

Loizou, 2007; Apoux and Healy, 2009, 2010, 2012).

Fortunately, a method has been developed to account for this potential limitation in the traditional method used to create band-importance functions. Apoux and Healy

(2012) demonstrated that the importance of a speech band could be measured in a more

55 general sense – one that takes into account synergistic and redundant interactions. In this method, a given target band is presented along with n other bands having frequency positions determined randomly. In a comparison trial, the target band is absent, but the positions of the other bands remain the same. In subsequent pairs of trials, the target band is assessed using new random frequency positions of the n other bands. The difference between performance on the band-present versus band-absent trials reflects the importance of the target band, irrespective of the location of information elsewhere in the spectrum. In other words, the resulting importance represents the manner in which the target band interacts with other bands to contribute to overall intelligibility. This method has been referred to as the “compound” approach.1 Apoux and Healy (2012) used this approach to assess the importance of individual auditory-filter (ERBN) wide bands using vowel and consonant materials. In the current study, the compound approach is extended to create band-importance functions using SII band divisions and sentences and words, for which published functions exist.

A second issue that must be considered when assessing band importance involves the role of the filter slopes. Although there was some awareness of the influence of transition bands on filtered-speech intelligibility when existing ANSI band-importance functions were created, the steepness of slopes required to mitigate this influence was severely underestimated. For example, Studebaker and Sherbecoe (1991) suggested that slopes of 96 dB/oct should be sufficient to eliminate the influence of transition bands.

More recent studies do not support this suggestion. In particular, Healy (1998) demonstrated that much of the high intelligibility of sentences filtered to a narrow

56 “spectral slit” (Warren et al., 1995) can be attributed to information contained in the transition bands created by the filter skirts. The 100 CID everyday-speech sentences

(Silverman and Hirsch, 1955; Davis and Silverman, 1978) were filtered to a 1/3-octave band centered at 1500 Hz. When this band had filter slopes of 96 dB/octave (using

Butterworth or Finite-duration Impulse Response [FIR] filtering), normal-hearing listeners produced an intelligibility score of 98%. However, when the nominal 1/3- octave bandwidth was maintained, but the filter slopes were increased to approximately

300 dB/octave (using a 275-order FIR filter), mean intelligibility fell to 55%. Essentially removing the transition bands through an increases in slope angle to approximately 1,700 dB/octave (using a 2000-order FIR filter), resulted in a mean score of only 16%.

Subsequent work by Warren and colleagues confirmed the strong role that filter slopes play in the intelligibility of filtered speech. Warren and Bashford (1999) confirmed the relatively low intelligibility of a 1/3-octave band centered at 1500 Hz, created using a 2000-order FIR filter. They also showed that isolated 96 dB/octave triangular skirts produced far higher intelligibility than did the 1/3-octave rectangular passband. Another experiment confirmed that 1/3-octave CID sentence intelligibility dropped as filter slope angles increased. A value of 4800 dB/octave was needed to eliminate the contribution of the skirts (Warren et al., 2004). Thus, restriction of the acoustic signal using sharply-defined boundaries is critical.

From the above, it may be assumed that the contribution of transition bands was not eliminated in existing ANSI band-importance functions. This contribution is clearly a limitation of the SII, as the contribution to intelligibility provided by specific frequency

57 bands within the acoustic speech spectrum is of interest for band importance. In their study, Apoux and Healy (2012) used interpolated bands of speech and noise to reduce the influence of transition bands. This technique, however, requires the relative levels of speech and noise to be selected carefully to limit masking of the target speech by spectrally-adjacent noise (cf., Apoux and Healy, 2009). In the present study, a refinement of the compound approach is introduced, which involves the use of steep filter slopes to eliminate the contribution of transition bands.

The compound method provides a procedure for measuring directly the importance of clearly-defined bands, while accounting for the multiple interactions that exist among speech frequencies. The purpose of the current study was to use the refined compound method to create band-importance functions for the standard recordings of the

SPIN sentences (Kalikow, Stevens, and Elliot, 1977) and CID W-22 phonetically- balanced words (Hirsh et al., 1952), using standard band divisions, and to compare these functions with those available in the SII for these same speech materials.

II. EXPERIMENT 1. HIGH- AND LOW-PREDICTABILITY SPIN SENTENCES

A. Method

1. Subjects

Sixty normal-hearing listeners between the ages of 18 and 40 years (mean = 20.4) participated. Fifty-five were female. They were recruited from courses at The Ohio State

University and received a monetary incentive. All had pure tone audiometric thresholds

58 at or below 20 dB HL at octave frequencies from 250 to 8000 Hz (ANSI, 2004, 2010).

None had any prior exposure to the sentence materials employed here.

2. Stimuli

The materials were sentences from the revised version (Bilger, et al., 1984) of the

Speech Perception In Noise test (SPIN; Kalikow, Stevens, and Elliot, 1977). They were extracted from the original Bolt, Beranek and Newman recordings and are therefore the same materials specified in the SII. The audio was extracted at 44.1 kHz sampling and

16-bit resolution from an authorized CD version of the test (Authorized Version, Revised

SPIN Test. [Audio Recording] Champaign, IL: Department of Speech and Hearing

Science). The test consists of 200 High-Predictability sentences, in which the final words used for scoring are cued by the semantic content of the sentence (e.g., “Stir your coffee with a spoon.”). There are also 200 Low-Predictability sentences, in which the final scoring keyword is not signaled by context (e.g., “He would think about the rag.”). The sentences are five to eight words and six to eight syllables in length, and the scoring keywords are phonetically-balanced monosyllabic of moderate familiarity. They were produced by a male speaker having a standard American dialect. The interested reader is directed to Kalikow, Stevens, and Elliot (1977) and to Elliot (1995) for comprehensive histories of test development and recording.

The 21 critical band divisions specified in the SII were employed (See Table 4.1).

An FIR filter having an order ranging from 2,000 (for the highest-frequency bands) to

20,000 (for the lowest-frequency bands) was employed. Filter order was adjusted for

59 each band to produce approximately equal 8,000 dB/octave slopes across the spectrum.

Filter slopes were measured from cutoff to noise floor. Due to limitations associated with filtering in the low spectral region, slope values decreased somewhat below 500 Hz.

However, values remained over several thousand dB/octave at 300 Hz and were approximately 1,000 dB/octave at 100 Hz. Transition bandwidths below 500 Hz remained in the 3-5 Hz range. 2 Figure 4.1 displays the output of several band-pass filters used in the present experiments. After filtering, the various group delays associated with filtering at different orders (delay = order/2, in samples) were corrected to ensure that all bands were presented in exact temporal synchrony. This processing and analysis was performed primarily in MATLAB.

Table 4.1 Band divisions employed in the SII and here.

Band Center Frequency (Hz) Band Limits (Hz) 1 150 100-200 2 250 200-300 3 350 300-400 4 450 400-510 5 570 510-630 6 700 630-770 7 840 770-920 8 1000 920-1080 9 1170 1080-1270 10 1370 1270-1480 11 1600 1480-1720 12 1850 1720-2000 13 2150 2000-2320 14 2500 2320-2700 15 2900 2700-3150 16 3400 3150-3700 17 4000 3700-4400 18 4800 4400-5300 19 5800 5300-6400 20 7000 6400-7700 21 8500 7700-9500

60

Figure 4.1. Responses of the high-order FIR filters used to create the 21 speech bands. Shown are long-term average spectra of a 60-sec white noise filtered using parameters for bands 2, 3, 4; 10, 11, 12; and 18, 19, 20

3. Procedure

The 21 spectral bands formed 21 target-band conditions. They were distributed across three subject groups. The first randomly-assigned group was assigned bands 1-7, the second group was assigned bands 8-14, and the third group was assigned bands 15-

21. As stated earlier, the importance of each spectral band was assessed as information elsewhere in the spectrum was distributed randomly. The spectral band of interest (target band) was presented along with four other bands, and the location of these other bands was determined randomly from trial to trial. This number of bands was selected to place performance in the steep portion of the psychometric function relating intelligibility to number of bands, as established during pilot testing. To establish the importance of each target band, trials were paired. In one member of the pair, the target band was present,

61 along with the four other randomly-selected bands. In the other member of the pair, the target band was absent, but the same four other spectral bands were present. Thus, the

“fixed” number of bands technique of Apoux and Healy (2012) was employed.

Subjects heard 56 sentences in each of the seven target-band conditions. Half of those were High-Predictability and half were Low-Predictability. The sentences forming each paired trial were always the same predictability. This arrangement therefore required a total of 392 sentences (14 sentences band present + 14 sentences band absent x 2 predictabilities x 7 target-band conditions). The first 196 of the 200 sentences in each predictability subset were used. The presentation order of High- versus Low-

Predictability sentences alternated, and all 56 sentences in one target-band condition were completed before moving to the next. The order that target-band conditions appeared was randomized for each subject, as was the presentation of band present versus absent conditions and the condition-to-sentence correspondence within each target-band condition.

The level of each broadband sentence was set to play at 70 dBA (+ 2 dB) at each earphone using a flat plate coupler (Larson Davis AEC 101) and Type 1 sound level meter (Larson Davis 824). The level of each individual filtered band was not modified, so that the relative spectrum level of each band was maintained. Stimuli were converted to analog form using a PC and Echo Gina 3G D/A converters and presented diotically over Sennheiser HD 280 headphones.

Testing was performed in a double-walled sound booth. It began with a familiarization in which the eight unused sentences (four from each predictability subset)

62 were presented first broadband, then again as five bands, having frequencies selected randomly for each trial. Subjects responded after each trial by typing the final word of the sentence, 3 and received visual correct/incorrect feedback. Following this familiarization, subjects heard the seven blocks of 56 sentences each and responded as during familiarization, but did not receive feedback. Presentation of stimuli and collection of responses was performed using custom MATLAB scripts running on a PC.

The total duration of testing was approximately 2 hours and subjects were required to take a break after each block of 56 sentences.

B. Results

Group mean intelligibility scores (%) were as follows: High-Predictability band present = 72.1 (SD = 6.7), High-Predictability band absent = 54.8 (SD = 4.4), Low-

Predictability band present = 45.0 (SD = 8.7), Low-Predictability band absent = 31.5 (SD

= 6.1). Band-importance values were established following Apoux and Healy (2012):

The intelligibility difference between band-present and band-absent conditions was calculated for each target band for each subject, and these differences were averaged across subjects to create a mean difference for each band. These mean intelligibility differences were summed across bands and the importance of each band corresponded to the intelligibility difference of the band over the sum of the intelligibility differences.

Figure 4.2 shows importance for each of the 21 bands. High- and Low-

Predictability sentences were pooled in this view, as the single SII band-importance function represents both predictability subsets. Shown are data based on the first 10

63 subjects run in each of the three frequency regions, the first 15 subjects, all 20 subjects, and the second subgroup of 10 subjects run. Apparent are large variations across successive frequency bands that are relatively stable across different numbers of subjects and similar across the first and second subgroups of 10 randomly-selected subjects.

Figure 4.3 shows the band-importance function obtained in the current experiment (based on all 20 subjects in each spectral region and both High- and Low-Predictability sentences) plotted against that for the identical speech materials in the SII. As can be seen, the overall shapes of the two functions are similar. However, the variations across successive bands observed here cause the importance of individual bands to differ substantially across the two functions. Also shown are values for the first three formant frequencies, based on the average of final words from 50 randomly-selected sentences.

Values were determined in PRAAT using LPC and a maximum formant frequency limit of 5000 Hz (Boersma and Weenink, 2011). Figure 4.4 (top panel) displays absolute deviations from SII values for each band. These deviations range up to an importance of

0.054 (corresponding to 5.4% of the total information in speech). The bottom panel displays these deviations in percent, such that observed importance values that were double those in the SII would be assigned a deviation of 100% ([│current importance value - SII importance value│ / SII importance value] x 100). These deviations range up to 193% and average 43%. They are greatest for the lowest and highest bands, and for band 10 (1370 Hz).

64

Figure 4.2. Band importance values for SPIN sentences, for each of 21 speech bands. High- and Low-Predictability sentences were pooled. Shown are functions for (i) the first randomly-selected subgroup of 10 subjects in each frequency region, (ii) the first 15 subjects, (iii) all 20 subjects, and (iv) the second randomly-selected subgroup of 10 subjects in each frequency region.

Figure 4.3. Band importance values for SPIN sentences obtained in the current experiment (High- and Low-Predictability sentences pooled) versus that described in the SII for identical speech materials. The first three formant frequencies are indicated by inverted triangles at the top of the panel.

65

Figure 4.4. The top panel shows absolute deviations from SII importance values for each band in band-importance units. The bottom panel shows these deviations in percent (│current importance value - SII importance value│ / SII importance value). Because the SII importance for band 21 is zero, a percent difference could not be calculated in the bottom panel.

Figure 4.5 displays band-importance functions for High-Predictability and Low-

Predictability sentences separately. Although these two functions are generally similar in shape, differences in the importance of individual bands are again evident. Figure 4.6 shows absolute deviations from SII importance values for High- and Low-Predictability sentences separately. The top panel displays deviations in band importance units and the bottom panel shows these deviations in percent. As can be seen, the deviations from the

66 SII are generally larger for the Low-Predictability subset of sentences. The absolute deviations for the High- and Low-Predictability sentences average 0.012 and 0.025 respectively, and 31 and 68% respectively.

Figure 4.5. Band-importance functions for High-Predictability and Low-Predictability SPIN sentences. Also shown is the SII function for identical speech materials.

67

Figure 4.6. The top panel shows absolute deviations from SII importance values for High- and Low-Predictability SPIN sentences in band-importance units. The bottom panel shows these deviations in percent, as in Fig. 4.4. Again, percent difference could not be calculated in the bottom panel for band 21.

III. EXPERIMENT 2. PHONETICALLY-BALANCED WORDS

A. Method

1. Subjects

Sixty normal-hearing listeners (55 female) between the ages of 18 and 41 years

(mean = 21.1) participated. Of these, 32 participated in Experiment 1. Their recruitment, compensation, and audiometric characteristics were the same as in Experiment 1, except

68 that two subjects had a threshold of 25 dB HL at 8 kHz in the left ear. None had any prior exposure to the word materials employed here.

2. Stimuli and Procedure

The materials were drawn from the phonetically-balanced lists of the CID W-22 test (Hirsh et al., 1952). The test consists of 200 words produced by a male speaker having a general American dialect in the carrier phrase, “You will say ___.” The materials were extracted from the original recordings by Technisonic Studios and are therefore the particular recordings specified in the SII. The 44.1 kHz, 16-bit digital signal was extracted from CD (Auditory Research Laboratory, VA Medical Center,

Mountain Home, TN, 2006), which in turn originated from original Technisonics tape that was digitized at 20 kHz and 16-bit resolution. The processing of these words into 21 critical bands followed the procedures of Experiment 1.

As in Experiment 1, subjects were divided randomly into three groups and assigned target bands 1-7, 8-14, or 15-21. Again, trials in which the target band was present were paired with trials in which the target was absent, and four other randomly- located bands appeared along with the target band. Subjects heard 26 words (13 words band present/13 words band absent) in each of the 7 target-band conditions. Fifteen words were reserved for practice, necessitating a total of 197 words (“mew,” “two,” and

“dull” were omitted). Target-band conditions were blocked, such that all 26 trials in one condition were completed before moving on to the next. The order of target-band

69 conditions, the appearance of band present versus absent, and the word-to-condition correspondence, was randomized for each listener.

Testing began with familiarization consisting of 15 words heard first broadband, then as five randomly located bands. As in Experiment 1, subjects typed responses after each trial, 4 and received trial-by-trial feedback during familiarization, but not during formal testing. All other procedures and apparatus were identical to those in Experiment

1. The total duration of testing was approximately 1 hour and subjects were required to take a break after every 52 words.

B. Results

Band importance was calculated as in Exp. 1. Group mean intelligibility scores

(%) were 42.2 (SD = 8.3) for band present and 29.3 (SD = 5.0) for band absent. Figure

4.7 shows band-importance functions for phonetically-balanced words based on the first

10, 15, and 20 subjects run in each of the three frequency regions, as well as the last 10 randomly-selected subjects run in each frequency region. As was the case for the SPIN sentences, substantial variation across successive frequency bands exists, and these variations appear stable across various numbers of subjects and independent subgroups of listeners.

Figure 4.8 displays the band-importance function obtained in the current study against that described for the identical speech materials in the SII. Whereas the SII function is relatively flat below 2900 Hz and gradually sloping above that value, the function obtained here has the inverted U shape that is characteristic of the SPIN

70 sentences and most other speech materials. Substantial variation across individual bands is also apparent. The first three formant frequencies are indicated in Fig. 8, based on the average of 50 randomly chosen final words. The deviations from the SII function are plotted in Fig. 4.9. The top panel shows values in band-importance units and the bottom panel shows values in percent. Absolute deviations in importance of individual bands ranged as high as 0.083 (corresponding to 8.3% of total speech information). Expressed in percent, deviation values ranged up to 189% and averaged 71% across bands.

Figure 4.7. Band-importance functions for CID W-22 phonetically-balanced words. As in Fig. 4.2, functions are shown based on the first 10, 15, and 20 subjects run in each of three frequency regions, as well as for the second subgroup of 10 subjects run.

71

Figure 4.8. The band importance function obtained here for CID W-22 words versus that described in the SII for the identical speech materials. The first three formant frequencies are indicated by inverted triangles at the top of the panel.

Figure 4.9. As Fig. 4.4, but for W-22 words. Unlike Fig. 4.4, percent deviations could be calculated for all bands because SII importance is non-zero for all bands.

72 Finally, the impact of smoothing the band-importance functions was assessed. A triangular window was employed, in which the importance of band[n] (I[n]) was defined as:

I[n] = (.25 I[n-1] + .50 I[n] + .25 I[n+1]). (1)

The highest and lowest bands were included in this process, to provide a greater impact of smoothing. Band 1 was smoothed using:

I[1] = (.67 I[1] + .33 I[2]), (2) and band 21 was smoothed using:

I[21] = (.33 I[20] + .67 I[21]). (3)

Thus, the smoothed importance of band[n] always included the importance of the adjacent band(s) at half weight. Figure 4.10 shows the band-importance functions obtained here following this smoothing, as well as corresponding functions from the SII.

Although the discrepancies between the functions obtained here and those in the SII are reduced somewhat by smoothing, differences remain, especially for the W-22 words

(bottom panel).

73

Figure 4.10. Shown are band-importance functions obtained in the current study following smoothing across bands using a triangular weighting window. Corresponding functions from the SII are also displayed. The top panel shows the functions for the SPIN sentences, and the bottom panel shows functions for the CID W-22 words.

IV. DISCUSSION

The primary advantages of the compound method as implemented here include (i) resulting band-importance functions that account for the multitude of interactions that take place among various spectral regions and (ii) the restriction of speech information to sharply-defined spectral regions. Figures 3 and 8 show that the resulting functions differ

74 considerably from those in the SII, despite the use of identical speech recordings. While the overall shape of the function obtained here for the SPIN sentences is generally similar to the corresponding SII function, the function obtained here for the W-22 words bears little resemblance to that in the SII. The current functions are also differentiated from those in the SII through the existence of a complex microstructure in which the importance of adjacent bands may differ substantially. This microstructure is apparent in the band-importance functions for both the SPIN sentences and W-22 words. As the numerical band-importance values obtained in the current study may be of some utility, they have been provided in Table 4.2. 5

Table 4.2 Band importance values obtained in the current study for SPIN sentences (Exp. 1) and CID W-22 phonetically balanced words (Exp. 2).

Band High-Predictability Low-Predictability High- and Low-Predictability W-22 SPIN SPIN SPIN Pooled 1 0.0216 0.0441 0.0315 0.0326 2 0.0354 0.0151 0.0265 -0.0156 3 0.0423 0.0732 0.0558 0.0184 4 0.0560 0.0240 0.0420 0.0752 5 0.0344 0.0404 0.0370 0.0681 6 0.0757 0.0391 0.0597 0.0724 7 0.0718 0.0618 0.0674 0.0525 8 0.0560 0.0681 0.0613 0.0993 9 0.0521 0.0290 0.0420 0.0468 10 0.0905 0.1349 0.1099 0.1149 11 0.0560 0.0214 0.0409 0.1191 12 0.0472 0.0378 0.0431 0.0383 13 0.0551 0.0681 0.0608 0.0241 14 0.0679 0.0870 0.0763 0.1106 15 0.0570 0.0744 0.0646 0.0213 16 0.0619 0.0391 0.0519 -0.0057 17 0.0492 0.0542 0.0514 0.0482 18 0.0315 -0.0063 0.0149 0.0128 19 -0.0148 0.0177 -0.0006 -0.0071 20 0.0266 0.0542 0.0387 0.0270 21 0.0266 0.0227 0.0249 0.0468

75 Figures 4.2 and 4.7 show the functions generated by the first group of 10 subjects in each condition relative to that generated by the final 10. Because subjects were assigned to groups randomly, these “first 10” and “last 10” subgroups can be considered separate estimates by independent groups. The functions generated by these independent groups are highly similar, including the characteristic peaks and valleys in each function.

This clearly indicates that the microstructure present for both sentences and words is not attributable to random variation and is instead reflective of the truly differing contribution of various bands. While the predictive power of the current band-importance values, relative to that of the SII, has yet to be determined, the microstructure present in the current functions suggests increased sensitivity to the contribution of individual bands, which may in turn result in greater predictive power.

The existence of substantial microstructure that is reliable across independent estimates argues against the use of smoothing across frequency bands. This is the reason that smoothing was not employed in the majority of the analyses or to derive the values in

Table 4.2. However, smoothed functions were displayed in Fig. 4.10 to assess whether the lack of smoothing in the current study caused the dissimilarities observed between the current functions and those in the SII. The successive high-pass/low-pass filtering technique on which the SII functions are based employs smoothing of raw recognition scores, sometimes performed simply by eye. But as Fig. 4.10 shows, smoothing cannot account for the differences observed. It is also important to note that the deviations reported here between the current band-importance values and those in the SII are likely increased by the different use of smoothing across the two techniques. However, it can

76 be argued that these differences in smoothing form a portion of the overall difference between the two techniques, so the deviations as currently measured capture this important difference.

The deviations from SII values are detailed in Fig. 4.4 for the SPIN sentences.

The function is relatively stable across frequencies at approximately 0.02 when expressed as deviations in units of importance (top panel). An exception appears for band 10 (1370

Hz) where the deviation is 0.054. This difference in importance between the current function and that in the SII is quite substantial, as the current importance value of 0.1099 indicates that band 10 contributes 10.99 % of the total information in speech, whereas the

SII importance of 0.0556 indicates a contribution of only 5.56%. The bottom panel of

Fig. 4.4 provides percent deviations from SII values. The U shape results from the fact that importance values reported in the SII for the lowest- and highest-frequency bands approach zero, whereas those obtained here are higher. Figure 4.9 provides the same analysis for the W-22 words. The deviations, expressed both in importance units and in percent, are relatively flat across frequency. However, as Fig. 4.9 suggests, these deviations are quite large – even larger than for SPIN sentences. The differences in importance as high as 0.083 suggest that current ANSI functions underestimate individual band contributions to total speech information by as much as 8.3%.

As stated earlier, the band-importance values provided in the SII represent both the High-Predictability and Low-Predictability sentences of the SPIN test. Figure 4.5 shows that the High- and Low-Predictability importance functions share the same general shape, including substantial and somewhat similar microstructure. This result supports

77 the SII assertion that a single band-importance function may be used for both predictability subsets of the SPIN test (also see Bell, Dirks, and Trine, 1992). However, the deviations from SII values are considerably larger for the Low-Predictability sentences than for the High. In fact, the deviations are roughly double for the Low-

Predictability sentences, relative to the High, when expressed in importance units or percent. This current result suggests that the SII band-importance function may better characterize the High-Predictability subset of the SPIN test.

Shown in Figs. 4.3 and 4.8 are the locations of the first three formants for the corresponding speech materials. Formants 1 and 3 appear to align reasonably well with modest peaks in the importance function for the SPIN sentences. However, the prominent peak at 1370 Hz (band 10) does not align well with Formant 2. The correspondence is somewhat better in Experiment 2 for the W-22 words. Formants 2 and

3 appear to align well with prominent peaks in the function, and Formant 1 aligns with a more modest peak. These results suggest that the microstructure observed in the current band-importance functions can be related to acoustic aspects of the particular speech recordings employed, perhaps formant frequencies. Accordingly, it is suggested that band importance may need to be estimated using materials spoken by numerous talkers, if it is desired to have the resulting functions represent speech more generally. Although this concept is clear in early writings (Fletcher and Steinberg, 1929; Fletcher and Galt,

1950; both in Fletcher, 1995, pp. 278-279), it is not in practice today. Rather, the SII provides importance functions for several different speech tests based primarily on

78 particular single-talker recordings. We can refer to this dependence of band importance on specific recordings as a talker effect.

There is a related point, which we can refer to as a speech-material effect. In conceptualizing the Articulation Index, French and Steinberg (1947) considered speech to consist of articulation units – a succession of individual sounds received by the ear in their initial order and spacing in time. This view is reflected in their focus on acoustic speech and noise levels, and their use of meaningless CVC syllables and “syllable articulation,” defined as the percentage of syllables for which all three component sounds were perceived correctly. But we are now more aware that the use of these strictly bottom-up speech cues might change as the amount of top-down information changes.

Certainly, less information is needed to maintain communication as contextual information increases. This is reflected by the increasing slopes of transfer functions representing syllables versus words versus sentences. A question that remains involves the extent to which band importance is determined by talker effects, or if it is also affected by speech-material effects (i.e., syllables versus words versus sentences).

A primary advantage of the compound method as currently implemented involves the use of steep filter slopes to restrict speech information to well-defined regions. The

SII importance values for SPIN sentences were estimated using slopes of 96 dB/octave

(Bell, Dirks, and Trine, 1992). It is unknown to what extent this aspect of processing may have influenced the shape of the function. But it is clear that considerable amounts of speech information can reside in the transition bands created by such slopes, and that listeners can use this information that exists outside of the band of interest (the passband)

79 to recognize sentences (Healy, 1998, Warren and Bashford, 1999; Warren et al., 2004).

Functions for the CID W-22 words were estimated using generally steeper slopes (0.86 dB/Hz, Studebaker and Sherbecoe, 1991). However, the use of constant dB/Hz values results in widely-varying slopes in terms of dB/octave. Further, these slopes could be considered quite shallow in the lower frequencies. The use of steep and consistent filtering (see Fig. 4.1) allows the importance of clearly-defined regions of the spectrum to be assessed, without the complicating influences of information contained in transition bands. However, it is important to note that the use of steep filtering alone is not sufficient to assess band importance (e.g., Warren et al., 2005; 2011). The intelligibility of isolated or paired speech bands, even those having steep slopes, cannot approximate the multitude of interactions that take place among various speech frequencies.

The current study involved a large number of subjects (N = 120) each committing a substantial amount of time. In order to examine whether stable results can be obtained with fewer subjects, importance functions were created using data from the first 10, the first 15, then all 20 subjects in each of the three frequency regions in each experiment.

As Figs 4.2 and 4.7 show, the functions generated by the first 10 or 15 subjects is quite similar to that generated by all 20. It may therefore be concluded that another advantage of the compound method is that it requires fewer subjects than other approaches to accurately estimate band-importance functions. However, this limited subject requirement may depend upon a number of factors (e.g., number of trials, number of talkers, speech materials, etc). For instance, Apoux and Healy (2012) found that band-

80 importance functions for multitalker CVC and VCV phonemes continued to stabilize after the first 10-20 subjects.

The traditional high-pass/low-pass filtering technique of estimating band- importance functions requires independent control of speech bandwidth and overall level of performance. To accomplish this goal, background noise is added at various levels to adjust overall level of performance. Although noise has the effect of reducing to some extent the influence of shallow filter slopes, it selectively masks lower-amplitude portions of the signal and reduces modulation depth. Further, although early reports found a generally linear relation between contribution of a band and the effective SNR within that band (French and Steinberg, 1947; also see Steeneken and Houtgast, 1980), more recent work has shown that background noise can affect the shape of the band-importance function. For example, Apoux and Bacon (2004) estimated functions for consonants using the hole technique (Shannon et al., 2001; Kasturi et al., 2002; Apoux and Bacon,

2004), with and without a background noise. It was found that the shape of the functions obtained in quiet and in noise differed substantially. Apoux and Bacon attributed this effect to the differential effect of noise on various acoustic speech cues. It may be desirable to obtain the relative contribution of each speech band to overall intelligibility without these complicating influences of noise, as in the current study, and to examine subsequently the degrading influence of noise on each band.

81 V. SUMMARY AND CONCLUSIONS

In the current study, band-importance functions based on 21 critical bands were established for SPIN sentences and CID W-22 words using the compound technique and an additional refinement involving extremely steep filter slopes. These functions were compared to those for identical recordings in the SII (ANSI, 1997). The current method provides importance estimates for strictly-defined spectral regions, while accounting for the multitude of synergistic and redundant interactions that take place across the speech spectrum. It is also computationally simpler and more efficient than that traditionally used to evaluate band importance. Substantial differences were observed in the shapes of the functions obtained here relative to those in the SII, especially for the W-22 words.

The current importance functions are also apparently more sensitive to the contribution of particular spectral bands, as reflected in the substantial microstructure observed currently.

82 NOTES:

1. Other techniques to create band-importance functions exist, which may also account to some degree for between-band interactions. These techniques include the correlational method, in which the importance of a given band is determined by the correlation between the amount of noise in that band and speech recognition (Doherty and Turner,

1996; Turner et al., 1998; Apoux and Bacon, 2004; Calandruccio and Doherty, 2007), and the hole technique, in which the importance of a given band is determined by removing that band (or pair of bands) from the spectrum and assessing speech recognition (Shannon et al., 2001; Kasturi et al., 2002; Apoux and Bacon, 2004).

Reasons why the correlational and hole techniques could not be used are discussed in

Apoux and Healy (2012).

2. Although filter slope (rather than, e.g., transition bandwidth in Hz) was held approximately constant in the current study, it should be noted that slope angle values become less meaningful as they become very steep. This is because extremely small differences in transition bandwidth can lead to extremely large numerical differences in filter slope, as the slope value approaches infinity. Further, these values can depend heavily upon measurement accuracy (e.g., fft size and measurement point selection). For example, a negligible decrease in transition bandwidth from 5 Hz to 4 Hz (assuming a cutoff of 1500 Hz and 70 dB SNR) yields an increase in slope value of 3,800 dB/octave.

A further decrease from 4 Hz to 3 Hz yields a further increase of 5,700 dB/octave.

83

3. Only exact case-insensitive matches were accepted in this experiment. Homophones and misspellings were not accepted because any such responses would transpose overall scores up slightly in each condition, but would not affect the band-present/band-absent difference. Further, misspellings require subjective evaluation of inaccurate responses.

An analysis involving a subset of data indicated that the influence of accepting alternative responses was slight and similar in both band-present and band-absent conditions – the mean increase across six conditions examined (2 band present/absent x 3 target-band frequencies) was 1%, with a range of 0.4 - 1.6%.

4. Unlike the SPIN sentences in Experiment 1, for which even the Low-Predictability subset narrowed the possible responses to specific parts of speech, the W-22 words lacked context needed to distinguish homophones. Accordingly, pilot testing indicated large numbers of homophone responses, relative to the numbers observed in Experiment

1. Although band-present/band-absent difference scores should remain unaffected, strict response criteria may have reduced recognition scores close to floor values, where difference scores could have been compressed. Thus, homophone responses (e.g., bread, bred) were accepted in this experiment.

5. A small proportion of importance estimates are slightly negative. This is especially apparent for the W-22 words in Fig. 8. This could potentially result from two sources.

One possibility is that the presence of these few bands is truly detrimental. This could

84 result from these bands masking especially important bands higher in frequency.

Alternatively, some type of systematically misleading information could have been provided by these bands. For example, they may have been misinterpreted as specific formants, when in fact they were not. However, it is far simpler to interpret this small number of slightly negative values simply as noise in the data, resulting from slightly lower scores in those band-present conditions, relative to the corresponding band-absent conditions. The one negative importance value in Exp. 1 (Fig. 3) resulted from a 0.2% difference between band-present and band-absent conditions, and the three negative values in Exp. 2 (Fig. 8) resulted from differences that averaged 2.6%. The decision was made to allow these negative values to remain in Table 2, as this allows the decision to either use the values as empirically derived, or convert them to zero and recalculate importance.

85 ACKNOWLEDGEMENTS

This work was supported in part by grants from the National Institute on Deafness and other Communication Disorders (DC8594 to EWH and DC9892 to FA). We are grateful for the data collection and analysis assistance of Carla Berg and Mandi Grumm, and the manuscript preparation assistance of Kelsey Richardson and Lorie D’Elia.

86 REFERENCES

American National Standard Inst. (1969). ANSI S3.5. American National Standard

Methods for the Calculation of the Articulation Index (American National Standards

Inst., New York).

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

American National Standard Inst. (2004). ANSI S3.21 (R2009). American National

Standard Methods for Manual Pure-Tone Threshold Audiometry (American National

Standards Inst., New York).

American National Standard Inst. (2010). ANSI S3.6-2010. American National Standard

Specification for Audiometers (American National Standards Inst., New York).

Apoux, F., and Bacon, S. P. (2004). “Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise,” J. Acoust.

Soc. Am. 116, 1671-1680.

87 Apoux, F., and Healy, E. W. (2009). “On the number of auditory filter outputs needed to understand speech: Further evidence for auditory channel independence,” Hear. Res. 255,

99-108.

Apoux, F., and Healy, E. W. (2010). “Relative contribution of off- and on-frequency spectral components of background noise to the masking of unprocessed and vocoded speech,” J. Acoust. Soc. Am. 128, 2075-2084.

Apoux, F., and Healy, E. W. (2012). “Use of a compound approach to derive auditory- filter-wide frequency-importance functions for vowels and consonants,” J. Acoust. Soc.

Am. 132, 1078-1087.

Bell, T. S., Dirks, D. D., and Trine, T. D. (1992). “Frequency-importance functions for words in high- and low-context sentences,” J. Speech Hear. Res. 35, 950-959.

Bilger, R. C., Nuetzel, J. M., Rabinowitz, W. M., and Rzeczkowski, C. (1984).

“Standardization of a test of speech perception in noise,” J. Speech Hear. Res. 27, 32-48.

Boersma, P., and Weenink, D. (2011) Praat: Doing by computer (Version

4.3.22) [Computer program]. Retrieved from http://www.praat.org/ (last viewed online

April 2011).

88 Breeuwer, M., and Plomp, R. (1984). “Speechreading supplemented with frequency- selective sound-pressure information,” J. Acoust. Soc. Am. 76, 686-691.

Breeuwer, M., and Plomp, R. (1985). “Speechreading supplemented with formant- frequency information from voiced speech,” J. Acoust. Soc. Am. 77, 314-317.

Brungart, D. S., Chang, P. S., Simpson, B. D., and Wang, D. (2006). “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Am. 120, 4007-4018.

Cooke, M. P. (2006). “A glimpsing model of speech perception in noise,” J. Acoust. Soc.

Am. 119, 1562-1573.

Davis, H., and Silverman, S. R. (1978). Hearing and Deafness, 4th ed. (Holt, Rinehart, and Winston, New York), pp. 492-495.

Dubno, J. R., Lee, F-S., Matthews, L. J., Ahlstrom, J. B., Horwitz, A. R., and Mills, J. H.

(2008). “Longitudinal changes in speech recognition in older persons,” J. Acoust. Soc.

Am. 123, 462-475.

Elliott, L. L. (1995). “Verbal auditory closure and the Speech Perception in Noise (SPIN)

Test,” J. Speech Hear. Res. 38, 1363-1376.

89

Fletcher, H. (1995), Speech and Hearing in Communication, J. B. Allen, Ed., (Acoustical

Society of America, Woodbury, New York), pp. 278-279.

Fletcher, H., and Galt, R. H. (1950). “The perception of speech and its relation to telephony,” J. Acoust. Soc. Am. 22, 89-151.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119.

Grant, K. W., and Walden, B. E. (1996). “Spectral distribution of prosodic information,”

J. Speech Hear. Res. 39, 228-238.

Healy, E. W. (1998). “A minimum spectral contrast rule for speech recognition:

Intelligibility based upon contrasting pairs of narrow-band amplitude patterns” [Ph.D. dissertation]. The University of Wisconsin - Milwaukee; Available from: http://www.proquest.com/; Publication Number: AAT 9908202, pp. 56-73.

Healy, E. W., and Bacon, S. P. (2007). “Effect of spectral frequency range and separation on the perception of asynchronous speech,” J. Acoust. Soc. Am. 121, 1691-1700.

90 Healy, E. W., and Warren, R. M. (2003). “The role of contrasting temporal amplitude patterns in the perception of speech,” J. Acoust. Soc. Am. 113, 1676-1688.

Hirsh, I. J., Davis, H., Silverman, S. R., Reynolds, E. G., Eldert, E., and Benson, R. W.

(1952). “Development of materials for speech audiometry,” J. Speech Hear. Disord. 17,

321-337.

Kalikow, D. N., Stevens, K. N., and Elliott, L.L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337-1351.

Kasturi, K., Loizou, P. C., Dorman, M., and Spahr, T. (2002). “The intelligibility of speech with ‘holes’ in the spectrum,” J. Acoust. Soc. Am. 112, 1102-1111.

Li, N. and Loizou, P. C. (2007). “Factors influencing glimpsing of speech in noise,” J.

Acoust. Soc. Am. 122, 1165-1172.

Lippmann, R. P. (1996). “Accurate consonant perception without mid-frequency speech energy,” IEEE Trans. Speech Audio. Process. 4, 66-69.

Müsch, H. and Buus, S. (2001). “Using statistical decision theory to predict speech intelligibility. I. Model structure,” J. Acoust. Soc. Am. 109, 2896-2909.

91

Shannon, R. V., Galvin, J. J. III., and Baskent, D. (2001). “Holes in hearing,” J. Assoc.

Res. Otolaryngol. 3, 185-199.

Silverman, S. R. and Hirsh, I. J. (1955). ‘‘Problems related to the use of speech in clinical audiometry,’’ Ann. Otol. Rhinol. Laryngol. 64, 1234–1245.

Steeneken, H. J. M., and Houtgast, T. (1980). “A physical method for measuring speech- transmission quality,” J. Acoust. Soc. Am. 67, 318-326.

Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L., (1987). “A frequency importance function for continuous discourse,” J. Acoust. Soc. Am. 81, 1130-1138.

Studebaker, G. A., and Sherbecoe, R. L. (1991). “Frequency-importance and transfer functions for recorded CID W-22 word lists,” J. Speech Hear. Res. 34, 427-438.

Warren, R. M., and Bashford, J. A., Jr., (1999). “Intelligibility of 1/3-octave speech:

Greater contribution of frequencies outside than inside the nominal passband,” J. Acoust.

Soc. Am. 106, L47-L52.

92 Warren, R. M., Bashford, J. A., Jr., and Lenz, P. W. (2004). “Intelligibility of bandpass filtered speech: Steepness of slopes required to eliminate transition band contributions,”

J. Acoust. Soc. Am. 115, 1292-1295.

Warren, R.M., Bashford, J. A., Jr., and Lenz, P.W. (2005). “Intelligibilities of 1-octave rectangular bands spanning the speech spectrum when heard separately and paired,” J.

Acoust. Soc. Am. 118, 3261-3266.

Warren, R.M., Bashford, J. A., Jr., and Lenz, P.W. (2011). “An alternative to the computational Speech Intelligibility Index estimates: Direct measurement of rectangular passband intelligibilities,” J. Exp. Psychol.[Hum. Percept.] 37, 296 - 302.

Warren, R. M., Riener, K. R., Bashford, J. A., Jr., and Brubaker, B. S. (1995). “Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits,” Percept.

Psychophys. 57, 175-182.

93

CHAPTER 5: MANUSCRIPT 2

Speech-material and talker effects in speech band-importance functions

Sarah E. Yoho, Eric W. Healy, Carla L. Youngdahl, Frederic Apoux

Department of Speech and Hearing Science

The Ohio State University

Columbus, OH 43210

94 Abstract

Band-importance functions were created using the compound method [F. Apoux and

E.W. Healy (2012) 132, J. Acoust. Soc Am.] and compared to examine possible influences on the shape of those functions. Specifically evaluated were talker and speech material effects. Functions were created for sentences using the 21 critical-band divisions of the Speech Intelligibility Index. The functions were based on standard recordings from either one male talker or ten talkers of both genders. In addition, new recordings were created for the Central Institute for the Deaf W-22 words and the Speech Perception in

Noise sentences using the same male talker for both. Comparisons were then made between the functions for sentences derived with different talkers and between the functions for different speech materials spoken by the same talker. A substantial similarity was observed for functions for the sentences, despite the fact that they were all spoken by different talkers. In contrast, the two functions derived using the same talker but different speech materials differed more considerably. This suggests a weaker effect of talker, and a stronger effect of speech material, on the shape of the band-importance function. Results from the single- versus multiple-talker functions for sentences indicate that the use of multiple talkers largely diminishes any residual effect of talker, as the ten- talker function was smoother than the single talker function.

95 I. INTRODUCTION

Recently, a novel technique for deriving speech band-importance functions has been developed, which allows a large amount of detail to be observed. This ‘compound method’ (Apoux and Healy, 2012; Healy, et al. 2013) has distinct advantages over the traditional ANSI standard method, which account for the functions present in the Speech

Intelligibility Index (SII; ANSI, 1997; formerly the Articulation Index, AI; ANSI, 1969).

First, the compound method is more efficient and less time consuming to execute than the

ANSI method. For example, the previous method involves many steps including the establishment of a ‘crossover frequency,’ which divides the spectrum into equally- intelligible halves, the use of multiple signal-to-noise ratios and filtering conditions, and a smoothing of the function (French and Steinberg, 1947, DePaolis, 1992). Second, in the current implementation of the compound method (Healy, et al. 2013), steep filtering is employed to strictly restrict speech information to the frequency region of interest and to eliminate the confounding influence of information contained in the filter skirts (see

Healy, 1998; Warren et al., 1999; 2004). Third, the compound method allows the examination of importance in either quiet or background noise, whereas the ANSI method necessitates the use of background noise. Background noise is necessary to reduce ceiling effects in the ANSI method, where the band of interest in presented along with large amounts of the spectrum. The ability to present speech in quiet can be beneficial, because it eliminates interactions that have been shown to exist between the

96 effects of masking and the importance of speech in different frequency regions (Apoux and Bacon, 2004).

Finally, and importantly, the compound method allows the importance of a band to be observed in a way that takes into account the various complex interactions across speech frequencies. The ANSI method involves systematically varying the cutoffs during high- and low-pass filtering. Thus, the importance of a band is assessed when the entire frequency spectrum either above or below it is intact. This method relies on the assumption that the total speech information present is the simple sum of the contributions of the individual frequency bands (French and Steinberg, 1947). Although this assumption was originally intended to be applied to articulation testing of isolated phonemes and nonsense syllables, it has been extended to all forms of speech.

A large body of literature exists to indicate that intelligibility of speech bands is not simply additive (e.g., Breeuwer and Plomp, 1984; Warren et al., 1995; Lippmann;

1996, Healy and Warren, 2003; Healy and Bacon, 2007). When the AI was first being developed, Pollack (1948) showed that although the intelligibility of individual high- and low-pass narrow bands could be very low, the intelligibility when presented together could far exceed the sum of the two individually. This fact is also reflected in the transfer functions relating intelligibility to AI value (ANSI, 1969), which have a slope exceeding one.

In addition to this well-established synergistic nature of intelligibility, it has been argued that the contribution to total information that a band makes likely depends to a large extent on what other speech bands are present (see Healy et al., 2013). Thus, speech

97 information content is also likely not simply additive, but instead displays dramatic synergistic and redundant interactions. For example, if a speech band is presented along with other bands that are similar in frequency and provide information that is redundant with that band, then the information contributed by that speech band will be low. In contrast, that very same band will contribute heavily to total information if the other bands that are present are complementary in the information they provide. The compound method allows these interactions to occur and, over multiple trials, accounts for their influence.

A potentially interesting consequence of the compound method is an ability to observe structure, sometimes highly detailed ‘microstructure,’ within a band-importance function. The ANSI method does not provide this degree of detail. Healy, et al. (2013) suggested two possible influences on these detailed band-importance functions: “speech- material” and “talker” effects. In the first possibility, importance function shape and/or structure are affected by the type of speech material employed, including the particular phonetic and semantic composition of sentences or words. In the latter, importance function shape and/or structure is affected by the particular individual talker, including the specific idiosyncratic acoustic aspects of his or her voice. However, to what extent the function and its microstructure are reflective of these influences have yet to be determined.

Much work has gone into developing functions for different speech materials using the ANSI method. These include the CID W-22 word lists (Studebaker and

Sherbecoe, 1991), high- and low-context sentences (Bell, Dirks, and Trine 1992), and

98 ‘continuous discourse’ (Studebaker et al., 1987). These different functions were created with the assumption that a main factor underlying their shape is the particular linguistic content of the speech (i.e. the speech-material effect). Accordingly, they have been typically derived using a standard (usually male-voice) recording, and are often assumed to be accurate for any recording of that material. For example, Bell, Dirks and Trine

(1992) developed functions for both low- and high-context sentences, anticipating that the higher context speech would have greater influence from low-frequency bands.

Although the authors did find a small difference in the crossover frequency, they did not find differences between the shapes of the functions for the high- and low-context sentences.

The idea that the phonetic composition of a signal can influence the shape of its importance function led to the development of phonemically-balanced lists, such as the

CNC lists (Lehiste and Peterson, 1962). It was assumed the best way to measure the intelligibility of a system was to utilize common-occurrence words that contained the correct ratio of English phonemes found in everyday usage. The use of a restricted set of common words was employed so that articulation testing would not be influenced by

‘top-down’ factors, and therefore be more sensitive to the acoustic features of the speech.

There is good reason to believe that the frequency distribution of importance differs across phonemes. Whereas vowels and other voiced sounds have strong cues to their identity in the lower formant-frequency regions, fricative and sibilant sounds, as well as consonant-release bursts, have identifying features at much higher frequencies. In contrast, these influences should be substantially diminished once the set of different

99 phonemes becomes sufficiently large. This should be especially true if the distribution of these speech items reasonably represents the balance in everyday speech. Accordingly, it is reasonable that band-importance functions are different for different sets of isolated speech sounds (e.g., vowels versus consonants, see Apoux and Healy (2012). But it also seems reasonable to assume that the difference in importance-function shape attributable to speech material should be considerably diminished or even absent once a reasonably large array of different phonemes are included in the corpus.

With regard to a potential talker effect, it is well known that the acoustic manifestation of speech depends largely on vocal fold and tract size and therefore varies across individuals (Peterson and Barney, 1952). One might expect the acoustic characteristic of an individual’s voice to play a substantial role in the particular frequencies that are most important for understanding his or her speech. The SII asserts that a function for a certain speech material is appropriate for any recording of that speech material, regardless of the fact that it was developed from a single talker production. This conclusion seems to assume that the speech-material effect dominates and that the talker effect is absent. But it also may reflect the fact that the ANSI technique involves smoothing of the obtained intelligibility functions, often simply by eye, and may not be detailed enough to reveal the impact of the individual talker. This contrasts with the highly detailed band-importance functions generated using the compound method.

The level of detail provided by the compound method both necessitates and allows an investigation of these potential influences.

100 If the speech-material influence is of greater importance, as has been traditionally suggested, then it is essential that band-importance functions be established for each specific type of speech corpora. However, if the talker effect is of greater impact, then the band-importance function created using one talker may be unable to generalize to recordings from other talkers, even of the same speech corpus. A solution to a strong talker effect would involve the creation of band-importance functions based on materials spoken by numerous talkers. This clearly articulated intention is present in early writings

(“To obtain a desirable precision in the measurement of articulation, it is advisable to use at least five different voices…,” Fletcher and Steinberg, 1929; Fletcher and Galt, 1950; both in Fletcher, 1995, pp. 278-279). However, it is not in practice today.

Here, the contributions are examined of specific speech-material and talker effects on the shape of speech band-importance functions. In Experiment 1, the use of a single talker versus multiple talkers to derive a function was assessed. In Experiment 2, the influence of three different talkers all producing sentence materials was assessed. In

Experiment 3, the effect of a single talker speaking two different types of speech materials was assessed.

II. EXPERIMENT 1. SINGLE VS. TEN-TALKER SENTENCES

A. Method

1. Subjects

101 Sixty normal-hearing listeners between the ages of 19 and 37 (mean=21.8) years participated in this experiment. Fifty-five were female. The listeners were recruited from courses at The Ohio State University and received course credit or a monetary incentive.

All had pure-tone audiometric thresholds at or below 20 dB HL at octave frequencies from 250 to 8000 Hz (ANSI, 2004, 2010). None had previous exposure to the sentence materials utilized in this study.

2. Stimuli and procedure

The speech materials were sentences from the IEEE database (IEEE, 1969). The original 22 kHz, 16-bit recordings spoken by 10 different talkers judged to have general

American dialect (5 male, 5 female) were used. The corpus contains 720 sentences, and each sentence contains five key words for scoring. For the single-talker condition, one male talker was chosen from the ten. This particular talker was chosen due to his clarity of articulation.

The stimuli were filtered into the 21 critical band divisions specified in the SII

(see Table 5.1). The steep filtering technique of Healy et al. (2013) was employed. Filter orders were chosen such that filter slope for each band was held relatively constant, with slopes at or exceeding 1000 dB/octave. This steep filtering allowed for a high degree of acoustic band independence, as band overlap was minimized. The stimuli were filtered forward and backward in direction so that no phase distortion was introduced. This processing was performed primarily in MATLAB.

102

Table 5.1 Band divisions employed for all functions, as given by the SII.

Band Center Frequency (Hz) Band Limits (Hz) 1 150 100-200 2 250 200-300 3 350 300-400 4 450 400-510 5 570 510-630 6 700 630-770 7 840 770-920 8 1000 920-1080 9 1170 1080-1270 10 1370 1270-1480 11 1600 1480-1720 12 1850 1720-2000 13 2150 2000-2320 14 2500 2320-2700 15 2900 2700-3150 16 3400 3150-3700 17 4000 3700-4400 18 4800 4400-5300 19 5800 5300-6400 20 7000 6400-7700 21 8500 7700-9500

Subjects were randomly divided into three separate groups. Each group heard a subset of the 21 target-band conditions. The first group was assigned target bands 1-7, the second group was assigned target bands 8-14, and the third group was assigned target bands 15-21. The target band was always presented with four other bands. The frequency location of the other bands was selected randomly for each trial. The number of bands was determined during pilot testing to ensure intelligibility scores that avoided floor and ceiling effects. Trials were paired such that in one trial the target band was present along

103 with the four other randomly distributed bands, and in the other trial the same four other bands were presented without the target band. This allowed the importance of the target band to be assessed more globally, and in the presence of any combination of other spectral bands. Please see Apoux and Healy (2012) and Healy et al. (2013) for a more detailed discussion of the technique.

There were 14 conditions heard by each listener (7 target bands x 2 number-of- talkers). Subjects heard 20 sentences in each of these conditions for a total of 280 sentences (IEEE sentences 1-200 and 501-580). Half of these 20 sentences in each condition were target band present and half were target band absent. To obtain these 10 sentences in the multi-talker conditions, one sentence was presented from each of the ten talkers in random order. The order of conditions was as follows: Half of the subjects heard the entire single-talker context first, followed by the multi-talker context, and the other half heard the reverse order. Target-band conditions were blocked so that all sentences in one target-band condition were completed before moving on to the next.

Trials were paired and randomized such that a band-present trial and band-absent trial with the same ‘other’ bands were contiguous. The order in which target-band conditions appeared and the condition-to-sentence correspondence was randomized for each subject.

Broadband sentences were set to play back at 70 dBA at each earphone. The level was set using a flat plate coupler (Larson Davis AEC 101) and ANSI Class 1 sound level meter (Larson Davis 824). The relative spectrum level of each band was maintained. The speech stimuli were converted to analog form using a PC and Echo Gina 3G D/A

104 converters. Stimuli were presented diotically via Sennheiser HD 280 circumaural headphones.

Testing was performed in a double-walled IAC sound booth. A brief familiarization was conducted in which 20 multiple-talker sentences not used for formal testing (IEEE 701-720) were presented. The first five sentences were presented broadband, followed by five sentences consisting of 11 randomly selected bands, and finally ten sentences consisting of four randomly- selected bands. Component bands and the individual talker were randomly selected for each listener and trial. Subjects responded after each trial by repeating as much of the sentence as possible to the experimenter, and were given correct/incorrect feedback during familiarization but not during formal testing.

Subsequent to familiarization, subjects heard the 14 blocks of 20 sentences. The experimenter recorded on a computer interface, which, if any, of the five key words the subject correctly repeated in each trial. Presentation of stimuli and collection of responses were performed using custom MATLAB scripts running on a PC. The total duration of testing was approximately one hour and subjects were required to take a break half way through the experiment.

B. Results

The average intelligibility score for the single-talker conditions for band present was 62.7% (st. dev. =4.5) and for band absent was 46.8% (st. dev. = 3.1). The average intelligibility score for the multiple-talker conditions for band present was 53.8% (st. dev.

105 =4.2) and for band absent was 37.8% (st. dev. = 3.6). The importance of each target band was calculated according to the method of Apoux and Healy (2012). First, the band- absent and band-present scores were averaged across subjects. Then, the band-absent score was subtracted from the band-present score to create a mean difference score for each target band. The difference scores were then normalized by dividing each by the sum of the band difference scores for that talker condition.

Figure 5.1 shows band importance for the single-talker and ten-talker IEEE stimuli. In general, the function for the ten-talker stimuli is smoother than the function for the single-talker stimuli, especially in the upper half of the spectrum. In addition, there is a slight up-shift in the frequencies of greatest importance in the ten-talker function, perhaps reflecting the inclusion of female talkers having higher format frequencies.

Fig 5.1. Band-importance functions created using IEEE sentences. The closed symbols show functions for a single male talker and the open symbols show functions for 10 talkers (half male).

106

To assess the relative smoothness of the one- versus ten-talker functions, a

Gaussian stochastic process was used (Santner et al. 2003). This model fits a curve across the frequency bands and computes the point-to-point correlation across subsequent bands for each function. The scale parameter theta was used to indicate the overall smoothness of each band-importance function. A smaller scale parameter indicates a weaker correlation, which in turn indicates a smoother function. The Gaussian correlation function is given below in Equation (1).

2 푅(푥푖 − 푥푗) = exp [− ∑푘 휃푘(푥푖푘 − 푥푗푘) ] (1)

The estimated scale parameter for the single-talker condition (θ = .659) was smaller than that for the multi-talker condition (θ = .493), indicating that the multi-talker function is smoother than the single-talker function.

III. EXPERIMENT 2. DIFFERENT TALKERS

A. Method

1. Subjects

Sixty normal-hearing listeners between the ages of 19 and 31 (mean=20.9) years participated in this experiment. Fifty-eight were female. None had participated in

Experiment 1. The recruitment procedures, incentives, hearing criteria, and previous exposure characteristics were all the same as in Experiment 1.

107 2. Stimuli and procedure

The speech materials used here were from the revised version of the Speech

Perception in Noise test (SPIN; Kalikow, Stevens, and Elliot, 1977). The test consists of

200 key words, each positioned as the final word in both a high- and low-predictability context sentence. For the purposes of the current study, a recording was created in the laboratory using a male speaker judged to have a general American dialect. The recordings were made in a double-walled IAC sound booth and were recorded using a condenser microphone having a flat frequency response (AKG C2000B) and a commercial windscreen. The microphone was preamplified (Mackie 1202-VLZ) and digitally recorded (Echo Gina 3G) at 44.1 kHz and 16-bit resolution. The talker sat 12 inches from the microphone and read the list of sentences twice. Recordings were monitored to ensure adequate digital gain and that no peak clipping occurred. A single production was selected for each sentence, and the level of each sentence was equated within 1 dB.

A band-importance function was created for these materials using band divisions identical to those of Experiment 1. The processing of stimuli and experimental procedure of this experiment are identical to those for the same speech materials in Healy et al.,

2013. An FIR filter with orders in the range of 2000 (for the highest-frequency band) and

20000 (for the lowest-frequency band) was employed in MATLAB. These orders were chosen to produce approximately equal filter slopes (8000 dB/oct) across bands. Filter slopes were slightly decreased for bands below 500 Hz due to limitations in filtering in the low region, however they remained at or above 1000 dB/oct. Subsequent to filtering,

108 the group delays introduced by the different filter orders were corrected with sample- point accuracy so that all bands were presented in synchrony.

As in Experiment 1, subjects were randomly assigned to one of three groups: target bands 1-7, 8-14, or 15-21. Half of the trials were ‘band present,’ in which the target band was presented along with four other randomly distributed bands, and the other half were ‘band absent,’ in which only the four other bands were presented. Trials were paired and randomized such that a band-present trial and band-absent trial with the same ‘other’ bands were contiguous.

Subjects heard 56 sentences in each of the seven target band conditions for a total of 392 sentences. Half of the 56 were high-predictability and half were low-predictability, and for each paired band-present/band-absent trial, the predictability was the same. As in

Experiment 1, the order in which target-band conditions appeared, the order of presentation of band-present versus band-absent conditions, and the condition-to- sentence correspondence within each condition were all randomized for each subject.

Broadband sentences were set to play back at 70 dBA at each earphone. The level was set using a flat plate coupler (Larson Davis AEC 101) and ANSI Class 1 sound level meter (Larson Davis 824). The relative spectrum level of each band was maintained. The speech stimuli were converted to analog form using a PC and Echo Gina 3G D/A converters, and stimuli were presented diotically via Sennheiser HD 280 circumaural headphones.

Testing was performed in a double-walled IAC sound booth. A brief familiarization consisted of eight sentences not used for formal testing. The sentences

109 were presented first broadband and then the same sentences were repeated as five spectral bands, randomly selected from trial-to-trial. Subsequent to familiarization, subjects heard the seven blocks of test sentences. Subjects responded after each trial by typing the final word of the sentence on a custom MATLAB computer interface, and correct-incorrect feedback was given for familiarization only. The decision was made to have subjects type their own responses during this experiment, because only the final word was reported and scored, in accord with established SPIN format. The total duration of testing was approximately 2 hours with breaks offered after every block.

B. Results

The average intelligibility score for band present was 65.5% (st. dev. =4.5) and for band absent was 54.6% (st. dev. = 2.6). The importance of each band was calculated according to the method used in Experiment 1. Figure 5.2 shows the band-importance function created for the novel SPIN talker in the current exp. Also plotted is the IEEE single talker function from Experiment 1 and the function representing the standard male talker SPIN sentences from Healy et al. (2013).

110

Fig 5.2. Band-importance functions created using sentence materials and three different individual male talkers.

Of interest is the considerable similarity between the three functions, despite the use of different talkers. Most notably, all share a peak in importance at approximately 1500

Hz (and by definition have reduced importance on either side), another peak at approximately 2500 Hz, a broad drop in importance across 2500 – 6000 Hz, and a final peak at approximately 7000 Hz.

IV. EXPERIMENT 3: DIFFERENT MATERIALS

A. Methods

1. Subjects

Sixty normal-hearing listeners between the ages of 19 and 31 (mean=20.8) years participated in this experiment. Fifty-seven were female. One previously participated in

Experiment 1 and 28 previously participated in Experiment 2. The recruitment and compensation methods and the audiometric characteristics were the same as in

111 Experiments 1 and 2. None had previous exposure to the word materials utilized in this study.

2. Stimuli and procedure

The speech materials were the phonetically balanced words of the CID W-22 test

(Hirsh et al., 1952). The original corpus contains 200 words produced by a male speaker with a general American dialect and set in the carrier phrase, “Say the word ___”. For the purposes of the current exp., a new recording was made using the same talker and procedure as for the SPIN sentences in Experiment 2.

A band-importance function was created for these materials using the filtering procedure and band divisions of Experiment 2. The processing of stimuli and experimental procedure of this experiment are identical to those for the same speech materials in Healy et al. (2013). As in the previous experiments, subjects were randomly assigned to target bands 1-7, 8-14, or 15-21. Each subject thus heard seven target-band blocks presented in random order, with 26 words presented in each block. Half (thirteen) of the trials were ‘band present’ in which the target band was presented along with four other randomly distributed bands and the other half were ‘band absent’ in which only the four other bands were presented. Trials were paired and randomized such that a band- present trial and band-absent trial with the same ‘other’ bands were contiguous. The order in which target-band conditions appeared, the presentation order of band-present versus band-absent conditions, and the condition-to-word correspondence within each condition were all randomized for each subject.

112 Prior to testing, a familiarization occurred in which subjects heard 15 words not heard during the test. They first heard the 15 words broadband and then the same words as five bands randomly distributed in frequency. Subjects typed responses into a custom

MATLAB interface and received feedback on response accuracy for the familiarization stage only. Due to the open-set nature of monosyllable word identification, homophones of the target word were accepted for this experiment.

Testing was performed in a double-walled IAC sound booth. Broadband words were set to play back at 70 dBA at each earphone. The level was set using a flat plate coupler (Larson Davis AEC 101) and ANSI Class 1 sound level meter (Larson Davis

824). The relative spectrum level of each band was maintained. The speech stimuli were converted to analog form using a PC and Echo Gina 3G D/A converters, and stimuli were presented diotically via Sennheiser HD 280 circumaural headphones. Duration of testing was approximately 1 hour.

B. Results

The average intelligibility score for band present was 57.6% (st. dev. =4.9) and for band absent was 44.8% (st. dev. = 3.6). The importance of each band was calculated according to the method used in Experiments 1 and 2. Figure 5.3 shows band importance for the CID W-22 words spoken by the novel talker, as well as the function representing the SPIN sentences produced by the same talker from Experiment 2. Of note are the considerable differences between the functions. Specifically, the peak in importance present at approximately 700 Hz for the words is absent for the sentences; a smaller peak

113 at approximately 1000 Hz exists for words, whereas sentences display a valley; a peak at approximately 1500 Hz for sentences is absent for words. Substantial differences in importance exist across materials in the 2000- to 4000-Hz region; and opposite peaks and valleys exist across the two speech materials from approximately 5000 to 8500 Hz.

Accordingly, the band of greatest importance for the CID W-22 words had a center frequency of 1600 Hz, whereas the band of greatest importance for the SPIN sentences had a center frequency of 700 Hz.

Fig 5.3. Band-importance functions created using CID W-22 words (open symbols) and SPIN sentences (closed symbols), both spoken by the same talker.

V. DISCUSSION

The compound method possesses numerous differences from traditional techniques for evaluating band importance, and therefore allows for a more thorough examination of factors underlying the resulting functions. The current method accounts for the

114 multiplicity of interactions across bands, and reveals functions that are much more detailed and structured than those resulting from the traditional method. The current method is also more efficient, which makes it well suited to test band importance with a multiple-talker stimulus set. These characteristics facilitate the current evaluation of potential influences on band importance.

The functions derived currently using a single talker display a considerable amount of microstructure, in accord with previous examinations (Apoux and Healy,

2012; Healy et al., 2013). In contrast, the function for the ten-talker IEEE stimuli is smoother and less jagged. The bands of highest importance are also slightly shifted up in frequency in the ten-talker function, perhaps reflecting the inclusion of female voices in the stimulus set. There are two possible interpretations for the difference in smoothness between functions. In the first, the more pronounced peaks and valleys of the single- talker function reflect particular characteristics or features of that individuals’ voice, and the inclusion of multiple talkers averages out those idiosyncrasies across talkers. In the second interpretation, the variability in talker from trial-to-trial resulted in listeners monitoring frequencies more broadly, and placing less emphasis on any individual band.

Data from Assgari and Stilp (2015) suggest that listeners are less sensitive to modest differences in spectral peaks when forced to continually recalibrate to new talkers. There are also differences in higher-level processing of single- versus multiple-talker sentence lists, as exemplified by improved intelligibility of novel utterances by a familiar versus less familiar talkers (Nygaard et al., 1994) and an increased demand on working memory in a multiple-talker context (Martin et al., 1989).

115 To specifically examine speech-material and talker effects on band importance, additional band importance functions were created having same or different speech- material types, and same or different talkers. Evident from Fig. 5.2 is a high degree of similarity between the three functions reflecting the same speech-material type, but different talkers. This suggests that there is a particular characteristic shape to importance functions for sentences, regardless of the voice or recording. Further evidence that the effect is one of speech-material type and not specific to the particular speech corpus, comes from the fact that this similarity in functions was observed despite the use of two different sentence corpora (IEEE and SPIN).

One caveat is that a slight transposition can be observed in one of the three major peaks across these functions. This is the peak of maximum importance, around 1500 Hz, which approximately corresponds to the average second formant of speech (Hillenbrand et al., 1995). This shift could potentially reflect a difference in the acoustics of the talkers, particularly since it is observed in functions for identical materials, indicating that a slight effect of talker cannot be entirely excluded. This small effect is in accord with the small upward transposition in frequency of importance for male and female voices, relative to only male voices (Fig. 5.2). However, it remains that the three functions in Fig.

5.3 are highly similar in shape, despite the use of different talkers, suggesting a primary effect of speech-material type.

Additional evidence supporting this speech-material effect is a considerable dissimilarity in the functions for the same talker but different materials. As Fig. 5.3 shows, the two functions for the same individual talker producing different speech

116 materials (sentences versus monosyllables) vary considerably in the location of excursions. Indeed, for multiple regions of the spectrum the locations of peaks and valleys seem to be in direct opposition. In addition, the location of greatest importance is approximately one octave higher for the SPIN sentences than for the CID W-22 words.

These results indicate that the functions are not simply reflecting characteristic of the speaker’s acoustics, but rather depend more primarily on the speech material under evaluation.

One possible explanation for this observed difference for functions from the same talker is the contribution of top-down processing. Early articulation-testing work (Miller et al., 1951) showed large differences in the articulation functions for sentences, nonsense syllables, and digits. The authors attributed these differences to the number of alternatives to the target in the testing set. In other words, stimuli such as digits and sentences have a restricted number of possible answers, digits due to their limited number of examples and sentences due to their grammatical structure and contextual constraints. Monosyllables or nonsense syllables on the other hand have a larger number of alternatives, therefore requiring more bottom-up acoustic information to be correctly identified. These differences in the reliance on acoustic properties to identify a speech target could be at least in part responsible for the differences observed in Experiment 3.

The results from the current study are in accord with a result from Healy et al.

(2013), in which differences were observed in the functions for the low- versus high- predictability subsets of the SPIN sentences, a “speech-material” effect. Despite that both sets of sentences arose from the same talker and recording, and were created using the

117 same group of subjects within the same experimental session, there were deviations in the peaks of the functions indicating an effect of the difference in contextual information between them. This differs from the findings of Bell, Dirks and Trine (1992), who used the ASNI technique and found no difference in the overall shape of functions for low- and high-context SPIN sentences. It is likely that this discrepancy between studies can be attributed to the use of different measuring techniques, and either the higher level of detail available in the compound-method functions, or the accounting for band interactions in this method.

It may be considered somewhat surprising that a speech-material effect dominates the shape of the band-importance function, and that the talker effect is slight, if present.

Although this is a common modern assumption when the less-detailed ANSI technique is employed, recall that early formulators of the band-importance concept stressed the need for multiple talkers. Further, both the sentences and words employed currently contain considerable phonetic diversity. As was described in Section I, one might reasonably assume that speech band-importance functions, which are different for different sets of isolated phonemes, will become more highly similar once a threshold amount of phonetic diversity is achieved. However, the current results suggest that this assumption does not hold.

Taken together, these findings suggest the need for different functions for different speech materials, but not for every individual sentence corpus. Further, the ability to generalize from one talker to others using the compound technique appears to

118 be relatively strong. However, the use of multiple talkers appears to smooth the functions slightly.

119

ACKNOWLEDGEMENTS

This work was drawn from a dissertation submitted to The Ohio State University

Graduate School by the first author, under the direction of the last author. It was supported in part by the National Institute on Deafness and other Communication

Disorders (R01 DC08594 to EWH). We are grateful for the data-analysis and manuscript- preparation assistance of Jordan Vasko and Brittney Carter.

120 REFERENCES

American National Standard Inst. (1969). ANSI S3.5. American National Standard

Methods for the Calculation of the Articulation Index (American National Standards

Inst., New York).

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

American National Standard Inst. (2004). ANSI S3.21 (R2009). American National

Standard Methods for Manual Pure-Tone Threshold Audiometry (American National

Standards Inst., New York).

American National Standard Inst. (2010). ANSI S3.6-2010. American National Standard

Specification for Audiometers (American National Standards Inst., New York).

Apoux, F., and Bacon, S. P. (2004). “Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise,” J. Acoust.

Soc. Am., 116, 1671-1680.

121 Apoux, F. and Healy, E. (2012). “Use of a compound approach to derive auditory-filter- wide frequency-importance functions for vowels and consonants”, Journal of the

Acoustical Society of America 132, 1078–1087.

Assgari, A. A., and Stilp, C. E. (2015). “Talker information influences spectral contrast effects in speech categorization,” J. Acoust. Soc. Am., 138, 3023-3032.

Baer, T., & Moore, B. C. (1993). Effects of spectral smearing on the intelligibility of sentences in noise. The Journal of the Acoustical Society of America, 94, 1229-1241.

Bell, T. S., Dirks, D. D., and Trine, T. D. (1992). “Frequency-importance functions for words in high- and low-context sentences,” J. Speech Hear. Res. 35, 950-959.

Bench, J., Kowal, A., and Bamford, J. (1979). “The BKB (Bamford-Kowel-Bench) sentence lists for partially-hearing children, “Brit. J. Audiol. 13, 108-112.

Bilger, R. C., Nuetzel, J. M., Rabinowitz, W. M., and Rzeczkowski, C. (1984).

“Standardization of a test of speech perception in noise,” J. Speech Hear. Res. 27, 32-48.

Breeuwer, M., and Plomp, R. (1984). “Speechreading supplemented with frequency- selective sound-pressure information,” J. Acoust. Soc. Am. 76, 686-691.

122 DePaolis, R. A. (1992). The intelligibility of words, sentences, and continuous discourse using the articulation index (No. TR-92-04). Pennsylvania State University Park Applied

Research Lab.

Fletcher, H. (1995), Speech and Hearing in Communication, J. B. Allen, Ed., (Acoustical

Society of America, Woodbury, New York), pp. 278-279.

Fletcher, H., and Galt, R. H. (1950). “The perception of speech and its relation to telephony,” J. Acoust. Soc. Am. 22, 89-151.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119.

French, N. R. and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119.

Gaudrain, E., Grimault, N., Healy, E. W., & Béra, J. C. (2007). Effect of spectral smearing on the perceptual segregation of vowel sequences. Hearing Research, 231, 32-

41.

Healy, E. W. (1998). “A minimum spectral contrast rule for speech recognition:

Intelligibility based upon contrasting pairs of narrow-band amplitude patterns” [Ph.D.

123 dissertation]. The University of Wisconsin - Milwaukee; Available from: http://www.proquest.com/; Publication Number: AAT 9908202, pp. 56-73.

Healy, E. W., and Bacon, S. P. (2007). “Effect of spectral frequency range and separation on the perception of asynchronous speech,” J. Acoust. Soc. Am. 121, 1691-1700.

Healy, E. W., and Warren, R. M. (2003). “The role of contrasting temporal amplitude patterns in the perception of speech,” J. Acoust. Soc. Am. 113, 1676-1688.

Healy, E. W., Yoho, S. E., & Apoux, F. (2013). Band importance for sentences and words reexamined. Journal of the Acoustical Society of America, 133, 463-473.

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099-

3111.

Hirsh, I. J., Davis, H., Silverman, S. R., Reynolds, E. G., Eldert, E., and Benson, R. W.

(1952). “Development of materials for speech audiometry,” J. Speech Hear. Disord. 17,

321-337.

IEEE (1969). “IEEE recommended practice for speech quality measurements,” IEEE

Trans. Audio Electroacoust. 17, 225–246.

124

Kalikow, D. N., Stevens, K. N., and Elliott, L.L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337-1351.

Lippmann, R. P. (1996). “Accurate consonant perception without mid-frequency speech energy,” IEEE Trans. Speech Audio. Process. 4, 66-69.

Peterson G.E. and Barney H.L.(1952). “Control methods used in a study of the vowels,”

J. Acoust. Soc. Am. 24, 175–184.

Peterson, G. E., and Lehiste, I. (1962). “Revised CNC lists for auditory tests. J. Sp. Hear.

Dis.,” 27, 62-70.

Mullennix, J. W., Pisoni, D. B., and Martin, C. S. (1989). “Some effects of talker variability on spoken word recognition,” J. Acoust. Soc. Am., 85, 365-378.

Nygaard, L. C., Sommers, M. S., and Pisoni, D. B. (1994). “Speech perception as a talker-contingent process,” Psych. Science, 5, 42-46.

Pollack, I. (1948). “Effects of high-pass and low-pass filtering on the intelligibility of speech in noise,” J. Acoust. Soc. Am., 20, 259–266.

125

Santner, T. J., Williams, B. J., & Notz, W. (2003). The design and analysis of computer experiments. Springer.

Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L., (1987). “A frequency importance function for continuous discourse,” J. Acoust. Soc. Am. 81, 1130-1138.

Studebaker, G. A., and Sherbecoe, R. L. (1991). “Frequency-importance and transfer functions for recorded CID W-22 word lists,” J. Speech Hear. Res. 34, 427-438.

Ter Keurs, M., Festen, J. M., and Plomp, R. (1992). “Effect of spectral envelope smearing on speech reception I,”. J. Acoust. Soc. Am. 91, 2872-2880.

Ter Keurs, M., Festen, J. M., & Plomp, R. (1993). Effect of spectral envelope smearing on speech reception. II. The Journal of the Acoustical Society of America, 93, 1547-

1552.

Warren, R. M., and Bashford, J. A., Jr., (1999). “Intelligibility of 1/3-octave speech:

Greater contribution of frequencies outside than inside the nominal passband,” J. Acoust.

Soc. Am. 106, L47-L52.

126 Warren, R. M., Bashford, J. A., Jr., and Lenz, P. W. (2004). “Intelligibility of bandpass filtered speech: Steepness of slopes required to eliminate transition band contributions,”

J. Acoust. Soc. Am. 115, 1292-1295.

Warren, R. M., Riener, K. R., Bashford, J. A., Jr., and Brubaker, B. S. (1995). “Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits,” Percept.

Psychophys. 57, 175-182.

127

CHAPTER 6: MANUSCRIPT 3

Noise Susceptibility of Speech Critical Bands

Sarah E. Yoho, Frederic Apoux, and Eric W. Healy

Department of Speech and Hearing Science

The Ohio State University

Columbus, OH 43210

128

Abstract

The particular impact of noise on critical bands of speech is assessed here. Many current techniques for measuring the relative importance of various bands of speech do so in the presence of background noise. However, there are reasons to believe that the importance of a band and the way in which noise corrupts its contribution to speech intelligibility may be different factors. It is argued here that this use of background noise to degrade performance in band-importance methods may confound band importance with susceptibility to noise. To test this, a modified version of the compound method [F.

Apoux and E.W. Healy (2012) 132, J. Acoust. Soc. Am] was employed to evaluate the noise susceptibility of critical bands of speech. A Gaussian noise was added to the target speech band at various signal-to-noise ratios to determine the amount of noise required to reduce the importance of that band by 50%. Results indicate that band susceptibility to noise varies greatly across bands, in sharp contrast to current assumption, and no obviously systematic relationship exists between band importance and noise susceptibility.

129 I. INTRODUCTION

The relative contributions of various bands of speech to overall intelligibility have been studied extensively and are reflected in the band-importance functions that make up the ANSI standard Speech Intelligibility Index (SII, ANSI, 1997). However, this work is generally limited in its ability to assess band importance independently from the detrimental influence of background noise on bands of speech. In fact, the standard technique for assessing band importance (Studebaker, Pavlovic and Sherbecoe, 1987;

Studebaker and Sherbecoe, 1991) requires the use of background noise to reduce ceiling effects during intelligibility testing. There is reason to believe that the relative susceptibility to noise of various frequency bands of speech varies across the spectrum independently from that band’s contribution to overall intelligibility, and thus the two factors may be confounded unless assessed separately.

The masking effects of steady-state white or speech-shaped noise on speech perception are often assumed to be stable and consistent. French and Steinberg (1947) describe the amount of masking from individual bands of noise as being dependent solely on the signal-to-noise ratio (SNR) and the threshold of that band in quiet. However, the detection noise is much more straightforward than the reception and use of speech bands

-- which are influenced by, among other factors, its interaction with other bands in the spectrum and the particular acoustic and linguistic content of that band. Further, Zaar and

Dau (2015) found large variability in consonant-identification tasks as an effect of the different randomly-generated Gaussian noise maskers. This result stands in stark contrast

130 with the assumption of previous work that consonant identification in steady-state masking noise is invariant to the particular sample of white noise employed (e.g. Miller and Nicely, 1955).

The idea that different regions of the speech spectrum may be differentially susceptible to noise is supported by the general fact that acoustic cues vary greatly across frequency, and these cues may be more or less susceptible to noise. French and Steinberg

(1947) showed that as the SNR is decreased, the lower-frequency portions of speech are the last to be masked by white noise. In terms of the Articulation Index (ANSI, 1969), the

‘crossover’-frequency (frequency dividing the spectrum into two equally intelligible halves) is dependent on SNR, as Webster and Klumpp (1963) found this frequency to decrease by as much as one octave at 5 dB SNR from the value in quiet observed by

French and Steinberg (1947).

A common assumption, one Miller (1947) stated nearly seven decades ago, is that noise that matches the long-term average of speech is the most effective masker of speech. However, this may not be strictly true, as evidence exists to indicate that different regions of the speech spectrum are differentially affected by noise. Apoux and Bacon

(2004) examined the relative importance of envelope information across the spectrum, both in quiet and in noise. They employed two different band-importance methods, the

‘correlational method’ (Doherty and Turner, 1996) and the ‘hole method’ (Shannon et al.,

2001; Kasturi et al., 2002), to evaluate the relative contribution of the envelope information across four vocoded bands of speech. Apoux and Bacon found that in quiet, all four vocoded bands contributed equally to intelligibility, whereas in noise, the highest

131 frequency band increased in importance. This difference between quiet and noise conditions could not be attributed to the masking of information present in the filter transition bands, and instead was attributed to the difference in modulation depth across the spectrum. More specifically, the fact that modulation depth was substantially lower in band 3 may have forced listeners to rely more heavily on band 4 when noise was introduced, as the noise obscured (‘filled in’) the shallower modulations in band 3.

Fogerty (2011) also provided evidence that different cues in the speech signal contribute differently to intelligibility depending on listening conditions and the array of cues available to the listener. Subjects were presented with sentences filtered into three broad frequency bands. It was found that when individual acoustic speech cues, such as envelope or temporal fine structure, were presented independently from one another, normal-hearing listeners weighted each cue equally. However, when the cues were presented simultaneously, listeners weighted envelope cues and mid-frequency information more heavily. Thus, the availability of certain speech cues, which may be influenced by the particular characteristics of a masker, could influence which aspects of the speech signal contribute the most to intelligibility.

To our knowledge there has been no direct evaluation of the particular noise- susceptibility of individual bands of speech. In addition, many methods to estimate the relative importance of various bands of speech confound susceptibility to noise with importance. For example, the correlational method of Doherty and Turner (1996) directly measures importance by assessing the impact of noise. In this method, it is assumed that a band is important if it is resistant to the influence of noise.

132 A recently-developed technique to derive band importance, the ‘compound method’ (Apoux and Healy, 2012; Manuscript 1; Manuscript 2), does not require the use of background noise, and therefore the resulting band-importance functions are not limited or confounded by any potential effects of noise across the spectrum. In general, the method consists of two speech intelligibility measurements, one in which each band of interest is presented with n other randomly-distributed bands (band present), and another where the n other randomly-distributed bands are presented without the band of interest (band absent). The importance of a given band is then determined by its band- present minus band-absent score, relative to the sum of all other band’s scores.

The current study used the compound method to examine the particular noise susceptibility of 10 critical bands of speech, as specified in the SII. Specifically, the importance of each band was evaluated under various levels of noise. The amount of noise required to reduce that band’s contribution halfway from band present in quiet to band absent in quiet was determined. This amount of noise required to disrupt the contribution of each individual band was then compared. For example, if the contribution of band x is entirely eliminated by a certain amount of noise, but the contribution of band y is eliminated a smaller amount of noise, then band y has a greater susceptibility to noise than band x. The use of the half-way point between band present and band absent scores provides a sensitive measure of this relative susceptibility.

133 II. METHOD

A. Subjects

Twenty normal-hearing listeners between the ages of 19 and 33 yrs (mean=21.7 yrs) participated in this experiment. Sixteen listeners were female. The listeners were recruited from courses at The Ohio State University and received either course credit or a monetary incentive for participation. All had pure tone audiometric thresholds at or below 20 dB HL at octave frequencies from 250 to 8000 Hz (ANSI, 2004, 2010). None had previous exposure to the sentence materials utilized in this study.

B. Stimuli

The speech materials were sentences from the IEEE database (IEEE, 1969). The corpus contains 720 sentences and each sentence contains five key words for scoring. The original 22 kHz, 16-bit recordings spoken by 10 different talkers judged to have general

American dialects (5 male and 5 female) were used.

The stimuli were filtered into the 21 critical band divisions specified in the SII

(see Table 6.1). Filter orders were chosen to approximately equate slopes across bands in dB/oct and ensure minimal acoustic band overlap. This was accomplished through the use of high-order fir filters (orders 1000-10000), which preserved the amplitude and phase within the passbands (see Healy, 1998). These filters produced slopes exceeding

1000 dB/oct. The stimuli were filtered forward and backward in direction so that no phase distortion was introduced. A band of Gaussian noise was filtered using the same

134 cutoffs and orders as each speech band and was added to each target speech band. No noise was added to the ‘other’ bands in the compound method (described below). The noise had a 10-ms raised cosine rise/fall and started at least 300 ms prior to the speech to avoid possible overshoot effects (Bacon and Liu, 2000). This processing was performed in MATLAB.

Table 6.1 Band divisions for ten critical bands examined here.

Band Center Frequency (Hz) Band Limits (Hz)

2 250 200-300 4 450 400-510 6 700 630-770 8 1000 920-1080 10 1370 1270-1480 12 1850 1720-2000 14 2500 2320-2700 16 3400 3150-3700 18 4800 4400-5300 20 7000 6400-7700

C. Procedure

1. Band Importance Method

Baseline intelligibility scores in quiet that are used currently for comparison

(band-present and band-absent scores for each target band) were drawn from Manuscript

2. These data were determined as follows: Three groups of 20 normal-hearing subjects were assigned to a subset of target band conditions (bands 1-7, bands 8-14, or bands 15-

21). Stimuli were the IEEE sentences (IEEE, 1969) spoken by five male and five female talkers. These stimuli were filtered into the 21 critical bands given by the SII and shown

135 in Table 6.1. This filtering was performed in MATLAB and filtering characteristics were the same as described the Stimuli section above. For each of the three target-band subsets, subjects heard a total of 140 sentences. For each target-band condition, the band of interest was presented with four other bands having frequency positions randomly determined for each trial and subject (“band present”). Each of these trials was paired with a contiguously presented trial in which the same ‘other’ bands were presented without the target band (“band absent”). Conditions were blocked by target band and randomized. Sentence-to-condition correspondence and order of band present/band absent in each paired trial were randomized. Each of the seven target-band conditions included two (one band-present, one band-absent) sentences from each of the ten talkers, for a total of 20 sentences, with the order of talkers randomized within each block. The level of the broadband speech (all 21 bands) was set to 70 dBA. Subjects were instructed to repeat back as much of each sentence as possible to the experimenter, and each sentence had five keywords for scoring. For each target-band condition a band-present and a band-absent score was calculated and averaged across subjects. These scores in quiet from Manuscript 2 were used for the current noise-susceptibility calculation to determine the point at which the importance of each band was halved, as described in the

Results and Discussion section.

2. Noise Susceptibility Method

In the current study, a modified version of the band-importance method was employed to determine noise susceptibility. In this modification, the target band was

136 always ‘present,’ but was presented at different SNRs, while the ‘other’ bands were presented in quiet. One group of subjects heard all ten target bands. Due to the limited number of sentences available, and the large number of conditions, a subset of the 21 total critical bands were tested here. The even-numbered bands were chosen in order to cover a wide range of the spectrum, as well as to avoid the inclusion of the lowest band

(band 1) due to its very small relative importance (and therefore very small band-present minus band-absent difference). The target band was always presented with four ‘other’ bands, randomly selected from the full group of 21 bands on a trial-by-trial basis. For each trial, noise was introduced to the target band to achieve one of six SNRs: -12, -8, -4,

0, 4, or 8 dB. Conditions were blocked by target band such that all six SNRs for a target band were heard in random order before moving on to a different target band. In addition, the order of target bands was randomized for each subject, as well as the sentence-to- condition correspondence1. A total of 10 sentences was heard per condition, with one sentence from each talker presented in random order. A total of 60 conditions were presented (10 bands × 6 SNRs). Test duration was approximately 2 hrs, with most individuals completing the experiment over two sessions.

Subjects were seated in a double-walled IAC sound booth with the experimenter.

Stimuli were presented diotically via Sennheiser HD 280 circumaural headphones.

Broadband sentences (21 summed speech bands) were set to play back at 70 dBA at each earphone. The level was set using a flat plate coupler (Larson Davis AEC 101) and ANSI

Class 1 sound level meter (Larson Davis 824). The relative spectrum level of each band was maintained. The speech stimuli were converted to analog form using a PC and Echo

137 Gina 3G D/A converters. Processing of stimuli was performed in MATLAB. Subjects were instructed to repeat back as much of each sentence as possible to the experimenter, who recorded responses with the assistance of a custom MATLAB script.

Prior to testing, subjects completed a familiarization session in which they heard

20 sentences spoken by a male and female talker not heard during testing. Presented were

5 broadband sentences, 5 sentences as 11 randomly selected critical bands, and finally 10 sentences as four randomly selected critical bands. Correct/incorrect feedback was given during this familiarization stage only.

III. RESULTS

Figure 6.1 shows sentence intelligibility in percent correct as a function of target- band SNR for each of the ten target-band conditions. The top and bottom dashed lines are band-present and band-absent scores in quiet, respectively, from Manuscript 2. The dotted line represents the halfway point (arithmetic mean) between band-present and band-absent scores for that band. In addition, a 3rd order linear regression was computed to fit a curve to the function in each panel. The point at which the regression line intersected the mid-point between band-present and band-absent lines, as given by equation 1, was determined for each curve (values given in Table 6.2). This intersection reflects the SNR at which the contribution of that band to overall speech intelligibility drops halfway from its maximum (band present) to its minimum (band absent).

2 3 푌̂ = 푏0 + 푏1푥 + 푏2푥 + 푏3푥 (1)

138

Figure 6.1. Sentence intelligibility in percent correct as a function of target band signal- to-noise ratio for the ten critical bands tested. The top dashed line in each panel is average ‘band-present’ sentence intelligibility for that target band from Manuscript 2. The bottom dashed line in each panel is average ‘band-absent’ sentence intelligibility for that target band from Manuscript 2. The dotted line in each panel is the midway point between band present and band absent for that target band.

The intersection point varies greatly across bands, from as low as -4 dB SNR for the band with a center frequency of 450 Hz to as high as 4.3 dB SNR for the band with a center frequency of 250 Hz. It is important to note that the functions displayed in Figure

6.1 were derived using a group of subjects different from that used to determine the band-

139 present and band-absent reference lines. Whereas most functions involving SNRs of -12 to 8 dB span the distance between the reference lines with reasonable accuracy, the functions for the two most extreme bands (bands 2 and 20) matched these previous data with less accuracy. This is likely due to the narrow range of band-present and absent scores resulting from their relatively low importance to overall speech intelligibility

(these were the lowest importance bands tested). As a result, within-subjects control conditions were implemented in which five normal-hearing subjects, who did not take part in the previous experiment, heard bands 2 and 20 in the current six SNR conditions, as well as three additional conditions: +12 dB SNR, band present in quiet, and band absent in quiet. Each subject heard 30 sentences in each of these 18 conditions, with the sentence-to-condition correspondence and order of conditions randomized for each subject. Results are shown in Figure 6.2. The data obtained in these conditions displayed a greater degree of agreement between the band-present/band-absent in quiet values and the function relating intelligibility to SNR of the target band. However, the resulting band-susceptibility values were within 1.6 dB of the values displayed in Figure 6.1 for the across-subjects comparison. Because of the closer correspondence obtained in the control conditions, these susceptibility values are used to represent bands 2 and 20.

Individual noise-susceptibility values are located in Table 6.2 and Figure 6.3.

140

Figure 6.2. Same as for Fig. 6.1, but data from a new group of five subjects. The dashed ‘band-present’ and ‘band-absent’ scores were obtained from these same five subjects.

141 Table 6.2 Noise susceptibility (in dB SNR) by band. Values for bands 2 and 20 are from the control conditions.

Band Center Frequency (Hz) SNR(dB)

2 250 4.3 4 450 -4.0 6 700 -3.8 8 1000 -3.2 10 1370 -0.6 12 1850 1.7 14 2500 -3.4 16 3400 1.8 18 4800 -3.0 20 7000 0.2

Figure 6.3. Noise susceptibility as equivalent signal-to-noise ratios for the target bands indicated. Values for bands having center frequencies of 250 and 7000 Hz are from the group of five subjects shown in Figure 6.2.

142 IV. DISCUSSION

These data demonstrate the noise susceptibility of various critical bands of speech. The degree of vulnerability, or alternatively the degree of robustness, to the detrimental influence of extraneous noise was systematically evaluated for ten critical bands spanning the speech spectrum. To examine this, an adaptation of the compound method (Apoux and Healy, 2012; Manuscript 1) was employed. Sentence intelligibility was measured while the band of interest was presented along with four other bands randomly distributed in frequency from trial-to-trial. In addition, noise was added to the target band at different SNRs. The SNR required to reduce sentence intelligibility half- way from band present in quiet to band completely absent, was determined. This 50% reduction point was then compared across bands to evaluate each band’s susceptibility to noise.

Apparent from Figure 6.3 are large differences in the noise susceptibilities of the ten bands tested. Thus, arguably the most important finding here is that susceptibility is not equal across the spectrum. In fact, the difference between bands was found to be as large as 8.3 dB SNR (band 2 versus band 4). Further, these large differences occur within a single region of the spectrum. Whereas the lower portion of the spectrum (bands 4, 6 and 8) appears relatively consistent with respect to degree of noise susceptibility, the lowest band tested (band 2) differs considerably. In particular, bands 4, 6, and 8 appear to be quite robust to noise, whereas band 2 demonstrates a large vulnerability to noise.

Further, the higher region of the spectrum (bands 12, 14, 16, and 18) displays a large

143 degree of variability across bands. Accordingly, it can be concluded that no steady trend of susceptibility across the frequency range exists. This is true despite the fact that these data were obtained utilizing a speech corpus of ten different talkers (half of each gender).

Therefore, these results do not simply reflect any particular acoustic idiosyncrasy of an individual talker, but rather the differences observed here are likely to be more global or generalizable across talkers.

Another finding of interest is the above-average susceptibility of bands 10 and 12 observed here. These bands, which have center frequencies of 1370 Hz and 1850 Hz, respectively, represent regions of the spectrum that are some of the most important to speech intelligibility. In fact, of the bands examined here, these two contributed most to sentence intelligibility (from Manuscript 2). This region roughly corresponds to the area of average second and third formants of speech, and therefore their lack of robustness to the corrupting influence of noise is troublesome. This is a possible explanation for earlier work showing a shift by as much as one octave down from 1900 Hz in the frequencies of greatest importance when testing in noise relative to testing in quiet (Webster and

Klumpp, 1963; Webster, 1973).

The current results also have implications for the investigation of band importance. Most current techniques utilize noise to degrade and control overall intelligibility during the derivation of band importance (e.g., ANSI, 1969; Studebaker and

Sherbecoe, 1991; Doherty and Turner, 1996; ANSI, 1997; Calandruccio and Doherty,

2007; Kates, 2013). Accordingly, the current findings have immediate relevance. Band importance and band susceptibility to noise appear to be separate factors, as there is no

144 systematic relationship between the two, even when compared using the same speech materials, recordings, and technique from Manuscript 2. Figure 6.4 displays this lack of relationship through a scatterplot of band importance versus noise susceptibility for each of the 10 bands tested. A Pearson’s correlation between noise susceptibility (in dB SNR) and importance was non-significant (r = .008, p = .982). For the five bands showing the greatest susceptibility to noise in this study, two of those had a low overall importance to intelligibility, one had moderate importance, and two were the most important bands examined. These findings illustrate clearly the difficulty with evaluating speech band importance when presented in background noise. Band-importance techniques that rely on altering the signal-to-noise ratio to evaluate importance may therefore be measuring a band’s vulnerability to noise rather than its importance.

Figure 6.4. Relationship between noise susceptibility (in dB SNR) and band importance (from Manuscript 2) for the ten bands tested here (r = .008, p = .982).

145 One possible solution to this confound is the use of a technique that allows the examination of speech-band importance either in quiet or in background noise. The technique employed here, the compound method, allows for this flexibility. In previous work (Apoux and Healy, 2012, Manuscript 1, Manuscript 2), speech band importance was examined for speech in quiet. This was possible due to the paucity of speech information required in each trial. Unlike previous techniques, which require a large amount of the spectrum to be present on any given trial, the compound method is able to systematically vary the amount of information presented with the target band (number of

‘other’ bands) to assess performance in the steep portion of the psychometric function relating intelligibility to information present. Therefore, the use of noise to degrade performance is not necessary. However, if the desire is to evaluate band importance for speech in noise, the only manipulation necessary may be the addition of non-target

‘other’ band(s) to adjust overall performance as overall SNR is varied.

Another possible application of this technique involves the evaluation of the effect of sensorineural hearing impairment on the relative importance of various regions of speech. The interactions on speech intelligibility between the effects of noise and broadened auditory tuning can be quite complex. Accordingly, it may be advantageous to evaluate speech band importance and noise susceptibility separately for this population.

In fact, data such as these may reveal additional information about the way in which listeners with hearing loss process and decode speech, particularly in noise.

Although the current study employed steady-state speech-shaped noise to examine susceptibility, there are data to indicate that the results would hold for

146 modulated noises as well. Stone et al. (2011) showed that the random fluctuations inherent to a ‘steady’ noise, when that noise is band-pass filtered, can play a substantial role in masking, particularly in cases where the available acoustic cues are limited. In fact, it was argued that energetic masking as a concept largely does not exist, and that all masking is essentially modulation masking.

V. SUMMARY AND CONCLUSIONS

In the current study, the susceptibility to noise of critical bands of speech was examined. The amount of noise within the band of interest was systematically varied to determine the SNR at which the contribution of that band dropped by a certain criterion.

A multi-talker sentence corpus, IEEE sentences spoken by five male and five female talkers, was utilized so that band susceptibility would be examined more generally, without the potential influence of the particular individual’s voice. Observed here were large variations across bands in the degree of tolerance or robustness to the degrading influence of noise. This was true even for bands residing within the same general region of the spectrum. In addition, the bands of greatest importance were found to be some of the most susceptible to noise. This combination of high importance and high susceptibility to noise is of concern with regard to the effective transmission of the speech cues within these bands to the listener. In addition, these results call into question the common practice of evaluating the relative importance of various bands of speech

(the derivation of band-importance functions) in the presence of background noise. This is particularly true given the lack of a systemic relationship observed between band

147 importance and band susceptibility to noise. These results, when considered in combination with results from previous studies (Manuscript 1; Manuscript 2), indicate that speech-band importance and speech-band susceptibility to noise may play very different roles in intelligibility, and this difference may be underappreciated in previous work.

148

NOTE

1. Due to an error in the processing script, 3 of the 20 subjects heard a small number of sentences twice. This corresponded to at most 1.3 sentences/condition.

149

ACKNOWLEDGEMENTS

This work was drawn from a dissertation submitted to The Ohio State University

Graduate School by the first author, under the direction of the last author. It was supported in part through a grant from The Ohio State University Alumni Grants for

Graduate Research and Scholarship Program, and by the National Institute on Deafness and other Communication Disorders (R01 DC08594 to EWH). We are grateful for the data-collection assistance of Brittney Carter and Jordan Vasko, and the data-analysis assistance of Shuang Liu.

150 REFERENCES

American National Standard Inst. (1969). ANSI S3.5. American National Standard

Methods for the Calculation of the Articulation Index (American National Standards

Inst., New York).

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

American National Standard Inst. (2004). ANSI S3.21 (R2009). American National

Standard Methods for Manual Pure-Tone Threshold Audiometry (American National

Standards Inst., New York).

American National Standard Inst. (2010). ANSI S3.6-2010. American National Standard

Specification for Audiometers (American National Standards Inst., New York).

Apoux, F., and Bacon, S. P. (2004). “Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise,” J. Acoust.

Soc. Am., 116, 1671-1680.

151 Apoux, F., and Healy, E. W. (2012). “Use of a compound approach to derive auditory- filter-wide frequency-importance functions for vowels and consonants,” J. Acoust. Soc.

Am. 132, 1078-1087.

Bacon, S. P., and Liu, L. (2000). “Effects of ipsilateral and contralateral precursors on overshoot,” J. Acoust. Soc. Am., 108, 1811-1818.

Buus, S. (1985). “Release from masking caused by envelope fluctuations,” J. Acoust.

Soc. Am., 78, 1958-1965.

Calandruccio, L., and Doherty, K. A. (2007). “Spectral weighting strategies for sentences measured by a correlational method,” J. Acoust. Soc. Am., 121, 3827-3836.

Christiansen, T. U., Dau, T., & Greenberg, S. (2007). Spectro-temporal processing of speech–An information-theoretic framework. In Hearing–From Sensory Processing to

Perception (pp. 517-523). Springer.

Doherty, K. A., and Turner, C. W. (1996). “Use of a correlational method to estimate a listener’s weighting function for speech,” J. Acoust. Soc. Am., 100, 3769-3773.

152 Festen, J. M. , and Plomp, R. (1990). “Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal-hearing,” J. Acoust. Soc. Am.

88, 1725–1736.

Fogerty, D. (2011). “Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure,” J. Acoust. Soc. Am., 129, 977-

988.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am., 19, 90-119.

Healy, E. W. (1998). “A minimum spectral contrast rule for speech recognition:

Intelligibility based upon contrasting pairs of narrow-band amplitude patterns” [Ph.D. dissertation]. The University of Wisconsin - Milwaukee; Available from: http://www.proquest.com/; Publication Number: AAT 9908202, pp. 56-73.

Healy, E. W., Yoho, S. E., & Apoux, F. (2013). “Band importance for sentences and words reexamined,” J. Acoust. Soc. Am., 133, 463-473.

IEEE (1969). “IEEE recommended practice for speech quality measurements,” IEEE

Trans. Audio Electroacoust. 17, 225–246.

153 Kasturi, K., Loizou, P. C., Dorman, M., and Spahr, T. (2002). “The intelligibility of speech with ‘holes’ in the spectrum,” J. Acoust. Soc. Am. 112, 1102-1111.

Kates, J. M. (2013). “ Improved estimation of frequency importance functions,” J.

Acoust. Soc. Am. 134, EL459–EL464

Miller, G. A. (1947). “The masking of speech,” Psychological Bulletin, 44, 105-129.

Miller, G. A., and Nicely, P. E. (1955). “An analysis of perceptual confusions among some English consonants,” J. Acoust. Soc. Am. 27, 338-352.

Shannon, R. V., Galvin III, J. J., and Baskent, D. (2002). Holes in hearing. J. Assoc. Res.

Otolaryn., 3, 185-199.

Stone, M. A., Füllgrabe, C., Mackinnon, R. C., and Moore, B. C. (2011). “The importance for speech intelligibility of random fluctuations in “steady” background noise,” J. Acoust. Soc. Am., 130, 2874-2881.

Studebaker, G. A., and Sherbecoe, R. L. (1991). “Frequency-importance and transfer functions for recorded CID W-22 word lists,” J. Speech Hear. Res., 34, 427-438.

154 Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L. (1987). “A frequency importance function for continuous discourse,” J. Acoust. Soc. Am., 81, 1130-1138.

Webster, J. C. "The Effects of Noise on the Hearing of Speech." In Proceedings of the

International Congress on Noise as a Public Health Problem. Dubrovnik, Yugoslavia,

1973. 25-42.

Webster, J. C., and Klumpp, R. G. (1963). “Articulation index and average curve‐fitting methods of predicting speech interference,” J. Acoust. Soc. Am., 35, 1339-1344.

Zaar, J., and Dau, T. (2015). “Sources of variability in consonant perception and their auditory correlates,” J. Acoust. Soc. Am., 137, 2306.

155

CHAPTER 7: GENERAL SUMMARY AND DISCUSSION

The three manuscripts contained here examined and discussed the importance of critical bands of speech, as well as the particular corrupting influence of noise on those bands. The first manuscript tested a new technique to derive band-importance functions, and compared those newly derived functions to those found in the Speech Intelligibility

Index (ANSI, 1997). The second manuscript examined the influence of talker and speech- material effects on band-importance functions, as well as the impact of using a single talker versus multiple talkers to derive the functions. The third manuscript described the degree of noise susceptibility of critical bands of speech and the relationship between noise susceptibility and importance.

In the first manuscript, the novel compound method was implemented to derive band-importance functions for sentences from the Speech Perception in Noise test (SPIN;

Kalikow, Stevens, and Elliot, 1977) and for words from the CID W-22 lists (Hirsh et al.,

1952). The compound method evaluates the importance of a band of speech in the presence of random combinations of other frequency bands, thus allowing importance to be established in a more global and comprehensive manner that takes into account interactions with other bands. In addition, due to the limited amount of spectral

156 information required for each trial, importance is able to be examined without the presence of performance-degrading background noise. A further advantage of the compound method is that it employs very steep slopes in the filtering stage of frequency- band processing. Therefore, the importance of a single band reflects only the frequency content of the passband, and is devoid of influences of filter skirts. These skirts may cause adjacent bands to overlap acoustically and have been shown to contribute strongly to intelligibility (Healy, 1998; Warren et al., 1999; 2004).

The newly derived band-importance functions differ considerably from the functions currently given by the Speech Intelligibility Index. Specifically, a detailed

‘microstructure’ was observed, in which adjacent bands were shown to contribute vastly different amounts to overall intelligibility. This microstructure was found to be consistent across separate groups of listeners, indicating it is reliable and reproducible. In addition, an analysis was performed to compare the formant frequencies of the speaker’s voice to the peaks observed in the function. There was reasonable consistency between the measured formants and the observed peaks for both the sentence and word materials.

A restricted analysis of the influence of contextual information was also performed, in which separate functions for the high- and low-predictability subsets of the

SPIN sentences were compared. Despite the fact that these sentences were from the same recording of the same male talker, and tested on the same group of listeners, differences were observed between the two functions. This may indicate an effect of the linguistic content of the speech message on the importance of the component frequency bands.

157 The second manuscript examined directly the influence of both talker and speech material on band-importance functions. In addition, the use of a single talker versus multiple talkers to derive a function was assessed. Again, the compound method was employed to derive band-importance functions for the various speech materials and recordings. For the first experiment, two functions were derived for the IEEE sentences

(IEEE, 1969). One utilized a single, male speaker of the sentences. The second utilized a total of ten talkers, half of which were male and half of which were female. These functions were then compared. In the second experiment, a novel recording of the SPIN test sentences was made. Subsequently, a band-importance function was derived via the compound method for this novel recording, and compared to the function for the same speech material but a different talker. Lastly, a third experiment was conducted in which a band-importance function was derived for a novel recording of the CID W-22 word lists, spoken by the same talker as that in Experiment 2. Thus, a comparison could be performed between functions from the same talker, speaking different speech materials.

In Experiment 1, the function derived from sentences spoken by multiple talkers was smoother in overall shape than the function derived from one talker. In Experiment

2, the functions for the same speech materials spoken by different talkers showed a considerable similarity. In contrast, the functions compared in Experiment 3, from different materials spoken by the same talker, differed substantially. Thus, any effect of individual talker on the shape of the functions seems to be limited, whereas the effect of speech materials appears to be much stronger. Results from Experiment 1 suggest that

158 any relatively small effect of talker may be mitigated by the use of multiple talkers of different genders to derive the function.

The third manuscript examined systematically the degree to which various critical bands are susceptible to the corrupting influence of noise. Specifically, the SNR required to reduce the contribution of each band of speech by half was determined. This noise susceptibility was found to vary considerably across bands -- by as much as an 8.3 dB in

SNR. Of note was the large susceptibility of bands in the important mid-frequency range of the spectrum. This manuscript also showed that little relationship exists between noise susceptibility and band importance, supporting the idea that these are different phenomena.

Taken together, these three manuscripts indicate a need to consider many separate factors when deriving band-importance functions, and subsequently applying those functions to audiologic and research applications. These factors include interactions between speech bands, as well as talkers, speech material, and background noise.

First, the data from manuscript one indicate that band-importance methods that take into account the many interactions and redundancies across the speech spectrum may reveal complexities and intricacies within the importance function, whereas other methods may be insensitive to that level of detail. There is a theoretical implication of these differences between the techniques, as the normal function of auditory processing is perhaps more in line with the design of the compound method than other methods. This is best illustrated by the glimpsing model of speech perception (Cooke, 2006), in which the

159 auditory system locates and integrates those portions of a signal that are most spared by noise and therefore most available to the listener. The compound method evaluates a band’s importance while varying the position of other available portions of the spectrum randomly from trial-to-trial, which closely mimics the ‘glimpses’ that the auditory system integrates at any given moment in time.

Second, the data from manuscript two indicate that for a band-importance function to be truly generalizable to any recording or spoken version of a speech corpus, more than one talker should be utilized in the derivation of that function. Although this idea was put forth in early literature (Fletcher and Steinberg, 1929; Fletcher and Galt,

1950), it is rarely implemented in contemporary work. The function for ten talkers was significantly smoother than that for a single talker, and many of the large peak-to-peak excursions were greatly reduced, which suggests that the use of ten talkers of mixed gender may be sufficient for a function to generalize. However, a more systematic evaluation of the effect of number of talkers, as well as the effect of talker gender may reveal additional considerations.

The data from the second and third experiments of the second manuscript also provide implications for the application of a band-importance function for one type of speech material to a different type of speech material. These data suggest that there is an effect of the type of speech material under examination. In fact, functions for sentences spoken by three different talkers were substantially more similar to one another than functions for sentences versus words spoken by the same talker. Therefore, although the

160 issue of any potential talker effects on the shape of band-importance function may be mitigated by the use of multiple talkers, a speech-material effect appears to be the more dominant influence on these functions and therefore should be considered primary.

The results from the third manuscript call into question the common practice of evaluating band importance for speech in the presence of a background noise. Currently, many techniques require the use of background noise to reduce intelligibility to below- ceiling values. This is largely because most techniques evaluate importance of a frequency band when a large portion of the spectrum is intact. A theoretical consideration for this distinction between evaluating importance in quiet or in noise is that the underlying factors which determine band importance and band susceptibility to noise appear to be very different. The vastly different susceptibility to noise of the ten bands tested in this study, as well as the lack of systematic relationship between the degree of importance and degree of susceptibility across bands, indicates that these two factors may be confounded in previous literature. One consideration for this difference in contributions of speech bands and their susceptibility to noise is the potential need for separate functions for speech in noise and speech in quiet within a future revision of the

Speech Intelligibility Index.

One of the most concerning findings here is that the bands with the highest importance, or overall contribution to speech intelligibility, also have a high degree of susceptibility to extraneous noise. These bands span from 1270-1480 Hz and 1720-2000

Hz, roughly corresponding to the area around the average second formant of speech.

161 In addition, the pattern of susceptibility across the spectrum varies, with bands all residing within the high region of the spectrum demonstrating vastly different degrees of susceptibility from one another. It is yet to be determined whether this results from some difference in acoustic cues across these bands, as Apoux and Bacon (2004) postulated in their work on the relative contributions of vocoded speech bands in background noise. A further analysis involving the modulation depth of the speech residing within each band may give insight into this possibility. However, it is evident that that these incongruities across bands do not reflect any particular aspect of an individual talker’s acoustics, as the data in the current study were derived from a set of ten voices of both genders.

Although the studies presented here represent an expansion in fundamental work involving normal speech perception, there are implications for our understanding of the impact of hearing impairment. First, these data provide an indication of where and how listeners may look for and utilize relatively undistorted ‘glimpses’ of a signal under less- than-ideal conditions. It appears that when restricted portions of the spectrum are available to the listener, certain bands (generally in the mid-frequency region) provide the most effective (important) glimpses to reconstruct the speech signal. However, those same bands also appear to be some of the most susceptible to the corrupting influence of noise, and therefore may not often be accessible in noisy backgrounds. It is well known that hearing-impaired listeners are highly susceptible to noise in general. But how the complications of sensorineural hearing impairment affect these listening strategies is still largely unknown.

162 The ‘release-from-masking’ that normal-hearing listeners receive in modulated backgrounds (such as speech) is greatly reduced or even entirely eliminated by broadened tuning (ter Keurs et al., 1993). The upward spread of masking, whereby a lower frequency masker more effectively masks frequencies above it than below it, is also exaggerated for listeners with hearing impairment (Baer and Moore, 1993). These factors, among others, may result in changes or shifts to the noise susceptibility findings of the third manuscript when the same phenomenon is examined in the hearing-impaired population. One possibility is that noise susceptibility is also influenced by the auditory tuning of the listener, therefore bands in regions of the spectrum where tuning is substantially broadened relative to normal may show greater susceptibility to noise than regions where tuning is relatively intact.

In addition, it remains to be seen how well these new band-importance functions apply to predictions of intelligibility for listeners with sensorineural hearing impairment.

The importance functions presented here, as well as those currently given by the Speech

Intelligibility Index, were derived from studies involving normal-hearing listeners. The band divisions for all of the studies presented here were selected based on ‘critical bands,’ or an approximation of the critical bandwidth within the normal ear.

Sensorineural hearing impairment, especially for hearing loss in excess of approximately

30 dB HL, often results in a broadening of auditory filters (see Moore, 2007). The microstructure observed in the compound method functions, particularly for a single talker, have large band-to-band excursions. In other words, the importance of one critical band may be very different from the importance of an adjacent critical band. Broadened

163 peripheral auditory filters may reduce the differences in importance for contiguous critical bands. Further, the relative lack of access to spectral cues due to broad auditory tuning can increase the dependence upon temporal cues in the speech signal (e.g., Apoux and Bacon, 2004; Souza and Gallun, 2011). The distribution of temporal cues across the spectrum may cause the overall shape of the importance function to differ. However, these interactions could be quite complex and systematic investigation is needed to fully describe the influence of hearing loss on band-importance.

Although the data presented here may differ when examined in the impaired population, the possibility exists that those differences could give valuable insight into how the impaired auditory system functions. In particular, they may reveal important ways in which the impaired system performs glimpsing, or aspects of this processing that break down with the introduction of hearing loss. These methods could also give insight into which acoustic cues and portions of the speech signal hearing-impaired listeners do or do not have access to.

One promising implication of these results involves incorporation into hearing technologies such as hearing aids and cochlear implants. The remediation of the impacts of hearing loss is a challenging one, and the need is great. In fact, some prevalence estimates indicate that as many as 90% of individuals 80 years of age and older have some degree of hearing loss (Cruikshanks et al, 1998).

The standard treatment for most individuals with sensorineural hearing loss is the use of amplification devices such as hearing aids. There are two primary goals of hearing

164 aids. The first is to amplify sounds so that they are audible to the listener. The second is to make the target speech signal more intelligible to the listener. This is typically attempted through some form of signal processing that attempts to reduce extraneous background noise from the sound mixture. Hearing aids are generally effective at amplifying the signal adequately, even for individuals with a large degree of hearing loss.

However, current signal processing techniques for reducing unwanted background noise from the target speech are limited in effectiveness.

A possible consequence of our greater understanding of speech cues, and the corrupting influence of noise, involves ways to improve current speech segregation techniques, and thus the performance of hearing aids and cochlear implants. One such algorithm is that of Wang and Wang, 2013 and Healy et al., 2013. This algorithm is based on the Ideal Binary Mask (IBM), which retains time-frequency (T-F) units dominated by speech and discards units dominated by noise. Although IBM processing is capable of producing large gains in intelligibility, it requires a priori knowledge of the separate speech and noise signals, which is not available in real world scenarios. Thus, the algorithm estimates the IBM by training to recognize units dominated by noise and those dominated by speech.

A potential future study could involve the inclusion of band importance, as well as noise susceptibility, in the algorithm’s decision to retain particular T-F units. Currently,

IBM processing employs a fixed criterion across the spectrum for whether or not to retain a particular unit of the signal. This criterion, called the ‘local criterion’ is the SNR below

165 which the unit is deemed too noisy and thus discarded. However, given the results of the current studies, there is evidence to suggest that altering this local criterion across the spectrum may be advantageous. As not all speech bands contribute equally to intelligibility, the criterion could be less strict for those bands that contribute most, and more strict for the relatively unimportant bands. Thus, the amount of speech information would be maximized, while the amount of noise, particularly that which resides in lower-importance bands, would be minimized. An additional weighting could be introduced by the inclusion of the current noise susceptibility data. In this way, a weight could be applied to each frequency channel that takes into account the local signal-to-noise ratio, the importance of that frequency band, and the susceptibility to noise of that band.

In conclusion, the data presented here have both theoretical and practical implications. From a theoretical perspective, these data provide additional insight into the relative contributions of various bands of speech to overall intelligibility. They also suggest that the effect of noise on the reception of a band of speech differs considerably across the spectrum, and even within particular regions of the spectrum. Practical implications include suggestions for developing more accurate band-importance functions, as well as the potential for implementation of these data into current speech-noise segregation techniques.

166

REFERENCES

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

Apoux, F., and Bacon, S. P. (2004). “Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise,” J. Acoust.

Soc. Am., 116, 1671-1680.

Cooke, M. (2006). “A glimpsing model of speech perception in noise,” J. Acousti. Soc.

Am., 119, 1562-1573.

Fletcher, H., & Steinberg, J. C. (1929). “Articulation testing methods,”. Bell System

Technical Journal, 8, 806-854.

Fletcher, H., and Galt, R. H. (1950). “The perception of speech and its relation to telephony,” J. Acoust. Soc. Am., 22, 89-151.

167 Healy, E. W., Yoho, S. E., Wang, Y., & Wang, D. (2013). “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acoust. Soc. Am., 134,

3029-3038.

Hirsh, I. J., Davis, H., Silverman, S. R., Reynolds, E. G., Eldert, E., and Benson, R. W.

(1952). “Development of materials for speech audiometry,” J. Speech Hear. Disord. 17,

321-337.

IEEE (1969). “IEEE recommended practice for speech quality measurements,” IEEE

Trans. Audio Electroac.., 17, 225–246.

Kalikow, D. N., Stevens, K. N., and Elliott, L.L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337-1351.

Moore, B. C. (2007). Cochlear hearing loss: physiological, psychological and technical issues. (2nd ed.). Chichester, UK: Wiley.

Wang, Y. and Wang, D. L. (2013). “Towards scaling up classification-based speech separation,” IEEE. Trans Audio. Speech Lang. Proc., 21, 1381-1390.

168 Ter Keurs, M., Festen, J. M., and Plomp, R. (1993). “Effect of spectral envelope smearing on speech reception II,”. J. Acoust. Soc. Am. 93, 1547-1552.

169

CUMULATIVE REFERENCES

American National Standard Inst. (1969). ANSI S3.5. American National Standard

Methods for the Calculation of the Articulation Index (American National Standards

Inst., New York).

American National Standard Inst. (1997). ANSI S3.5 (R2007). American National

Standard Methods for the Calculation of the Speech Intelligibility Index (American

National Standards Inst., New York).

American National Standard Inst. (2004). ANSI S3.21 (R2009). American National

Standard Methods for Manual Pure-Tone Threshold Audiometry (American National

Standards Inst., New York).

American National Standard Inst. (2010). ANSI S3.6-2010. American National Standard

Specification for Audiometers (American National Standards Inst., New York).

170 Apoux, F., and Bacon, S. P. (2004). “Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise,” J. Acoust.

Soc. Am. 116, 1671-1680.

Apoux, F., and Healy, E. W. (2009). “On the number of auditory filter outputs needed to understand speech: Further evidence for auditory channel independence,” Hear. Res. 255,

99-108.

Apoux, F., and Healy, E. W. (2010). “Relative contribution of off- and on-frequency spectral components of background noise to the masking of unprocessed and vocoded speech,” J. Acoust. Soc. Am. 128, 2075-2084.

Apoux, F., and Healy, E. W. (2011). “Relative contribution of target and masker temporal fine structure to the unmasking of consonants in noise,” J. Acoust. Soc. Am. 130, 4044-

4052.

Apoux, F., and Healy, E. W. (2012). “Use of a compound approach to derive auditory- filter-wide frequency-importance functions for vowels and consonants,” J. Acoust. Soc.

Am. 132, 1078-1087.

Assgari, A. A., and Stilp, C. E. (2015). “Talker information influences spectral contrast effects in speech categorization,” J. Acoust. Soc. Am., 138, 3023-3032.

171 Bacon, S. P., and Liu, L. (2000). “Effects of ipsilateral and contralateral precursors on overshoot,” J. Acoust. Soc. Am., 108, 1811-1818.

Baer, T., & Moore, B. C. (1993). Effects of spectral smearing on the intelligibility of sentences in noise. The Journal of the Acoustical Society of America, 94, 1229-1241.

Baer, T., Moore, B. C., and Kluk, K. (2002). “Effects of low pass filtering on the intelligibility of speech in noise for people with and without dead regions at high frequencies,” J. Acoust. Soc. Am., 112, 1133-1144.

Başkent, D., Eiler, C. L., and Edwards, B. (2010). “Phonemic restoration by hearing- impaired listeners with mild to moderate sensorineural hearing loss,” Hear. Res., 260, 54-

62.

Bell, T. S., Dirks, D. D., and Trine, T. D. (1992). “Frequency-importance functions for words in high- and low-context sentences,” J. Speech Hear. Res. 35, 950-959.

Bench, J., Kowal, A., and Bamford, J. (1979). “The BKB (Bamford-Kowel-Bench) sentence lists for partially-hearing children, “Brit. J. Audiol. 13, 108-112.

Bilger, R. C., Nuetzel, J. M., Rabinowitz, W. M., and Rzeczkowski, C. (1984).

“Standardization of a test of speech perception in noise,” J. Speech Hear. Res. 27, 32-48.

172

Boersma, P., and Weenink, D. (2011) Praat: Doing phonetics by computer (Version

4.3.22) [Computer program]. Retrieved from http://www.praat.org/ (last viewed online

April 2011).

Breeuwer, M., and Plomp, R. (1984). “Speechreading supplemented with frequency- selective sound-pressure information,” J. Acoust. Soc. Am. 76, 686-691.

Brungart, D. S., Chang, P. S., Simpson, B. D., and Wang, D. (2006). “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Am. 120, 4007-4018.

Buus, S. (1985). “Release from masking caused by envelope fluctuations,” J. Acoust.

Soc. Am., 78, 1958-1965.

Calandruccio, L. , and Doherty, K. A. (2007). “Spectral weighting strategies for sentences measured by a correlational method,” J. Acoust. Soc. Am. 121, 3827–3836.

Cherry, E. C. (1953). “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., 25, 975-979.

173 Christiansen, T. U., Dau, T., & Greenberg, S. (2007). Spectro-temporal processing of speech–An information-theoretic framework. In Hearing–From Sensory Processing to

Perception (pp. 517-523). Springer.

Cooke, M. (2006). “A glimpsing model of speech perception in noise,” J. Acousti. Soc.

Am., 119, 1562-1573.

Davis, H., and Silverman, S. R. (1978). Hearing and Deafness, 4th ed. (Holt, Rinehart, and Winston, New York), pp. 492-495.

DePaolis, R. A. (1992). The intelligibility of words, sentences, and continuous discourse using the articulation index (No. TR-92-04). Pennsylvania State University Park Applied

Research Lab.

Doherty, K. A., and Turner, C. W. (1996). “Use of a correlational method to estimate a listener’s weighting function for speech,” J. Acoust. Soc. Am. 100, 3769-3773.

Dubno, J. R., and Dirks, D. D. (1989). “Auditory filter characteristics and consonant recognition for hearing‐impaired listeners,” J. Acoust. Soc. Am. 85, 1666-1675.

174 Dubno, J. R., Lee, F-S., Matthews, L. J., Ahlstrom, J. B., Horwitz, A. R., and Mills, J. H.

(2008). “Longitudinal changes in speech recognition in older persons,” J. Acoust. Soc.

Am. 123, 462-475.

Egan, J. P., and Wiener, F. M. (1946). “On the intelligibility of bands of speech in noise,”

J. Acoust. Soc. Am. 18, 435-441.

Elliott, L. L. (1995). “Verbal auditory closure and the Speech Perception in Noise (SPIN)

Test,” J. Speech Hear. Res. 38, 1363-1376.

Fletcher, H. (1940). “Auditory patterns,” Reviews of Modern Phys. 12, 47-65.

Fletcher, H. (1995), Speech and Hearing in Communication, J. B. Allen, Ed., (Acoustical

Society of America, Woodbury, New York), pp. 278-279.

Fletcher, H., & Steinberg, J. C. (1929). “Articulation testing methods,”. Bell System

Technical Journal, 8, 806-854.

Fletcher, H., and Galt, R. H. (1950). “The perception of speech and its relation to telephony,” J. Acoust. Soc. Am. 22, 89-151.

175 Fogerty, D. (2011). “Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure,” J. Acoust. Soc. Am., 129, 977-

988.

French, N. R., and Steinberg, J. C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119.

Gaudrain, E., Grimault, N., Healy, E. W., & Béra, J. C. (2007). Effect of spectral smearing on the perceptual segregation of vowel sequences. Hearing Research, 231, 32-

41.

Glasberg, B. R., and Moore, B. C. (1986). “Auditory filter shapes in subjects with unilateral and bilateral cochlear impairments,” J. Acoust. Soc. Am. 79, 1020-1033.

Glasberg, B. R., and Moore, B. C. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103-138.

Glasberg, B. R., and Moore, B. C. (2000). “Frequency selectivity as a function of level and frequency measured with uniformly exciting notched noise,” J. Acoust. Soc. Am.

108, 2318-2328.

176 Grant, K. W., and Braida, L. D. (1991). “Evaluating the articulation index for auditory– visual input,” J. Acoust. Soc. Am. 89, 2952-2960.

Grant, K. W., and Walden, B. E. (1996). “Spectral distribution of prosodic information,”

J. Speech Hear. Res. 39, 228-238.

Greenberg, S., Arai, T., and Silipo, R.(1998). “Speech intelligibility derived from exceedingly sparse spectral information,” in Proc. of the Intl. Conf. on Spoken Lang.

Proc., Sydney, pp. 2803–2806

Healy, E. W. (1998). “A minimum spectral contrast rule for speech recognition:

Intelligibility based upon contrasting pairs of narrow-band amplitude patterns” [Ph.D. dissertation]. The University of Wisconsin - Milwaukee; Available from: http://www.proquest.com/; Publication Number: AAT 9908202, pp. 56-73.

Healy, E. W., and Bacon, S. P. (2002). “Across-frequency comparison of temporal speech information by listeners with normal and impaired hearing,” J. Speech Lang.

Hear. Res. 45, 1262-1275.

Healy, E. W., and Bacon, S. P. (2007). “Effect of spectral frequency range and separation on the perception of asynchronous speech,” J. Acoust. Soc. Am. 121, 1691-1700.

177 Healy, E. W., and Steinbach, H. M. (2007). “The effect of smoothing filter slope and spectral frequency on temporal speech information,” J. Acoust. Soc. Am., 121, 1177-

1181.

Healy, E. W., and Warren, R. M. (2003). “The role of contrasting temporal amplitude patterns in the perception of speech,” J. Acoust. Soc. Am. 113, 1676-1688.

Healy, E. W., Yoho, S. E., & Apoux, F. (2013). “Band importance for sentences and words reexamined,” J. Acoust. Soc. Am., 133, 463-473.

Healy, E. W., Yoho, S. E., Wang, Y., & Wang, D. (2013). “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acoust. Soc. Am., 134,

3029-3038.

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099-

3111.

Hirsh, I. J., Davis, H., Silverman, S. R., Reynolds, E. G., Eldert, E., and Benson, R. W.

(1952). “Development of materials for speech audiometry,” J. Speech Hear. Disord. 17,

321-337.

178 IEEE (1969). “IEEE recommended practice for speech quality measurements,” IEEE

Trans. Audio Electroac.., 17, 225–246.

Kalikow, D. N., Stevens, K. N., and Elliott, L.L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337-1351.

Kasturi, K., Loizou, P. C., Dorman, M., and Spahr, T. (2002). “The intelligibility of speech with ‘holes’ in the spectrum,” J. Acoust. Soc. Am. 112, 1102-1111.

Kates, J. M. (2013). “ Improved estimation of frequency importance functions,” J.

Acoust. Soc. Am. 134, EL459–EL464

Kryter, K. D. (1962a). “Methods for the calculation and use of the articulation index,” J.

Acoust. Soc. Am. 34, 1689-1697.

Kryter, K. D. (1962b). “Validation of the articulation index,” J. Acoust. Soc. Am. 34,

1698-1702.

Li, N. and Loizou, P. C. (2007). “Factors influencing glimpsing of speech in noise,” J.

Acoust. Soc. Am. 122, 1165-1172.

179 Lippmann, R. P. (1996). “Accurate consonant perception without mid-frequency speech energy,” IEEE Trans. Speech Audio. Process. 4, 66-69.

Miller, G. A. (1947). “The masking of speech,” Psychological Bulletin, 44, 105-129.

Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some

English consonants. The Journal of the Acoustical Society of America, 27, 338-352.

Miller, G. A., Heise, G. A., and Lichten, W. (1951). “The intelligibility of speech as a function of the context of the test materials,” J. Exper. Psych., 41, 329-335.

Moore, B. C. (2003). “Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms,” Speech Communication, 41, 81-91.

Moore, B. C. (2004). “Dead regions in the cochlea: conceptual foundations, diagnosis, and clinical applications,” Ear Hear., 25, 98-116.

Moore, B. C. (2007). Cochlear hearing loss: physiological, psychological and technical issues. (2nd ed.). Chichester, UK: Wiley.

Moore, B. C., and Glasberg, B. R. (1983). “Suggested formulae for calculating auditory‐ filter bandwidths and excitation patterns,” J. Acoust. Soc. Am. 74, 750-753.

180

Mullennix, J. W., Pisoni, D. B., and Martin, C. S. (1989). “Some effects of talker variability on spoken word recognition,” J. Acoust. Soc. Am., 85, 365-378.

Müsch, H. and Buus, S. (2001). “Using statistical decision theory to predict speech intelligibility. I. Model structure,” J. Acoust. Soc. Am. 109, 2896-2909.

Noordhoek, I. M., Houtgast, T., and Festen, J. M. (2001). “Relations between intelligibility of narrow-band speech and auditory functions, both in the 1-kHz frequency region,” J. Acoust. Soc. Am. 109, 1197-1212.

Nygaard, L. C., Sommers, M. S., and Pisoni, D. B. (1994). “Speech perception as a talker-contingent process,” Psych. Science, 5, 42-46.

Patterson, R. D. (1974). “Auditory filter shape,” J. Acoust. Soc. Am. 55, 802-809.

Patterson, R. D. (1976). “Auditory filter shapes derived with noise stimuli,” J. Acoust.

Soc. Am. 59, 640-654.

Pavlovic, C. V. (1987). “Derivation of primary parameters and procedures for use in speech intelligibility predictions,” J. Acoust. Soc. Am. 82, 413-422.

181 Pavlovic, C. V., and Studebaker, G. A. (1984). “An evaluation of some assumptions underlying the articulation index,” J. Acoust. Soc. Am. 75, 1606-1612.

Peterson G.E. and Barney H.L.(1952). “Control methods used in a study of the vowels,”

J. Acoust. Soc. Am. 24, 175–184.

Peterson, G. E., & Lehiste, I. (1962). Revised CNC lists for auditory tests. Journal of

Speech and Hearing Disorders, 27, 62-70.

Pollack, I. (1948). “Effects of high-pass and low-pass filtering on the intelligibility of speech in noise,” J. Acoust. Soc. Am. 20, 259–266.

Rosen, S. (1992). “Temporal information in speech: acoustic, auditory and linguistic aspects,” Philosophical Transactions of the Royal Society of London. Series B:

Biological Sciences, 336, 367-373.

Santner, T. J., Williams, B. J., & Notz, W. (2003). The design and analysis of computer experiments. Springer.

Shannon, R. V., Galvin III, J. J., and Baskent, D. (2002). Holes in hearing. J. Assoc. Res.

Otolaryn., 3, 185-199.

182 Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science, 270, 303-304.

Silverman, S. R. and Hirsh, I. J. (1955). ‘‘Problems related to the use of speech in clinical audiometry,’’ Ann. Otol. Rhinol. Laryngol. 64, 1234–1245.

Steeneken, H. J. M., and Houtgast, T. (1980). “A physical method for measuring speech- transmission quality,” J. Acoust. Soc. Am., 67, 318-326.

Stone, M. A., Füllgrabe, C., Mackinnon, R. C., and Moore, B. C. (2011). “The importance for speech intelligibility of random fluctuations in “steady” background noise,” J. Acoust. Soc. Am., 130, 2874-2881.

Stone, M. A., Glasberg, B. R., and Moore, B. C. (1992). “Simplified measurement of auditory filter shapes using the notched-noise method,” British J. of Audiology, 26, 329-

334.

Stone, M. A., and Moore, B. C. (2003). “Tolerable hearing aid delays. III. Effects on speech production and perception of across-frequency variation in delay,” Ear Hear. 24,

175-183.

183 Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L., (1987). “A frequency importance function for continuous discourse,” J. Acoust. Soc. Am. 81, 1130-1138.

Studebaker, G. A., and Sherbecoe, R. L. (1991). “Frequency-importance and transfer functions for recorded CID W-22 word lists,” J. Speech Hear. Res. 34, 427-438.

Studebaker, G. A., Sherbecoe, R. L., McDaniel, D. M., and Gwaltney, C. A. (1999).

“Monosyllabic word recognition at higher-than-normal speech and noise levels,” J.

Acoust. Soc. Am. 105, 2431-2444.

Summers, V., and Molis, M. R. (2004). “Speech recognition in fluctuating and continuous maskers: Effects of hearing loss and presentation level,” J. Speech, Lang.,

Hear. Res. 47, 245-256.

Ter Keurs, M., Festen, J. M., and Plomp, R. (1992). “Effect of spectral envelope smearing on speech reception I,”. J. Acoust. Soc. Am. 91, 2872-2880.

Ter Keurs, M., Festen, J. M., and Plomp, R. (1993). “Effect of spectral envelope smearing on speech reception II,”. J. Acoust. Soc. Am. 93, 1547-1552.

Tun, P. A., O'Kane, G., and Wingfield, A. (2002). “Distraction by competing speech in young and older adult listeners,” Psych. and Aging, 17, 453-467.

184

Turner, C. W., Chi, S. L., and Flock, S. (1999). “Limiting spectral resolution in speech for listeners with sensorineural hearing loss,” J. Speech, Lang., Hear. Res. 42, 773-784.

Turner, C. W., Kwon, B. J., Tanaka, C., Knapp, J., Hubbartt, J. L., and Doherty, K. A.

(1998). “Frequency-weighting functions for broadband speech as estimated by a correlational method,” J. Acoust. Soc. Am. 104, 1580-1585.

Turner, C. W., Souza, P. E., and Forget, L. N. (1995). “Use of temporal envelope cues in speech recognition by normal and hearing‐impaired listeners,” J. Acoust. Soc. Am. 97,

2568-2576.

van Schijndel, N. H., Houtgast, T., & Festen, J. M. (2001). “Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am. 110, 529-542.

Van Tasell, D. J., Greenfield, D. G., Logemann, J. J., and Nelson, D. A. (1992).

“Temporal cues for consonant recognition: Training, talker generalization, and use in evaluation of cochlear implants,” J. Acoust. Soc. Am. 92, 1247-1257.

Van Tasell, D. J., Soli, S. D., Kirby, V. M., and Widin, G. P. (1987). “Speech waveform envelope cues for consonant recognition,” J. Acoust. Soc. Am. 82, 1152-1161.

185 Wang, Y. and Wang, D. L. (2013). “Towards scaling up classification-based speech separation,” IEEE. Trans Audio. Speech Lang. Proc., 21, 1381-1390.

Warren, R. M. (1970). “Perceptual restoration of missing speech sounds,” Science, 167,

392-393.

Warren, R. M., and Bashford, J. A., Jr., (1999). “Intelligibility of 1/3-octave speech:

Greater contribution of frequencies outside than inside the nominal passband,” J. Acoust.

Soc. Am. 106, L47-L52.

Warren, R. M., Bashford, J. A., Jr., and Lenz, P. W. (2004). “Intelligibility of bandpass filtered speech: Steepness of slopes required to eliminate transition band contributions,”

J. Acoust. Soc. Am. 115, 1292-1295.

Warren, R. M., Bashford Jr, J. A., and Lenz, P. W. (2005). “Intelligibilities of 1-octave rectangular bands spanning the speech spectrum when heard separately and paired,” J.

Acoust. Soc. Am. 118, 3261-3266.

Warren, R.M., Bashford, J. A., Jr., and Lenz, P.W. (2011). “An alternative to the computational Speech Intelligibility Index estimates: Direct measurement of rectangular passband intelligibilities,” J. Exp. Psychol.[Hum. Percept.] 37, 296 - 302.

186 Warren, R. M., Riener, K. R., Bashford, J. A., Jr., and Brubaker, B. S. (1995). “Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits,” Percept.

Psychophys. 57, 175-182.

Webster, J. C. "The Effects of Noise on the Hearing of Speech." In Proceedings of the

International Congress on Noise as a Public Health Problem. Dubrovnik, Yugoslavia,

1973. 25-42.

Webster, J. C., and Klumpp, R. G. (1963). “Articulation index and average curve‐fitting methods of predicting speech interference,” J. Acoust. Soc. Am., 35, 1339-1344.

Wegel, R., and Lane, C. E. (1924). “The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear,” Physical Review, 23, 266-285.

Zaar, J., and Dau, T. (2015). “Sources of variability in consonant perception and their auditory correlates,” J. Acoust. Soc. Am., 137, 2306.

187