This thesis has been approved by

The Honors Tutorial College and the Department of Communication Sciences &

Disorders

Dr. Li Xu

Professor, Communication Sciences & Disorders

Thesis Advisor

Dr. Chao-Yang Lee

Professor, Communication Sciences & Disorders

Director of Studies, Communication Sciences & Disorders

Dr. Donal Skinner

Dean, Honors Tutorial College 2

PERCEPTION OF SPECTRALLY-DEGRADED, FOREIGN-ACCENTED

A Thesis

Presented to

The Honors Tutorial College

Ohio University

In Partial Fulfillment of the Requirements for Graduation from the Honors Tutorial College with the degree of

Bachelor of Science in Communication Sciences & Disorders

By

Jenna Barrett

April 2021

3

Introduction

Background on Speech

Speech perception is an incredible and complicated process that allows us to hear, interpret, and understand speech . It begins with a speech signal. Most of us understand speech as a string of letters which make up words, which make up sentences, which go on to form even larger elements of language. Physically, however, speech is an acoustic wave. This complex wave is made up of numerous sounds that run together and occasionally overlap, just as they do in words and sentences. The smallest units of are known as phonemes. When an acoustic wave reaches the ear, vibrations pass through the hearing structures and stimulate the auditory nerve. These nerve impulses are processed by the and integrated with previous knowledge of sound to allow for the comprehension of speech (Urbas, 2019).

Within the nervous system, there are specialized mechanisms for processing speech. This is evidenced by experiments demonstrating that phonemes can be perceived at rates seven times faster than that of nonspeech sounds. What is perhaps even more impressive is the flexibility of this system. Humans can easily understand speech in their native language, often regardless of speech rate, voice, pitch, accent, loudness, or signal distortion. Even though the acoustic signal may be changed or distorted in many of these cases, the nervous system is able to rapidly adjust and compensate for these variations.

This is known as perceptual constancy (Urbas, 2019).

There are two main types of speech processing theories involved in perception: bottom-up processing and top-down processing (Figure 1). Bottom-up processing refers to using the simplest elements of an acoustic wave to guide more complex perception. 4

Within the wave, there are acoustic cues to phonemes, which can be combined into larger and larger speech components. Top-down processing is essentially the opposite. Complex linguistic knowledge, including context clues and sentence predictability, are used to anticipate speech. Both types of processing are necessary for . For example, in one study, researchers presented participants with a pair of words where the initial sound was a mixture of the /d/ and /t/ American English phonemes. In one instance, the phoneme made a nonsense word (drash) and in the other instance, the phoneme made a real word (trash). Participants were more likely to perceive the phoneme that created a real word (Austermann and Yamada, 2009). This is an example of the combination of processing techniques. Bottom-up processing would inform participants that the acoustic signal contained a phoneme similar to either /d/ or /t/, but top-down processing would allow participants to use previous linguistic knowledge to choose the phoneme that produced a real word.

Figure 1. The figure above describes the interaction between top-down and bottom-up processing in the perception of a speech stimulus. The basic information derived from the acoustic wave as well as higher-level contextual information are important for speech perception (Austermann and Yamada, 2009).

5

In research based on non-tonal languages, it was proposed that bottom-up and top-down processing occurred in series, with application of the acoustic signal coming before the use of more complex linguistic knowledge. However, in recent research on tonal languages, it was found that both types of processing were present in early and late stages of processing. Researchers used event-related potentials (ERPs) to track the use of bottom-up and top-down processing in lexical tone perception. Based on the results, a parallel lexical tone processing model was proposed (Shuai and Gong, 2014). This research raises questions as to the different types of processing that may occur in native speakers of a language (L1) and second language learners (L2) when presented with speech that contains acoustic-phonetic deviations.

The Normal Hearing Mechanism

The hearing mechanism is an integral component of speech perception. In a normal hearing individual, the hearing process begins with sound. When an object vibrates, the pressure causes particles in the surrounding medium to move, creating a sound wave. This sound wave is funneled into the , causing the tympanic membrane to vibrate. In the , these vibrations are passed to three small ossicles known as the , , and . These bones vibrate in sequence and function to correct the impedance mismatch, or loss of sound energy, that occurs when traveling from an air to fluid medium. The mechanical vibrations from the ossicles are then transferred to the fluid-filled of the . The cochlea contains three membranes: Reissner’s membrane, the basilar membrane, and the tectorial membrane.

The basilar membrane is organized tonotopically, meaning different areas of the 6 membrane respond to different . In the case of the basilar membrane, the basal portion is stimulated by higher frequencies and the apical portion is stimulated by lower frequencies. On the basilar membrane lies the , which contains sensory hair cells of hearing. Ultimately, the inner hair cells are responsible for the neural impulse that travels to the brain.

There are three fluid-filled passages within the cochlea containing either perilymph or endolymph. Perilymph fills the scala vestibuli and scala tympani, the two chambers separated by the basilar membrane. Perilymph is rich in sodium ions. In contrast, endolymph fills the scala media, which is the chamber between the scala vestibuli and scala tympani, also known as the cochlear duct. Endolymph is rich in potassium ions. Inside the organ of Corti, inner hair cells are grouped in a single row, and each hair cell is linked to about ten nerve cells. These inner hair cells consist of a main body and thin stereocilia that stick out of the basilar membrane. Fluid vibrations within the cochlea cause stereocilia of hair cells in specific portions of the basilar membrane to bend. This bending opens up ion channels between the stereocilia and hair cells, altering the concentration of sodium and potassium ions in the nerve cell. In response, potassium rushes into the hair cells, depolarization of the hair cell occurs, and neurotransmitters

(glutamate) are released from the bottom of the hair cells. The auditory nerves pick up the neurotransmitters and generate action potentials (Niparko, 2000). These electrical impulses are transmitted to the brain via the auditory nerve. The diagram below (Figure

2) illustrates this hearing process up to the cochlear level. 7

Figure 2. The figure above shows the anatomical structures involved in the mechanism of hearing. Sound waves enter the ear canal and cause the tympanic membrane to vibrate. These vibrations are amplified by the ossicles of the middle ear and transferred to the inner ear. Fluid vibrations in the cochlea cause hair cells to bend, creating electric potentials that stimulate the auditory nerve and create the perception of sound (Hawkins, 2020).

Beyond the cochlea, the neural impulse must be transmitted from the cochlear nerve to the auditory cortex for identification and perception. Information from the cochlea follows the auditory pathway, which consists of roughly 4 sequential areas in the brain that work to decode the signal. The pathway begins with the cochleo-vestibular nerve. Recall that electrical impulses travel here via the auditory nerves that synapse with the hair cells in the cochlea. From the cochleo-vestibular nerve, the impulses travel to the cochlear nuclei of the brain stem where information on duration, intensity, and of the signal is deciphered. Next, a neuron carries this information to the superior olivary complex of the brainstem. It is important to note that, while both brain hemispheres process sound stimuli from both ears, much of the information from the stimuli passes to the contralateral side at this junction. The signal then travels to the inferior colliculus of the brainstem. Both the superior olivary complex and inferior colliculus work to integrate information from both ears and determine sound lateralization and localization. A fourth 8 neuron then carries the signal up to the medial geniculate body of the thalamus, which functions to focus auditory attention and prepare a response to the sound signal. The last stop is the auditory cortex in the temporal lobe; here, the message of the signal is received and translated. Similar to the basilar membrane, the auditory cortex is organized tonotopically, with certain neurons tuned to specific frequencies; this allows for time and frequency information to be decoded. In addition, the primary auditory cortex is involved in identifying the sound, employing memory, and, if necessary, generating a response

(Pujol, 2020). Information from the auditory cortex may be passed on to different areas of the brain for speech processing. The diagram below (Figure 3) illustrates the auditory pathway from the cochlear nerve to the auditory cortex.

Figure 3. The figure above shows the auditory pathway from the cochlear nerve to the auditory cortex. The pathway begins at the cochlea-vestibular nerve, which receives information from the auditory nerves of hair cells. Specific neurons relay the signal sequentially to the cochlear nuclei, superior olivary complex, the inferior colliculus, the medial geniculate body in the thalamus, and, finally, the auditory cortex (Pujol, 2020). 9

Hearing is a complex process, and there are a variety of physical and environmental factors that can lead to the breakdown of hearing machinery and, thus, hearing loss. The most prominent cause of hearing loss is damage to the delicate hair cells of the inner ear (Niparko, 2000). In cases where these hair cells are completely destroyed, the auditory nerve cannot be stimulated, and individuals are left profoundly deaf. One option to bridge the gap between these damaged hair cells and the auditory nerve is via an auditory prosthesis known as a cochlear implant.

Cochlear Implants

A cochlear implant is a neuroprosthetic device used to treat profound sensorineural hearing loss. The surgically implanted device bypasses the normal hearing mechanism and directly stimulates the auditory nerve using electrical signals. It is important to note that, while the cochlear implant is a highly successful communication tool, it is not a cure for deafness, nor does it restore normal hearing. The cochlear implant consists of both external and internal components. The external components include a microphone, speech processor, and transmitter, and the internal components include an implanted receiver and electrode array (Young and Kirk, 2016). The brain is often referred to as the final component of the cochlear implant system, as outcomes often vary depending on individual patient characteristics. Figure 4 shows the external and internal elements of the cochlear implant system (excluding the brain). 10

Figure 4. The figure above shows internal and external components of the cochlear implant system. The external components include the microphone, battery pack, speech processor, and external transmitter. The internal components include the implanted receiver and intracochlear electrodes. Sound from the environment is picked up by the microphone, converted into electrical signals, and transmitted to the auditory nerve via electrodes (Young and Kirk, 2016).

The process of hearing with a cochlear implant begins with the external components. Environmental sounds are picked up by the microphone and converted from sound waves into an electrical signal. These electric signals are then sent to the speech processor, where the signal is split into different frequency channels and transformed into a pattern of electric pulses. Then, the transmitter relays this information to the receiver via radio waves. In most cases, a magnet is used to hold the external transmitter in place next to the implanted receiver. Finally, the information is sent to the electrode array implanted in the cochlea where specific electrodes are stimulated in sequence. These electrodes serve a parallel function to the hair cells in a normal-hearing ear. Similar to the basilar membrane, electrodes are organized tonotopically, with stimulation of electrodes in basal locations resulting in perception of higher pitches and stimulation of electrodes 11 in apical locations resulting in perception of lower pitches. In the context of speech processing, these electrodes are stimulated in rapid sequence. The electrical impulses are delivered to the auditory nerve and relayed to the brain, where they are interpreted as sound (Young and Kirk, 2016). Cochlear implant characteristics can vary widely in stimulation type, signal processing strategy, transmission, and electrode design.

Moreover, patient-specific characteristics can greatly impact the success of cochlear implant technology. Thus, it is important that these factors are considered when evaluating candidacy for a cochlear implantation.

Determination of cochlear implant candidacy is a complex process that involves a multidisciplinary approach and input from a variety of health professionals. A cochlear implant requires long-term commitment and extensive rehabilitation; thus, it is vital to determine whether the benefits of implantation outweigh the risks. Patients must meet certain criteria and receive counsel on realistic expectations. In general, a cochlear implant is suitable for individuals with severe to profound sensorineural hearing loss who do not receive benefit from amplification and are motivated to carry out post-operative rehabilitation. For candidacy determination in adults, a hearing, otologic, and medical assessment are standard. Prior to implantation, candidates must also undergo speech audiometry with the use of amplification. Criteria is often variable, but FDA regulations require word discrimination scores of lower than 50-60% in optimal conditions (Niparko,

2000).

Despite a push for earlier implantation, candidacy criteria for children is still a bit more selective than that for adults. Cochlear implantation was not approved for children until 1990, and it was not until 2000 that the minimum age for implantation was lowered 12 to 12 months. The required testing, however, may begin as early as 6 months (Young and

Kirk, 2016). Children are required to receive multiple assessments of auditory function; for prelingual children, this includes behavioral measures such as visual reinforcement audiometry (VRA) and behavioral observation audiometry (BOA) as well as objective measures such as otoacoustic emissions (OAEs) and acoustic reflex testing. Children under 2 years old must be diagnosed with bilateral profound sensorineural hearing loss in order to be considered for implantation. For older children, audiologic assessment and speech recognition tests are required. Children above 3 years old must score £30% on open-set, age-appropriate speech recognition test. Hearing aid trials are also common, and children must make monthly gains in auditory and language development in order for cochlear implantation to be excluded from consideration (Young and Kirk, 2016).

Parental reports, of cochlear and neural structures, familial support, and certain etiologies are also considered as part of criteria for all children. While the FDA has put these regulations in place to protect children, as cochlear implant technology evolves, it is predicted that restrictions will be eased.

Outcomes for cochlear implant users have been rather remarkable. While there are several objective and subjective measures that have been used to evaluate outcomes, one of the most prominent is speech recognition scores. Often, there are large improvements in speech recognition scores after implantation and, in some cases, cochlear implant users have achieved speech recognition scores similar to those of normal hearing individuals.

In a recent study (Dornhoffer et al., 2021) of 323 adult cochlear implant users, speech recognition scores were compared before and after implantation. Tests included the consonant-nucleus-consonant (CNC) word test as well as Az-Bio sentences in both quiet 13 and noise. The mean percentage of improvement in scores between pre- and postoperative testing was greater than 33% for all tests. While the majority of participants

(>78% for all tests) showed improvement on each of the three tests, outcomes for individual participants were highly variable. A small percentage of participants showed no improvement, and four participants showed a decrease in speech recognition scores. In another study (Young and Kirk, 2016), 181 children who had received an implant before age 5 were followed post-implantation. In the first part of the study, children were evaluated on speech perception, , language, reading, and psychosocial development 4-7 years post-implantation. For speech recognition, the children scored an average of 50% with no auditory and visual cues and 80% with cues. In the second part of the study, 112 of the original children returned to assess speech, language, reading, executive function, and working memory 13 years post-implantation. Results revealed significant gains in speech perception. Similar to the previous study, however, outcomes for individual participants were highly variable, with scores ranging from 0-100% on word and sentence tests. In addition to these impressive speech recognition gains, cochlear implants have the ability to improve overall satisfaction. Severe to profound hearing loss can negatively impact communication, relationships, and a sense of community. After implantation, many individuals show improvement in self-reported measures of quality of life (Niparko, 2000).

While several studies have documented the benefits of cochlear implant technology, there is still a great amount of individual variation in cochlear implant performance. Researchers have investigated the factors that influence cochlear implant outcomes. In adults, these factors include age at implantation, duration of deafness, 14 degree of residual hearing, and age at onset of hearing loss (Niparko, 2000). In general, the prognosis for speech perception was better when individuals were implanted earlier on, had a shorter duration of deafness, maintained some residual hearing, had previous auditory experience, and relied primarily on spoken communication. This highlights the detriment of long-term auditory deprivation to hearing centers. For children, factors influencing cochlear implant outcomes include duration of cochlear implant use, age at implantation, communication method, educational environment, and presence of accompanying disabilities (Niparko, 2000). Lack of early auditory input can be detrimental to the development of auditory pathways; thus, earlier implantation yields better outcomes for children. In addition, children must adapt to hearing through a cochlear implant, and proper communication and educational stimulation from adults is necessary to facilitate speech recognition. Recent research has also shown that cochlear implant performance is highly correlated with cognitive measures such as working memory (Niparko, 2000). Because outcomes for the population of cochlear implant users can be so variable, they are often difficult subjects of study. In the current study, we utilized vocoded speech to simulate the experience of cochlear implant users and reduce confounding subject-specific factors.

Vocoded Speech

Every sound we perceive can be split into two components: temporal and spectral.

The temporal, or time, information is contained within the sound wave’s envelope. The envelope is essentially an outline of the , and it provides information on the energy of the signal, period, and maximum amplitude. The spectral, or frequency, 15 information consists of all the details of sound. Spectral information includes fundamental frequency, harmonics, , and other frequency components (Xu and

Pfingst, 2008). In perceptual testing, it is sometimes necessary to separate these two elements of sound.

Vocoding is a technique that transforms a sound source into synthesized speech; it is useful for audio compression, speech synthesis, and research on human speech perception capabilities. In this process, the temporal envelope is extracted from a waveform, and the majority of spectral information is discarded. The synthesis (Figure 5) begins with speech input that is converted into a signal. The signal is sent through various band pass filters that pass frequencies within a certain range and reject frequencies outside that range. Next, the envelope is extracted from each band using half-wave rectification and low pass filtering. Rectification outlines each waveform, and low pass filtering smooths the signal based on a particular frequency. For example, if the low pass filter is 50Hz, the filter will smooth out any frequency above 50Hz to make the waveform less choppy. The waveform is then compressed to fit a certain range, such as the range of hearing in an individual with hearing loss. A white noise is modulated by the envelope of each band. Lastly, the bands of noise are passed through the same band pass filters used at the start and recombined to create noise-vocoded speech (Chen, Zheng, and Tsao,

2017). This process allows for the preservation of temporal information in the original speech signal but degradation of the spectral information. 16

Figure 5. The diagram above describes the steps in creating noise vocoded speech from an input signal. The input speech undergoes band pass filtering, rectification, low pass filtering, compression, white noise modulation, and a second round of band pass filtering. Note that the vocoder used in the proposed study differs slightly from the description above; however, the principle is the same (Chen, Zheng, and Tsao, 2017).

In research experiments, the amount of temporal information can be varied by changing the lowpass cutoff frequencies, and the amount spectral information can be varied by altering the number of channels (filters). Research involving noise vocoded speech has investigated the importance of both spectral and temporal cues for speech perception. While both aspects of sound are important for perception, it appears that high speech recognition scores can still be achieved in cases of degraded speech input. In one study of consonants, vowels, and words in simple sentences, participants achieved 90% recognition scores with only three channels of noise vocoded speech (Shannon et al.,

1995). This study, however, was limited by the easy sentence tasks and quiet conditions.

Research has shown that as the difficulty of speech material or listening condition increases (such as with noise, further spectral degradation, or more complex sentence tasks), more spectral information is required (Shannon, Fu, and Galvin, 2004). 17

Additionally, in studies of consonant, vowel, lexical tone, and phoneme recognition, researchers found a tradeoff in the amount of temporal and spectral information required for recognition (Xu, Thompson, and Pfingst, 2005; Xu and Pfingst, 2008). In other words, as the amount of temporal information decreased, the necessary amount of spectral information for similar speech recognition scores increased and vice versa.

This research has important implications for cochlear implant users. The signal processing in cochlear implant technology is very similar to vocoder processing; however, in cochlear implants, electrical impulses are modulated by the speech envelope and then directly stimulate the auditory nerve (Chen, Zheng, and Tsao, 2017). While this technology has proven both incredible and beneficial for deaf individuals, more research is necessary to improve speech recognition outcomes for cochlear implant users. In one study, it was found that, although cochlear implants may have as many as 22 electrodes, users are only able to utilize spectral information from 4-8 of those channels (Shannon,

Fu, and Galvin, 2004). The result is that cochlear implant users must rely more heavily on temporal cues available through the signal processing. This can lead to difficulties with enjoying music or perceiving lexical tones, both of which rely primarily on spectral cues

(Xu and Pfingst, 2008). Vocoded speech simulations are invaluable in cochlear implant research because they offer a way to manipulate spectro-temporal information and test speech recognition without having to consider patient variability.

Foreign-Accented Speech

Native speech differs from foreign-accented speech in a few important ways.

Accented speech differs in the segmental (consonant and vowel placement) and 18 suprasegmental (rhythm, stress, and intonation) components, as well as syllable structure and voice quality. This could present itself as dysfluent prosodic patterns, vowel substitutions, phonetic distortions, or inappropriate placement of the and articulators (Anderson-Hsieh, Johnson, and Koehler, 1992). There are numerous reasons for this divergence from native speech and several speaker factors that can impact the strength of an accent. Age of acquisition of the second language, motivation, cultural differences, and time spent with native speakers all have an influence on the strength of an individual's accent (Gilakjani and Ahmadi, 2011). Additionally, because phonemes differ between languages, L2 often lack native models for sounds in a new language

(Flege, 1988). This may result in an L2 speaker transferring the rules of their native language onto the second language. Mispronunciation of sounds, as well as the transposition of rules and suprasegmental features from a native language lead to the development of a stronger accent (Gilakjani and Ahmadi, 2011).

Accented speech can lead to communicative costs for both conversational parties.

Most significantly, a speaker’s message can be easily misunderstood if certain words or speech fragments are not recognized by the listener. Although intelligibility is not directly proportional to accent severity, there is some evidence that the amount of miscommunication between speaker and listener increases with accent. In one study, researchers found that native English speakers who were able to transcribe Mandarin- accented speech correctly on the first try still judged the utterances as having low comprehensibility (Munro and Derwing, 1995). This could be due to the different processing requirements involved with accented speech versus native speech. Accented speech may require longer or more effortful processing. In some cases, a misheard phrase 19 may even need to be retrieved from short-term memory and replayed for processing. This leads to the illusion of lower comprehension. Other communicative costs include prejudice against accented speakers, frustration, or negative attitudes towards conversing with accented speakers.

Cochlear implant users may be expected to have a particularly difficult time with accented speech due to the combination of signal processing and non-native speech patterns. When speech is degraded in this way, visual cues may improve speech recognition; this is known as an audiovisual (AV) benefit. Researchers recently investigated this AV benefit for accented speech. Two groups of participants were tested: cochlear implant listeners using their own signal processing devices and normal hearing listeners using vocoded-speech. Interestingly, both groups of participants showed similar speech recognition results with accented and unaccented speech (Waddington et al.,

2020). This indicates that accented speech does not limit the ability of cochlear implant users to utilize AV cues.

Considerations for Foreign-Accented Speech Perception

There are a few considerations to be made in the testing of foreign-accented speech perception. One problem is that accent ratings can be highly subjective. In one study, researchers found that accent ratings were impacted by a multitude of factors, including the speaker’s sex, proficiency level, and native language, as well as the rater’s exposure to foreign-accented speech (Kraut and Wulff, 2013). Accentedness is often entangled with other terms such as intelligibility, comprehensibility, or fluency. Accent is defined as “a distinctive manner of expression” of language (Merriam-Webster, n.d.). A 20 person’s accent is not always correlated with their ability to be understood. For example, there may be cases where accent is prominent, yet intelligibility and comprehensibility are still high (Munro and Derwing, 1995). The distinction between these terms is important when it comes to accent ratings. A numbered rating scale requires descriptors that do not confuse the rater into scoring on the basis of other factors.

There are a variety of scales that may be used to assess accentedness of speakers.

These include continuous scales, such as a sliding scale model, direct magnitude estimation, and more discrete measures, such as numbered Likert scales ranging from three to ten gradients. The most common approach for accent ratings in previous studies has been the Likert scale (Jesney, 2004). In a 1999 study by Southwood and Flege, experimenters attempted to discern whether or not a linear scale was suitable for foreign accent ratings. Two groups of native English speakers were asked to rate Italian accents on either a seven-point interval scale or by direct magnitude estimation. A linear relationship was found between the two groups of scores, indicating that interval scales are effective for foreign accent rating tasks. It is important to note, however, that a ceiling effect was observed on the upper end of the seven-point interval scale. This result suggested that listeners’ sensitivity to foreign accent may be greater than previously assumed, and individuals may have the ability to discriminate accent on a larger scale.

Thus, a nine- or eleven-point interval scale was recommended for future studies in perception of foreign-accented speech (Southwood and Flege, 1999). As a result of previous research, in the current study, we chose a 1-9 Likert scale where 1 denoted a speaker with “no accent” and 9 denoted a speaker who was “extremely heavily accented.” 21

Another important factor in foreign-accented speech perception is speech material. In a 1994 research study, second language learners were evaluated on their perceived level of accentedness using either read or extemporaneous speech materials. It was found that there was no significant difference between perceived accentedness in either condition; however, there were some important considerations in the results.

Extemporaneous material does not allow for any preparation and, thus, may result in some speech errors. Read material allows for more time to focus on form but, unlike extemporaneous material, may contain words, phrases, or sentence structure unfamiliar to a non-native speaker (Munro and Derwing, 1994). Speech errors or dysfluent reading may lead to accent scores higher than normal. This highlights the importance of choosing speech materials that are appropriate for the participants being tested.

A final consideration is that of adaptation. Human speech processing is incredibly flexible when encountering new forms of speech. Processing speed for foreign-accented speech is initially slow for native English speakers; however, listeners are able to adapt very quickly, often within one minute of accent exposure. It appears that individuals are able to characterize acoustic-phonetic deviations from their native language and use it to process a less familiar one in real time (Clarke and Garrett, 2004). In a study with older, hearing impaired adults who were briefly exposed to Mandarin-accented speech, it was found that their degree of adaptation was similar to normal hearing controls. This indicates that adaptation may remain robust even when hearing ability is less than optimal (Hau et al., 2020). While this phenomenon is consistent for a variety of foreign accents, even unfamiliar ones, it is unclear if a similar rapid adaptation effect will occur with accented, vocoded speech. 22

Foreign-Accented Speech-In-Noise Perception

Overlaying speech with noise may lead to perceptual difficulties related to audibility, processing, and cognition (Gordon-Salant et al., 2013). Noise refers to any sound other than the target speech, and there are several different types of noise

(environmental noise, white noise, steady-state noise, speech babble, etc.) that result in one of two types of masking: energetic masking or informational masking. Energetic masking refers to noise that physically interferes with a target speech signal due to both sound sources having energy in the same critical bands. Informational masking refers to noise that perceptually interferes with a target speech signal, such as multi-talker babble

(Lidestam et al., 2014). What is important to understand here is that speech-in-noise studies focus on the masking effect of competing sound and do not distort the target speech signal itself.

There has been a plethora of research on perception of foreign-accented speech in noisy conditions. Studies have focused on a variety of factors, including prosody, intelligibility of speech, and the effects of age and hearing loss. Most findings, however, are similar in that foreign-accented speech in noise leads to poorer speech perception outcomes compared to unaccented speech in quiet conditions. Studies have found that noise can impact the perception of foreign-accented utterances and decrease intelligibility of accented speech (Munro, 1998). Further, listeners have shown difficulties with perception of speech that differs from the prosody of their native language (Pinet and

Iverson, 2010). There are ways that listeners can overcome the masking effects of noise.

Sometimes, listeners are able to use speech cues, such as pitch and speech rate, to 23 separate target speech from competing background noise. This is known as masking release. It is possible that the presence of foreign-accent has a negative effect on listeners’ ability to pick up on these cues. In one study, it was revealed that foreign- accent limited listeners’ ability to use these masking release cues, especially for older listeners with lower hearing sensitivity (Gordon-Salant et al., 2013).

In the current study, one variable of study is the degree of accentedness of talkers.

While accent and intelligibility are not synonyms, they can be related. Previous research has focused on the degree of intelligibility of talkers and the impact this has on speech recognition. In a 2020 study of foreign-accented speech-in-noise, Strori et al. looked at the relationship between talker intelligibility and linguistic structure. Results revealed that increased sentence complexity decreased speech recognition scores for talkers with high intelligibility, but there was no relationship between sentence complexity and speech recognition scores for talkers with low intelligibility. This indicates that different processing strategies may be used for low- and high-intelligibility foreign-accented speech. It is possible that, in the current study, the combination of strong accent and spectral degradation will have a similar effect on sentence recognition scores. In other words, a strong accent may correlate to low intelligibility, leading to lower scores regardless of sentence difficulty.

Studies of accented speech-in-noise perception have also been done with cochlear implant users. This perceptual challenge is threefold because it involves the spectral challenges of a cochlear implant, the masking challenges of speech-in-noise, and the acoustic-phonetic challenges of accented speech. In a 2014 study, Ji et al. investigated sentence recognition thresholds in noise of native and non-native talkers by both normal 24 hearing listeners and cochlear implant users. Results revealed that accent had a detrimental effect for both groups of subjects, especially for cochlear implant users. In addition, there was greater variability in results for cochlear implant users, perhaps due to variability in this population. Listeners also performed better with female talkers than male talkers. This study offers an interesting opportunity for comparison to the current study, as both involve recognition of accented speech in spectrally-degraded conditions.

The Rainbow Passage

The Rainbow Passage is a short, public domain paragraph written by Grant

Fairbanks. It is phonetically balanced and incorporates normal vocal and motor movements used in conversational American English speech. The passage is often used in speech and language testing because it provides a rich speech sample of the American

English language. It can be used to evaluate an individual’s speech proficiency, ability to connect speech, reading comprehension, and accent. Nearly all phonemes of American

English are represented in the first four lines of the passage, and the last of the phonemes occur in the sixth line. One major criticism of the paragraph is that it contains some fairly advanced vocabulary words, which may be difficult for evaluating younger individuals or non-native speakers (Shull 2014).

HINT Sentences

The Hearing in Noise Test (HINT) is a set of sentences that can be used in speech perception testing in both quiet and noisy conditions. The HINT sentences are phonetically balanced and meant to represent natural speech. Results from this test offer 25 information on an individual’s signal-to-noise ratio (SNR) threshold, percentile in relation to a set of data, and intelligibility (Nilsson, Soli, and Sullivan, 1994). This test is suitable for comparing normal hearing listeners to those with hearing impairments, and it provides useful information on the speech perception abilities of those with hearing aids, cochlear implants, or other assistive listening devices.

The HINT sentences consist of twenty-six sentence sets with ten sentences per set. This amounts to a total of 260 phonetically balanced sentences. The sentences are scored based on the number of correct keywords identified by a participant; thus, scores can range from 0-100% (Raman, Lee, and Chung, 2011).

R-SPIN Sentences

The Revised Speech Perception in Noise Test (R-SPIN) is a useful tool for assessing the effect of contextual information on speech recognition. Each sentence is classified as either high predictability (HP) or low predictability (LP) based on the last word of the sentence. The HP sentences contain context clues that aid in predicting the last word of the sentence. An example of an HP sentence is ‘Stir your coffee with a spoon’. The LP sentences do not provide context clues, and the last word is much more difficult to predict. An example of an LP sentence is ‘Bob could have known about the spoon’. The R-SPIN sentences are customarily presented at a 8-dB SNR, but testing at multiple different SNRs is also feasible.

The R-SPIN sentences consist of eight sentence lists with fifty sentences per list, yielding 400 sentences in total. Each list is divided evenly between HP sentences and LP 26 sentences. The sentences are scored based on identification of the last word in the sentence (Wilson, McArdle, Watts, and Smith, 2012).

Project Description

In today’s diverse society, the ability to communicate with individuals from different language backgrounds is a relevant and necessary skill. Understanding accented speech can be a challenge for native English speakers, as it diverges from standard

English in its acoustic and phonetic properties. On top of these variations, background noise and signal processing can further deteriorate speech. While much research has been done on speech perception in noise (e.g., Strori et al., 2020), vocoded speech presents an alternative opportunity for exploration. Instead of a complete speech signal that is masked by noise, vocoded speech results in degraded spectral information in the speech signal itself. As both vocoder processing and accent can change the acoustic and phonetic properties of speech, it is unclear how this combination will impact perception. The goal of this study is to investigate the perception of spectrally-degraded, foreign-accented speech in native English speakers (L1).

This study can contribute to the field of speech perception by examining the perceptual strategies of L1 listeners in response to spectrally-degraded, foreign-accented speech. This research is also important for expanding our knowledge on cochlear implants. Since vocoder-processed speech simulates sound through a cochlear implant, the results of this study may improve our understanding of perception of foreign-accented speech through these neuroprosthetic devices.

27

The specific research questions this study aims to answer are as follows:

1. How does the degree of accent of non-native talkers affect the perception of

spectrally-degraded, foreign-accented English speech in L1 listeners?

2. How does contextual information affect the perception of spectrally-degraded,

foreign-accented English speech in L1 listeners?

Procedure

Rainbow Passage Recordings

Twenty-six participants (14 males and 12 females) were recruited to record a reading of the Rainbow Passage in English. Twenty-four of these participants were native

Mandarin speakers who learned English as a second language and had varying degrees of accent. Two (one male and one female) were native English speakers. Participants were seated in a sound booth approximately 10 cm away from a microphone. A paper copy of the Rainbow Passage was provided, and participants were asked to read the passage at a normal speaking volume. Sound levels on the microphone were adjusted as necessary.

After recording, the sound files were cut in preparation for the subsequent rating task.

Only the first section of the Rainbow Passage (approximately 30 seconds) was used for the rating task. This speech sample is shown below.

“When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.”

Accented-Speech Ratings 28

To assess the severity of each of the speaker’s accents, we asked 105 Ohio

University students in undergraduate speech science and hearing science classes to rate the accent of each participant. The purpose of these ratings was to identify two L2 speakers (1 male and 1 female) with slight accents and two L2 speakers (1 male and 1 female) with strong accents to record sentence sets for the subsequent perceptual task. All twenty-six recordings were used for the rating task. The two native English speakers served as controls to ensure that raters performed the rating task conscientiously.

The speech samples were presented to raters in a randomized order, and each speaker was given a corresponding letter (A-Z) based on their order. For example, the first speaker played was labeled “A”, the second speaker played was labeled “B”, the third speaker played was labeled “C” and so on. Ratings were given on a Likert scale of 1 to 9, and only whole number ratings were accepted. A rating of 1 described speech with no accent (native English speaker), and a rating of 9 described speech that was extremely heavily accented. The students in the class received a sheet of paper with a description of this rating scale and the letters A-Z representing each of the speaker’s speech samples.

There was a blank space next to each letter to record an accent rating from 1 to 9. There was an additional area on the sheet to collect rater information, including if the rater was a native or non-native English speaker and if the rater was studying to become a SLP or

Audiologist. The ratings of those who were non-native speakers were excluded from the study.

Ratings took place in a classroom setting with students seated at desks with a clear view of a projector at the front of the room. The Rainbow Passage speech sample was displayed on a projector. Students were given time to read the speech sample 29 themselves, so they knew what to expect from participants. Ratings began when all students had finished reading the speech sample. For the first participant, the letter “A” was displayed on the projector while the speech sample of speaker A was played to students. The sound was played through speakers at the front of the classroom at a comfortable level. After the participant speech sample was played, students were given time to record their ratings on their papers. The same process was repeated for speakers B through Z, with a letter displayed on the screen and the corresponding speaker speech sample playing through the classroom speakers. For each classroom of students, the presentation order of the speech samples was randomized. After all of the ratings were collected, a histogram of the ratings for each speaker were plotted.

Histograms

Histograms were created for each individual speaker as well as for group mean data. The first chart below (Figure 6) shows the number of speakers that received each mean rating score. Many students rated the speakers similarly, with the majority of ratings falling between a score of 6 and 7. Over half of the speakers (~61%) were scored between 6 and 7. Only five out of the twenty-six speakers were rated a score below 4, and only one speaker was rated a score above 7. This trend in scoring is indicative of two possibilities. One possibility is that the student raters chosen to rate each speaker were unable to differentiate between accent severities very accurately. For example, they may have had trouble perceiving the difference between accents with similar severities, such as a rating of 5 and 6. A second possibility is that many of the speakers chosen for the

Rainbow Passage recordings had similar accent severities. 30

It is also important to note that, while raters were chosen from similar undergraduate courses and all were reported as native English speakers, it is possible that response bias was present in the reported ratings. Because foreign accent ratings are a matter of perception, they lack any concrete units; instead, qualitative descriptors were used for the rating scale. Raters’ previous experiences and familiarity with Mandarin- accented speech could impact subjective ratings. In other words, raters who are less familiar with accented speech may be inclined to provide higher rating scores than those who are more familiar with accented speech. This is described in Southwood and Flege,

1999.

Figure 6. The diagram above shows the number of L2 speakers that received each mean rating score. The ratings were completed by 105 native-English speaking students at Ohio University. Ratings were given on a whole-number scale of 1-9. A total of twenty-six speakers were rated. The two speakers rated with a mean rating score of one are native English speakers, and the rest are native Mandarin speakers. The most common mean rating score across all speakers was between 6 and 7.

The second chart below (Figure 7) shows the mean rating score of each individual speaker. It should be noted that the average score for the two native English speakers 31 were 1.01 and 1.09, indicating that ratings were completed conscientiously. Scores for

Mandarin speakers ranged from 3.69 to 7.41. When choosing the speakers to participate in the subsequent perceptual task, we looked at individual histogram data and chose the speakers whose mean scores were closest to 3 (slight accent) and 6 (strong accent) and whose data showed the least variability across raters. Three speakers were close to a mean rating of 3; on the chart, these speakers are labeled as speakers 3, 4, and 5.

Additionally, several speakers were close to a mean rating of 6; these speakers are labeled as speakers 14-17. The two individuals chosen for slight accent both rated a score of 3.69. The two individuals chosen for strong accent rated a score of 5.87 and 6.1. Each accent level had one male and one female speaker. It is important to note that speakers who rated above a 6.5 were not chosen for the sentence recording task because there was a possibility that intelligibility would be too low after vocoder processing.

Figure 7. 32

The diagram above shows the mean rating score of accentedness and standard deviation for each individual speaker. The ratings were completed by 105 native English-speaking students at Ohio University. Speakers are labeled with numbers 1-26 and are organized on the chart from lowest mean rating score to highest mean rating score. The first two speakers are native English speakers, and the rest are native Mandarin speakers. The four speakers marked with arrowheads were selected to complete the sentence recording task.

Sentence Recordings

The four L2 speakers chosen in the rating task each recorded two sets of sentences: HINT sentences and RSPIN sentences. Participants were seated in a sound booth approximately 10 cm away from a microphone. A copy of the sentences was provided, and participants were asked to read each sentence at a normal speaking volume.

Sound levels on the microphone were adjusted as necessary. Each of the four L2 speakers was compensated for their participation.

After all of the sentences were recorded, they were cut into individual sound files using a CoolEdit software. Next, the sentences were processed using noise vocoder processing with a custom MATLAB program (Xu & Pfingst, 2008). During the vocoder processing, 6 frequency channels were used to restrict the amount of spectral information in the processed signals. The total bandwidth was 150 to 5500 Hz, and the cutoff frequency of the lowpass filter was set at 160 Hz. These vocoder-processed sentences were used in the subsequent perceptual task.

Perceptual Test

Twenty native English speakers aged between 19 and 23 years old (mean age =

20.9 yrs) were recruited to listen to the sentences recorded by the L2 participants.

Participants were recruited from the Ohio University student population and consisted of undergraduate and graduate students from various majors. Prior to testing, participants 33 were asked about their hearing ability. Those who did not report normal hearing ability were excluded from the study. The perceptual test lasted approximately 1.5 hours. All twenty L1 subjects were compensated for their participation.

The experiment was conducted using a custom MATLAB program. Participants performed two types of vocoder-processed speech recognition tests that included HINT sentences and R-SPIN sentences. Before testing began, a practice session that consisted of 60 HINT sentences was completed by each participant. The purpose of this practice session was to allow listeners to adapt to accented and vocoder-processed speech. In the practice session, each of the four L2 speakers contributed 15 sentences, 5 of which were presented in the original, unprocessed condition and 10 of which were vocoder-processed at 6 frequency channels. Sentences were randomized before being presented to listeners.

Participants were allowed to listen to each sentence up to three times, type their answer, and then view the sentence on the computer screen. After the practice session, participants completed a HINT test that consisted of 80 sentences and an R-SPIN test that consisted of 400 sentences. During testing, participants were still able to listen to the sentence up to three times and type their answer, but they were not able to view the sentences. For the HINT test, eight sentence lists were randomly selected. Each of the four L2 speakers contributed 20 sentences, half of which were presented in the original, unprocessed condition and half of which were vocoder-processed at 6 frequency channels. Participants were asked to type the entire sentence they heard. For the R-SPIN test, all eight sentence lists were played for each participant. Each of the four L2 speakers contributed 100 sentences, half of which were presented in the original, unprocessed condition and half of which were vocoder-processed at 6 frequency channels. Participants 34 were asked to type only the final word of the sentence they heard. The order of L2 speakers and sentences presented in the processed and unprocessed conditions were randomized for the perceptual test before being presented to listeners.

Participants were seated in a sound booth with Sennheiser HD 300 Pro headphones on. They were asked to listen to the signal through headphones at the most comfortable levels and type out their responses. Sentence recognition performance data was scored based on percent correct.

Hypotheses

The proposed study tested the following three hypotheses:

1. L1 listeners’ perception will perform more poorly with the use of spectrally-

degraded, foreign-accented speech;

2. Additionally, L1 listeners will show greater difficulty in perception of more

heavily accented speech than less heavily accented speech;

3. Recognition of degraded, accented speech will interact with listener’s ability to

use contextual information, and such an interaction will be revealed in recognition

performance of the low- and high-predictability (LP and HP) sentences.

Results

HINT results

HINT sentences were scored based on number of key words identified correctly in each sentence. The sentences were displayed on an excel sheet with the original sentence and the participant response side by side. Sentences were scored manually for each 35 participant with possible scores ranging from 0 to 7 denoting the correct number of key words identified. The number of key words correctly identified in each sentence was divided by the total number of key words in that sentence, resulting in a percentage score.

This was done with each participant for all 80 sentences in the HINT testing phase.

Accuracy rate was calculated by dividing the total number of keywords correctly identified by the total number of keywords in all sentences.

Figure 8 displays the recognition accuracies for HINT sentences. It is important to note that this figure also includes recognition accuracies from Yang et. al (2021) for native-English male talkers. This allows for better comparison between native and foreign-accented speech perception outcomes. Recognition accuracies were higher in the unprocessed condition and for less-accented talkers overall. In the unprocessed condition, listeners showed high recognition accuracy for all talkers, with only slightly lower scores for talkers with a strong accent. Slightly accented talkers in the unprocessed condition showed the same median of performance as native talkers. For all talkers in the unprocessed condition, recognition accuracy was greater than 94%. In the vocoded condition, listeners maintained high recognition accuracy for native-English speakers as well as the slightly-accented talkers. Average scores were greater than 90% for these talkers. For strongly-accented talkers, accuracy scores dropped significantly, with an average of 65% for strongly-accented male talkers and 75% for strongly-accented female talkers. Also of interest is the difference in listener performance between male and female talkers. For the strongly-accented talkers in the unprocessed condition as well as both slightly-accented and strongly-accented talkers in the vocoded condition, listeners performed better with female talkers than with male talkers. 36

The data for non-native talkers revealed seven outliers: one in the unprocessed condition with the slightly accented male speaker, one in the unprocessed condition with the slightly accented female speaker, one in the unprocessed condition with the heavily accented male speaker, two in the unprocessed condition with the heavily accented female speaker, and two in the processed condition with the slightly accented male speaker.

Unprocessed Vocoded 100

90

80

70

60

50

40

Percent correct (%) 30

20

10

0 MN M1 F1 M2 F2 MN M1 F1 M2 F2 HINT Sentences

Figure 8. The chart above shows recognition accuracy for HINT sentences. The data is shown for talkers with slight accent (M1 and F1) and strong accent (M2 and F2) in both unprocessed and vocoded conditions. In addition, data from Yang et. al (2021) was adopted to include recognition accuracies for native English male talkers (MN). The outer limits of the box represent the 25th and 75th percentiles, and the center line represents the median of performance. The whiskers of each box represent the range of performance. The crosses represent outliers.

R-SPIN results

R-SPIN sentences were scored based on identification of the final key word in each sentence. The sentences were displayed on an excel sheet with the original sentence and the participant response side by side. Sentences were scored manually for each 37 participant with either a 1 (correct identification of keyword) or 0 (incorrect or no identification of keyword). Participant responses with the same root word that included suffixes or additional morphemes were counted as incorrect. Scoring was done with each participant for all 400 sentences in the R-SPIN testing phase. Accuracy rate was calculated for both LP and HP sentences by dividing the total number of final keywords correctly identified by the number of sentences in each list (25 HP sentences and 25 LP sentences).

Figure 9 displays the recognition accuracies for R-SPIN sentences. It is important to note that this figure also includes recognition accuracies from Yang et. al

(2021) for native-English male talkers. This allows for better comparison between native and foreign-accented speech perception outcomes. Recognition accuracies were highest in the unprocessed condition and for less-accented talkers. Additionally, listeners performed better in conditions with greater contextual information. In the unprocessed condition for HP sentences, listeners performed fairly well across all speakers. Listeners had nearly perfect recognition accuracies for native-English speakers and averaged 93% for all non-native talkers in the HP condition. For the unprocessed LP condition, scores stayed consistently high for native-English talkers but dropped significantly for non- native talkers to an average of 70%. The lack of contextual information had a larger impact on recognition accuracies for the strongly-accented talkers. In the vocoded condition, scores were lower across all speakers in both HP and LP conditions. For HP sentences, listeners’ recognition accuracy for native-English talkers averaged 93%, and listeners’ recognition accuracy for non-native talkers averaged 61%. Scores were significantly lower in the LP condition for all speakers. For LP sentences, listeners’ 38 recognition accuracy for native-English talkers averaged 65%, and listeners’ recognition accuracy for non-native talkers averaged 36%.

For the R-SPIN results, an inconsistent pattern was observed in listener performance between male and female talkers. In the unprocessed condition for talkers with strong accent, listeners performed better with male talkers than with female talkers in both the HP and LP conditions. For talkers with slight accent in the unprocessed condition, the opposite trend was seen in both HP and LP conditions, with listeners scoring higher for female talkers than male talkers. In the vocoded condition, listeners consistently performed better with female talkers, regardless of accent level or semantic context.

The data for non-native talkers revealed two outliers in the LP group: one in the unprocessed condition with the heavily accented male speaker and one in the processed condition with the slightly accented female speaker.

Unprocessed Vocoded 100

90

80

70

60

50

40

Percent correct (%) 30

20

10

0 MN M1 F1 M2 F2 MN M1 F1 M2 F2 MN M1 F1 M2 F2 MN M1 F1 M2 F2 HP LP HP LP

Figure 9. The chart above shows recognition accuracy for R-SPIN sentences. The data is shown for talkers with slight accent (M1 and F1) and strong accent (M2 and F2) in both unprocessed and vocoded conditions. Within each condition, the data is split for HP and LP sentences. In addition, data from Yang et. al (2021) was adopted to include recognition accuracies for native English male talkers (MN). The outer limits of the 39

box represent the 25th and 75th percentiles, and the center line represents the median of performance. The whiskers of each box represent the range of performance. The crosses represent outliers.

Semantic benefit

In suboptimal listening conditions, such as perception of accented, vocoded speech, listeners may benefit from additional context for speech recognition. When the smaller elements of speech are distorted, semantic cues allow listeners to identify words or phrases based on previous linguistic knowledge. In this way, listeners are able to anticipate speech. For the current study, semantic benefit was tested using HP and LP R-

SPIN sentences. The magnitude of semantic benefit offered in a speech recognition task can be calculated using recognition accuracies from these two sentence sets.

Figure 10 displays the difference in recognition performance between HP and

LP R-SPIN sentences (HP recognition accuracies – LP recognition accuracies) for all speakers in both unprocessed and processed conditions. This represents the benefit provided by semantic cues in regard to listeners’ speech recognition performance. It is important to note that this figure also includes recognition accuracies from Yang et. al

(2021) for native-English male talkers. This allows for better comparison between native and foreign-accented speech perception outcomes. In the unprocessed condition, semantic cues provided no benefit to listeners in recognition of native-English speech, limited benefit for recognition of slightly-accented speech, and the greatest benefit for recognition of strongly-accented speech. In the processed condition, nearly the opposite trend was observed; semantic cues provided the greatest benefit to listeners in recognition of native-English and slightly-accented speech, but a lesser benefit for recognition of strongly-accented speech. 40

An inconsistent pattern was observed in semantic benefit between male and female talkers. For strongly-accented talkers in both unprocessed and processed conditions, a greater benefit of semantic cues was seen for female talkers. However, for slightly-accented talkers, a greater benefit was seen for males in the unprocessed condition, and a similar benefit was seen for both males and females in the processed condition.

The data for non-native talkers revealed two outliers: one in the unprocessed condition for the slightly-accented male talker and one in the processed condition for the slightly accented male talker.

Unprocessed Vocoded 80

70

60

50

40

30

20

10

0

-10

-20

HP-LP performance difference (percentage points) MN M1 F1 M2 F2 MN M1 F1 M2 F2

Figure 10. The chart above shows the performance difference between HP and LP R-SPIN sentences (HP-LP). The data is shown for talkers with slight accent (M1 and F1) and strong accent (M2 and F2) in both unprocessed and vocoded conditions. In addition, data from Yang et. al (2021) was adopted to include recognition accuracies for native English male talkers (MN). The outer limits of the box represent the 25th and 75th percentiles, and the center line represents the median of performance. The whiskers of each box represent the range of performance. The crosses represent outliers.

Discussion 41

This study investigated the combined adverse effects of Mandarin-accented

English speech and vocoder processing on sentence recognition of native English listeners. There were two specific research questions this study attempted to answer: (1) how does the degree of accent of non-native talkers affect the perception of spectrally- degraded, foreign-accented English speech in L1 listeners, and (2) how does contextual information affect the perception of spectrally-degraded, foreign-accented English speech in L1 listeners. In order to analyze these research questions, recognition accuracies of unprocessed and processed sentences produced by four Mandarin-accented talkers were calculated for twenty native English listeners and compared to data from a native English talker (Yang, 2021).

The first research question examined how accentedness impacted perception of spectrally-degraded, foreign-accented speech in native English listeners. Across all conditions for both HINT and R-SPIN sentences, listeners performed better with no accent or slightly-accented talkers than with strongly-accented talkers. Additionally, listeners performed better in the unprocessed condition than the processed condition. It is important to note that results revealed a possible ceiling effect for HINT sentences in the

6-channel vocoded condition. For slightly-accented talkers, average recognition scores in the unprocessed and processed condition were similar. The detrimental effect of accent observed was minimal, likely due to the ease of speech materials and amount of spectral information available. Nonetheless, these findings suggest that presence of foreign-accent as well as a greater degree of accentedness negatively impacts listeners ability to recognize spectrally-degraded speech stimuli. These results are similar to studies of accented speech-in-noise perception with both normal hearing listeners and cochlear 42 implant users (Gorden-Salant et al., 2013; Pinet & Iverson, 2010; Strori et al., 2020; Ji et al., 2014). Speech-in-noise provides a masking effect of the speech signal, whereas vocoded speech distorts the spectral information in the speech signal itself. While speech- in-noise and vocoded speech are two different forms of distortion, the similarities in speech recognition performance indicate a comparable disadvantageous effect of foreign- accent in both of these adverse listening conditions.

The second research question examined how contextual information in the speech stimuli impacted perception of spectrally-degraded, foreign-accented speech in native

English listeners. This was revealed by the difference in recognition accuracies between

R-SPIN HP and LP sentences. In the unprocessed condition, the greatest benefit of semantic cues was seen for the strongly-accented talkers; there was nominal benefit for native and slightly-accented talkers. In contrast, the processed condition revealed the greatest benefit for native and slightly-accented talkers and a minimal benefit for strongly-accented talkers. This is most likely due to a ceiling and floor effect, respectively. In the unprocessed condition with native and slightly-accented talkers, the difference in recognition performance between HP and LP conditions was slight. Because the unprocessed sentences conserved the spectral information of the signal and the degree of accent was less severe, it is possible that listeners did not need to utilize semantic cues for speech recognition. In other words, the acoustic cues were clear enough that linguistic knowledge provided no additional benefit to speech recognition. As a result, semantic benefit was lower for these groups. Similarly, in the processed condition with strongly- accented talkers, the difference in recognition performance between HP and LP conditions was minimal. Due to the lack of spectral information and more severe degree 43 of accent, it is possible that listeners were unable to utilize semantic cues and top-down processing for speech recognition. In other words, acoustic cues were so degraded that linguistic knowledge provided no additional benefit to speech recognition. As a result, semantic benefit was lower for this group. This finding suggests that the amount of benefit obtained from semantic cues in suboptimal listening conditions depends on the degree of foreign accent.

Although talker sex was not an original variable of interest for this study, the results showed an interesting pattern in the recognition accuracies of male and female talkers. For the majority of conditions, listeners performed better with female talkers than with male talkers. This was the case for all talkers in both conditions for HINT sentences as well as slightly accented talkers in the unprocessed condition for HP and LP R-SPIN sentences and all talkers in the processed condition for HP and LP R-SPIN sentences.

Only for strongly accented talkers in the unprocessed condition for HP and LP R-SPIN sentences did listeners perform better with male talkers. There are several possible reasons for this trend. First, it is important to note that accent ratings for strongly- accented talkers were not the same for male and female. The strongly-accented male talker rated a score of 6.1, while the strongly-accented female talker rated a score of 5.87.

Although the difference in scoring is minimal, it is possible that the male talker for the strong accent condition had a slightly more pronounced accent than the female talker. A more plausible explanation for this phenomenon is the perceptual differences resulting from talker-specific characteristics. In a 1996 study of the correlation between speech intelligibility and talker-related characteristics, Bradlow et al. found that female talkers, as a group, were more intelligible than male talkers. Other characteristics, including 44 larger fundamental frequency range, larger vowel spaces, timing, and articulation also contributed to higher intelligibility. These talker-specific correlates to intelligibility could explain the differences in recognition accuracies for male and female talkers in the current study.

This study tested three hypotheses: (1) L1 listeners’ perception will perform more poorly with the use of spectrally-degraded, foreign-accented speech, (2) L1 listeners will show greater difficulty in perception of more heavily accented speech than less heavily accented speech, and (3) recognition of degraded, accented speech will interact with listener’s ability to use contextual information, and such an interaction will be revealed in recognition performance of the low- and high-predictability (LP and HP) sentences. All three hypotheses are supported based on the results discussed above.

Limitations/Future Directions

While every effort was made to efficiently test the research questions and control for confounding variables, there are several limitations of this study that should be noted.

One important limitation is the sample size of this study. Twenty native English speakers were recruited to participate in the perceptual test. While this is a decent sample size, it is important to note that, with smaller sample sizes, there is greater likelihood of variability and individual response bias in the results. The participants recruited for this study may not be representative of the entire population of native English-speaking adults (19-23 yrs) with normal hearing. In addition, all participants were students at Ohio University and living in the Athens area at the time of the study. This means that participants had the resources to attend college and live on campus during the COVID-19 pandemic. It is 45 reasonable to infer that, had participants been chosen from a different area or outside of an academic environment, results may have differed. In future studies on L1 perception of accented speech, recruiting a larger sample of participants from different environments would be beneficial for further generalizing the results of the study.

Another limitation of this study was the number of channels used in the vocoded condition. In studies of spectrally-degraded speech perception, the number of spectral channels can vary (e.g. 2, 4, 6, 8, 12). The lower the number of spectral channels, the more degraded the speech signal; thus, it is probable that decreasing the number of spectral channels would result in a decline in sentence recognition performance. Ideally, many different channel conditions would be used to test this hypothesis. In the current study, however, sentences were only processed with a 6-channel vocoder due to the quantity of sentences available, specifically for R-SPIN sentences. In the current study, four different Mandarin-accented speakers recorded all R-SPIN sentences. In order to present all speakers to the native English listeners in both the processed and unprocessed condition, all eight R-SPIN lists were required (4 speakers x 2 conditions = 8 sentence lists). If another spectral channel were to be tested, four more sentence lists would be needed (4 speakers x 3 conditions = 12 sentence lists). As a result, only one channel condition was used. Results revealed a ceiling effect for HINT sentences in the 6-channel vocoded condition, and additional ceiling and floor effects of performance were observed for R-SPIN sentences in the 6-channel vocoded condition. With this in mind, it would be beneficial for future studies to test a variety of spectral channels to better elucidate the upper and lower limits of perception. 46

Another future direction of this research is investigating perception of L2 listeners in response to spectrally-degraded, foreign-accented speech. It is probable that native and non-native listeners utilize different perceptual strategies in response to speech with acoustic-phonetic deviations. Repeating the current study with L2 listeners would offer better insight into these differing strategies as well as the effects of language background on sentence recognition.

Although the current study utilized both male and female talkers, the original aim was not to investigate the effects of talker sex on speech perception. Results of the current study suggest that talker sex may play a role in recognition performance; however, due to the small sample size and lack of consistent results for talker sex across conditions, it is difficult to draw any concrete conclusions. It may be beneficial for future studies to further examine the effect of talker sex on perception of spectrally-degraded, foreign-accented speech.

Conclusions

This study investigated the perception of spectrally-degraded, Mandarin-accented speech in native English listeners. Twenty-four native Mandarin speakers and two native

English speakers recorded a reading of the Rainbow passage in English. A sample of this recording was played for 105 Ohio University students who rated the accents on a scale of 1 to 9. Based on this rating task, two speakers with slight accent and two speakers with strong accent were chosen to record both HINT and R-SPIN sentence lists. The R-SPIN sentence lists consisted of HP and LP sentences, which were used to assess the effect of contextual information on speech recognition of native English listeners. Sentences were 47 presented to twenty native English listeners in both the original condition and a 6-channel vocoder processed condition. Recognition accuracies were scored based on percent correct.

Results revealed that the combination of accented speech and vocoder processing was detrimental to speech recognition in native English listeners. Additionally, data from

R-SPIN sentence lists suggest that the presence of accent limited listeners’ ability to utilize contextual information. These adverse effects increased as the degree of accent increased. Results also suggested a link between talker sex and recognition performance; recognition accuracies were higher overall for the noise-vocoded, foreign-accented speech produced by female talkers. Talker sex, however, was not an original variable of study and further study is needed to draw any solid conclusions on talker characteristics.

Altogether, these results provide evidence for the perceptual strategies of L1 listeners in response to spectrally-degraded, foreign-accented speech. This has implications for both the field of speech perception and our knowledge of cochlear implant users, as the signal processing in cochlear implants is similar to vocoder processing.

Annotated Bibliography

Anderson-Hsieh, J., Johnson, R., & Koehler, K. (1992). The relationship between native

speaker judgments of non-native pronunciation and deviance in segmentals, prosody,

and syllable structure. Language Learning, 42, 529–555.

This study looked at the relationship between L2 pronunciation and acoustic-

phonetic deviations in speech. Sixty oral tapes from the SPEAK test were used to

assess pronunciation and non-native deviations in eleven different language 48

groups. The tapes were first rated on their pronunciation and then analyzed for

phonological errors. It was found that deviations in segmentals, prosody, and

syllable structure all impacted pronunciation ratings, but prosody showed the

largest effect.

Austermann, Anja & Yamada, Seiji. (2009). Learning to Understand Expressions of

Approval and Disapproval through Game-Based Training Tasks. Advances in

Human-Robot Interaction, doi: 10.5772/6840.

This paper described how user feedback could be utilized in human-robot

interactions. Speech, rhythm, and touch feedback were all used as teaching

mechanisms for the robot. One limitation of this method is that the robot must

complete a training session with every person it learns from. In other words, the

robot is unable to generalize feedback. This paper also offered information on

speech and audiovisual perception in humans and gave an overview of top-down

and bottom-up processing.

Bradlow, A. R., Torretta, G. M., & Pisoni, D. B. (1996). Intelligibility of normal speech

I: Global and fine-grained acoustic-phonetic talker characteristics. Speech

communication, 20(3), 255.

This study investigated global and fine-grained acoustic-phonetic talker

characteristics in relation to intelligibility. Intelligibility scores were collected for

20 different talkers and a total of 2,000 sentences. It was found that gender,

fundamental frequency range, larger vowel space, and timing characteristics

contributed to intelligibility. Conversely, fundamental frequency mean and

speaking rate were not correlated with intelligibility. 49

Chen, F., Zheng, D., & Tsao, Y. (2017). Effects of noise suppression and envelope

dynamic range compression on the intelligibility of vocoded sentences for a tonal

language. The Journal of the Acoustical Society of America, 142(3), 1157.

https://doi.org/10.1121/1.5000164

This paper described two experiments that evaluated how noise suppression and

envelope compression affect intelligibility of both tone-vocoded and noise-

vocoded speech. The first experiment tested the effects of noise suppression, and

the second experiment tested the effects of envelope compression. It was found

that tone-vocoded speech yielded higher intelligibility scores for listeners of a

tonal language in general. The effects of noise suppression were dependent on

masking techniques, but envelope compression had a greater negative impact on

noise-vocoded speech.

Clarke, C. M., & Garrett, M. F. (2004). Rapid adaptation to foreign-accented English.

The Journal of the Acoustical Society of America, 116(6), 3647–3658.

https://doi.org/10.1121/1.1815131

This study looked at how exposure to non-native speech affects speech

perception. Three different experiments were done to examine how well native

English speakers could adapt to Spanish and Chinese-accented speech. The first

experiment looked at processing efficiency changes after short-term exposure.

The second experiment explored strategies for understanding difficult speech. The

third experiment tested rapid adaptation with the more unfamiliar Chinese accent.

It was found that native English speakers processed foreign-accented speech

slower than native speech at first but were able to adapt within one minute. 50

Dornhoffer J.R., Reddy P., Meyer T.A., Schvartz-Leyzac K.C., Dubno J. R., McRackan

T. R. (2021). Individual Differences in Speech Recognition Changes After Cochlear

Implantation. JAMA Otolaryngol Head Neck Surg, 147(3), 280-286.

doi:10.1001/jamaoto.2020.5094

This article compared the difference in speech recognition scores between

cochlear implant users before and after implantation. This cross-sectional study

looked at results for word recognition (CNC), sentence recognition (AzBio), and

sentence recognition in noise in 323 patients. The majority of patients showed

improvement in speech recognition scores, but a small subset showed no

improvement. Outcomes for individual patients were highly variable.

Flege J. E. (1993). Production and perception of a novel, second-language phonetic

contrast. The Journal of the Acoustical Society of America, 93(3), 1589–1608.

https://doi.org/10.1121/1.406818

This study consisted of four experiments focused on vowel duration. The first

experiment measured vowel durations of /t/ and /d/ spoken by native and non-

native English speakers. The second experiment asked subjects to identify the

final phoneme in a monosyllabic word as /t/ or /d/. The third experiment assessed

non-native speakers' use of vowel duration as a cue for the use of /t/ versus /d/.

The final experiment looked at the relationship between sensory and motor

systems. The results showed that non-native speakers showed deviations in vowel

duration and imitation. These deviations were greatest for late learners of English.

Gilakjani, A. P., & Ahmadi, M. R. (2011). Why is pronunciation so difficult to learn?

English Language Teaching, 4(3), 74-83. http://dx.doi.org/10.5539/elt.v4n3p74 51

This paper offers an overview of pronunciation of a second language as well as

suggestions for teachers of second language learners. Researchers discuss

misconceptions of pronunciation and list several factors that can influence

pronunciation and accent, including motivation, exposure, attitude, instruction,

and age. Lastly, they cover the best ways to enhance the learning and

pronunciation of L2 students, including conversation, drilling, critical listening,

and an emphasis on suprasegmental features.

Gordon-Salant, S., Yeni-Komshian, G. H., Fitzgibbons, P. J., Cohen, J. I., and Waldroup,

C. (2013). Recognition of accented and unaccented speech in different maskers by

younger and older listeners, The Journal of the Acoustical Society of America,

134(1), 618-627.

This study investigated speech recognition in noise of younger and older listeners

in response to both unaccented and Spanish-accented speech. Three groups of

listeners (young, old, and old with hearing loss) listened to IEEE sentences

recorded by four male talkers. Results revealed that accent had a detrimental

effect on listeners’ ability to use masking release cues. Hearing sensitivity of

listeners also played a role in speech segregation performance.

Hau, J. A., Holt, C. M., Finch, S., & Dowell, R. C. (2020). The Adaptation to Mandarin-

Accented English by Older, Hearing-Impaired Listeners Following Brief Exposure to

the Accent. Journal of Speech, Language & Hearing Research, 63(3), 858–871.

https://doi-org.proxy.library.ohio.edu/10.1044/2019_JSLHR-19-00136

This purpose of this study was to compare the language processing of foreign-

accented speech between older listeners with and without hearing impairments 52

and younger listeners with normal hearing. Three groups of participants were

recruited for the study and assigned to either an experimental group, which

received training from Mandarin-accented talkers, or a control group, which

received training from Australian English talkers. After training, all participants

listened to sentences from a novel Mandarin-accented speaker and were scored on

speech recognition. Results showed that all listeners in the experimental group

had higher speech recognition than the control group. In other words, they were

able to adapt to Mandarin-accented speech after training with Mandarin-accented

talkers. There was no difference in the degree of adaptation between participant

groups.

Hawkins, J. E. (2020, October 29). Human ear. Britannica.

https://www.britannica.com/science/ear

This article provided a diagram of the normal hearing mechanism.

Jesney, K. (2004). The use of global foreign accent rating in studies of L2 acquisition.

Language Research Centre University of Calgary.

This report summarizes the current literature on perception of foreign accented

speech as well as the objective characteristics of foreign accent. The report

reviews methodology of accent rating scales, the findings of accentedness studies,

and details of several relevant studies. Several topics, including rating scales,

listener-based variables, and talker-based variables are examined in depth.

Ji, C., Galvin, J. J., Chang, Y. P., Xu, A., & Fu, Q. J. (2014). Perception of speech

produced by native and nonnative talkers by listeners with normal hearing and 53

listeners with cochlear implants. Journal of Speech, Language, and Hearing

Research, 57(2), 532-554.

This study investigated foreign-accented sentence recognition in noise of both

normal hearing listeners and cochlear implant users. The speech stimuli were

HINT sentences produced by native English talkers as well as native Spanish

talkers who spoke English as a second language. Results revealed that foreign

accent had a greater detrimental effect on sentence recognition for cochlear

implant users than normal hearing subjects.

Kraut, R., & Wulff, S. (2013). Foreign-Accented Speech Perception Ratings: A

Multifactorial Case Study. Journal of Multilingual and Multicultural Development,

34(3), 249–263. https://doi-

org.proxy.library.ohio.edu/10.1080/01434632.2013.767340

The purpose of this study was to examine the factors that affect foreign-accented

speech perception ratings. Seventy-eight native English speakers rated the speech

of twenty-four international students of various backgrounds and English fluency.

Results showed that ratings of accent, comprehensibility, and communicative

ability were impacted by the speaker’s sex, first language, and proficiency level as

well as the listener’s experience with foreign-accented speech.

Lidestam, B., Holgersson, J., & Moradi, S. (2014). Comparison of informational vs.

energetic masking effects on speechreading performance. Frontiers in Psychology, 5,

639. https://doi-org.proxy.library.ohio.edu/10.3389/fpsyg.2014.00639

This study compared the effects of energetic masking and informational masking

on visual-only speechreading performance. Twenty-three participants were asked 54

to speechread high frequency words and self-rate their level of distraction and

effort in the presence of either steady-state noise or four-talker babble. Steady-

state noise had no effect on speechreading, but four-talker babble was detrimental

to speechreading accuracy. This highlights the detriment of informational

masking in combination with processing tasks.

Merriam-Webster. (n.d.) Accent. In Merriam-Webster.com dictionary. Retrieved January

4, 2020, from https://www.merriam-webster.com/dictionary/accent

This website provided the definition of accent.

Munro, M. (1998). The effects of noise on the intelligibility of foreign-accented

speech. Studies in Second Language Acquisition, 20(2), 139-154.

doi:10.1017/S0272263198002022

This study investigated the effect of cafeteria noise on the perception of native

English and Mandarin-accented speech. Twenty-four native English speakers

were presented with true and false statements spoken by ESL learners and native

English speakers in quiet and noisy conditions. Listeners were required to write

the sentences and indicate the truth of each statement. Results revealed that noise

had a detrimental effect on intelligibility of accented sentences.

Munro, M. J., & Derwing, T. M. (1994). Evaluations of Foreign Accent in

Extemporaneous and Read Material. Language Testing, 11(3), 253–266.

The purpose of this study was to determine whether the delivery of speech

materials affects the perceived accentedness of a speaker with a foreign accent.

The two delivery methods studied were read materials and extemporaneous

materials. The results showed that there was no difference between perceived 55

foreign accentedness and the mode of delivery. However, this does not mean the

two conditions do not differ. Those in the reading condition had the advantage of

focusing on form and the disadvantage of producing reading errors. Those in the

extemporaneous condition had the advantage of familiar sentence structure and

the disadvantage of speaking without preparation.

Munro, M., & Derwing, T. (1995). Processing time, accent, and comprehensibility in the

perception of native and foreign-accented speech. Language and Speech, 38, 289–

306.

This study investigated the effect of accent on processing time. Twenty native

English speakers listened to statements by both native and non-native speakers.

Response latency was used as a measure of processing time. It was found that

native English speakers took longer to process statements made by non-native

speakers. However, there was no evidence to show that a stronger accent was

associated with longer processing times.

Nilsson M., Soli S., & Sullivan J. (1994). Development of the Hearing In Noise Test for

the Measurement of Speech Reception Thresholds in Quiet and in Noise. The Journal

of the Acoustical Society of America, 95(2), 1085. doi:10.1121/1.408469

This journal article describes the development of the HINT sentence set. It

describes the content of the sentences as well as the results of testing for

reliability and practicality. Confidence intervals for speech recognition thresholds

of HINT sentences are established in quiet, noise, and low pass filter conditions.

Niparko, J. K. (2000). Cochlear implants: Principles and practice. Lippincott Williams

and Wilkins. 56

This text is focused on the basic principles of cochlear implant technology and the

outcomes for cochlear implant users. A wide variety of material and research is

covered, and there are seven sections that include information on hearing loss,

candidacy, surgical procedures, outcomes, language acquisition, and ethical

considerations.

Pinet, M., and Iverson, P. (2010). Talker-listener accent interactions in speech-in-noise

recognition: Effects of prosodic manipulation as a function of language

experience, The Journal of the Acoustical Society of America, 128(3), 1357-1365.

This study looked at how prosody contributed to the recognition of native and

non-native speech in noise. Three groups (native English listeners, inexperienced

native French listeners, and experienced native French listeners) listened to BKB

sentences with varied pitch and segment durations. Results revealed that listeners

were more adept at recognizing speech with the prosody of their native language.

Individual results varied for subjects who had more experience with the non-

native language, indicating that language experience may affect the interaction

between talker accent and listener accent.

Pujol, R. (2020, February 15) Auditory Brain. Cochlea.

http://www.cochlea.eu/en/auditory-brain

This article described the primary and non-primary auditory pathways in the

brain.

Raman G., Lee J., & Chung M. (2011). Effectiveness of Cochlear Implants in Adults with

Sensorineural Hearing Loss. Rockville, MD: Agency for Healthcare Research and

Quality (US). 57

This report contained a table describing different sentence lists used for speech

perception testing. The table included sentence category (speech perception in

quiet or noise), description of the sentences, and scoring information. The HINT

sentences were described in this table.

Shannon, R. V., Qian-Jie Fu, & Galvin III, J. (2004). The Number of Spectral Channels

Required for Speech Recognition Depends on the Difficulty of the Listening

Situation. Acta Oto-Laryngologica (Supplement), 124, 50–54.

It is well established that a limited amount of spectral information can result in

impressive speech recognition outcomes in optimal conditions. It has also been

found that, as speech materials and listening conditions become more difficult,

more spectral information is required. This paper aimed to quantify the

association between difficulty of conditions and speech recognition. A sigmoid

function was used to describe this relationship.

Shannon, R.V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech

recognition with primarily temporal cues. Science, 270, 303–304.

This study looked at speech recognition scores in conditions where only temporal

cues were available. Noise vocoding was used to remove the majority spectral

detail from a sound signal. Participants were asked to listen to noise vocoded

consonants, vowels, and simple sentences. The results of this study showed high

speech recognition scores with only three bands of modulated noise. This has

implications for auditory prosthesis, which rely primarily on temporal cues for

transmission of sound signals.

Shuai, Lan & Gong, Tao. (2014). Temporal Relation between Top-down and Bottom-up 58

Processing in Lexical Tone Perception. Frontiers in Behavioral Neuroscience, 8, 97.

doi:10.3389/fnbeh.2014.00097.

Researchers conducted a series of experiments involving event-related potentials

(ERPs) to track the use of top-down and bottom-up processing in lexical tone

perception. Mandarin participants completed three experimental tasks while their

ERP patterns were monitored. It was found that both types of processing were

necessary for lexical tone perception and, in contrast with previous research,

occurred simultaneously both early and late in processing.

Shull, T. (2014, May 20), Rainbow Passage. DailyCues.

dailycues.com/learn/iqpedia/pages/rainbow-passage/.

This website provides information on the author and intended purpose of the

Rainbow Passage. It also provides a full text copy of the Rainbow Passage.

Southwood MH, & Flege JE. (1999). Scaling foreign accent: direct magnitude estimation

versus interval scaling. Clinical Linguistics & Phonetics, 13(5), 335–349. https://doi-

org.proxy.library.ohio.edu/10.1080/026992099299013

This study investigates whether or not linear partitioning is suitable for foreign

accent ratings scales. Two groups of native English listeners were asked to rate

Italian foreign accents on either a seven-point interval scale or by direct

magnitude estimation. Results showed that linear partitioning is suitable for

foreign accent ratings. However, there was a ceiling effect observed for the seven-

point scale, and response bias was observed.

Strori, D., Bradlow, A., Souza, P. (2020). Recognition of foreign-accented speech in 59

noise: The interplay between talker intelligibility and linguistic structure. The Journal

of the Acoustical Society of America, 147(6), 3765-3782. doi:10.1121/10.0001194

This study includes three experiments involving the recognition of foreign-

accented speech in noise. The first experiment examined the relationship between

talker intelligibility and sentence structure in noise. The second experiment

attempted to generalize results from the first experiment. The third experiment

looked at intelligibility and linguistic structure in the absence of pitch cues. The

results of this study showed that increased sentence complexity decreased speech

recognition for accented speech with high intelligibility. There was no effect of

sentence complexity for accented speech with low intelligibility. This indicates

that different processing strategies may be used for accented speech with high and

low intelligibility.

Urbas J. V. (2019). Speech Perception. Salem Press Encyclopedia of Health.

This article gives an overview of speech perception and the complex phenomena

related to the processing of speech sounds. It offers a description of speech

components, including an in-depth description of the distinctions between

consonants and vowels. The article also describes theoretical approaches to

explaining perception and related topics, such as perceptual constancy.

Waddington, E., Jaekel, B. N., Tinnemore, A. R., Gordon-Salant, S., & Goupell, M. J.

(2020). Recognition of Accented Speech by Cochlear-Implant Listeners. Ear and

Hearing, 41(5), 1236-1250. doi: 10.1097/AUD.0000000000000842

This article investigated the audiovisual (AV) benefit for normal hearing

individuals and cochlear implant users when presented with both unaccented and 60

accented speech. The sound for normal hearing listeners was processed using a

real-time vocoder, and the cochlear implant listeners used their own sound

processors. Both groups of participants showed similar AV benefit for both

unaccented and accented speech conditions. However, older age of CI listeners

was related to decreased performance with accented speech.

Wilson R., McArdle R., Watts K., & Smith S. (2012). The Revised Speech Perception in

Noise Test (R-SPIN) in a multiple signal-to-noise ratio paradigm. Journal of the

American Academy of Audiology, 23(8), 590–605. doi:10.3766/jaaa.23.7.9

The purpose of this study was to revise the original R-SPIN sentences to be

presented at different signal-to-noise ratios (SNRs). The R-SPIN sentences were

digitally manipulated to different SNRs and tested on both participants with

normal hearing and those with hearing loss. The results of the study support the

practicality of R-SPIN at multiple SNRs but encourage more research on the use

of R-SPIN testing for those with hearing loss.

Xu, L. & Pfingst, B. E. (2008). Spectral and temporal cues for speech recognition:

Implications for auditory prostheses. Hearing Research, 242, 132-140.

The goal of this article was to identify the interaction between spectral and

temporal cues in speech recognition in both quiet and noise. Researchers analyzed

a series of studies that focused on phoneme and lexical tone recognition in various

conditions using vocoder processed sound and normal hearing participants. The

results showed that both spectral and temporal information are important for

speech recognition in quiet and noise, and there appears to be a tradeoff in the

amount of spectral and temporal information required. 61

Xu, L., Thompson, C. S., & Pfingst, B. E. (2005). Relative contributions of spectral and

temporal cues for phoneme recognition. Journal of the Acoustical Society of America,

117, 3255-3267.

The goal of this study was to examine the importance of both spectral and

temporal cues in the perception of American English consonants and vowels.

Seven native English speakers were asked to identify consonants and vowels that

had been manipulated using a noise vocoding technique. The results showed that

both spectral and temporal information are important for consonant and vowel

recognition, and there appears to be a tradeoff in the amount of spectral and

temporal information required.

Yang, J., Wagner, A., Zhang, Y., Xu, L. (2021). English phoneme recognition of vocoded

speech by Mandarin-speaking English-learners. Speech Communication, In revision.

Results from this study were utilized as a means of comparison for the current

study. More specifically, recognition accuracies of native-English male talkers

were used for comparison of native and foreign-accented speech perception

outcomes.

Young, N. M., & Kirk, K. I. (2016). Pediatric cochlear implantation: Learning

and the brain (2nd ed.). Springer.

This text focuses on the implications of cochlear implantation in the pediatric

population. A wide variety of material and research is covered, and there are six

sections that cover information on cochlear implant design, clinical management,

outcomes in children, special populations, maximizing learning outcomes, and

educational management.