Development of speech perception and spectro-temporal modulation processing : behavioral studies in infants Laurianne Cabrera

To cite this version:

Laurianne Cabrera. Development of speech perception and spectro-temporal modulation processing : behavioral studies in infants. Psychology. Université René Descartes - Paris V, 2013. English. ￿NNT : 2013PA05H112￿. ￿tel-01394244￿

HAL Id: tel-01394244 https://tel.archives-ouvertes.fr/tel-01394244 Submitted on 9 Nov 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Université ParisDescartes Ecole Doctorale 261 Cognition, Comportements, Conduites Humaines

DEVELOPPEMENT DE LA PERCEPTION DE LA PAROLE ET DU TRAITEMENT AUDITIF DES MODULATIONS SPECTRO-TEMPORELLES : ETUDES COMPORTEMENTALES CHEZ LE NOURRISSON

Development of speech perception and spectro-temporal modulation processing: Behavioral studies in infants

Thèse de Doctorat

présentée par Laurianne Cabrera

Laboratoire de Psychologie de la Perception, Université Paris Descartes – CNRS UMR 8158

Discipline : Psychologie

Thèse présentée et soutenue le 22 novembre 2013 devant un jury composé de :

Lynne Werner Professeur Rapporteur Sven Mattys Professeur Rapporteur Pascale Colé Professeur Examinateur Carolyn Granier-Deferre MDU-HDR Examinateur Li Xu Professeur Examinateur Josiane Bertoncini Chargé de Recherche Co-directeur de thèse Christian Lorenzi Professeur Co-directeur de thèse

A mes parents

2

Acknowledgments - Remerciements

Je tiens tout d’abord à remercier mes deux directeurs, Josiane et Christian, pour leur confiance, leur attention et surtout leur temps. Merci à tous les deux d’avoir essayé de me décourager à faire de la recherche, chacun à votre façon, dès notre première entrevue. Ces remarques ont en fait renforcé mon envie de collaborer avec vous et aussi ma motivation à travailler sur ce sujet original. Je ne vous remercierai jamais assez d’avoir initié un projet si stimulant et de m’avoir transmis chacun et avec tant de philosophie vos connaissances. Josiane, je te remercie pour ces moments passés à regarder la tour Eiffel scintillée au coucher du soleil depuis ton bureau et toutes les heures passées à discuter de nos recherches et autres. Christian, je te remercie pour les œufs au plat de bon matin, les cours de grammaire et de t’être intéressé au développement précoce. J’espère être devenue en partie l’« hybride » que vous attendiez, j’ai plus qu’énormément appris grâce à vous deux.

I would also like to thank very warmly Pr. Feng Ming Tsao for his welcoming in Taiwan. Thank you for your helpful comments and for all your availability. I also thank Ni Zi, Li Yang, You Shin, Lorin, Naishin and the other students for their welcoming and for having waited on me hand and foot during three months, and for all their help with this language that sounded « Chinese » for me!

I am also really thankful to my thesis committee: Lynne Werner, Sven Mattys, Pascale Colé, Carolyn Granier-Deferre and Li Xu. I hope you will be satisfied reading my works and forgive my “French writing style”.

Ensuite, je tiens à remercier tous les gens du laboratoire de Psychologie de la Perception, et plus particulièrement l’équipe parole et l’équipe audition. Merci à Thierry, Judit, Ranka, Scania pour leur aide et remarques très précieuses. Un très grand merci à Josette qui m’a donné toutes les « clefs » du labo bébé, mais aussi à Viviane et à Sylvie. Merci à Henny, Anjali et Judit pour leur relecture attentive de ma thèse, papiers, abstracts, et pour nos discussions scientifiques (et

3 pour les moins scientifiques aussi). Merci à Agnès, Dan, Tim, Trevor, Victor, Jonhatan, Daniel et Romain pour leur aide et les moments de détente en conférence. Merci à tous les autres pour les déjeuners, Friday drinks et plus particulièrement Silvana, Arnaud, Cédric, Andrea, Arielle… (soit le 4ème étage). Enfin, merci à tous les gens qui ont transité par le bureau 608, les anciens, Louise, Nayeli, Bahia, Marion, Aurélie, et surtout aux purs et durs : Camillia, Carline, Lauriane, Louah, Léo, Nawal. J’espère toujours me souvenir de nos frasques, petits mots doux, rigolades et « pauses ».

Je remercie grandement tous les enfants et parents qui ont participé à mes études, autant à Paris qu’à Taipei, et qui m’ont appris à expliquer mes recherches simplement. Merci aussi aux jeunes adultes qui ont participé à ces études et plus particulièrement à tous mes amis qui sans trop comprendre les sons entendus m’ont grandement aidé. Je remercie tout spécialement l’ESPCI (et entre autres, Léopold, François, Lucie, Hugo, Benjamin, David, Paul, Max) et enfin Anaïs, Martin, Solène et Sophie.

Enfin, merci à mes parents, mon frère, et ma tante qui m’ont soutenu lors de ma « montée à la capitale », merci de ne pas avoir douté de mes choix et d’avoir su cultiver ma curiosité.

Finalement, merci à Axel qui était dans l’ombre de tous les moments de cette thèse, de la soutenance à l’école doctorale jusqu'à la soutenance de thèse. Merci de m’avoir remotivée face à certaines statistiques (>.05) et t’être enthousiasmé pour d’autres (<.05). Merci pour toute l’aide sur Matlab et autres logiciels. Merci d’avoir été là simplement et d’avoir tant discuté de mes recherches.

4

Abstract - Résumé The goal of this doctoral research was to characterize the auditory processing of the spectro-temporal cues involved in speech perception during development. The ability to discriminate phonetic contrasts was evaluated in 6- and 10-month-old infants using two behavioral methods. The speech sounds were processed by “vocoders” designed to reduce selectively the spectro-temporal modulation content of the phonetically contrasting stimuli. The first three studies showed that fine spectro-temporal modulation cues (the frequency-modulation cues and spectral details) are not required for the discrimination of voicing and place of articulation in French-learning 6-month-old infants. As for French adults, 6-month-old infants can discriminate those phonetic features on the sole basis of the slowest amplitude-modulation cues. The last two studies revealed that the fine modulation cues are required for lexical-tone (pitch variations related to the meaning of one-syllable word) discrimination in French- and Mandarin-learning 6-month-old infants. Furthermore, the results showed the influence of linguistic experience on the perceptual weight of these modulation cues in both young adults and 10-month-old infants learning either French or Mandarin. This doctoral research showed that the spectro-temporal auditory mechanisms involved in speech perception are efficient at 6 months of age, but will be influenced by the linguistic environment during the following months. Finally, the present research discusses the implications of these findings for cochlear implantation in profoundly deaf infants who have only access to impoverished speech modulation cues.

Cette thèse vise à caractériser le traitement auditif des informations spectro- temporelles impliquées dans la perception de la parole au cours du développement précoce. Dans ce but, les capacités de discrimination de contrastes phonétiques sont évaluées à l’aide de deux méthodes comportementales chez des enfants âgés de 6 et 10 mois. Les sons de parole sont dégradés par des « vocodeurs » conçus pour réduire sélectivement les modulations spectrales et/ou temporelles des stimuli phonétiquement contrastés. Les trois premières études de cette thèse montrent que les informations spectro-temporelles fines de la parole (les indices de modulation de fréquence et détails spectraux) ne sont pas nécessaires aux enfants français de 6 mois pour percevoir le trait phonétique de voisement et de lieu d’articulation. Comme pour les adultes français, les informations de modulation d’amplitude les plus lentes semblent suffire pour percevoir ces traits phonétiques. Les deux dernières études montrent cependant que les informations spectro-temporelles fines sont requises pour la discrimination de tons lexicaux (variations de hauteur liée au sens de mots monosyllabiques) chez les enfants français et taiwanais de 6 mois. De plus, ces études montrent l’influence de l’expérience linguistique sur le poids perceptif de ces informations de modulations dans la discrimination de la parole chez les jeunes adultes et les enfants français et taiwanais de 10 mois. Ces études montrent que les mécanismes auditifs spectro-temporels sous- tendant la perception de la parole sont efficaces dès l’âge de 6 mois, mais que ceux-ci vont être influencés par l’exposition à l’environnement linguistique dans les mois suivants. Enfin, cette thèse discute les implications de ces résultats vis-à-vis de l’implantation précoce des enfants sourds profonds qui reçoivent des informations de modulations dégradées.

5

Table of Contents

ACKNOWLEDGMENTS - REMERCIEMENTS 3

ABSTRACT - RÉSUMÉ 5

ABBREVATIONS 9

FOREWORD 10

CHAPTER 1. The mechanisms of speech perception 12

I. Speech perception 14

1.1 Are we equipped with specialized speech-processing mechanisms? 14 1.1.1 The search for invariants 14 1.1.2 Speech-specific perceptual phenomena 15 1.2 Investigation of infants’ speech perception 17 1.2.1 Are young infants able to perceive phonemic differences? 17 1.2.2 Infants’ perception as a model for the investigation of specialized speech mechanisms 19 1.3 Discussion and conclusions 20

II. Developmental psycholinguistics 22

2.1 Exploration of infants’ abilities to discriminate speech acoustic cues 22 2.1.1 Perception of prosody and low-pass filtered speech signals 22 2.1.2 Perception of masked speech 24 2.2 The perceptual reorganization of speech 26 2.2.1 Vowel perception 26 2.2.2 Native and non-native consonant perception 27 2.2.3 Lexical-tone perception 28 2.3. Discussion and Conclusions 30

III. Auditory development 32

3.1 Development of frequency selectivity and discrimination 33 3.1.1 Frequency selectivity/resolution 33 3.1.2 Frequency discrimination 35 3.2 Development of temporal resolution 35 3.2.1 Processing of temporal envelope 36 3.2.2 Processing of temporal fine structure 37

6

3.3 Discussion and Conclusions 38

IV. Speech as a spectro-temporally modulated signal 40

4.1 Speech as a modulated signal 40 4.2 The perception of speech modulation cues 40 4.2.1 Perception of speech modulation cues by adults 40 4.2.2 Perception of speech modulation cues by children 43 4.3 Speech perception with cochlear implants 45 4.4 Discussion and conclusions 47

V. Objectives of the present doctoral research 49

References 51

CHAPTER 2. Discrimination of voicing on the basis of AM cues in French 6- month-old infants (head-turn preference procedure) 67

1. Introduction: Six-month-old infants discriminate voicing on the basis of temporal-envelope cues 68

2. Article: Bertoncini, Nazzi, Cabrera & Lorenzi (2011) 70

CHAPTER 3. Discrimination of voicing on the basis of AM cues in French 6- month-old infants: effects of frequency and temporal resolution (head-turn preference procedure) 81

1. Introduction: Perception of speech modulation cues by 6-month-old infants 82

2. Article: Cabrera, Bertoncini & Lorenzi (2013) 84

CHAPTER 4. Discrimination of voicing and place of articulation on the basis of AM cues in French 6-month-old infants (visual habituation procedure) 116

1. Introduction: Infants discriminate voicing and place of articulation with reduced spectral and temporal modulation cues 117

2. Article: Cabrera, Lorenzi & Bertoncini (in preparation) 122

7

CHAPTER 5. Discrimination of lexical tones on the basis of AM cues in 6- and 10-month-old infants: influence of lexical-tone expertise (visual habituation procedure) 137

1. Introduction: Linguistic experience shapes the perception of spectro-temporal fine structure cues (Adult data) 138

2. Article: Cabrera, Tsao, Gnansia, Bertoncini & Lorenzi (submitted) 140

3. Introduction: The perception of speech modulation cues is guided by early language-specific experience (Infant data) 154

4. Article: Cabrera, Tsao, Hu, Li, Lorenzi & Bertoncini (in preparation) 156

GENERAL DISCUSSION 187

1. Fine spectro-temporal details are not required for accurate phonetic discrimination in French-learning infants 189 1.1 Implications for speech perception and development 190 1.2. Implications for auditory perception in infants and adults 192 2. Linguistic experience changes the weight of temporal envelope and spectro- temporal fine structure cues in phonetic discrimination 196 2.1. Implications for auditory perception in infants and adults 198 2.2. Implications for speech perception and development 199 3. Behavioral methods used to assess discrimination of vocoded syllables in infants 201 4. Implications for pediatric cochlear implantation 203 4.1 Simulation of electrical for French contrasts 203 4.2 Simulation of electrical hearing for lexical-tone contrasts 203 5. Conclusions 205

References 207

8

Abbrevations

ABR: Auditory-Brainstem Responses ALT: Alternating (sequences) AM: Amplitude Modulation ANOVA: Analaysis Of Variance CI: CV: Consonant-Vowel E: Temporal-Envelope ERB: Equivalent Rectangular Bandwidth F0: Fundamental Frequency FM: Frequency Modulation HPP: Head-turn Preference Procedure ISI: Inter-Stimulus Interval LT: Looking Time NH: Normal Hearing REP: Repeated (sequences) RMS: Root-Mean Square SD: Standard Deviation SNR: Signal-to-Noise Ratio TFS: Temporal-Fine Structure VCV: Vowel-Consonant-Vowel VH: Visual Habituation VOT: Voice-Onset-Time

9

Foreword

The present research program aims to study the basic auditory mechanisms involved in speech perception during infancy. Speech perception abilities have been explored extensively in young infants, and fundamental knowledge on speech acquisition has been gained in this domain over the last decades (see Kuhl, 2004 for a review). For instance, we are now fully aware that, during the first year of life, perceptual mechanisms evolve under the influence of the environmental language. Surprisingly, the exploration of the mechanisms specialized for speech perception and their development has relegated auditory mechanisms to a position of secondary importance. Nonetheless, the ability to perceive speech sounds is dependent on the auditory system and its development. Recently, psychoacoustic studies conducted with adult listeners have offered a novel description of the auditory mechanisms directly linked to the processing of speech signals (see Shamma & Lorenzi, 2013 for a review). This description emphasizes the role of spectro-temporal modulations in speech perception, in quiet and in adverse listening conditions. From a clinical perspective, these studies have also motivated the development of novel signal- processing strategies, and diagnostic and evaluation tools for people suffering from cochlear (e.g., Shannon, 2012) and for deaf adults and infants with a cochlear implant (CI), that is a rehabilitation device conveying the coarse- amplitude modulation cues present in the naturally produced speech signal. This PhD dissertation focuses on normal-hearing infants and young adults to further explore the influence of auditory processes on the development of speech perception using this novel description emphasizing the role of modulation in speech perception. The general aim is to explore the contribution of the basic auditory mechanisms that process modulation cues (known to be crucial for adults) to speech discrimination in infants during the first year of life. Here, signal-processing algorithms called “vocoders” have been used to selectively alter the modulation content of French and Mandarin speech signals. The behavioral experiments conducted with vocoded signals should therefore improve our knowledge about the nature and role of the low-level, spectro-temporal auditory mechanisms involved in the development of speech perception. Moreover, some 10 experimental conditions have been designed to simulate the reception of the speech signal transmitted by current CI processors for normal-hearing listeners. The results of these simulations of speech processing by CI processors in normal- hearing infants should indicate to what extent certain acoustic cues (i.e., spectral and temporal modulation cues) are required for the normal development of speech-perception mechanisms.

The first chapter of this dissertation presents a general review of the mechanisms involved in adult and infant speech perception. This chapter reviews current knowledge about auditory development, which has mainly been assessed using non-linguistic sounds. Then, this chapter presents the most recent findings on the perception of speech modulation cues in adult listeners. The second and third chapters present two published articles on the perception of speech modulation cues in French-learning 6-month-old infants. These studies (using the head-turn preference procedure) show that young infants are able to rely on the slowest modulations in amplitude over time to discriminate a French voicing contrast (/aba/ versus /apa/). The fourth chapter is an article in preparation that extends the results obtained with 6-month-old infants using a different behavioral method and procedure (the visual habituation) to test discrimination. The results indicate that French-learning 6-month-old infants can use the slow modulations in amplitude over time but may require the fast modulations in amplitude to discriminate phonetic contrasts. The fifth and last chapter presents two articles (submitted and in preparation) showing the role of linguistic experience on the use of amplitude – and frequency – modulation cues in French- and Mandarin- learning infants of 6 and 10 months of age and in native French- and Mandarin- speaking adults. The results reveal that Mandarin and French listeners weigh amplitude and frequency modulation cues differently in various acoustically degraded conditions. This bias reflects the impact of the language exposure and seems to emerge between 6 and 10 months.

11

Chapter 1. The mechanisms of speech perception

12

Chapter 1. The mechanisms of speech perception

Speech is a complex acoustical signal. A normally functioning auditory system is able to analyze and organize this complex signal as a sequence of linguistic units. The details of this process are not fully understood yet. For example, we still do not comprehensively understand what necessary (and sufficient) acoustic information is needed to correctly perceive speech signals, and how this process develops in humans (see Saffran, Werker & Werner, 2006). Psycholinguists first studied infants as a “model” to investigate the initial (innate) state of the speech abilities that could explain the specific speech processing mode discovered in adult listeners (Eimas, Siqueland, Jusczyk, & Vigorito, 1971). Speech perception studies in infants have now contributed to an abundant literature that describes infants’ early abilities to perceive speech sounds, and details perceptual development under the pressure of the linguistic environment (see Kuhl, 2004). However, speech perception mechanisms are continuously constrained by auditory mechanisms, which have been mostly assessed with non-linguistic stimuli (see for a review Burnham & Mattock, 2010; Saffran et al., 2006). Nevertheless, the acoustic characteristics of non-linguistic stimuli (i.e., noise, tone) are different than those of speech signals. Thus, the development of the basic auditory mechanisms contributing to speech perception has been only partly assessed.

13

I. Speech perception

The accurate transmission of the speech signal in diverse and challenging situations motivated the first explorations of speech perception mechanisms in humans (e.g.,Licklider, 1952; Miller & Nicely, 1955). How can a speech signal be clearly conveyed from one speaker to a listener? How can we build an artificial system able to recognize spoken instructions? These general questions – related to the efficiency of speech recognition – have been addressed by exploring the correspondence between acoustic cues and phonetic variation. This investigation was audacious given the variability in human voices, and the variability in tokens from a single speech category, and the outcome of these studies is discussed (see Remez, 2008). Nevertheless, the search for acoustic invariants in speech perception has inspired the general aim of the present work. Sections 1.1 and 1.2 provide a review of the pioneering research that: 1) led to the phenomena first thought as being specific to humans, and 2) started the exploration of speech perception in infants, i.e., human beings who have not yet acquired a linguistic system, and consequently are just beginning to be influenced by their linguistic environment.

1.1 Are we equipped with specialized speech-processing mechanisms?

1.1.1 The search for invariants

In 1951, Cooper, Liberman and Borst proposed to use the visual display of speech sounds obtained with spectrograms to identify the crucial acoustic features of speech. They examined systematically the spectrograms of the same sounds pronounced by different speakers in different contexts. To corroborate their observations, they created a tool (called “pattern playback”) reconverting spectrograms into sounds. On the basis of the spectrograms, Cooper et al. (1951) were able to synthesize new sounds by changing some acoustic parameters, and tested the effect of these acoustic changes on the perception of adult listeners who had to identify or discriminate these synthetic speech stimuli.

14

Several additional experiments were then conducted with normal-hearing adults to determine the contribution of these acoustic variables to the perception of speech (e.g., Cooper, Delattre, Liberman, Borst, & Gerstman, 1952; Delattre, Liberman, & Cooper, 1955). For example, the position of the noise burst corresponding to the articulatory explosion was shown to differentiate the stop consonants (/p/, /t/, /k/). However, the authors also observed that the nature of the following vowel produced variations in the perceptual characteristics of the consonant. For example, changes in the transition of the first formant were shown to influence the perception of voicing in stop consonants, and changes in the transition of the second formant were shown to enable listeners to distinguish place of articulation (e.g., Liberman, Delattre, Cooper, & Gerstman, 1954). In sum, these studies made clear that a single set of acoustic features does not correspond systematically to a particular phonetic segment (see Rosen, 1992). Stevens and Blumstein (1978) also searched for some invariant acoustic properties that could reliably distinguish between consonants independently of the surrounding phonetic context. They suggested that the auditory system extracts spectral energy at the stimulus onset for a stop consonant. They found that spectral properties at the consonantal release determine the distinction between certain places of articulation (among stop consonants). Furthermore, they pointed out that the trajectory of the formant transition may not be the main acoustic cue indicating place of articulation. Rather, other acoustic properties such as the rapidity of spectrum changes, voicing and the abruptness of amplitude changes or periodicity were shown to play a role in the perception of phonetic categories (e.g., Stevens, 1980; Remez & Rubin, 1990). These investigations of speech acoustic invariants have led to the finding that several combinations of acoustic cues correlate with phonetic categories, and that no single cue is necessary nor sufficient to distinguish and identify phonetic categories.

1.1.2 Speech-specific perceptual phenomena

Some authors assumed that invariance is not in the signal but in the listener (Liberman & Mattingly, 1985). With synthetic sounds, it is possible to create continua of speech sounds in which the change is gradual, (e.g., change in

15 the voice onset time [VOT] or change in the direction of the second formant transition). Adult listeners asked to identify and discriminate pairs of sounds along an acoustic continuum showed a specific relationship between identification and discrimination (e.g., Liberman, Harris, Hoffman, & Griffith, 1957). Adults are better at discriminating stimuli identified as belonging to two different phonemic categories (such as /b/ and /d/ or /d/ and /g/) than stimuli identified as belonging to the same category. This phenomenon has been called “categorical perception” because discrimination peaks only between categories and is at chance within categories. Continua modeling variations in formant transition, changes in place of articulation, voicing and manner of articulation are all perceived categorically in adults (Liberman, Harris, Eimas, Lisker, & Bastian, 1961; Liberman et al., 1957; Lisker & Abramson, 1970; Miyawaki et al., 1975). These pioneering studies did not reported categorical perception for non- speech sounds (Liberman, Harris, Kinney, & Lane, 1961), supporting the notion that a specialized process is involved in the perception of speech sounds (Liberman, 1970; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). Liberman and colleagues have described another phenomenon called “duplex perception” consistent with the existence of mechanisms dedicated to speech processing (Liberman, 1982; Mann & Liberman, 1983). They observed that a portion of the acoustic signal (the third-formant transition) is simultaneously integrated into a speech percept (completing an ambiguous syllable consisting of the first two formants) and a non-speech percept (that sounds like a non-speech chirp). In other words, when the listeners were presented dichotically with the two sounds (i.e., the third-formant transition and the ambiguous syllable), they reported a “duplex” percept: a non-speech chirp in one ear and a complete syllable. According to Liberman and colleagues, duplex perception may reveal the existence of a specialized speech module that is different from a non-speech mode of processing (see Liberman & Mattingly, 1989). However, duplex perception was also observed for musical sounds (Hall & Pastore, 1992; Pastore, Schmuckler, Rosenblum, & Szczesiul, 1983) and for other non-speech sounds (Fowler & Rosenblum, 1990) in subsequent studies. Moreover, categorical perception of speech sounds was observed in non-human species such as chinchillas and monkeys (Kuhl & Miller, 1975, 1978; Kuhl &

16

Padden, 1982, 1983), and was also found with complex non-speech sounds varying in time (Miller, Wier, Pastore, Kelly, & Dooling, 1976; Pisoni, 1977). Such results have therefore challenged the hypothesis of having specialized mechanisms strictly devoted to speech perception. Other findings on the perception of synthetic speech sounds were interpreted as supporting the notion of a specialized speech module. Remez and his colleagues (1981) created synthetic signals composed of three sinusoids preserving frequency and amplitude variations of the first three formants of original speech signals. These “sine-wave speech” stimuli are perceived in two ways: as speech or non-speech sounds. When listeners have no information about the nature of the stimuli, they do not recognize the signals as speech. However, when they are told that the stimuli are “linguistic”, they are able to recognize speech stimuli and to identify correctly words and sentences. The instructions given to the listeners are thus sufficient to engage them into a “speech mode” (Remez, Rubin, Pisoni & Carell, 1981). The search of acoustic invariants important to speech perception led to the belief that there exists mechanisms specialized for speech perception but this is still under debate (e.g., Bent, Bradlow, & Wright, 2006; Leech, Holt, Devlin, & Dick, 2009).

1.2 Investigation of infants’ speech perception

1.2.1 Are young infants able to perceive phonemic differences?

A parallel line of research was developed to assess speech perception in newborns and young infants. Infants are individuals without prior knowledge or experience with producing speech. These studies first investigated the same questions asked with adults and focused on phonetic perception (i.e., whether speech perception requires specific mechanisms). This research also led to elaborate different methods used to assess speech-perception abilities in infant listeners. Pioneering work exploring speech perception in infants was conducted by Eimas et al. (1971), who tested phonetic discrimination abilities in 1- and 4- month-old infants using a “high amplitude sucking procedure”. This procedure involves the measurement of the rate of high-amplitude sucks produced by the 17 babies during the presentation of a sequence of sounds. The stimuli used by Eimas et al. (1971) were synthetic versions of consonant-vowel (CV) syllables issued from a continuum varying on VOT from /ba/ to /pa/. The stimuli had different VOT values: -20, 0, +20, +40, +60 ms. Infants were familiarized with one stimulus until their sucking responses decreased. Two groups of infants were then exposed to a VOT change of 20 ms. One change corresponded to a change in category (+20 ms versus +40 ms, corresponding respectively to /ba/ and /pa/ as identified by adults). The other change did not correspond to a phonetic category difference (-20 versus 0, or +60 versus +80 ms). Results show that sucking rate increases after a between-category change but not after a within-category change. Thus, young infants seem able to perceive a minimal phonetic contrast, but are unable to discriminate two stimuli separated by the same acoustic distance within a phonetic category. In other words, 1- and 4-month-old infants demonstrate categorical perception for speech. Following this pioneering study, categorical perception has been found for a continuum of frequency variations of the second and third formants. Infants discriminate two stimuli only when they belong to two different categories as identified by adults (as in /b/ versus /g/; Eimas, 1974). Furthermore, newborns are similarly able to discriminate the place of articulation of stop consonants and differences in vowel quality on the basis of short (34 to 44-ms long) CV syllables containing only information about the relative onset frequency of formant transitions (Bertoncini, Bijeljac-Babic, Blumstein, & Mehler, 1987). Young infants are also able to discriminate between speech sounds from foreign languages with which they have no experience (Lasky, Syrdal-Lasky, & Klein, 1975; Streeter, 1976; Trehub, 1976). Thus, normal-hearing infants are able to discriminate small variations in speech sounds (such as VOT duration, frequency variations in the second and third formants and the frequency at the onset of formant transition) and they rely on syllables as a unit to segment speech sounds (Bertoncini, Floccia, Nazzi, & Mehler, 1995; Bertoncini & Mehler, 1981; Bijeljac-Babic, Bertoncini, & Mehler, 1993). Some of these findings also indicate that sounds from different phonetic categories are better discriminated.

18

1.2.2 Infants’ perception as a model for the investigation of specialized speech mechanisms

The sophisticated early capacity of infants to perceive speech has suggested an “innate” specialization of humans in the processing of speech sounds. Categorical perception in infants (e.g., Eimas, 1974) led to postulate that this population is as sensitive as adults are to many types of phonetic boundaries before a long exposure to language. Eimas (1974) proposed that infants perceive speech using a “linguistic mode” through the activation of a special speech processor equipped with phonetic-feature detectors. However, categorical perception was also found for non-speech sounds in infants (Jusczyk, Pisoni, Reed, Fernald, & Myers, 1983), suggesting that the early ability of infants to categorize speech sounds can be explained by general auditory processing mechanisms (e.g., Aslin, Pisoni, Hennessy, & Perey, 1981; Aslin, Werker, & Morgan, 2002). The speech processing capabilities of infants could also be biased by an “innately-guided learning process” or by “probabilistic epigenesis” for speech perception (e.g., Bertoncini, Bijeljac-Babic, Jusczyk, Kennedy, & Mehler, 1988; Werker & Tees, 1992, 1999). According to this alternative hypothesis, perceptual mechanisms are sensitive to particular distributional properties present in speech sounds (such as the acoustic characteristics and their organization in the speech signal) and this may explain the rapid learning of native languages. Nowadays, a large number of studies still explore the hypothesis of predisposition, or at least the hypothesis of early preference for speech sounds in infants by comparing their preference and/or brain activations for speech sounds versus non-linguistic sounds. Neonates were shown to prefer listening to speech sounds compared to sine-wave speech stimuli (Vouloumanos & Werker, 2007) but to listen equally to (human) speech and monkeys’ vocalizations (Vouloumanos, Hauser, Werker, & Martin, 2010). With the development of electrophysiological and neuroimaging techniques, brain activation during the presentation of speech sounds have been recorded in young infants. Dissimilar responses to changes in speech sounds (phonetic changes) versus simple acoustic changes (sine tones differing in spectrum) have been observed in neonates. These observations support the idea, that different neural networks and specialized modules are operational 19 early in development (Dehaene-Lambertz, 2000). However, as pointed out by Rosen and Iverson (2007), the acoustic properties of the non-speech sounds used in these experiments may influence the infants’ preferences and reactions. Brain activations in newborns were explored with more complex non-speech sounds (compared to sine-wave speech or complex-tones) varying in their spectro-temporal structure (Telkemeyer et al., 2009). No difference was found in brain activity between speech and non-speech sounds. These results suggest that differences in the acoustic properties, rather than in the linguistic properties of the stimuli may also drive the specific responses to speech sounds observed in other studies (see also Zatorre & Belin, 2001). Thus, the early abilities to process speech sounds demonstrated by young infants may be related to the acoustic properties (i.e., variations in spectral and temporal constituents) of the signals. It is therefore important to investigate the early auditory abilities involved in speech processing.

1.3 Discussion and conclusions

Early speech perception studies conducted in adults led researchers to explore the acoustic cues responsible for the robust speech perception seen in fluent listeners and opened the way to the study of infants’ abilities to perceive speech signals. Several phenomena observed in infants’ speech perception were assumed to reflect the operation of innate and linguistically related mechanisms. However, this view was questioned by studies indicating that the infants’ capacities do not necessarily reflect the operation of phonetic processes (e.g., Jusczyk & Bertoncini, 1988). The auditory system of human and non-human mammals may simply respond to the physical properties of speech sounds (see Aslin et al., 2002; Nittrouer, 2002). Several studies are still questioning the existence of a “species-specific” system for speech processing. Nevertheless, this first literature review shows how important it is to describe and control systematically the acoustic properties of the speech signal when assessing speech perception abilities in humans. Following these pioneering studies, different research programs started to study speech perception in infants. Most of them focused more on the perceptual abilities involved in the acquisition of language (phonology, word segmentation, lexical acquisition, grammar; e.g., Mattys,

20

Jusczyk, Luce, & Morgan, 1999; see Gervain & Mehler, 2010 for a review) than on the auditory foundations of speech perception. The latter question was mostly addressed in adults using psychoacoustic methods.

21

II. Developmental psycholinguistics

The pioneering studies initiated by Eimas et al. (1971) explored infants’ abilities to perceive speech sounds and identified some of the acoustic cues required for the manifestation of these abilities. Moreover, certain speech sounds elicited a particular preference in neonates, such as their mother’s voice compared to a stranger’s voice (Querleu et al., 1986), or utterances from their environmental language compared to utterances from a foreign language (Mehler et al., 1988), and speech produced in an infant-directed manner compared to adult-directed speech (Fernald, 1985). Section 2.1 presents studies that attempted to isolate the acoustic cues of these preferential speech sounds. In order to uncover the acoustic cues underlying such preferences, two main techniques were used: low-pass filtering and masking.

2.1 Exploration of infants’ abilities to discriminate speech acoustic cues

2.1.1 Perception of prosody and low-pass filtered speech signals

One hypothesis suggests that newborns prefer their mothers’ voice because of prenatal exposure to mother-specific low-frequency sounds. During pregnancy, the womb and amniotic fluid attenuate external sounds, but transmit low audio frequencies (relatively well; Armitage, Baldwin, & Vince, 1980; Granier-Deferre, Lecanuet, Cohen, & Busnel, 1985; Querleu, Renard, Versyp, Paris-Delrue, & Crèpin, 1988). This prenatal experience of “filtered” speech could facilitate the postnatal recognition of mother’s voice because low-pass filtered sounds preserve most of the prosodic information of the speech signal (i.e., gross rhythmic cues and voice pitch). Spence and DeCasper (1987) used different low-pass filtered voices (with a cutoff frequency of 1 kHz), and found that in these conditions, neonates prefer to listen to their mother’s voice compared to the voice of another woman. Spence and Freeman (1996) also examined neonates’ perception of low-pass filtered voices (at 500 Hz) compared to whispered speech produced by their own mother. They found that infants prefer to listen to their mother’s voice in the low-pass

22 filtered condition but not in the whispered condition. These results suggest that whispered speech lacks the acoustic information underlying infants’ preference for their mother’s voice. At the same time, neonates have been shown to be able to discriminate two unfamiliar whispered voices. This study indicates that prosodic information and more specifically variations in fundamental frequency (F0, missing in whispered speech) are required by infants’ preference for their mother’s voice. Another early preference in speech perception has been observed: the preference for the infant’s native language (Mehler et al., 1988). The perception of discourse fragments from different languages has been investigated in neonates and 2-month-olds. Results indicate that infants prefer listening to utterances in their native language rather than utterances in a foreign (unknown) language. Two control conditions were designed to assess the acoustic cues underlying this preference. First, the speech signals were played backward to alter the prosodic information (i.e., rhythm and pitch trajectory) while keeping the gross spectral characteristics of the signal (the long-term power spectrum of forward and backward signals being identical). Second, the speech signals were low-pass filtered at 400 Hz in an attempt to reduce the (phonetic) segmental information while preserving most of the prosodic cues. The results indicated that infants were equally able to discriminate their native language from a non-native one in the normal speech condition and in the low-pass filtered condition, but infants failed in the backward condition. These results suggest that segmental information is not required to make this native versus non-native language distinction. Prosodic cues seem sufficient to discriminate utterances issued from two languages and possibly to recognize the native one. Further investigations using low-pass filtered speech signals (400 Hz cutoff frequency) refined these results and showed that neonates discriminate languages on the basis of global rhythmic properties (syllable-timed versus stress-timed versus mora-timed languages; Nazzi, Bertoncini, & Mehler, 1998). The same pattern of results were observed in 2- month-old babies (Dehaene-Lambertz & Houston, 1998). Another preference repeatedly observed in young infants is for infant- directed speech (Fernald, 1985), that is, speech with the exaggerated pitch contours produced by mothers or caregivers when speaking to babies. Fernald and

23

Kuhl (1987) investigated the acoustic cues favoring this preference by removing the lexical content (by using synthetized sine-wave signal) of original speech signals produced either in an infant- or in an adult-directed manner. The experimental conditions were such that either the F0 variations, the amplitude variations, or the duration of each signal sample was preserved. Results showed that 4-month-old infants prefered the infant-directed speech only in the condition preserving the F0 variations of the original signals. Amplitude and duration information did not contribute to the differentiation of infant- from adult-directed speech, while F0 variations seem to play a major role. Finally, another study showed that 6-month-old infants categorize infant-directed speech utterances on the basis of low-pass filtered speech signals (with a cutoff frequency of 650 Hz), confirming that prosodic features provide necessary and sufficient information to detect and maintain auditory attention to infant-directed speech (Moore, Spence, & Katz, 1997). These studies explored the nature of the acoustic cues to which infants seem to be prepared to attend to in the speech signal. They demonstrated that infants are sensitive to the prosodic cues conveyed by low audio frequencies (at least < 400 Hz).

2.1.2 Perception of masked speech

Another way to explore the processing of the acoustic cues in speech and the robustness of speech coding is to test speech perception abilities in masking conditions. Here, speech sounds are presented simultaneously with another speech sound or against background noise. Only a limited number of studies have used these specific listening conditions to assess speech perception in infants. The “cocktail-party” effect (illustrating the ability of human listeners to follow a given voice in the presence of other voices, Cherry, 1953) was explored in infants (Newman & Jusczyk, 1996). The participants had to recognize words pronounced by a female speaker in the presence of sentences pronounced by a “masking” or “competing” male speaker. The recognition of familiar words (presented in a familiarization phase) was observed in 7.5-month-olds using a signal-to-noise ratio (SNR) of 5 and 10 dB SNR. However, no preference was observed at 0 dB SNR for the familiar words. The perception of a target word

24

(such as the baby’s own name) was also tested in younger babies in the presence of a multi-talker background composed of 9 different female voices (Newman, 2005). Results indicated that 5-month-olds prefer listening to their own names when these targets are 10 dB louder than the distractor. However, at 5 dB SNR, 5- and 9-month-olds failed and only 13 month-old infants succeeded in detecting their own names. Finally, a recent study by Newman (2009) indicated that 5- and 8.5- month-old infants are better at recognizing their own names in the presence of a multi-talker background compared to a single voice background (played forward and backward) that shares more similar acoustic cues with the target voice. In another set of studies, the detection of speech in noise was tested by varying systematically the SNR for adults and infants (Nozza, Wagner, & Crandell, 1988; Trehub, Schneider, & Bull, 1981). Six- and 12-month-old infants required a higher SNR to detect the speech compared to children and adults. The discrimination of phonetic differences (/ba/ versus /ga/) was also assessed in infants using several SNRs (-8, 0, 8 and 16 dB) with a bandpass filtered noise masker (from 100 to 4 kHz; Nozza, Rossman, Bond, & Miller, 1990). Results showed that performance increases with higher SNR in both infant and adult groups, but infants required higher SNR than adults to achieve comparable levels of performance. Very recently, the effects of changing the temporal structure of noise maskers (modulated versus unmodulated noise) on perceptual sensitivity to a vowel change have been investigated in infants. Werner (2013) observed that infants improve their ability to listen in the dips of the modulated masker with more regular or slower modulations. However, contrary to adults, no effect of the modulation depth was observed in the infants’ performance. This result may be related to the influence of informational masking or to the attractiveness of the masker in infants. These studies indicate that the infants’ ability to perceive speech is more affected by noise or competing sounds compared to adults. In other words, speech coding seems less robust in infants. This may reflect immature/inefficient coding of speech cues, poorer capacity to “glimpse” speech in the dips of the fluctuating backgrounds, and/or poorer segregation capacities. These results highlight

25 significant differences between infants and adults that parallel the development of the auditory and speech systems.

2.2 The perceptual reorganization of speech

As described above, infants show a very early bias towards prosodic (i.e., suprasegmental) information similar to what they may have received prenatally (Nazzi et al., 1998). However, the developing speech system is marked by important changes occurring during the first year of life. For segmental information, infants are able to discriminate both native and non-native phonemic contrasts early in development (e.g., see Kuhl, 2004, and Werker & Tees, 1999 for a review). This phenomenon was demonstrated for different segments such as consonants, vowels and lexical tones.

2.2.1 Vowel perception

The vowel space is commonly described as a two-dimensional space combining the first and second formant (F1 and F2) frequency variations for each vowel (Miller, 1989). Phonetically, vowels are described more categorically according to features such as anteriority or aperture. Vowels are more salient in the speech signal than consonants (e.g., Mehler, Dupoux, Nazzi, & Dehaene- Lambertz, 1996), and are the main carriers of prosodic information. Four-month-old infants can discriminate vowel contrasts including non-native ones, but 6- and 10-month-olds cannot (Trehub, 1973, 1976; Polka & Werker, 1994). Kuhl proposed that vowels are perceptually organized into categories around “prototypes” specific to each language (Kuhl, 1991). English- and Swedish-learning 6-month-old infants were tested with prototypes of native and non-native vowels (Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). This was done using synthetic vowels generated along a frequency continuum of the first and second formants in order to produce variants around an English- or a Swedish vowel prototype. Results show that the “perceptual-magnet effect” (Kuhl, 1993) operates only for the native vowel categories: Six-month-old infants perceive more variants of the native vowel as identical to the prototype, compared to the variants of the non-native vowel category. At 6 months, then, language- specific perceptual organization has begun to outline the category of vowel

26 sounds. Thus, the perceptual reorganization for speech occurs before 6 months of age for vowel contrasts (but see Rvachew, Mattock, Polka, & Ménard, 2006).

2.2.2 Native and non-native consonant perception

For consonants, phonetic descriptions in terms of voicing, place and manner is more widely used than acoustic descriptions in terms of burst, formant transitions, VOT and so on. For consonant contrasts, Trehub (1976) showed that English-learning 2-month-old infants discriminate non-native (French) contrasts. However, this ability seems to disappear for adults (Eilers, Gavin, & Wilson, 1979). Werker, Gilbert, Humphrey and Tees (1981) found that English-speaking adults show poor discrimination performance for a Hindi consonant contrast contrary to 6-8 month-old English infants who discriminate native and non-native contrasts relatively well. Werker and her colleagues further investigated this decline in discrimination performance between infancy and adulthood (Werker & Tees, 1984). They observed in the ability to discriminate the non-native Hindi contrast decreases in English learners between 6-8 months and 8-10 months of age, and that, like the adult listeners, 10-12-month-olds did not discriminate this non-native contrast. Several studies replicated these results indicating perceptual reorganization for different consonant contrasts between 6 and 10 months of age (e.g., Best & McRoberts, 2003; Best, McRoberts, LaFleur, & Silver-Isenstadt, 1995; Tsushima et al., 1994). Behavioral evidence is complemented by electrophysiological data showing that 6-7-month-old infants exhibit similar neural responses to native and non-native speech sounds while 11-12-month-old infants show a decrease in neural activation for non-native sounds (e.g., Cheour et al., 1998; Rivera-Gaxiola, Silva-Pereyra, & Kuhl, 2005). The decline in perceptual abilities for non-native speech sounds is only one aspect of perceptual reorganization that the native listener undergoes during the same period. In other words, not only does the perception of non-native contrasts decline, but the perceptual boundaries between phonetic categories also become closer to those of the native language. For example, Hoonhorst et al. (2009) showed that between 4 and 8 months of age, French-learning infants become more sensitive to the magnitude of the VOT in French (VOT = 0 ms) that

27 distinguishes voiced and voiceless plosive consonants. Similarly, Kuhl et al. (2006) tested 6-8- and 10-12-month-old infants learning English or Japanese on their ability to discriminate the /r/-/l/ contrast that does not exist in Japanese. They showed that both 6-8-month-old groups discriminated this contrast. However, in the older groups, performance diverged: it increased in English-learning babies while it decreased in Japanese babies. The same improvement in discrimination performance was observed in Mandarin- and English-learning infants when they had to discriminate a native consonant contrast in Mandarin or English, respectively (Tsao, Liu, & Kuhl, 2006). Thus, at the end of the infant’s first year, speech perception is marked not only by a decline in non-native contrast discrimination (e.g., Werker & Tees, 1984) but also by an improvement in native contrast discrimination. Infants seem to be sensitive to the statistical distributions of the native-language phonemic units that shape the partition of phonological categories (see Kuhl 2004; Werker & Tees, 2005 for reviews). Moreover, a positive relationship between native speech discrimination and the receptive vocabulary size, and a negative relationship between non-native speech discrimination and cognitive control scores has been shown in 11-month-old infants (Conboy, Sommerville & Kuhl, 2008). This perceptual reorganization has mainly been explained by the building of phonological categories and a change in attentional processes. With listening experience, infants may learn to ignore certain variations in speech, especially acoustic variations that are irrelevant to the development of native-language categories. Auditory exposure to specific language input may favor the perception and selection of specific acoustic cues.

2.2.3 Lexical-tone perception

Vowels also convey voice-pitch information determined by F0 and its variations. Pitch information is extremely important in every language but in some of them (i.e., tonal languages), pitch has a major role at the lexical (word) level since it is involved in word meaning. A lexical tone corresponds to pitch variation over the vocalic portion of the syllable and F0 information can vary at the initial portion, the middle portion, and the end of the vowel. Moreover, like other segments, multiple cues are correlated to lexical tones such as duration,

28 amplitude contour and spectral envelope of the speech signal (e.g., Whalen & Xu, 1992; but see Kuo, Rosen & Faulkner, 2008). The number of tones used to contrast lexical entries varies among tonal languages (e.g., Mandarin includes four lexical tones, Thai includes five). In adults, lexical-tone users outperform non- users in their ability to perceive lexical tones (Burnham & Francis, 1997). However, in certain conditions, non-native adults are able to perceive lexical-tone differences but they do not perceive them as phonological categories (Hallé, Chang, & Best, 2004). The perception of lexical tones in infancy was first assessed in Mandarin- and English-learning at 6 and 9 months of age (Mattock & Burnham, 2006). These results showed that both groups of 6- and 9-month-old Mandarin-learning infants discriminate lexical tones above chance level. In English-learning infants, only the 6-month-olds showed similar discrimination performance, while English- learning 9-month-olds discriminated one type of lexical-tone contrast (including contour variations, such as falling versus rising tones) but not a contrast between level and contour pitch values (such as low versus rising tone). As described previously, perceptual reorganization was showed to occur earlier for vowel than consonant contrasts (e.g., Kuhl et al., 1992). Given that lexical tones are conveyed by the vowels, Mattock, Molnar, Polka and Burnham (2008) explored if there was a similar reorganization for lexical-tone perception and if this occured around the age of 6 months. The authors tested English- and French-learning infants of 4, 6 and 9 months and found that 4- and 6-month-olds, of both languages, succeeded in discriminating (Thai) lexical tones. Conversely, 9-month-olds from both languages failed to discriminate the same lexical-tone contrasts, suggesting that some perceptual reorganization occurs for lexical tones at the same time as has been observed for consonants. Nevertheless, Yeung, Chen and Werker (2013) tested the discrimination of native and non-native lexical tones in 4-month-olds learning two different tonal languages (Cantonese or Mandarin) as well as infants learning English (a non- tonal language). Their results suggested that the specialization for lexical tone perception starts even earlier than what Mattock et al. (2008) had proposed. Indeed, four-month-old infants not only showed discrimination but also diverging preferences related to the lexical tone of their native language. The authors

29 concluded that “acoustic salience likely plays an important role in determining the timing of phonetic development” (Yeung et al., 2013, p.123). Perceptual reorganization for lexical tones seems to occur during infancy but its time course remains to be specified. As suggested above, the acoustical properties of lexical tones may play a major role in the course of perceptual organization. For this reason, studying lexical-tone perception is a logical way to explore the interaction between audition and speech processes.

2.3. Discussion and Conclusions

Over the last decade, a number of studies assessed speech-perception abilities in infants. Only a small subset of these studies examined carefully the acoustic parameters potentially responsible for the specific perceptual preferences or discrimination abilities of infants. Psycholinguistic studies have previously used filtered signals to remove segmental (phonetic) information that could be used to distinguish (or prefer) speech stimuli. Several investigations of infants’ language discrimination made use of these non-segmental conditions and showed that prosodic information (partly conveyed by low audio frequencies) is critical. However, filtering does not allow one to assess the role/importance of the fine (detailed) acoustic cues required by infants to perceive phonetic contrasts. Infants are able to recognize or discriminate speech sounds in the presence of noise or a speech masker. However, performance differs according to the nature of the masker and the cause of such differences is still unclear (see Werner, 2013). Previous exploration of speech perception in infants also extends to phonetic segment perception. Infants become more or less efficient for some phonetic segments (native and non-native ones, respectively) but do not show the same developmental patterns for vowels, consonants or lexical tones. These phonetic segments differ in terms of their acoustic structure. Vowels are considered to be more acoustically salient than consonants and this may explain the earlier improvement and perceptual organization shown for native vowels. The relationship between auditory development and tuning for speech categories may also differ between phonetic segments. However, as far as we know, the acoustic properties of these speech segments have received relatively little

30 attention in the current literature on developmental speech perception. The perceptual reorganization observed in the first year of life and its relation to hearing abilities and to the development of auditory processes is central to this doctoral research.

31

III. Auditory development

Speech sounds are first processed, like all other sounds, by peripheral (i.e., cochlear) mechanisms involving the basilar membrane, inner and outer hair cells and auditory-nerve fibers (Pickles, 1988). Numerous bio-mechanical processes take place within the peripheral auditory system, but two of particular interest here are those involved in the spectral and temporal analysis of incoming sounds. The basilar membrane decomposes complex waveforms into their frequency components in the . This organization of frequency coding (called tonotopy) is modeled as resulting from the operation of a bank of narrowly tuned bandpass filters. Complex sounds are thus decomposed into a series of narrow frequency bands (usually described as 32 independent frequency bands) with a passband equal to one “equivalent-rectangular bandwidth”, 1-ERBN (Glasberg & Moore, 1990; Moore, 2003; "N" standing here for "normal-hearing listeners").

Each 1-ERBN wide band may be viewed as a sinusoidal carrier with superimposed amplitude modulation (AM) and frequency modulation (FM; e.g., Drullman, 1995; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995; Sheft, Ardoint, & Lorenzi, 2008; Smith, Delgutte, & Oxenham, 2002; Zeng et al., 2005). The FM or “temporal fine structure” is determined by the dominant (instantaneous) frequencies in the signal that fall close to the center frequency of the band. The AM or “temporal envelope” corresponds to the relatively slow fluctuations in (instantaneous) amplitude superimposed on the carrier. Both AM (envelope) and FM (temporal fine structure) cues are represented in the pattern of phase-locking in auditory-nerve fibers. However, for most mammals, the accuracy of phase locking to the temporal fine structure is constant up to about 1-2 kHz and then declines, so phase locking is no longer detectable at about 5-6 kHz (e.g., Johnson, 1980; Kiang, Pfeiffer, Warr, & Backus, 1965; Palmer, Winter, & Darwin, 1986; Rose, Brugge, Anderson, & Hind, 1967; cf Heinz, Colburn, & Carney, 2001), whereas phase locking to temporal-envelope (AM) cues remains accurate for carrier (audio) frequencies well beyond 6 kHz (e.g., Joris, Schreiner, & Rees, 2004; Joris & Yin, 1992; Kale & Heinz, 2010).

32

Moreover, several additional peripheral mechanisms (e.g., synaptic adaptation) appear to limit temporal envelope (AM) coding beyond those that limit temporal fine structure (FM) coding (Joris & Yin, 1992), suggesting potential dissociations between the auditory processing of temporal envelope (AM) and temporal fine structure (FM) cues. Studies with human fetuses and preterm infants have indicated that the cochlea is functionally and structurally developed before birth (e.g., Abdala, 1996; Bargones & Burns, 1988; Morlet et al., 1995; Pujol & Lavigne-Rebillard, 1992). Nevertheless, the neural auditory pathway (Moore, 2002) and the primary auditory processes involved in frequency, temporal and intensity coding are not adultlike at birth and continue to mature later in childhood (see Burnham & Mattock, 2010; Saffran et al., 2006 for reviews). The development of these low- level, sensory processes has been studied mainly in infancy with non-linguistic stimuli (i.e., tones, noises).

3.1 Development of frequency selectivity and discrimination

Frequency processing in the cochlea may depend on both spectral (spatial) and temporal mechanisms. Frequency selectivity (and thus, frequency resolution) corresponds to the ability to separately perceive multiple frequency components of a given complex sound. Frequency selectivity depends on spatial mechanisms in the cochlea (the tonotopic coding along the basilar membrane). Frequency discrimination corresponds to the ability to distinguish between sounds differing in frequency. At high frequencies, frequency discrimination is constrained by the width of cochlear filters (and thus, by frequency selectivity/resolution), but at low frequencies, frequency discrimination may also depend on a purely temporal code constrained by neural phase locking (e.g., Moore, 1974; see Moore, 2004 for a review). Behavioral and electrophysiological methods have been used to assess the development of frequency selectivity/resolution and frequency discrimination in infants.

3.1.1 Frequency selectivity/resolution

Masking paradigms have been used to assess the development of frequency selectivity and thus frequency resolution in humans. A competing

33 sound with specific frequency content (e.g., a narrow band of noise) is presented to the listener in order to interfere with the detection of a target pure tone of a given frequency. Higher (masked) detection threshold for the target pure tone is well explained by the characteristics of cochlear filters (shape, width). In infants, several studies showed that responses to a masked pure tone become adult-like around 6 months of age at low and high frequencies. No difference was found between infants of 4-8 months and adults in the detection of masked tones for frequencies between 500 and 4 kHz (Olsho, 1985). Schneider, Morrongiello and Trehub (1990) assessed masked thresholds at 400 and 4 kHz using narrowband maskers of different bandwidths for 6.5-month-olds, children of 2 and 5 years and adults. The masked thresholds increased similarly for all groups and no increase of the threshold was observed beyond 2 kHz, indicating that the critical width of auditory filters does not change with age. However, younger infants (3-month-olds) showed a mature frequency selectivity at a low frequency of 500 Hz and 1 kHz but not at a high frequency of 4 kHz (Spetner & Olsho, 1990) and 6-month-olds demonstrated mature frequency selectivity at all frequencies. The width of the auditory filter (and thus, cochlear filter) becomes mature during the first year of life. However, developmental studies conducted with children have shown immaturity in frequency selectivity until 4 years of age (Allen, Wightman, Kistler, & Dolan, 1989; Irwin, Stillman, & Schade, 1986) but these results are strongly dependent on the task. Hall and Grose (1991) observed that 4-year-olds had a mature auditory filter width when tested using different types of maskers (i.e., notched noise conditions with different signal levels). Thus, frequency selectivity at low frequencies appears to be efficient at birth but develops at higher frequencies between 3 and 6 months. Moreover, physiological studies using auditory brainstem responses (ABR) or otoacoustic emissions led to the same conclusions (see Abdala & Folsom, 1995a, 1995b; Bargones & Burns, 1988; Folsom & Wynne, 1987). This maturation seems consistent with the early capacity of infants to process and recognize low-pass filtered speech sounds (e.g., Mehler et al., 1988).

34

3.1.2 Frequency discrimination

Six-month-olds show mature frequency resolution. Still, other studies revealed that they require a larger frequency difference than adults to detect a change in frequency between tones below 2 kHz (Aslin, 1989; Olsho, Koch, & Halpin, 1987; Olsho, Schoon, Sakai, Turpin, & Sperduto, 1982; Sinnott & Aslin, 1985). Olsho (1984) showed that at 5-8 months, infants’ frequency discrimination thresholds are higher than those of adults at low frequency (below 2 kHz) but not at high frequency. Three-month-old infants have higher frequency discrimination thresholds than 6-month-olds, 12-month-olds and adults at a high frequency (4 kHz; Olsho et al., 1987). At 500 and 1 kHz, all three infant groups present higher frequency discrimination thresholds than adults. The early improvement in high frequency discrimination is consistent with the improvement in frequency selectivity observed at the same ages. No difference at low frequencies was observed between 3 and 6 months, suggesting that frequency discrimination remains immature at low frequencies (Olsho, 1984; Sinnott & Aslin, 1985; Olsho et al., 1987). The discrimination ability for low frequencies seems to become adult-like later in childhood (Jensen & Neff, 1993; Maxon & Hochberg, 1982; Thompson, Cranford, & Hoyer, 1999). Thus, between 3 and 6 months infants exhibit better frequency discrimination performance at high frequencies than at lower frequencies. The factors underlying this developmental trajectory are still unclear. Several explanations have been proposed to account for the differences observed between infants, children and adults such as training (Olsho, Koch, & Carter, 1988) and testing methods (Sutcliffe & Bishop, 2005). Moreover, as described before, low frequency coding may involve a temporal code (Moore, 1974) constrained by neural phase locking in auditory-nerve fibers. The immaturity in the discrimination of frequency differences at low frequencies may be explained by an inaccurate temporal code (see Kettner, Feng, & Brugge, 1985). In other words, the temporal and place codes for frequency may not follow the same pattern of development.

3.2 Development of temporal resolution

"Temporal resolution" of the auditory system refers to the ability to detect changes in sounds over time (Moore, 2004). Psychoacoustical and 35 neurophysiological studies showed that these changes are processed at different time scales in the peripheral and central auditory system. As described above, two time scales have been identified: a relatively fast one (the FM cues or “temporal fine structure”) and a relatively slow one (the AM cues or “temporal envelope”). Temporal resolution generally refers to the limits in the processing of temporal- envelope cues. The auditory processing of temporal envelope (AM) information is now generally understood as resulting from the operation of a central modulation filterbank (e.g., Dau, Kollmeier, & Kohlrausch, 1997a, 1997b). The auditory processing of temporal fine structure (FM) cues is assumed to be constrained by neural phase locking in auditory-nerve fibers, at least for slow FM rates (< 5 Hz) and low carrier audio frequencies (< 1 kHz). For faster FM rates and higher carrier audio frequencies, FM cues may be converted into envelope (AM) and spectral (place) cues by cochlear filters (see Moore, 2004). The processing of these two kinds of information was studied in normal-hearing infants and children with relatively simple sounds and various psychoacoustical and electrophysiological measures.

3.2.1 Processing of temporal envelope

Various methods are used to assess temporal envelope (AM) processing. Gap detection tasks consist in finding the shortest gap duration that can be detected in a sound. For broadband noise stimuli, the gap detection threshold measured in adults is about to 1-3ms, indicating exquisite temporal resolution for the auditory system. Poor temporal resolution has been observed in 3-, 6- and 12- month-old infants (e.g., Trehub, Schneider, & Henderson, 1995; Werner, Marean, Halpin, Spetner, & Gillenwater, 1992). However, earlier maturation (by 3 or 6 months of age) has been suggested with electrophysiological measurements of gap detection (Trainor, Samuel, Desjardins, & Sonnadara, 2001; Werner, Folsom, Mancl, & Syapin, 2001). Moreover, earlier maturation of envelope processing has also been found in other behavioral studies using AM detection tasks (e.g., Levi & Werner, 1996) and forward or backward masking tasks (e.g., Levi & Werner 1996; Werner, 1996, 1999). Regarding AM processing, a single electrophysiological study

36 conducted with 1-month-old infants indicates immaturity in the processing of AM (Levi, Folsom, & Dobie, 1995). From the evidence above, it appears that temporal auditory resolution is mature around 6 months. However, studies conducted with children showed differences in auditory temporal resolution. To reconcile these findings, it is important to consider that results from these studies performed with children are also strongly dependent on the methods used to assess temporal resolution. For children, gap detection tasks show maturation of temporal resolution around 5-7 years (Diedler, Pietz, Bast, & Rupp, 2007; Trehub et al., 1995; Wightman, Allen, Dolan, Kistler, & Jamieson, 1989) whereas AM detection tasks indicate earlier maturation at 4 years (e.g., Buss, Hall, Grose, & Dev, 1999; Hall & Grose, 1994). Sensory versus nonsensory factors for this long maturation are still debated (see Werner, 2007) but the results generally appear to be more dependent on the task or on sound complexity than in adults (see Buss et al., 1999).

3.2.2 Processing of temporal fine structure

Very few studies assessed the detection of change in FM in infants and children. These investigations were conducted in order to better understand the perception of speech-rate or infant-directed speech in infants. Six-month-old infants require larger frequency transitions to detect a FM change in a 1-kHz tone compared to adults (Aslin, 1989). However, Colombo and Horowitz (1986) showed that 4-month-old infants are able to discriminate two different FM sweeps (from 150 Hz to 550 Hz or from 150 Hz to 275 Hz in a 1-sec period) and do not prefer exaggerated sweeps (such as those in infant-directed speech). More recently, Leibold and Werner (2007) found that 4-month-olds are more sensitive to FM cues swept from 150 Hz to 550 Hz than at lower frequencies. However, the infant-directed speech preference may also be based on the preference for higher frequency (i.e., for 550 Hz compared to 150 Hz) than for the FM properties of this signal. In children, performance in detecting FM change improves between 6 and 10 years and sensitivity to low modulation rate (2 Hz) is poor until 9 years (Dawes & Bishop, 2008). These studies indicate that young infants are able to process differences in FM but fail to relate these results to speech perception abilities. Overall, it seems

37 that the processing of FM cues is not completely mature in infancy. At the neural level, animal data collected in kittens show that the temporal responses (i.e., phase locking properties) of auditory-nerve fibers to pure tones improve with age (Kettner et al., 1985).

3.3 Discussion and Conclusions

Three major acoustic cues are extracted by the human auditory system: the spectral cues (encoded spatially) and the temporal cues (encoded via neural phase-locking) at two different time scales (the slow and the fast ones). Accurate coding of these three acoustic cues is required for robust speech perception in real-life listening conditions. The development of the basic auditory capacities involved in the extraction of these three cues has been assessed extensively with non-linguistic sounds. As discussed above, these three capacities appear to be “mature” by 6 months of age, but continue to develop until late in childhood (e.g., Saffran et al., 2006; Buss et al., 1999). This section has emphasized the fact that the developmental course of auditory temporal processes is less understood than the development of other basic auditory capacities. Table 1 summarizes the main findings regarding the development of frequency and temporal processing. Infants are able to discriminate fine phonetic contrasts although they have relatively high pure-tone absolute thresholds and poor low-frequency discrimination and high- frequency selectivity. Auditory processing has rarely been tested using complex sounds such as speech stimuli. The development of speech analysis and synthesis algorithms/devices has recently allowed in-depth investigations of low-level spectro-temporal auditory abilities involved in the perception of speech signals for adults. Section 4 presents this novel framework and the recent data acquired in adults and children with respect to the processing of speech AM and FM cues.

38

Studies Ages Method and stimuli Difference across ages Olsho, 4-8 Detection of tone probe No difference with adults 1985 months from 500 to 4000 Hz 6.5 Tone of 800 Hz and 4000 Schneider months No change in the critical Hz masker with different et al., 1990 2;5 bandwidth Frequency bandwidth years selectivity/ Spetner & resolution 3;6 Pulsating probe detection at 3 months worse at high Olsho, months 500, 1000 and 4000 Hz frequency 1990 Abdala & 3;6 Auditory brainstem 3 months worse at high Folsom, months responses frequency 1995 Frequency change 5-8-month-olds’ thresholds Olsho, 5; 8 detection Pure tone from twice higher than in adults 1984 months 250 to 8000 Hz at low frequency Frequency 3; 6 3 months worse than Olsho et Pure tones of 500, 1000 and discrimination and 12 adults, 6; 12 months worse al., 1987 4000 Hz months at 500 and 1000 Hz Jensen & 4 to 6 Variability but similar to Frequency change detection Neff, 1993 years adults at 6 years Same thresholds at 500 and 1000 Hz than adults, but no Levi et al., Frequency and Envelope 1 month improvement with 1995 Following Responses increasing carrier frequency 3; 6 and Gap detection in high pass Poorer thresholds than Werner et 12 maskers of 500, 2000 or adults, but same effect of al., 1992 months 8000 Hz frequency across group Poorer sensitivity at 3 Temporal Levi & AM detection from 5 to 3; 6 months than at 6 months resolution Werner, 200 Hz modulation in months but similar effect of 1996 broadband noise modulation frequency Hall & AM detection from 5 to 4-7 years poorer, 9-10 4 to 11 Gross, 200 Hz modulation in years adultlike, but same years 1994 broadband noise sensitivity to modulation Werner, 3; 6 Forward masking for 6 months more adultlike 1999 months 1000 Hz tone than 3 months Buss et al., 5 to 11 Backward, forward Threshold improves but 1999 years masking task-dependent

Table 1. Methods and results of studies investigating the development of frequency selectivity/resolution, frequency discrimination and temporal resolution of the auditory system. The last column indicates whether differences have been found between infants/ children and adults.

39

IV. Speech as a spectro-temporally modulated signal

4.1 Speech as a modulated signal

Our current understanding of how speech information is represented in the auditory system has recently improved thanks to a wealth of psychophysical studies relying primarily on “vocoders” (Drullman, 1995; Dudley, 1939; Flanagan, 1972, 1980; Flanagan, Meinhart, Golden, & Sondhi, 1965) to manipulate the speech signal (but see also Remez & Rubin, 1990). The first stage in this new form of vocoder is a filterbank (also referred to as the “analysis filterbank”) that is supposed to mimic cochlear frequency analysis. Its outputs are often modeled as the product of the (Hilbert) temporal envelope (or AM function) and a FM sine-wave carrier (the temporal fine structure) at the analysis-filter center frequency. AM vocoders (e.g., Drullman, 1995; Shannon et al., 1995) preserve the original AM component and discard the original FM component by replacing it by a band of noise or a tone with frequency equal to the center frequency of the analysis band (e.g., Shannon et al., 1995). Conversely, FM vocoders (Smith et al., 2002; Gilbert & Lorenzi, 2006; Lorenzi, Gilbert, Carn, Garnier & Moore, 2006) preserve the original FM component and discard the AM. By manipulating the filter bandwidths (and thus, frequency resolution), researchers can gradually change the relative importance of the AM and FM components conveying the intelligibility of the reconstituted speech. By manipulating the cutoff frequency of the filters used to demodulate speech within each analysis band (and thus, by manipulating temporal resolution), researchers can also gradually change the relative importance of the modulation components conveying the intelligibility of the reconstituted speech.

4.2 The perception of speech modulation cues

4.2.1 Perception of speech modulation cues by adults

A large number of studies used nonsense syllables, words or sentences processed by vocoders to reduce the original AM (temporal envelope) or FM

40

(temporal fine structure) speech cues within each analysis frequency band (e.g., Gilbert & Lorenzi, 2006; Shannon et al., 1995; Sheft et al., 2008; Smith et al., 2002; Zeng et al., 2005). High levels of speech intelligibility and phonetic-feature reception are obtained in normal-hearing adults when the vocoded speech stimuli retain mainly the AM speech cues. Shannon et al. (1995) evaluated the role of these spectro-temporal modulation cues in speech identification by using different noise-excited vocoders replacing FM information by a band of noise within each analysis band. The AM cues were extracted using different lowpass filters (i.e., lowpass filters with a cutoff frequency varying from 16 to 500 Hz) within a limited number of broad analysis frequency bands (1, 2, 3 or 4 bands). Results showed that syllable identification and phonetic-feature reception (i.e., voicing, manner and place of articulation) is poor for 1, 2, or 3 analysis frequency bands, but sharply increases with 4 bands, irrespective of the cutoff frequency of the lowpass filter used to extract AM. These studies highlight the primary role of the AM cues in speech identification at least in quiet. Other studies with vocoded signals retaining mainly the (Hilbert) FM cues also show high levels of speech identification with nonsense syllables or sentences (e.g., Gilbert & Lorenzi, 2006; Hopkins, Moore, & Stone, 2010; Lorenzi et al., 2006; Sheft et al., 2008). However, a much longer training period is necessary for participants to be able to identify these "FM- speech" stimuli compared to their “AM-speech” counterparts, suggesting that FM cues are less salient than AM cues. The essential role of AM cues in speech identification in quiet was also shown with chimaeric sounds combining the AM cues of a given speech signal with the FM cues of another speech signal (Smith et al., 2002). When presented with these chimaeric stimuli, the responses of adult listeners are mostly driven by the AM cues rather than by the FM cues. These data comparing the role of AM and FM cues confirmed that AM cues play a critical role in speech identification. However, another set of studies demonstrated that the relative importance of FM cues may increase when speech sounds are distorted, or more generally, when speech redundancy is reduced (Ardoint & Lorenzi, 2010; Gilbert, Bergeras, Voillery, & Lorenzi, 2007; Gilbert & Lorenzi, 2006; Hopkins et al., 2010; Nelson, Jin, Carney, & Nelson, 2003; Qin & Oxenham, 2003; Zeng et al., 2005). Indeed, speech intelligibility is poorer for “AM speech” than for “intact speech” (that is, for speech combining AM and FM

41

cues within each frequency band) when stimuli are spectrally reduced or filtered, periodically interrupted or masked by interfering talkers or background noise (Eaves, Summerfield, & Kitterick, 2011; Gnansia, Péan, Meyer, & Lorenzi, 2009). Moreover, some investigations demonstrated different perceptual weighting of AM and FM cues in different listening conditions such as steady or interrupted noise (e.g., Fogerty, 2011; Fogerty & Humes, 2012). Other studies suggested that the role of AM and FM cues may vary across languages such as English or French and tonal languages (i.e., Mandarin or Thai). Xu and Pfingst (2008) used chimaeric monosyllable words combining the AM and the FM of two Mandarin lexical tones within each frequency band. They showed that, contrary to the results of Smith et al. (2002), the responses of Mandarin adult speakers are driven by the FM cues more than by the AM cues. Figure 1 shows the identification scores obtained with chimaeric sounds in Chinese Mandarin adults and English adults. Several studies found similar results (Fu, Zeng, Shannon, & Soli, 1998; Kong & Zeng, 2006; Wang, Xu, & Mannell, 2011), indicating that speech modulation cues are used differently in lexical-tone identification compared to consonant or vowel identification.

Figure 1. Identification scores of speech chimeras containing conflicting speech information in the AM (temporal envelope) and the FM (temporal fine structure) of two speech items. The responses of English adults for English items are represented by triangles, and the responses of Mandarin speakers for Mandarin items are represented by circles. Blue lines represent the percentage of correct responses consistent with AM. Red lines represent the responses consistent with FM. Redrawn from Smith et al. (2002) and Xu and Pfingst (2008).

42

It follows from the studies discussed above that the weight of AM (temporal envelope) and FM (temporal fine structure) varies strongly across listening conditions and languages in adult listeners. In other words, the processing of AM and FM speech cues may not be hard-wired, as assumed in recent models of speech recognition (e.g., Cooke, 2006; Elhilali, Chi, & Shamma, 2003; Jørgensen & Dau, 2013).

4.2.2 Perception of speech modulation cues by children

In another pioneering study, Eisenberg, Shannon, Martinez, Wygonski and Boothroyd (2000) assessed the ability of normal-hearing 7-and 10-year-old children and adults to identify nonsense syllables, words and sentences vocoded to retain only AM cues below 160 Hz within 4 to 8 analysis frequency bands (the FM carriers were replaced by noise in each band). Results showed, children younger than 7 years require a higher frequency resolution (i.e., a greater number of analysis frequency bands) than 10-year-olds and adults to reach similar identification performance. This initial investigation was extended by Bertoncini, Serniclaes and Lorenzi (2009) to include younger children aged between 5 and 7 years. A discrimination task was used with nonsense syllables vocoded to retain only the FM cues or the AM cues below 64 Hz within 16 frequency bands (here, the FM carriers were replaced by pure tones with fixed frequencies). Bertoncini et al. (2009) found that normal-hearing 5-, 6-, and 7-year-old children were able to discriminate speech contrasts on the basis of AM cues or FM cues at an adult level. This study also showed no significant difference in performance across age groups for voicing, place, manner, and nasality. Thus, the perception of AM cues below 64 Hz in speech, and of speech FM cues appears as robust for 5-year-old children as for adults when a discrimination task is used. More recently, Newman and Chatterjee (2013) used a preferential looking paradigm with 27-month-old toddlers. Vocoded sentences were used to instruct the children to look at a target picture. Children were able to look at the picture of the target word on the basis of AM cues extracted within 8 frequency bands. With 4 channels, performance was inconsistent and with 2 channels, children failed to

43

look at the target word. Moreover, children oriented towards the target picture more slowly as the number of channels decreases. Figure 2 summarizes the results of Eisenberg et al. (2000) and Newman and Chatterjee (2013) obtained in 5-7-year-olds and 2-year-olds in two word identification tasks. The cutoff frequency for the envelope extraction was fixed in each study and was relatively high (respectively 160 Hz and 400 Hz). The number of frequency bands varied from 2 to 32. These studies indicate that children’s capacity to use AM speech cues varies as a function of the number of analysis-frequency bands. This contrasts with data from adults (who have identification performance above chance at 4 frequency bands; Shannon et al. 1995) and suggests that the processing of speech modulation cues is not as robust as in adults. This may suggest that fine modulation details are more important for children than for adults. However, it is important to note that children are clearly able to use reduced spectro-temporal information (at least in 8 frequency bands) to identify speech sounds and this ability increases with age. Moreover, Nittrouer and Lowenstein (2010, see also Nittrouer, Lowenstein, & Packer, 2009) showed that children have more difficulties with noise-vocoded speech signals compared to sine-wave speech signals, suggesting (without demonstrating it) that dynamic spectral information may play an important role in children’ speech perception (children may be more susceptible than adults to the interference or modulation-masking effect caused by the envelope fluctuations to noise carriers in vocoded speech).

44

Figure 2. Correct scores of word recognition of 5-7-year-old children and 2- year-old children. Chance level is 50 %. Vocoded speech signals contained only the amplitude modulation cues of original speech signals (except in the “intact” condition, corresponding to non-degraded signals). The number of analysis- frequency bands varied from 2 to 32. Redrawn from Eisenberg et al. (2000) and Newman and Chatterjee (2013).

To the best of our knowledge, only one study examined the perception of AM cues of speech sounds before childhood. This study measured cardiac activity for human fetuses (38 weeks of gestational age) in response to intact and degraded version of speech sentences containing only gross AM cues extracted by a single, broadband noise-excited vocoder. Speech sentences were played by loudspeakers in front of the mothers’ abdomen. Results showed that fetuses react similarly to the intact speech signal and to its degraded version (Granier-Deferre, Ribeiro, Jacquet, & Bassereau, 2011). Thus, fetuses perceive changes in the gross AM cues of sentences.

4.3 Speech perception with cochlear implants

These psychoacoustical studies emphasized the role of spectro-temporal modulations in speech perception. Moreover, these investigations have led to the development and improvement of signal-processing strategies for hearing- impaired listeners and deaf listeners wearing a cochlear implant. A cochlear implant (CI) is an electronic device implanted in the peripheral auditory system of

45

patients with severe to profound sensorineural hearing loss. CIs process incoming sounds as AM vocoders: they deliver the AM cues of the original signal via a limited number of (relatively broad) analysis frequency channels (see Shannon, 2012). Worst CI listeners use four independent channels whereas the best CI listeners use eight channels (Friesen, Shannon, Baskent, & Wang, 2001; Fu & Nogaki, 2005). The FM cues of the original signal are not transmitted by CI processors: they are replaced by a fixed train of pulses which amplitude is modulated by the original AM cues. Thus, the output signal delivered by CIs on each electrode (and thus, its modulation content) is severely impoverished compared to the original signal. Several psychoacoustical studies conducted with adults equipped with a CI showed that CI patients may have normal or sometimes better-than-normal ability to detect AM in sounds (e.g., Shannon, 1992) and that this ability is an important predictor of CI success (e.g., Cazals, Pelizzone, Saudan, & Boex, 1994; Fu, 2002; Won, Drennan, Nie, Jameyson, & Rubinstein, 2011). Additional studies also found that CI (adult) patients may have normal or even better-than-normal ability to discriminate complex AM patterns such as ramped/damped temporal envelopes or so-called "2nd-order AM cues" (Lorenzi et al., 2004; Lorenzi, Gallégo, & Patterson, 1997, 1998). In comparison, very few studies investigated FM perception in CIs patients. These studies found that, as expected, CI (adult) patients have a very poor ability to detect (sine) FM (Chen & Zeng, 2004; Luo & Fu, 2007). Altogether, these studies conducted with non-linguistic stimuli and detection/discrimination tasks confirmed that CI patients receive AM (temporal envelope) information relatively well over four to eight independent channels, but receive little information (if any) regarding FM (temporal fine structure). Thus, CI devices were first given to postlingually deaf adults with relatively good success (e.g., Krueger et al., 2008; Spahr & Dorman, 2004; see Moore & Shannon, 2009 for a review). Later, CIs were given to children, and finally to prelingually deaf infants (e.g., Svirsky, Robbins, Kirk, Pisoni, & Miyamoto, 2000). In adults, a CI delivers an impoverished signal to a mature language processing system. In the case of congenitally deaf young children, a CI delivers the same impoverished signal to the immature auditory system and the undeveloped language processing system. CIs are now proposed to infants at increasingly younger ages. However, how the immature auditory system deals

46

with such reduced speech signal is still unknown. Moreover, how language skills develop under these specific circumstances is also unknown. The perception of speech with CIs could be viewed as a model to further explore the link between low-level auditory processes and language development. Several studies showed that deaf children fitted with CIs before the age of 3 years develop good receptive and productive language skills, (e.g., Bouton, Serniclaes, Bertoncini, & Colé, 2012; Holt & Svirsky, 2008; Miyamoto, Houston, Kirk, Perdew, & Svirsky, 2003). Finally, earlier age at implantation has been shown to lead to better language skills (e.g., Connor & Zwolan, 2004; Dettman, Pinder, Briggs, Dowell, & Leigh, 2007; Nicholas & Geers, 2007; Richter, Eißele, Laszig, & Löhle, 2002; Svirsky, Teoh, & Neuburger, 2004; Tomblin, Barker, Spencer, Zhang, & Gantz, 2005). These results suggest several sensitive periods for spoken language development (see Kral & Sharma, 2012, for a review). In infants equipped with a CI, language skills develop but large inter- individual variability is still observed (e.g., Svirsky et al., 2000). The age at implantation appears as a major factor in the language outcomes. However, what is largely ignored is how congenitally deaf infants learn the properties of their native language by using the impoverished spectro-temporal information delivered by the CI processors.

4.4 Discussion and conclusions

Vocoders proved to be useful tools for investigating the auditory mechanisms underlying speech perception. Moreover, they simulate in normal- hearing listeners the perception of impoverished speech signals similar to those conveyed by CIs. Normal-hearing adults have shown high speech recognition abilities on the basis of reduced speech modulation cues (e.g., Shannon et al., 1995). Note that adults tested in these studies are “experts” in their native language and their high identification performance may be – at least partially – due to their language skills (e.g., Hervais-Adelman, Davis, Johnsrude, Taylor, & Carlyon, 2011; Sohoglu, Peelle, Carlyon, & Davis, 2012). Children who have not attained this level of expertise show poorer identification performance with spectro-temporal reduced speech signals (Eisenberg et al., 2000; Newman & Chatterjee, 2013). However, in

47

a less demanding task, children are able to use spectro-temporal reduced signals to discriminate phonetic contrasts as well as adults (Bertoncini et al., 2009). Together, these data suggest that the processing of speech modulation cues is not fully mature in children. There are no available data in younger infants between 6 and 12 months of age, that is between those who are not perceptually tuned in their native language and those who have reorganized their perception of speech under the influence of the native language. The mechanisms involved in this reorganization are not well understood, but they are clearly affected by learning and exposure to the linguistic environment. A possible consequence of this perceptual reorganization for speech may be that infants become more specialized in the processing of language-specific acoustic information.

48

V. Objectives of the present doctoral research

Our review points out the lack of information regarding the auditory processes involved in speech perception during infancy. If the pioneering studies of infant speech perception are still inspiring, the present research program is based on a radically different approach. Our aim is neither to search for new acoustic invariants, nor to demonstrate specialization for speech sound processing in infants. These PhD works aim rather to explore the low-level spectro-temporal auditory mechanisms underlying speech discrimination and their development in infancy. The main purpose of the present research program is to characterize these early mechanisms by using a psychoacoustic approach of speech analysis focusing on the role of spectral and temporal modulations. The development of speech perception is thus reconsidered in light of the development of the capacity of the auditory system to detect and discriminate slow and fast AM and FM cues as found in speech. These basic spectral and temporal auditory processes will be tested using vocoded speech stimuli and behavioral discrimination tasks in infants. This PhD work aims to explore the following questions:

- To what extent are infants able to use modulation cues when listening to speech sounds? The first three studies aim to explore the perception of AM and FM speech cues using vocoded syllables when 6-month-old infants have to discriminate two French phonetic contrasts (/aba/-/apa/ and /aba/-/ada/). The effects of changing the spectral and temporal resolution of the vocoded stimuli are also studied in 6- month-old infants to explore the robustness of speech-modulation encoding in infancy.

- How do auditory development (and especially the perception of speech modulation cues) and linguistic experience impact speech processing? To what extent the linguistic-environment (shaping speech perception) affects the basic auditory capacity to use speech modulation cues?

49

The last two studies investigate the effect of age and language experience in the processing of speech modulation cues by comparing the discrimination of intact and vocoded lexical tones in French-native and Mandarin-native adults and infants (of 6 and 10 months).

This PhD work focuses on normal-hearing infants. However, some vocoders have been used to simulate cochlear implant processors in adults and infants with normal hearing. The implications of this work for pediatric cochlear implantations will be discussed in the General Discussion section.

50

References

Abdala, C. (1996). Distortion product otoacoustic emission (2 f- f) amplitude as a function of f/f frequency ratio and primary tone level separation in human adults and neonates. The Journal of the Acoustical Society of America, 100, 3726–3740. Abdala, C, & Folsom, R. C. (1995a). The development of frequency resolution in humans as revealed by the auditory brain-stem response recorded with notched-noise masking. The Journal of the Acoustical Society of America, 98, 921–930. Abdala, C, & Folsom, R. C. (1995b). Frequency contribution to the click-evoked auditory brain-stem response in human adults and infants. The Journal of the Acoustical Society of America, 97(4), 2394–2404. Allen, P., Wightman, F., Kistler, D., & Dolan, T. (1989). Frequency resolution in children. Journal of speech and hearing research, 32(2), 317–322. Ardoint, M., & Lorenzi, C. (2010). Effects of lowpass and highpass filtering on the intelligibility of speech based on temporal fine structure or envelope cues. Hearing research, 260(1-2), 89–95. Armitage, S. E., Baldwin, B. A., & Vince, M. A. (1980). The fetal sound environment of sheep. Science, 208(4448), 1173–1174. Aslin, R N. (1989). Discrimination of frequency transitions by human infants. The Journal of the Acoustical Society of America, 86(2), 582–590. Aslin, R. N., Pisoni, D. B., Hennessy, B. L., & Perey, A. J. (1981). Discrimination of voice onset time by human infants: New findings and implications for the effects of early experience. Child Development, 52(4), 1135–1145. Aslin, R. N., Werker, J. F., & Morgan, J. L. (2002). Innate phonetic boundaries revisited (L). The Journal of the Acoustical Society of America, 112, 1257–1260. Bargones, J. Y., & Burns, E. M. (1988). Suppression tuning curves for spontaneous otoacoustic emissions in infants and adults. The Journal of the Acoustical Society of America, 83, 1809–1816. Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of experimental psychology human perception and performance, 32(1), 97–103. Bertoncini, J., Bijeljac-Babic, R., Blumstein, S. E., & Mehler, J. (1987). Discrimination in neonates of very short CVs. The Journal of the Acoustical Society of America, 82(1), 31–37. Bertoncini, J., Bijeljac-Babic, R., Jusczyk, P. W., Kennedy, L. J., & Mehler, J. (1988). An investigation of young infants’ perceptual representations of speech sounds. Journal of Experimental Psychology: General, 117(1), 21– 33.

51

Bertoncini, J., & Mehler, J. (1981). Syllables as units in infant speech perception. Infant behavior and development, 4, 247-260.Bertoncini, J., Serniclaes, W., & Lorenzi, C. (2009). Discrimination of speech sounds based upon temporal envelope versus fine structure cues in 5-to 7-year-old children. Journal of Speech, Language, and Hearing Research, 52(3), 682–695. Best, C. C., & McRoberts, G. W. (2003). Infant perception of non-native consonant contrasts that adults assimilate in different ways. Language and speech, 46(2-3), 183–216. Best, C. T., McRoberts, G. W., LaFleur, R., & Silver-Isenstadt, J. (1995). Divergent developmental patterns for infants’ perception of two nonnative consonant contrasts. Infant behavior and development, 18(3), 339–350. Bijeljac-Babic, R., Bertoncini, J., & Mehler, J. (1993). How do 4-day-old infants categorize multisyllabic utterances?. Developmental psychology, 29(4), 711–721. Bouton, S., Serniclaes, W., Bertoncini, J., & Cole, P. (2012). Perception of speech features by French-speaking children with Cochlear Implants. Journal of Speech, Language and Hearing Research, 55(1), 139–153. Burnham, D., & Francis, E. (1997). The role of linguistic experience in the perception of Thai tones. Southeast Asian linguistic studies in honour of Vichin Panupong, 29–47. Burnham, D., & Mattock, K. (2010). Auditory development. The Wiley-Blackwell handbook of infant development, 1, 81–119. Buss, E., Hall, J. W., 3rd, Grose, J. H., & Dev, M. B. (1999). Development of adult-like performance in backward, simultaneous, and forward masking. Journal of speech, language, and hearing research, 42(4), 844–849. Cazals, Y., Pelizzone, M., Saudan, O., & Boex, C. (1994). Low-pass filtering in amplitude modulation detection associated with vowel and consonant identification in subjects with cochlear implants. The Journal of the Acoustical Society of America, 96, 2048–2054. Chen, H., & Zeng, F.-G. (2004). Frequency modulation detection in cochlear implant subjects. The Journal of the Acoustical Society of America, 116, 2269–2277. Cheour, M., Ceponiene, R., Lehtokoski, A., Luuk, A., Allik, J., Alho, K., & Näätänen, R. (1998). Development of language-specific phoneme representations in the infant brain. Nature neuroscience, 1(5), 351–353. Cherry, E. C. (1953). Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25, 975–979. Colombo, J., & Horowitz, F. D. (1986). Infants’ attentional responses to frequency modulated sweeps. Child development, 57(2), 287–291. Conboy, B. T., Sommerville, J. A., & Kuhl, P. K. (2008). Cognitive control factors in speech perception at 11 months. Developmental psychology, 44(5), 1505–1512. Connor, C. M., & Zwolan, T. A. (2004). Examining multiple sources of influence

52

on the reading comprehension skills of children who use cochlear implants. Journal of Speech, Language and Hearing Research, 47(3), 509–525. Cooke, M. (2006). A glimpsing model of speech perception in noise. The Journal of the Acoustical Society of America, 119(3), 1562–1573. Cooper, F. S., Delattre, P. C., Liberman, A. M., Borst, J. M., & Gerstman, L. J. (1952). Some experiments on the perception of synthetic speech sounds. The Journal of the Acoustical Society of America, 24, 597–606. Cooper, F S, Liberman, A. M., & Borst, J. M. (1951). The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Sciences of the United States of America, 37(5), 318–325. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997a). Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. The Journal of the Acoustical Society of America, 102(5 Pt 1), 2892–2905. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997b). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. The Journal of the Acoustical Society of America, 102(5 Pt 1), 2906–2919. Dawes, P., & Bishop, D. V. (2008). Maturation of visual and auditory temporal processing in school-aged children. Journal of Speech, Language and Hearing Research, 51(4), 1002–1017. Dehaene-Lambertz, G. (2000). Cerebral specialization for speech and non-speech stimuli in infants. Journal of Cognitive Neuroscience, 12(3), 449–460. Dehaene-Lambertz, G., & Houston, D. (1998). Faster orientation latencies toward native language in two-month-old infants. Language and Speech, 41(1), 21–43. Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. The Journal of the Acoustical Society of America, 27, 769–773. Dettman, S. J., Pinder, D., Briggs, R. J., Dowell, R. C., & Leigh, J. R. (2007). Communication development in children who receive the cochlear implant younger than 12 months: risks versus benefits. Ear and hearing, 28(2), 11S–18S. Diedler, J., Pietz, J., Bast, T., & Rupp, A. (2007). Auditory temporal resolution in children assessed by magnetoencephalography. Neuroreport, 18(16), 1691–1695. Drullman, R. (1995). Temporal envelope and fine structure cues for speech intelligibility. The Journal of the Acoustical Society of America, 97(1), 585–592. Dudley, H. (1939). Remaking speech. The Journal of the Acoustical Society of America, 11, 169–177. Eaves, J. M., Summerfield, A. Q., & Kitterick, P. T. (2011). Benefit of temporal fine structure to speech perception in noise measured with controlled

53

temporal envelopes. The Journal of the Acoustical Society of America, 130(1), 501–507. Eilers, R. E., Gavin, W., & Wilson, W. R. (1979). Linguistic experience and phonemic perception in infancy: A crosslinguistic study. Child Development, 14–18. Eimas, P. D. (1974). Auditory and linguistic processing of cues for place of articulation by infants. Attention, Perception, & Psychophysics, 16(3), 513–521. Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171(3968), 303–306. Eisenberg, L. S., Shannon, R. V., Martinez, A. S., Wygonski, J., & Boothroyd, A. (2000). Speech recognition with reduced spectral cues as a function of age. The Journal of the Acoustical Society of America, 107(5 Pt 1), 2704–2710. Elhilali, M., Chi, T., & Shamma, S. A. (2003). A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech communication, 41(2), 331–348. Fernald, A. (1985). Four-month-old infants prefer to listen to motherese. Infant behavior and development, 8(2), 181–195. Fernald, A., & Kuhl, P. (1987). Acoustic determinants of infant preference for motherese speech. Infant Behavior and Development, 10(3), 279–293. Flanagan, J. L. (1972). Speech analysis synthesis and perception. Springer- Verlag. Flanagan, J. L. (1980). Parametric coding of speech spectra. The Journal of the Acoustical Society of America, 68, 412–419. Flanagan, J. L., Meinhart, D. I. S., Golden, R. M., & Sondhi, M. M. (1965). Phase vocoder. The Journal of the Acoustical Society of America, 38, 939–940. Fogerty, D. (2011). Perceptual weighting of the envelope and fine structure across frequency bands for sentence intelligibility: effect of interruption at the syllabic-rate and periodic-rate of speech. The Journal of the Acoustical Society of America, 130(1), 489–500. Fogerty, D., & Humes, L. E. (2012). A correlational method to concurrently measure envelope and temporal fine structure weights: effects of age, cochlear pathology, and spectral shaping. The Journal of the Acoustical Society of America, 132(3), 1679–1689. Folsom, R. C., & Wynne, M. K. (1987). Auditory brain stem responses from human adults and infants: Wave V tuning curves. The Journal of the Acoustical Society of America, 81, 412–419. Fowler, C. A., & Rosenblum, L. D. (1990). Duplex perception: A comparison of monosyllables and slamming doors. Journal of Experimental Psychology: Human Perception and Performance, 16(4), 742–754. Friesen, L. M., Shannon, R. V., Baskent, D., & Wang, X. (2001). Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. The Journal of the Acoustical Society of America, 110(2), 1150–1163.

54

Fu, Q. J. (2002). Temporal processing and speech recognition in cochlear implant users. Neuroreport, 13(13), 1635–1644. Fu, Q.-J., & Nogaki, G. (2005). Noise susceptibility of cochlear implant users: the role of spectral resolution and smearing. Journal of the Association for Research in Otolaryngology, 6(1), 19–27. Fu, Q.-J., Zeng, F.-G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104, 505–515. Gervain, J., & Mehler, J. (2010). Speech perception and language acquisition in the first year of life. Annual review of psychology, 61, 191–218. Gilbert, G., Bergeras, I., Voillery, D., & Lorenzi, C. (2007). Effects of periodic interruptions on the intelligibility of speech based on temporal fine- structure or envelope cues. The Journal of the Acoustical Society of America, 122(3), 1336–1339. Gilbert, G., & Lorenzi, C. (2006). The ability of listeners to use recovered envelope cues from speech fine structure. The Journal of the Acoustical Society of America, 119(4), 2438–2444. Glasberg, B. R., & Moore, B. C. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1-2), 103–138. Gnansia, D., Péan, V., Meyer, B., & Lorenzi, C. (2009). Effects of spectral smearing and temporal fine structure degradation on speech masking release. The Journal of the Acoustical Society of America, 125(6), 4023– 4033. Granier-Deferre, C., Lecanuet, J. P., Cohen, H., & Busnel, M. C. (1985). Feasibility of prenatal hearing test. Acta Oto-Laryngologica, 99(S421), 93–101. Granier-Deferre, C., Ribeiro, A., Jacquet, A.-Y., & Bassereau, S. (2011). Near- term fetuses process temporal features of speech. Developmental science, 14(2), 336–352. Hall III, J. W., & Grose, J. H. (1991). Notched-noise measures of frequency selectivity in adults and children using fixed-masker-level and fixed- signal-level presentation. Journal of Speech, Language and Hearing Research, 34(3), 651–660. Hall, J. W., 3rd, & Grose, J. H. (1994). Development of temporal resolution in children as measured by the temporal modulation transfer function. The Journal of the Acoustical Society of America, 96(1), 150–154. Hall, M. D., & Pastore, R. E. (1992). Musical duplex perception: perception of figurally good chords with subliminal distinguishing tones. Journal of Experimental Psychology: Human Perception and Performance, 18(3), 752–762. Hallé, P. A., Chang, Y.-C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395–421. Heinz, M. G., Colburn, H. S., & Carney, L. H. (2001). Evaluating auditory

55

performance limits: i. one-parameter discrimination using a computational model for the auditory nerve. Neural computation, 13(10), 2273–2316. Hervais-Adelman, A. G., Davis, M. H., Johnsrude, I. S., Taylor, K. J., & Carlyon, R. P. (2011). Generalization of perceptual learning of vocoded speech. Journal of Experimental Psychology: Human Perception and Performance, 37(1), 283–295. Holt, R. F., & Svirsky, M. A. (2008). An exploratory look at pediatric cochlear implantation: is earliest always best?. Ear and hearing, 29(4), 492–511. Hoonhorst, I., Colin, C., Markessis, E., Radeau, M., Deltenre, P., & Serniclaes, W. (2009). French native speakers in the making: from language-general to language-specific voicing boundaries. Journal of experimental child psychology, 104(4), 353–366. Hopkins, K., Moore, B. C. J., & Stone, M. A. (2010). The effects of the addition of low-level, low-noise noise on the intelligibility of sentences processed to remove temporal envelope information. The Journal of the Acoustical Society of America, 128(4), 2150–2161. Irwin, R. J., Stillman, J. A., & Schade, A. (1986). The width of the auditory filter in children. Journal of Experimental Child Psychology, 41(3), 429–442. Jensen, J. K., & Neff, D. L. (1993). Development of basic auditory discrimination in preschool children. Psychological Science, 4(2), 104–107. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. The Journal of the Acoustical Society of America, 68, 1115–1137. Jørgensen, S., & Dau, T. (2013). Modelling speech intelligibility in adverse conditions. Advances in experimental medicine and biology, 787, 343– 351. Joris, P. X., Schreiner, C. E., & Rees, A. (2004). Neural processing of amplitude- modulated sounds. Physiological reviews, 84(2), 541–577. Joris, P. X., & Yin, T. C. (1992). Responses to amplitude-modulated tones in the auditory nerve of the cat. The Journal of the Acoustical Society of America, 91(1), 215–232. Jusczyk, P. W., & Bertoncini, J. (1988). Viewing the development of speech perception as an innately guided learning process. Language and Speech, 31(3), 217–238. Jusczyk, P. W., Pisoni, D. B., Reed, M. A., Fernald, A., & Myers, M. (1983). Infants’ discrimination of the duration of a rapid spectrum change in nonspeech signals. Science, 222(4620), 175–177. Kale, S., & Heinz, M. G. (2010). Envelope coding in auditory nerve fibers following noise-induced hearing loss. Journal of the Association for Research in Otolaryngology: JARO, 11(4), 657–673. Kettner, R. E., Feng, J. Z., & Brugge, J. F. (1985). Postnatal development of the phase-locked response to low frequency tones of auditory nerve fibers in the cat. The Journal of neuroscience: the official journal of the Society for Neuroscience, 5(2), 275–283.

56

Kiang, N. Y., Pfeiffer, R. R., Warr, W. B., & Backus, A. S. (1965). Stimulus coding in the cochlear nucleus. Transactions of the American Otological Society, 53, 35–58. Kong, Y.-Y., & Zeng, F.-G. (2006). Temporal and spectral cues in Mandarin tone recognition. The Journal of the Acoustical Society of America, 120(5 Pt 1), 2830–2840. Kral, A., & Sharma, A. (2012). Developmental neuroplasticity after cochlear implantation. Trends in neurosciences, 35(2), 111–122. Krueger, B., Joseph, G., Rost, U., Strauss-Schier, A., Lenarz, T., & Buechner, A. (2008). Performance groups in adult cochlear implant users: speech perception results from 1984 until today. Otology & neurotology: official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology, 29(4), 509–512. Kuhl, P. K. (1991). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Perception & psychophysics, 50(2), 93–107. Kuhl, P. K. (1993). Innate predispositions and the effects of experience in speech perception: The native language magnet theory. In Developmental neurocognition: Speech and face processing in the first year of life (pp. 259–274). Springer. Kuhl, P. K. (2004). Early language acquisition: cracking the speech code. Nature reviews. Neuroscience, 5(11), 831–843. Kuhl, P. K., & Miller, J. D. (1975). Speech perception by the chinchilla: Voiced- voiceless distinction in alveolar plosive consonants. Science, 190(4209), 69–72. Kuhl, P. K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America, 63, 905–922. Kuhl, P. K., & Padden, D. M. (1982). Enhanced discriminability at the phonetic boundaries for the voicing feature in macaques. Perception & Psychophysics, 32(6), 542–550. Kuhl, P. K., & Padden, D. M. (1983). Enhanced discriminability at the phonetic boundaries for the place feature in macaques. The Journal of the Acoustical Society of America, 73, 1003–1013. Kuhl, P. K, Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental science, 9(2), F13– F21. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. Kuo, Y.-C., Rosen, S., & Faulkner, A. (2008). Acoustic cues to tonal contrasts in Mandarin: Implications for cochlear implants. The Journal of the

57

Acoustical Society of America, 123, 2815–2864. Lasky, R. E., Syrdal-Lasky, A., & Klein, R. E. (1975). VOT discrimination by four to six and a half month old infants from Spanish environments. Journal of Experimental Child Psychology, 20(2), 215–225. Leech, R., Holt, L. L., Devlin, J. T., & Dick, F. (2009). Expertise with artificial nonspeech sounds recruits speech-sensitive cortical regions. The Journal of neuroscience, 29(16), 5234–5239. Leibold, L. J., & Werner, L. A. (2007). Infant auditory sensitivity to pure tones and frequency-modulated tones. Infancy, 12(2), 225–233. Levi, E. C., Folsom, R. C., & Dobie, R. A. (1995). Coherence analysis of envelope-following responses (EFRs) and frequency-following responses (FFRs) in infants and adults. Hearing research, 89(1-2), 21–27. Levi, E. C., & Werner, L. A. (1996). Amplitude modulation detection in infancy: Update on 3-month-olds. Association for Research in Otolaryngology, 19, 142–150. Liberman, A M. (1970). Some characteristics of perception in the speech mode. Research publications - Association for Research in Nervous and Mental Disease, 48, 238–254. Liberman, A. M. (1982). On finding that speech is special. American Psychologist, 37(2), 148–167. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological review, 74(6), 431– 461. Liberman, A. M., Delattre, P. C., Cooper, F. S., & Gerstman, L. J. (1954). The role of consonant-vowel transitions in the perception of the stop and nasal consonants. Psychological Monographs: General and Applied, 68(8), 1– 13. Liberman, A., Harris, K. S., Eimas, P., Lisker, L., & Bastian, J. (1961). An effect of learning on speech perception: The discrimination of durations of silence with and without phonemic significance. Language and Speech, 4(4), 175–195. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of experimental psychology, 54(5), 358–368. Liberman, A. M., Harris, K. S., Kinney, J. A., & Lane, H. (1961). The discrimination of relative onset-time of the components of certain speech and nonspeech patterns. Journal of Experimental Psychology, 61(5), 379– 388. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1–36. Liberman, A. M., & Mattingly, I. G. (1989). A specialization for speech perception. Science, 243(4890), 489–494. Licklider, J. C. (1952). On the process of speech perception. The Journal of the acoustical society of America, 24, 590–594.

58

Lisker, L., & Abramson, A. S. (1970). The voicing dimension: Some experiments in comparative phonetics. In Proceedings of the 6th international congress of phonetic sciences (pp. 563–567). Lorenzi, C., Gallégo, S., & Patterson, R. D. (1997). Discrimination of temporal asymmetry in cochlear implantees. The Journal of the Acoustical Society of America, 102, 482–485. Lorenzi, C., Gallégo, S., & Patterson, R. D. (1998). Amplitude compression in cochlear implants artificially restricts the perception of temporal asymmetry. British journal of audiology, 32(6), 367–374. Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., & Moore, B. C. J. (2006). Speech perception problems of the hearing impaired reflect inability to use temporal fine structure. Proceedings of the National Academy of Sciences, 103(49), 18866–18869. Lorenzi, C., Sibellas, J., Füllgrabe, C., Gallégo, S., Fugain, C., & Meyer, B. (2004). Effects of amplitude compression on first-and second-order modulation detection thresholds in cochlear implant listeners. International journal of audiology, 43(5), 264–270. Luo, X., & Fu, Q.-J. (2007). Frequency modulation detection with simultaneous amplitude modulation by cochlear implant users. The Journal of the Acoustical Society of America, 122, 1046–1054. Mann, V. A., & Liberman, A. M. (1983). Some differences between phonetic and auditory modes of perception. Cognition, 14(2), 211–235. Mattock, K., & Burnham, D. (2006). Chinese and English infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241– 265. Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106(3), 1367-1381. Mattys, S. L., Jusczyk, P. W., Luce, P. A., & Morgan, J. L. (1999). Phonotactic and prosodic effects on word segmentation in infants. Cognitive psychology, 38(4), 465–494. Maxon, A. B., & Hochberg, I. (1982). Development of psychoacoustic behavior: Sensitivity and discrimination. Ear and Hearing, 3(6), 301–308. Mehler, J., Dupoux, E., Nazzi, T., & Dehaene-Lambertz, G. (1996). Coping with linguistic diversity: The infant’s viewpoint. Signal to syntax: Bootstrapping from speech to grammar in early acquisition, 101–116. Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2), 143–178. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. The Journal of the Acoustical Society of America, 27, 338–352. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. The journal of the Acoustical society of America, 85, 2114–2134.

59

Miller, J. D., Wier, C. C., Pastore, R. E., Kelly, W. J., & Dooling, R. J. (1976). Discrimination and labeling of noise-buzz sequences with varying noise- lead times: an example of categorical perception. The Journal of the Acoustical Society of America, 60(2), 410–417. Miyamoto, R. T., Houston, D. M., Kirk, K. I., Perdew, A. E., & Svirsky, M. A. (2003). Language development in deaf infants following cochlear implantation. Acta oto-laryngologica, 123(2), 241-244. Miyawaki, K., Jenkins, J. J., Strange, W., Liberman, A. M., Verbrugge, R., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception & Psychophysics, 18(5), 331–340. Moore, B. C. J. (1974). Relation between the critical bandwidth and the frequency-difference limen. Journal of the Acoustical Society of America, 55(2), 359-359. Moore, B. C. J. (2003). Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms. Speech Communication, 41(1), 81–91. Moore, B. C. J. (2004). An introduction to the psychology of hearing (Vol. 4). Academic press San Diego. Moore, B. C J. (2012). Effects of bandwidth, compression speed, and gain at high frequencies on preferences for amplified music. Trends in amplification, 16(3), 159–172. Moore, D. R. (2002). Auditory development and the role of experience. British Medical Bulletin, 63(1), 171–181. Moore, D. R., & Shannon, R. V. (2009). Beyond cochlear implants: awakening the deafened brain. Nature neuroscience, 12(6), 686–691. Moore, D. S., Spence, M. J., & Katz, G. S. (1997). Six-month-olds’ categorization of natural infant-directed utterances. Developmental Psychology, 33(6), 980–989. Morlet, T., Lapillonne, A., Ferber, C., Duclaux, R., Sann, L., Putet, G., … Collet, L. (1995). Spontaneous otoacoustic emissions in preterm neonates: prevalence and gender effects. Hearing research, 90(1), 44–54. Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human perception and performance, 24(3), 756–766. Nelson, P. B., Jin, S.-H., Carney, A. E., & Nelson, D. A. (2003). Understanding speech in modulated interference: cochlear implant users and normal- hearing listeners. The Journal of the Acoustical Society of America, 113(2), 961–968. Newman, R. S. (2005). The cocktail party effect in infants revisited: listening to one’s name in noise. Developmental Psychology, 41(2), 352. Newman, R. S. (2009). Infants’ listening in multitalker environments: Effect of the number of background talkers. Attention, Perception, &

60

Psychophysics, 71(4), 822–836. Newman, R., & Chatterjee, M. (2013). Toddlers’ recognition of noise-vocoded speech. The Journal of the Acoustical Society of America, 133(1), 483– 494. Newman, R. S., & Jusczyk, P. W. (1996). The cocktail party effect in infants. Perception & Psychophysics, 58(8), 1145–1156. Nicholas, J. G., & Geers, A. E. (2007). Will they catch up? The role of age at cochlear implantation in the spoken language development of children with severe to profound hearing loss. Journal of Speech, Language and Hearing Research, 50(4), 1048–1062. Nittrouer, S. (2002). Learning to perceive speech: How fricative perception changes, and how it stays the same. The Journal of the Acoustical Society of America, 112, 711–719. Nittrouer, S., & Lowenstein, J. H. (2010). Learning to perceptually organize speech signals in native fashion. The Journal of the Acoustical Society of America, 127(3), 1624–1635. Nittrouer, S., Lowenstein, J. H., & Packer, R. R. (2009). Children discover the spectral skeletons in their native language before the amplitude envelopes. Journal of experimental psychology. Human perception and performance, 35(4), 1245–1253. Nozza, R. J., Rossman, R. N., Bond, L. C., & Miller, S. L. (1990). Infant speech- sound discrimination in noise. The Journal of the Acoustical Society of America, 87, 339–350. Nozza, R. J., Wagner, E. F., & Crandell, M. A. (1988). Binaural release from masking for a speech sound in infants, preschoolers, and adults. Journal of Speech, Language and Hearing Research, 31(2), 212–218. Olsho, L. W. (1984). Infant frequency discrimination. Infant Behavior and Development, 7(1), 27–35. Olsho, L. W. (1985). Infant auditory perception: Tonal masking. Infant Behavior and Development, 8(4), 371–384. Olsho, L. W., Koch, E. G., & Carter, E. A. (1988). Nonsensory factors in infant frequency discrimination. Infant Behavior and Development, 11(2), 205– 222. Olsho, L W, Koch, E. G., & Halpin, C. F. (1987). Level and age effects in infant frequency discrimination. The Journal of the Acoustical Society of America, 82(2), 454–464. Olsho, L. W., Schoon, C., Sakai, R., Turpin, R., & Sperduto, V. (1982). Preliminary data on frequency discrimination in infancy. The Journal of the Acoustical Society of America, 71, 509–511. Palmer, A. R., Winter, I. M., & Darwin, C. J. (1986). The representation of steady-state vowel sounds in the temporal discharge patterns of the guinea pig cochlear nerve and primarylike cochlear nucleus neurons. The Journal of the Acoustical Society of America, 79(1), 100–113. Pastore, R. E., Schmuckler, M. A., Rosenblum, L., & Szczesiul, R. (1983).

61

Duplex perception with musical stimuli. Perception & Psychophysics, 33(5), 469–474. Pickles, J. O. (1988). An introduction to the physiology of hearing (Vol. 2). Academic press London. Pisoni, D. B. (1977). Identification and discrimination of the relative onset time of two component tones: implications for voicing perception in stops. The Journal of the Acoustical Society of America, 61(5), 1352–1361. Polka, L., & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 421–435. Pujol, R., & Lavigne-Rebillard, M. (1992). Development of neurosensory structures in the human cochlea. Acta oto-laryngologica, 112(2), 259–264. Qin, M. K., & Oxenham, A. J. (2003). Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. The Journal of the Acoustical Society of America, 114(1), 446–454. Querleu, D., Renard, X., Versyp, F., Paris-Delrue, L., & Crèpin, G. (1988). Fetal hearing. European journal of obstetrics, gynecology, and reproductive biology, 28(3), 191–212. Querleu, D., Renard, X., Versyp, F., Paris-Delrue, L., Vervoort, P., & Crepin, G. (1986). Can the fetus listen and learn. British journal of obstetrics and gynaecology, 93(4), 411–412. Remez, R. E. (2008). Perceptual Organization of Speech. The handbook of speech perception, (D.B., Pisoni & R.E., Remez, pp. 28–50). Remez, R. E., & Rubin, P. E. (1990). On the perception of speech from time- varying acoustic information: Contributions of amplitude variation. Perception & Psychophysics, 48(4), 313–325. Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception without traditional speech cues. Science, 212(4497), 947-949. Richter, B., Eißele, S., Laszig, R., & Löhle, E. (2002). Receptive and expressive language skills of 106 children with a minimum of 2 years’ experience in hearing with a cochlear implant. International journal of pediatric otorhinolaryngology, 64(2), 111–125. Rivera-Gaxiola, M., Silva-Pereyra, J., & Kuhl, P. K. (2005). Brain potentials to native and non-native speech contrasts in 7-and 11-month-old American infants. Developmental Science, 8(2), 162–172. Rose, J. E., Brugge, J. F., Anderson, D. J., & Hind, J. E. (1967). Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. Journal of neurophysiology, 30(4), 769–793. Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistic aspects. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 336(1278), 367-373. Rosen, S., & Iverson, P. (2007). Constructing adequate non-speech analogues: what is special about speech anyway? Developmental science, 10(2), 165– 168.

62

Rvachew, S., Mattock, K., Polka, L., & Ménard, L. (2006). Developmental and cross-linguistic variation in the infant vowel space: The case of Canadian English and Canadian French. The Journal of the Acoustical Society of America, 120, 2250–2259. Saffran, J. R., Werker, J. F., & Werner, L. A. (2006). The infant’s auditory world: Hearing, speech, and the beginnings of language. In Handbook of child psychology (D. Kuhn & R. Siegler., Vol. 2, pp. 58–108). Schneider, B. A., Morrongiello, B. A., & Trehub, S. E. (1990). Size of critical band in infants, children, and adults. Journal of experimental psychology. Human perception and performance, 16(3), 642–652. Shamma, S., & Lorenzi, C. (2013). On the balance of envelope and temporal fine structure in the encoding of speech in the early auditory system. The Journal of the Acoustical Society of America, 133(5), 2818–2833. Shannon, R. V. (1992). Temporal modulation transfer functions in patients with cochlear implants. The Journal of the Acoustical Society of America, 91, 2156–2164. Shannon, R. V. (2012). Advances in auditory prostheses. Current opinion in neurology, 25(1), 61–66. Shannon, R V, Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science (New York, N.Y.), 270(5234), 303–304. Sheft, S., Ardoint, M., & Lorenzi, C. (2008). Speech identification based on temporal fine structure cues. The Journal of the Acoustical Society of America, 124(1), 562–575. Sinnott, J. M., & Aslin, R. N. (1985). Frequency and intensity discrimination in human infants and adults. The Journal of the Acoustical Society of America, 78, 1986-1992. Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416(6876), 87–90. Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2012). Predictive top- down integration of prior knowledge during speech perception. The Journal of Neuroscience, 32(25), 8443–8453. Spahr, A. J., & Dorman, M. F. (2004). Performance of subjects fit with the Advanced Bionics CII and Nucleus 3G cochlear implant devices. Archives of otolaryngology--head & neck surgery, 130(5), 624–628. Spence, M. J., & DeCasper, A. J. (1987). Prenatal experience with low-frequency maternal-voice sounds influence neonatal perception of maternal voice samples. Infant Behavior and Development, 10(2), 133–142. Spence, M. J., & Freeman, M. S. (1996). Newborn infants prefer the maternal low-pass filtered voice, but not the maternal whispered voice. Infant Behavior and Development, 19(2), 199–212. Spetner, N. B., & Olsho, L. W. (1990). Auditory frequency resolution in human infancy. Child development, 61(3), 632–652. Stevens, K. N. (1980). Acoustic correlates of some phonetic categories. The

63

Journal of the Acoustical Society of America, 68, 836–842. Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. The Journal of the Acoustical Society of America, 64(5), 1358–1368. Streeter, L. A. (1976). Language perception of 2-mo-old infants shows effects of both innate mechanisms and experience. Nature, 269 (5538), 39–41. Sutcliffe, P., & Bishop, D. (2005). Psychophysical design influences frequency discrimination performance in young children. Journal of experimental child psychology, 91(3), 249–270. Svirsky, M A, Robbins, A. M., Kirk, K. I., Pisoni, D. B., & Miyamoto, R. T. (2000). Language development in profoundly deaf children with cochlear implants. Psychological science, 11(2), 153–158. Svirsky, M. A., Teoh, S.-W., & Neuburger, H. (2004). Development of language and speech perception in congenitally, profoundly deaf children as a function of age at cochlear implantation. Audiology and Neurotology, 9(4), 224–233. Telkemeyer, S., Rossi, S., Koch, S. P., Nierhaus, T., Steinbrink, J., Poeppel, D., … Wartenburger, I. (2009). Sensitivity of newborn auditory cortex to the temporal structure of sounds. The Journal of neuroscience: the official journal of the Society for Neuroscience, 29(47), 14726–14733. Thompson, N. C., Cranford, J. L., & Hoyer, E. (1999). Brief-tone frequency discrimination by children. Journal of Speech, Language and Hearing Research, 42(5), 1061-1068. Tomblin, J. B., Barker, B. A., Spencer, L. J., Zhang, X., & Gantz, B. J. (2005). The effect of age at cochlear implant initial stimulation on expressive language growth in infants and toddlers. Journal of Speech, Language and Hearing Research, 48(4), 853-867. Trainor, L. J., Samuel, S. S., Desjardins, R. N., & Sonnadara, R. R. (2001). Measuring temporal resolution in infants using mismatch negativity. Neuroreport, 12(11), 2443–2448. Trehub, S. E. (1973). Infants’ sensitivity to vowel and tonal contrasts. Developmental Psychology, 9(1), 91-96. Trehub, S. E. (1976). The discrimination of foreign speech contrasts by infants and adults. Child Development, 466–472. Trehub, S. E., Schneider, B. A., & Bull, D. (1981). Effect of reinforcement on infants’ performance in an auditory detection task. Developmental Psychology, 17(6), 872-877. Trehub, S E, Schneider, B. A., & Henderson, J. L. (1995). Gap detection in infants, children, and adults. The Journal of the Acoustical Society of America, 98(5 Pt 1), 2532–2541. Tsao, F.-M., Liu, H.-M., & Kuhl, P. K. (2006). Perception of native and non- native affricate-fricative contrasts: cross-language tests on adults and infants. The Journal of the Acoustical Society of America, 120(4), 2285– 2294.

64

Tsushima, T., Takizawa, O., Sasaki, M., Shiraki, S., Nishi, K., Kohno, M., … Best, C. (1994). Discrimination of English/rl/and/wy/by Japanese infants at 6-12 months: Language-specific developmental changes in speech perception abilities. In Third International Conference on Spoken Language Processing. Vouloumanos, A., Hauser, M. D., Werker, J. F., & Martin, A. (2010). The tuning of human neonates’ preference for speech. Child development, 81(2), 517– 527. Vouloumanos, A., & Werker, J. F. (2007). Listening to language at birth: Evidence for a bias for speech in neonates. Developmental science, 10(2), 159–164. Wang, S., Xu, L., & Mannell, R. (2011). Relative contributions of temporal envelope and fine structure cues to lexical tone recognition in hearing- impaired listeners. Journal of the Association for Research in Otolaryngology, 12(6), 783–794. Werker, J. F., & Tees, R. C. (1984). Phonemic and phonetic factors in adult cross- language speech perception. The Journal of the Acoustical Society of America, 75, 1866-1878. Werker, J. F., & Tees, R. C. (1992). The organization and reorganization of human speech perception. Annual Review of Neuroscience, 15(1), 377– 402. Werker, J F, & Tees, R. C. (1999). Influences on infant speech processing: toward a new synthesis. Annual review of psychology, 50, 509–535. Werker, J. F., & Tees, R. C. (2005). Speech perception as a window for understanding plasticity and commitment in language systems of the brain. Developmental psychobiology, 46(3), 233–251. Werker, J. F., Gilbert, J. H., Humphrey, K., & Tees, R. C. (1981). Developmental aspects of cross-language speech perception. Child development, 349–355. Werner, L. A. (1999). Forward masking among infant and adult listeners. The Journal of the Acoustical Society of America, 105(4), 2445–2453. Werner, L. A. (2007). Issues in human auditory development. Journal of communication disorders, 40(4), 275–283. Werner, L. A. (2013). Infants’ detection and discrimination of sounds in modulated maskers. The Journal of the Acoustical Society of America, 133(6), 4156–4167. Werner, L. A., Marean, G. C., Halpin, C. F., Spetner, N. B., & Gillenwater, J. M. (1992). Infant auditory temporal acuity: Gap detection. Child Development, 63(2), 260–272. Werner, L. A., Folsom, R. C., Mancl, L. R., & Syapin, C. L. (2001). Human Auditory Brainstem Response to Temporal Gaps in Noise. Journal of Speech, Language & Hearing Research, 44(4), 737–750. Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49(1), 25–47. Wightman, F., Allen, P., Dolan, T., Kistler, D., & Jamieson, D. (1989). Temporal

65

resolution in children. Child development, 60(3), 611–624. Won, J. H., Drennan, W. R., Nie, K., Jameyson, E. M., & Rubinstein, J. T. (2011). Acoustic temporal modulation detection and speech perception in cochlear implant listeners. The Journal of the Acoustical Society of America, 130, 376. Xu, L., & Pfingst, B. E. (2008). Spectral and temporal cues for speech recognition: implications for auditory prostheses. Hearing research, 242(1-2), 132–140. Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68(2), 123-139. Zatorre, R. J., & Belin, P. (2001). Spectral and temporal processing in human auditory cortex. Cerebral cortex, 11(10), 946–953. Zeng, F.-G., Nie, K., Stickney, G. S., Kong, Y.-Y., Vongphoe, M., Bhargave, A., … Cao, K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences of the United States of America, 102(7), 2293–2298.

66

Chapter 2. Discrimination of voicing on the basis of AM cues in French 6-month- old infants (head-turn preference procedure)

67

Chapter 2. Discrimination of voicing on the basis of AM cues in French 6-month-old infants (head-turn preference procedure)

1. Introduction: Six-month-old infants discriminate voicing on the basis of temporal-envelope cues

This chapter presents a pilot experimental study designed to assess whether infants can discriminate syllables on the sole basis of the relatively slow AM cues in relatively narrow frequency bands. These discrimination abilities are assessed with the head-turn preference procedure (HPP, adapted from Hirsh- Pasek et al., 1987), a behavioral method classically used to investigate infant speech perception. The population tested in this study is composed of French-learning 6- month-olds with normal hearing. At 6 months, infants show well developed auditory abilities (as described in Chapter 1. III). More precisely, the frequency and the temporal resolution of the auditory system should be mature by around this age (see Saffran, Werker, & Werner, 2006 for a review). Infants are not entirely tuned to their native language for consonant perception, but they show a clear ability to discriminate, categorize speech sounds (e.g., Kuhl, 2004). Here, it is hypothesized that 6-month-olds would be able to discriminate a phonetic contrast on the sole basis of the relatively slow AM speech cues with 16 frequency bands. The present study evaluates the infants’ discrimination ability of a French voicing contrast (/aba/-/apa/). In the studies of Eisenberg, Shannon, Martinez, Wygonski, & Boothroyd (2000) and Bertoncini, Serniclaes and Lorenzi (2009), children’s discrimination scores were the lowest for the voicing contrast and this result motivates our choice to assess whether voicing is also difficult to discriminate by infants. In the present experiment, the same tone-excited vocoder as in Bertoncini et al. (2009) is used with 6-month-old infants. The AM and FM cues of original speech syllables are extracted in 16 frequency bands. The cutoff frequency for the AM extraction is set to 64 Hz and the original FM cues are severely degraded by replacing the FM carrier by sine-wave tones. Voicing is signaled by several, redundant spectro-temporal acoustic cues. For instance, slow

68

(2-50 Hz) AM cues signal the existence and duration of silent intervals which are known to be important in distinguishing voiced from voiceless plosives in intervocalic position. However, these very slow AM cues to voicing contrasts are known to be relatively weak. The presence of low-frequency (50-500 Hz) periodic (and F0-related) acoustic energy is probably the most important cue to the phonological feature of voicing. These faster, periodic fluctuations appear both in the AM and FM domains. In addition, voiced sounds have a power spectrum (i.e., a spectral envelope) heavily weighted towards low (<1 kHz) frequencies and hence tend to have relatively lower fluctuation rates than voiceless sounds. These fine spectro-temporal details appear in the FM domain only. The vocoder used in the following study differentially altered these acoustic cues signaling voicing, but preserved the duration of the silent interval, and some of the F0-related cues. In the present pilot experiment, the discrimination of vocoded syllables is assessed with HPP. With this method, infants have to turn their head toward a visual stimulation (a blinking light) to listen to sound sequences. In this experiment, the preference for sequences of alternating stimuli (composed of /aba/ and /apa/) is tested compared to sequences of repeated stimuli (/aba-aba/ or /apa- apa/; e.g., Best & Jones, 1995). If infants perceive voicing, they should listen longer to the alternating sequences than to the repeated ones. The results show that six-month-old infants have a significant preference for the alternating sequences even with reduced speech modulation cues. Thus, normal-hearing infants are able to use relatively slow AM cues alone (< 64 Hz) to perceive a voicing difference. These preliminary results are explored further in a second experiment using HPP (Chapter 3). It is important to note that in the present work, discrimination of voicing requires the extraction/ computation of invariant modulation features. In other words, this task requires more sophisticated processing than that required for basic detection and auditory discrimination tasks (e.g., Bertoncini, Bijeljac-Babic, Jusczyk, Kennedy, & Mehler, 1988; Holt, 2011).

The first article presented in the present chapter has been published in the Journal of Acoustical Society of America in May 2011: Bertoncini, J., Nazzi, T., Cabrera, L., & Lorenzi, C. (2011). Six-month-old infants discriminate voicing on the basis of temporal envelope cues. Journal of Acoustical Society of America, 129 (5), 2761–2764. DOI: 10.1121= 1.3571424.

69

2. Article: Bertoncini, Nazzi, Cabrera & Lorenzi (2011)

Six-month-old infants discriminate voicing on the basis of temporal envelope cues

Josiane Bertoncini Université Paris Descartes Laboratoire de Psychologie de la Perception, CNRS 45 rue des Sts Pères, 75006 Paris, France Thierry Nazzi Université Paris Descartes Laboratoire de Psychologie de la Perception, CNRS 45 rue des Sts Pères, 75006 Paris, France Laurianne Cabrera Université Paris Descartes 45 rue des Sts Pères, 75006 Paris, France Christian Lorenzi Université Paris Descartes Laboratoire de Psychologie de la Perception, ENS, 29 rue d’Ulm, 75005 Paris, France

Article submitted to the Journal of the Acoustical Society of America On: September 2010 Principals PACS number: 43.71.Es, 43.71.Ft, 43.71.Mk Running title: Voicing discrimination in infants

70

ABSTRACT

Profoundly-deaf children receiving a cochlear implant (CI) under the age of 2 years perform well, despite reception of degraded speech retaining mainly temporal-envelope (E) cues over a limited number of frequency bands. This suggests that very young children are able to use efficiently E cues to perceive speech. This assumption was tested by assessing voicing discrimination solely on the basis of E cues for twenty 6-month old, normal-hearing (NH) infants. A Head- turn Preference Procedure (HPP) was used to measure infants’ looking times during presentation of different tokens of /aba/ and /apa/ processed to retain E below 64 Hz while degrading temporal fine structure cues within 16 bands. The vocoded stimuli were arranged either into repeating sequences /aba, aba, .../ or /apa, apa, .../ or into alternating sequences (/aba, apa, aba, .../). The results showed that infants have a significant preference (longer looking times) for alternating sequences compared to repeating ones. These results indicate that: (i) infants can discriminate voicing on the basis of E cues alone, i.e. in the absence of fine spectral and temporal structure information; (ii) behavioral methods can be used with vocoded stimuli to investigate the developmental time course of speech perception in NH and CI children.

71

I. INTRODUCTION

Within the cochlea, speech sounds are decomposed by the “auditory filters” into a series of narrowband signals, each evoked at a different place on the basilar membrane. Each signal can be considered as a “carrier” – the temporal fine structure (TFS), which is determined by the dominant frequencies in the signal that fall close to the center frequency of the band – and a temporal envelope (E), which corresponds to the relatively slow fluctuations in amplitude superimposed on the carrier (Smith et al., 2002). Both E and TFS information are represented in the pattern of phase-locking in auditory-nerve fibers. However, for most mammals, the accuracy of phase locking to TFS is only constant up to about 1-2 kHz (Johnson, 1980), whereas phase locking to E remains accurate for carrier frequencies beyond 6 kHz (Joris and Yin, 1992). When speech sounds are presented in quiet, E cues alone are sufficient for adults to readily recognize the speech input at different levels (phonemes, words, sentences), even with a limited number of frequency bands (4-16), and after a brief exposure to these stimuli (Shannon et al., 1995). Indeed, voicing, nasality, place and manner of articulation are signaled by various E cues relatively widespread or located in the high-frequency range (Rosen, 1992). In comparison, TFS cues seem to play little role in speech identification when speech is intact and presented in quiet (Shannon et al., 1995). However, recent studies suggest that TFS cues may play a specific role in conveying phonetic information regarding voicing and nasality when E cues are severely degraded by acoustic distortions (Sheft et al., 2008). Consistent with this notion, the main segmental cues to voicing and nasality are restricted to the low-mid frequency range and are well represented in the pattern of phase-locking in auditory-nerve fibers. During the last decade, profoundly deaf children have been fitted with cochlear implants (CI) at a younger and younger age with reasonable success (Holt and Svirsky, 2008; Miyamoto et al., 2003). The CI speech processors deliver reliably E cues over a small number of independent frequency channels (i.e., about 8 channels; Friesen et al., 2001), but degrade severely TFS cues. This nevertheless allows most CI users to perform well in quiet situations. However, the oral language level attained by infant CI receivers is not consistently within the “normal” range. Most studies show that a better outcome is principally

72

correlated with an early implantation, the initial period of the first two years (or 18 months) being generally recommended when possible. In addition, several other factors have been shown to play a significant role such as parental support, education level, or socioeconomic status. But what is largely ignored is how congenitally deaf infants could learn the properties of their native language by processing the E information transmitted by CI device in the absence of TFS cues. One way to evaluate this issue is to turn to early language acquisition, and examine the potential role of E cues in early phonological acquisition. The early typical development of the segmental side of speech processing has received much interest between 1980 and 2000. Many experimental studies focused on phoneme discrimination and categorization by infants (Kuhl, 1991; Werker and Tees, 1984). Later on, the role of supra-segmental prosodic cues (rhythm and pitch) was assessed in different periods of language acquisition (Bertoncini et al., 1995; Nazzi et al., 1998a, 1998b, Jusczyk et al., 1993). Taken together, these studies suggest that the early acquisition of segmental and supra-segmental properties could be driven by E cues or at least, that having access to E cues only, as for CI infants, would provide enough information for such acquisition. However, since pitch perception is considered to be related to TFS processing, the role of TFS cues during the acquisition of tonal languages by NH and CI children remains an open question (Xu and Pfingst, 2003). One first step to start exploring this aspect of speech perception is to present speech signals processed by noise- or tone-excited vocoders (as in Shannon et al., 1995) to near-term fetuses, and normal-hearing children and infants, to verify whether they are sensitive to those E cues related to phonetic information, i.e. when TFS cues are removed, and the stimuli are presented in quiet. A first study suggests that as early as 38 weeks of gestational age, fetuses perceive the E cues of sentences processed by a single band noise-excited vocoder (Granier-Deferre et al., 2010). A second study conducted with children (5-12 years) demonstrated that recognition of speech processed by multi-band noise-excited vocoders to retain mainly E cues develops before the age of 7, and becomes adult-like around the age of 10 (Eisenberg et al., 2000). In a third study, 5- to 7-year-old children were shown to be able to discriminate nonsense syllables varying in voicing, place, manner and nasality when those syllables were processed by a 16-band tone- excited vocoder (Bertoncini et al., 2009). In addition, the response correctness and

73

latencies demonstrated by the children were found to be similar to that of young adult controls. The discrimination scores with vocoded signals were very high (d’2) and minimally reduced if compared to those obtained with intact speech. Like adults, 5- to 7-year old children did not receive any training before testing. These results suggest that even at an early age, NH children are able to rapidly adapt and assimilate vocoded signals to (degraded) speech sounds. The aim of the present study is to start a new line of research on the development of speech perception by presenting different vocoded speech signals to NH infants, at an early stage of development. As a first step in this direction, we present here one experiment indicating that 6-month-old infants successfully discriminate a voicing contrast on the sole basis of E cues.

II. METHOD

A. Participants Twenty 6-month-old infants (8 girls, 12 boys) with normal hearing (based on newborn hearing screening and parental report) were tested and their data included in the analyses (mean age = 6.2 months; range: 5.8 to 6.6 months; standard deviation = 0.25 months). The data of 6 additional infants were not taken into account, due to fussiness or crying (4) and to extreme mean looking times leading to outlier differences between the two kinds of stimulus series (2). Families were informed about the goals of the current study, and provided written consent before the participation of their children.

B. Stimuli Eight exemplars of each category /aba/ and /apa/ were selected from a set of vowel-consonant-vowel (VCV) nonsense syllables uttered by a French female speaker. The stimuli were recorded in a quiet room, and digitized via a 16-bit analog-to-digital converter at a 44.1-kHz sampling rate. The 16 selected stimuli were then processed to degrade TFS cues while preserving E cues at the output of a bank of analysis filters using the same speech- processing technique used by Bertoncini et al. (2009). Speech was filtered into 16 adjacent 0.35-oct wide frequency bands spanning the range 0.08 to 8.02 kHz. The

74

temporal envelope was extracted in each frequency band, using the Hilbert transform followed by lowpass filtering with a zero-phase, 6th-order Butterworth filter (cutoff frequency = 64 Hz). The filtered envelope was used to amplitude modulate a sine wave with a frequency equal to the center frequency of the band, and with random starting phase. The 16 amplitude-modulated sine waves were summed over all frequency bands. The 16 processed stimuli were finally equated in root mean square (rms) power. The effects of signal processing are illustrated in Fig. 1, showing spectrograms of the intact (left panels) and vocoded (right panels) versions of a given /aba/ (top panels) and /apa/ (bottom panels) stimulus. The stimuli were presented into series of 24 items constituted of 3 different orders of the 8 vocoded signals, in succession. There were 4 resulting series, two repeating (/apa apa apa .../ and /aba aba aba .../) series and two alternating series that differed only by the first presented item (/apa aba apa aba .../ and /aba apa aba apa .../). The stimuli were stored in digitized form on the computer, and were delivered at a sound pressure level of 70 dB by the loudspeakers via an audio amplifier.

Figure 1: Spectrograms of the intact (left panels) and vocoded (right panels) versions of a /aba/ (top panels) and /apa/ (bottom panels) stimulus.

75

C. Procedure The HPP procedure used here was adapted for studying discrimination as in Nazzi et al. (2009). Each infant was held on a caregiver’s lap, in the centre of the test booth, facing a green light. Before each trial, the infant’s attention was re- centred by blinking the central green light. As soon as the infant was correctly face oriented, the green light was extinguished and a red light started to flash either on the right or the left side of the booth. When the infant made a correct turn in that direction, a stimulus series began to be delivered by the loudspeaker located behind the flashing red light. Each series was played to completion, or stopped as soon as the infant looked away for 2 consecutive seconds. The total duration of looking times was recorded into a PC computer, via a response box controlled on line by the experimenter who was out of the booth and unaware of the stimulus series presented (both the caregiver and the experimenter listened to masking music). For each infant, the experimental session began with two musical trials, one on each side (randomly ordered) to give the infants an opportunity to practice one headturn to each side before the test itself. The test phase consisted of two test blocks (in each of which the two repeating and the two alternating series were presented). The mean duration of looking times was calculated for each type of series, over the first and second blocks of test trials. In this HPP paradigm, discrimination is attested when infants demonstrate a differential response to the two kinds of stimuli they are presented with. Here, infants might show longer looking times during listening either to ALT series or to REP ones if they are sensitive to the difference between ALT and REP series. Furthermore, a “preference” indexed by longer looking times might be observed for the ALT series, because the alternation (if perceived) might facilitate perceptual comparison, and maintain attention for a longer while (Best and Jones, 1998).

III. RESULTS

The mean results are shown in Fig. 2. An analysis of variance (ANOVA) for repeated measures was performed with Blocks and Series as within-subject factors. The group of 20 infants presented longer mean looking times during ALT

76

series than during REP series on the two blocks of trials (F(1,19) = 4.75, p = .042). Fourteen infants among 20 demonstrated a bias towards listening longer to ALT series compared to REP ones. The duration of looking times decreased from the first to the second block of test trials (F(1,19) = 11.63, p =.003). Although the interaction between Blocks and Series was not significant (F(1,19) = 2.29, p =.15), the general trend in looking times to decrease as experiment is progressing affected the mean size of the effect [2.88 s in the first block (t(19) = 2.29, unilateral p = 0.017), and 0.26 in the second block (t(19) < 1)]. Looking times were also found to decline in the subjects who, nonetheless, showed a “preference” for ALT series (5.97 s and 2.99 s in the first and second block, respectively).

Figure 2. Mean looking times (in seconds) to the alternating (ALT) series versus the repeating (REP) series. The error bars indicate the standard error of the mean.

IV. CONCLUSIONS

NH 6-month-old infants presented with degraded syllables /aba/ and /apa/ are sensitive to the difference between the E cues associated with the voiced and voiceless French phonemes /b/ vs. /p/. Without any pre-exposure, in a paradigm favoring immediate comparison between two contrasting categories (remember that each category was represented by 8 different exemplars), infants were able to pick up the difference between /apa/ and /aba/ on the basis of E cues only. These

77

data reveal that for 6-month-old infants, as for adults, fine spectral and temporal structure information, although potentially important for musical pitch and localization of sounds in space (Smith et al., 2002), is not essential for speech discrimination, at least in quiet. They also demonstrate that auditory temporal resolution (i.e., the ability to detect E fluctuations) is sufficient to support phonetic discrimination at 6 months of age. Finally, the fact that the preference towards alternating series over repeating ones was relatively immediate but did not last over few minutes suggests the action of low-level auditory mechanisms. This first study is also promising because, by using vocoded speech sounds, it opens a new way of studying how auditory (peripheral) mechanisms might be engaged in speech processing during language acquisition. Obviously, further studies must be conducted to determine whether E information could be used by infants not only to discriminate but also to categorize speech syllables, or to extract word form patterns from the speech stream. As an example, if the infants’ sensitivity to E cues was shown to be modulated by diverse linguistic input, it would support the idea that E perception is a built-in part of speech processing. Regarding possible follow ups in relation with spoken language acquisition by young CI users, it will be useful to extend the present study by testing NH infants on different phonetic contrasts (manner, place) while using a lower frequency resolution (e.g., 4 or 8 analysis filters), and comparing their performance with CI infants’ perception of the same contrasts. This will allow us to evaluate whether NH infants’ discrimination of phonetic contrasts on the basis of E cues is a reliable simulation of how young CI users discriminate the same contrasts during development.

V. ACKNOWLEDGMENTS

The authors wish to thank X. Li for preparing stimuli and testing infants.

78

VI. REFERENCES

Bertoncini, J., Floccia, C., Nazzi, T., and Mehler, J. (1995). “Morae and syllables: Rhythmical basis of speech representations in neonates,” Lang. Speech 38, 311-329. Bertoncini, J., Serniclaes, W., and Lorenzi, C. (2009). “Discrimination of speech sounds based upon temporal envelope versus fine structure cues in 5-to 7- year-old children,” J. Speech Lang. Hear. Res. 52, 682-695. Best, C. T., and Jones, C. (1998). “Stimulus-alternation preference procedure to test infant speech discrimination,” Infant Behav. Dev. 21 (Sup.1), 295. Eisenberg, L. S., Shannon, R. V., Schaefer Martinez, A., Wygonski, J., and Boothroyd, A. (2000). “Speech recognition with reduced spectral cues as a function of age,” J. Acoust. Soc. Am. 107, 2704-2710. Friesen, L. M., Shannon, R. V., Başkent, D., and Wang, X. (2001). “Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants,” J. Acoust. Soc. Am. 110, 1150–1163. Granier-Deferre, C., Ribeiro, A., Jacquet, A-Y., and Bassereau, S. (2010). Near- term fetuses process temporal features of speech. Dev. Science. 14(2), 336–352. Holt, R. F., and Svirsky, M. A. (2008). “An exploratory look at paediatric cochlear implantation: Is earliest always best?,” Ear Hear. 29, 492-511. Johnson, D. H. (1980). "The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones," J. Acoust. Soc. Am. 68, 1115-1122. Joris, P. X., and Yin, T. C. (1992). "Responses to amplitude-modulated tones in the auditory nerve of the cat," J. Acoust. Soc. Am. 91, 215-232. Jusczyk, P. W, Cutler, A., and Redanz, N. J. (1993). “Infants' preference for the predominant stress patterns of English words,” Child Dev. 64, 675-87. Kuhl, P. K. (1991). “Human adults and human infants show a "perceptual magnet effect" for the prototypes of speech categories, monkeys do not,” Percept. Psychophys. 50, 93-107. Miyamoto, R. T., Houston, D. M., Kirk, K. I., Perdew, A. E., and Svirsky, M. A. (2003) “Language development in deaf infants following cochlear implantation,” Acta Otolaryngol. 123, 241-244. Nazzi, T., Bertoncini, J., and Bijeljac-Babic, R. (2009). “A perceptual equivalent of the labial-coronal effect in the first year of life,” J. Acoust. Soc. Am. 126, 1440-1446. Nazzi, T., Bertoncini, J., and Mehler, J. (1998a). “Language discrimination by newborns: Towards an understanding of the role of rhythm,” J. Exp. Psychol. Hum. Percept. Perform. 24, 356-366. Nazzi, T., Floccia, C., and Bertoncini, J. (1998b). “Discrimination of pitch contour by neonates,” Infant Behav. Dev. 21, 779-784.

79

Rosen, S. (1992). "Temporal information in speech: acoustic, auditory and linguistic aspects," Philos. Trans. R. Soc. Lond. B. Biol. Sci. 336, 367- 373. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). "Speech recognition with primarily temporal cues," Science 270, 303-304. Sheft, S., Ardoint, M., and Lorenzi, C. (2008). "Speech identification based on temporal fine structure cues," J. Acoust. Soc. Am. 124, 562-575. Smith, Z. M., Delgutte, B., and Oxenham, A. J. (2002). "Chimaeric sounds reveal dichotomies in auditory perception," Nature 416, 87-90. Werker, J. F., and Tees, R. C. (1984). “Cross-language speech perception: evidence for perceptual reorganization during the first year of life,” Infant Behav. Dev. 7, 49-63. Xu, L., and Pfingst, B. E. (2003). “Relative importance of temporal envelope and fine structure in lexical-tone perception,” J. Acoust. Soc. Am. 114, 3024– 3027.

80

Chapter 3. Discrimination of voicing on the basis of AM cues in French 6-month- old infants: effects of frequency and temporal resolution (head-turn preference procedure)

81

Chapter 3. Discrimination of voicing on the basis of AM cues in French 6-month-old infants: effects of frequency and temporal resolution (head-turn preference procedure)

1. Introduction: Perception of speech modulation cues by 6-month-old infants

The first study of this PhD work shows that French-learning 6-month-old infants discriminate a French voicing contrast when the speech signal contained only the AM cues (< 64 Hz) extracted in 16 frequency bands. In the present chapter, a second study is designed to explore further the processing of modulation cues in phonetic discrimination with French-native 6-month-old infants. Three noise-excited vocoders similar to those used by Shannon, Zeng, Kamath, Wygonski and Ekelid (1995) are designed to process the target stimuli. These vocoders degrade selectively (i) the FM cues while preserving spectral- and temporal-AM cues, (ii) FM and AM cues (AM being lowpass filtered at 16 Hz in 32 frequency bands), (iii) FM and spectral-AM cues (AM cues being preserved in 4 broad frequency bands). These vocoders affect differentially the temporal and spectral resolution of the speech signal and thus the different modulation cues signaling voicing. The procedure used to test infants in this second study is different from that used in the previous study. The version of HPP used in the previous study favors a direct comparison between voiced and voiceless syllables with the presentation of alternating (containing /aba/ and /apa/) versus repeated sound sequences. However, the infants’ preference for alternating sequences has been found to be very brief: after only four trials (two alternating versus two repeated sequences) the infants’ looking times decreased for both types of sequences. In the present chapter, the procedure of HPP includes a familiarization period to one sound category followed by a test phase in which sequences of the (same) familiar stimulus and sequences of a (different) novel stimulus are presented alternatively in eight trials. With this procedure, discrimination is based on several observation intervals (i.e., multiple “looks”, see Holt, 2011), and thus on more trials than the previous version. This is thought to allow the observation of long lasting effects generated by mechanisms coping with speech. A short or a

82

long familiarization time to a given sound category is presented to French- learning normal-hearing infants (e.g., Ranka Bijeljac-Babic, Serres, Höhle, & Nazzi, 2012). It is assumed that the familiarization period should induce a preference for the familiar or novel sequences in infants if and only if they are able to discriminate these sequences from one another. Thus, if 6-month-olds are able to use reduced spectro-temporal modulations to discriminate voicing, they should show a preference for familiar or novel of sequences. The findings suggest that 6-month-old infants can use reduced spectro- temporal modulation information to achieve voicing discrimination. However, discrimination is found to be influenced by the time of exposure to these degraded speech signals. It is assumed that if infants have more difficulties to process some vocoded stimuli, they should require a longer time of exposure to discriminate them compared to intact stimuli (e.g., Hunter & Ames, 1988). This point is further explored in Chapter 4.

This second article is accepted for publication in the Journal of Speech and Hearing Research: Cabrera, L., Bertoncini, J., & Lorenzi, C. Perception of speech modulation cues by 6-month-old infants (2013).

83

2. Article: Cabrera, Bertoncini & Lorenzi (2013)

Perception of speech modulation cues by 6-month-old infants

Cabrera Laurianne Laboratoire de Psychologie de la Perception CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France Bertoncini Josiane Laboratoire de Psychologie de la Perception CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France Lorenzi Christian Département d’Etudes Cognitives, Institut d’Etude de la Cognition Ecole normale supérieure, Paris Sciences et Lettres 29 rue d’Ulm, 75005 Paris, France

Article submitted to the Journal of Speech and Hearing Research On: May 2012

84

ABSTRACT

Purpose: This study assessed the capacity of 6-month-old infants to discriminate a voicing contrast (/aba/-/apa/) on the basis of amplitude modulation cues (AM, the variations in amplitude over time within each frequency band) and frequency modulation cues (FM, the oscillations in instantaneous frequency close to the center frequency of the band). Method: Several vocoded speech conditions were designed to: (i) degrade FM cues in 4 or 32 bands, or (ii) degrade AM in 32 bands. Infants were familiarized to the vocoded stimuli for a period of either 1 or 2 min. Vocoded speech discrimination was assessed using the head-turn preference procedure. Results: Infants discriminated /aba/ from /apa/ in each condition. However, familiarization time was found to influence strongly infants’ responses (i.e., their preference for novel versus familiar stimuli). Conclusions: Six-month-old infants do not require FM cues, and can use the slowest (<16 Hz) AM cues to discriminate voicing. Moreover, six-month-old infants can use AM cues extracted from only four broad frequency bands to discriminate voicing.

Key Words: Speech perception, Vocoder, Modulation cues, Normal-hearing infants

85

I. INTRODUCTION

Most recent studies about speech perception in infants have focused on how they acquire the phonological properties of their native language in a variety of learning contexts (see Kuhl et al., 2008, for a review). The present study takes a different approach by exploring how infants process the acoustic information related to phonetic differences. More precisely, this study aims to assess to which extent infants rely on low-level (i.e., sensory) spectro-temporal modulation cues to discriminate a voicing contrast (see also Bertoncini, Nazzi, Cabrera, & Lorenzi, 2011). The auditory mechanisms underlying speech perception in adulthood have been thoroughly described. Speech signals are decomposed by the auditory filters in the cochlea into many narrow frequency bands with a passband equal to one

“equivalent-rectangular bandwidth”, 1-ERBN (Glasberg & Moore, 1990; Moore, 2003). Over the last two decades, several psychophysical studies reconsidered speech-perception processes based on the general assumption that each 1-ERBN wide band of speech should be viewed as a sinusoidal carrier with superimposed amplitude modulation (AM) and frequency modulation (FM) (e.g., Drullman, 1995; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995; Sheft, Ardoint, & Lorenzi, 2008; Smith, Delgutte, & Oxenham, 2002; Zeng et al., 2005). The AM component – often referred to as the “acoustic temporal envelope” - corresponds to the relatively slow modulations in amplitude over time. The FM component – often referred to as the “acoustic temporal fine structure” - represents the relatively fast fluctuations in instantaneous frequency over time with average frequency close to the center frequency of the 1-ERBN wide band. Both AM and FM features are represented in the phase-locking pattern of auditory-nerve fibers’ discharges. For most adult mammals, the accuracy of neural phase locking to instantaneous frequency (and thus, FM cues) is constant up to about 1-2 kHz and then declines so that phase locking is no longer detectable at about 5-6 kHz (e.g., Johnson, 1980; Kiang, Pfeiffer, Warr, & Backus 1965; Palmer & Russell, 1986; Rose, Brugge, Anderson, & Hind, 1967). In contrast, neural phase locking to AM cues remains accurate for carrier (audio) frequencies well beyond 6 kHz (e.g., Joris, Schreiner, & Rees, 2004; Joris & Yin, 1992; Kale & Heinz, 2010). Moreover, several additional peripheral auditory

86

mechanisms such as synaptic adaptation at the inner hair-cell level appear to limit AM coding beyond those that limit instantaneous-frequency coding (Joris & Yin, 1992). These physiological data suggest some form of dissociation in the representation of AM and FM features in the early stages of the auditory system. However, it has been proposed that only the slowest FM (<5-10 Hz) cues are encoded in a purely temporal manner (that is, independently of AM) via phase locking in auditory-nerve fibers whereas faster FM cues (>10 Hz) are encoded via a place (i.e., tonotopic) mechanism (e.g., Moore & Sek, 1996; Saberi & Hafter, 1995)1. Phonetic features such as voicing, nasality, place and manner of articulation are signaled by various AM cues (e.g., the existence and duration of silent intervals are important in distinguishing voiced from voiceless plosives in intervocalic position; Rosen, 1992), which are relatively widespread or located in the high audio-frequency range (Rosen, 1992). Segmental cues to voicing and nasality are restricted to the low-mid audio frequency range (e.g., voiced sounds have a power spectrum heavily weighted to audio frequencies below 1 kHz; Rosen, 1992) and are well represented in the pattern of phase locking in auditory- nerve fibers (e.g., Deng & Geisler, 1987; Sinex & Geisler, 1983), suggesting that voicing and nasality may also be signaled by FM cues encoded in a purely temporal manner. A number of speech perception studies have attempted to assess the relative importance of AM and FM cues in speech identification and phonetic-feature perception for normal-hearing adults. These studies have used nonsense syllables, words or sentences processed by vocoders, that is signal-processing algorithms which extract and alter selectively AM and FM cues within specific analysis frequency bands (Dudley, 1939). After filtering out or scrambling the original AM or FM speech cues within each analysis frequency band, the resulting signal is assumed to retain mainly the AM or the FM speech cues (e.g., Gilbert & Lorenzi, 2006; Shannon et al., 1995; Sheft et al., 2008; Smith et al., 2002; Zeng et al., 2005). These studies indicated that normal-hearing adults can achieve high levels

1 In this case, the differential attenuation of cochlear filtering converts the frequency excursions of FM into AM fluctuations at the output of auditory filters in the cochlea, a process sometimes referred to as “temporal envelope reconstruction” (see Gilbert & Lorenzi, 2006; Zeng et al., 2004, for applications to speech perception).

87

of speech intelligibility and phonetic-feature perception with speech stimuli vocoded to retain mainly the AM speech cues (stimuli referred to as “AM speech” thereafter). This was initially demonstrated by Shannon et al. (1995) who evaluated English-speaking adults’ abilities to identify nonsense syllables in different speech-processing conditions. The AM-speech stimuli contained only AM cues extracted within a limited number of broad analysis frequency bands (1, 2, 3 or 4 bands) using a lowpass filter with a cutoff frequency varying from 16 to 500 Hz. Syllable identification and phonetic-feature perception (i.e., voicing, manner and place of articulation) were poor for 1, 2, or 3 analysis frequency bands, but sharply increased with 4 bands, irrespective of the cutoff frequency of the lowpass filter used to extract AM. However, in order to reach such high levels of accuracy, participants required a relatively long training period (8 to 10 hours). Other studies with vocoded signals revealed that adult listeners could also achieve high levels of speech perception with nonsense syllables or sentences while the physical signal retains mainly the FM cues (e.g., Gilbert & Lorenzi, 2006; Hopkins, Moore, & Stone, 2010; Lorenzi, Gilbert, Carn, Garnier, & Moore, 2006; Sheft et al., 2008). However, participants required a much longer training period to accurately identify these “FM-speech” stimuli than their “AM-speech” counterparts. These data suggested that AM cues play a more important role than FM cues in accurate speech recognition. However, a number of studies conducted with adults tested with vocoded speech demonstrated that the relative importance of FM cues may increase when AM cues are degraded by various acoustic distortions (e.g., Ardoint & Lorenzi, 2010; Gilbert, Bergeras, Voillery, & Lorenzi, 2007; Gilbert & Lorenzi, 2006; Hopkins, Moore, & Stone, 2008; Nelson, Jin, Carney, & Nelson, 2003; Qin & Oxenham, 2003; Zeng et al., 2005). Indeed, speech perception was found to be poorer for “AM speech” than for “intact speech” (that is, for speech combining AM and FM cues within each frequency band) when stimuli were spectrally reduced or filtered, periodically interrupted or masked by interfering talkers or background noise (Eaves, Summerfield, & Kitterick, 2011; Gnansia, Pean, Meyer, & Lorenzi, 2009). These data demonstrate that for adults, both AM and FM cues convey phonetic information whose relative weight may vary according to the listening conditions, and in a more general sense, as a function of speech redundancy. Thus, for adults, the loss of AM cues may be compensated for by relying more on

88

FM cues. While numerous studies have investigated speech perception in adults, information is still lacking regarding the ability of neonates, infants and older children to use AM and FM cues in speech. Do infants use AM and FM cues in the same way as adults when listening to speech sounds? Do the acoustic degradations of AM and FM cues impair the performance of developing auditory mechanisms as much as that of mature (coupled) auditory and speech mechanisms? To our knowledge, only a few studies have directly addressed the developmental course of modulation perception using vocoded speech stimuli. In a pioneering study, Eisenberg, Shannon, Shaefer Martinez, Wygonski and Boothroyd (2000) assessed the ability of normal-hearing 7-and 10-year-old children and adults to identify nonsense syllables, words and sentences vocoded to retain only AM cues below 160 Hz within 4 to 8 analysis frequency bands (the FM carriers were replaced by noise in each band). The results showed that children less than 7 years required a higher frequency resolution (i.e., a greater number of analysis frequency bands) than 10 year-olds and adults to reach similar identification performance with the AM-speech stimuli. Interestingly, voicing perception (and especially, perception of the /sa/ versus /za/ contrast) was the poorest across all subjects groups. This initial investigation was extended by Bertoncini, Serniclaes and Lorenzi (2009) to younger children aged between 5 and 7 years. A discrimination task was used with nonsense syllables vocoded to retain only AM cues below 64 Hz within 16 frequency bands (here, the FM carriers were replaced by pure tones with fixed frequencies). Bertoncini et al. (2009) found that normal-hearing 5-, 6-, and 7-year-old children were able to discriminate speech contrasts on the basis of AM cues at an adult level. This study also showed no significant difference in performance across age groups for voicing, place, manner, and nasality. Thus, the perception of AM cues below 64 Hz in speech appeared as robust for 5-year-old children as for adults when a discrimination task was used. However, consistent with the results of Eisenberg et al. (2000), children’s discrimination scores were lowest for the voicing contrast. This finding was attributed to the attenuation of fundamental-frequency (F0) energy at the onset of voicing caused by the removal of FM cues. Bertoncini et al. (2011) further explored the perception of voicing based on AM cues by testing six-month-old infants. A behavioral task (based on the head-

89

turn preference procedure) was used to assess the ability of infants to discriminate pairs of AM-vocoded syllables such as /aba/ and /apa/. As in Bertoncini et al. (2009), the AM cues were extracted within 16 analysis frequency bands and lowpass filtered at 64 Hz. Again, the original FM carriers were replaced by pure tones. The results showed that infants attended significantly longer to alternating sequences of /aba/ and /apa/ than to repeated sequences of either /aba/ or /apa/. This suggested that infants detected the alternation on the basis of the AM cues differentiating voicing information. These studies revealed that infants and older children are able to discriminate phonetic features on the sole basis of AM cues. However, Eisenberg et al. (2000)’s study suggested that the capacity to resist degradations of AM and/or FM cues in speech sounds in a more demanding identification task may not be entirely mature before at least the age of 7 years. This is compatible with the outcome of several psychophysical studies conducted with non-linguistic stimuli, showing that auditory sensitivity to AM and FM cues is not adult-like until around 10-11 years of age (e.g., Aslin, 1989; Colombo & Horowitz, 1986; Hall & Grose, 1994; Moore, Cowan, Riley, Edmondson-Jones, & Ferguson, 2011; for a review, see also Saffran, Werker, & Werner, 2006, and Werner & Gray, 1998). Nonetheless, other studies conducted both in infants (e.g., Abdala & Folsom, 1995; Levi, Folsom, & Dobie, 1995; Olsho, 1985; Spetner & Olsho, 1990) and in cats (Brugge, Javel, & Kitzes, 1978; Kettner, Feng, & Brugge, 1985) suggest that some aspects of frequency selectivity and neural phase locking in auditory-nerve fibers and brain-stem neurons should be mature from early in infancy. Regarding temporal resolution (that is the ability to follow changes in AM fluctuations as a function of time), the results vary according to the nature of the stimuli and according to the methods. In some studies, temporal resolution approaches an adult-like profile at about 6 months of age (e.g., Levi & Werner, 1996; Trainor, Samuel, Desjardins, & Sonnadara, 2001; Trehub, Schneider, & Henderson, 1995). At the same time, other studies show differences across age groups, indicating that temporal sensitivity is generally poorer in infants and more dependent on the task or on sound complexity than in adults (Buss, Hall, Grose, & Dev, 1999; Diedler, Pietz, Bast, & Rupp, 2007; Smith, Trainor, & Shore, 2006; Wightman, Allen, Dolan, Kistler, & Jamieson, 1989). Nevertheless, it is generally admitted that, under some circumstances, the mechanisms governing auditory temporal

90

resolution in infants operate qualitatively like those in adults (Werner, Marean, Halpin, Spetner, & Gillenwater, 1992). The above review of infants’ abilities to process spectral and temporal auditory cues reveals important disparities between psychophysical studies using non-linguistic stimuli (e.g., pure tones, complex tones, noise bursts) and speech- acquisition studies using linguistic stimuli (e.g., syllables, words). The present study attempted to address these disparities by combining psychophysical and psycholinguistic methods to explore 6-month-old infants’ abilities to process phonetic information on the basis of AM and FM speech cues. More precisely, the present study investigated whether or not 6-month-old infants discriminate a French voicing contrast (/aba/ versus /apa/) in spite of degradations of the AM and/or FM cues in speech sounds. An impressive number of studies on the development of speech perception have shown that, during the first months of life, infants can discriminate many different phonetic contrasts, including non-native ones (e.g., Mattock, Molnar, Polka, & Burnham, 2008; for a review see Kuhl, 2004, and Werker & Tees, 1999). Voicing discrimination, and more precisely discrimination of the /b/ versus /p/ contrast, was demonstrated in infants as young as one month of age by Eimas, Siqueland, Jusczyk and Vigorito (1971) using the high-amplitude sucking method. However, at the end of the first year, speech perception is marked not only by a decline in non-native contrast discrimination (e.g., Werker & Tees, 1983) but also by facilitation in native contrast discrimination (e.g., Kuhl et al., 2006). Moreover, a recent study indicated that between 4- and 8-months of age, French- learning infants become more sensitive to the French value (0 ms) of the voice- onset-time (VOT) boundary between voiced and voiceless plosive consonants (Hoonhorst et al., 2009). Therefore, the speech-processing mechanisms responsible for the perception of phonetic features such as voicing are developed at 6 months of age, but far from being entirely tuned to the typical contrastive patterns of the native language (e.g., Mattock et al., 2008; Werker & Tees, 1983, but see Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). The present study explored the early capacity of 6-month-old French- learning infants to use AM and FM cues in discriminating a voicing contrast. Two sets of syllables (/aba/ and /apa/ stimuli) were either left intact or processed by a multi-channel, noise-excited vocoder in order to: (i) degrade FM cues by

91

replacing the FM carriers by bands of noise in each analysis frequency band, (ii) degrade AM cues – and especially F0-related periodic AM cues - by filtering out the AM components above 16 Hz in each frequency band, and (iii) degrade AM cues by reducing the frequency resolution of the vocoder (i.e., the number of analysis frequency bands) from 32, 1-ERBN wide to 4, 8-ERBN wide frequency bands. These three vocoded conditions were designed to assess whether normal- hearing 6-month-olds are able to discriminate a voicing contrast (/aba/ versus /apa/) on the sole basis of: (i) AM speech cues, (ii) the slowest (< 16 Hz) AM speech cues, and (iii) the AM cues extracted from a limited number of broad frequency bands. The intact and processed speech stimuli were presented to 6-month-old infants and an adaptation of the head turn preference procedure (HPP) was used to assess their discrimination ability. The procedure used in this experiment was adapted from the one introduced by Hirsh-Pasek et al. (1987). Our modified version of the HPP included a familiarization phase during which a stimulus is presented, followed by a test phase in which sequences of the (same) familiar stimulus alternated with sequences of a (different) novel stimulus. This version of HPP has been used recently in a study on melody discrimination in 2-month-old infants (Plantinga & Trainor, 2009) and in several psycholinguistic studies assessing speech discrimination for 5-to-9-month-olds (e.g. Bosch & Sebastian- Gallés, 2001; Höhle, Bijeljac-Babic, Herold, Weissenborn, & Nazzi, 2009; Nazzi, Jusczyck, & Johnson, 2000; Skoruppa et al., 2009). These studies showed that the familiarization period could induce in infants a preference for the familiar or novel sequences if and only if infants are able to discriminate these sequences2. In most studies, the duration of familiarization was set to 1 min but in some cases this duration was extended to 2 min (e.g. Bijeljac-Babic, Serres, Höhle, & Nazzi, 2012). In the present experiment, the duration of the familiarization phase was first set to 1 min. For each experimental condition, we expected that the novel

2 The procedure differs from a “spontaneous preference” paradigm in that the familiarization was achieved with sound A for half of the participants and with sound B for the other half, in such a way that preference for novel or for familiar stimuli cannot be confused with preference for one particular category (A or B). Second, it should be noted that while discrimination responses do not imply preference, displaying a preference necessitates discrimination between the two types of sequences.

92

sequences would yield longer (correctly oriented) looking times than the familiar sequences.

II. EXPERIMENT 1

A. Method

1. Participants

Six-month-old infants were recruited from a database of birth announcements. All families were informed about the goals of the current study and provided a written consent before their participation, in accordance with current French ethical requirements. Data from 88 infants from French-speaking families (22 infants per condition) were analysed in this experiment (40 boys and 48 girls; age range: 5 months 26 days; 7 months 10 days; mean = 6 months 16 days; SD = 9 days). All infants are normal-hearing (based on parental report of newborn hearing screening results). The data from 37 additional infants were not included for the following reasons: fussing and crying (n=22), looking time shorter than 1500 ms for one trial (n=6), outlier number of familiarization trials (n=6) and extreme mean looking times leading to outlier differences (more than, or less than the mean difference plus or minus 2 SD, respectively) between novel and familiar series (n=3).

2. Stimuli

Speech signals were recorded in a soundproof room and digitized (16-bit resolution) at a 44.1-kHz sampling rate. A female French native speaker who was instructed to “speak clearly” produced sequences of /aba/ and /apa/. Sixteen tokens were selected from a large sample (around 150) in each phonetic category to be comparable in duration, intensity and pitch (in order to reduce variability across categories). Mean duration was 651.5 ms for /aba/ (range: 592-692 ms; SD = 36 ms), and 630 ms for /apa/ (range: 570-702 ms; SD = 47 ms). All stimuli were equated in global root-mean-square (RMS) level. The F0 was estimated at 242 Hz using the YIN algorithm (de Cheveigné & Kawahara, 2002).

93

Four speech-processing conditions were used (the spectrograms of processed stimuli are shown in Figure 1)3. In the first condition (called “32-band

AM+FM speech”), the original speech signal was decomposed into 32, 1-ERBN- wide frequency bands using zero-phase, 6th-order Butterworth bandpass filters (36 dB/octave rolloff) with central frequency (CF) ranging from 80 to 8,020 Hz. The Hilbert transform was then applied to each bandpass filtered speech signal to extract the AM component and FM carrier. The AM component was low-pass filtered using a zero-phase Butterworth filter (36 dB/octave rolloff) with a cutoff frequency set to ERBN/2. The final narrow-band speech signal was obtained by multiplying each sample of the FM carrier by the filtered AM function. The narrow-band speech signals were finally added up and the level of the wideband speech signal was adjusted to have the same RMS value as the input signal. Thus, the vocoded speech signals retained the original AM and FM speech cues within each of the 32 analysis frequency bands. In the second condition (called “32-band AM speech”), the same signal processing scheme was used as in the “32-band AM+FM speech” condition, except that the FM carrier was replaced by a band of pink noise in each analysis frequency band. Thus, the resulting vocoded speech signal retained AM speech cues within 32 bands, but discarded the original (within-channel) FM speech cues. In the third condition (called “32-band AM<16Hz speech”), the same signal processing scheme was used as in the “32-band AM speech” condition, except that the AM component was low-pass filtered with a cutoff frequency of 16 Hz for each of the 32 bands in order to remove fast, F0-related AM cues. Thus, the resulting vocoded speech signal retained mainly the slowest (<16 Hz) AM speech cues within 32 bands, and discarded the original FM speech cues.

3 A pilot experiment using an ABX, forced-choice discrimination task was initially conducted on 40 normal-hearing adults using the intact and vocoded stimuli, in order to verify that the processed sounds can be discriminated. This pilot experiment showed that, after a very short practice, normal-hearing adults discriminated almost perfectly voiced and unvoiced consonants in each speech-processing condition (>90% correct discrimination). This finding is consistent with the outcome of the original speech-identification experiment conducted by Shannon et al. (1995), showing that voicing perception is nearly perfect in adults as long as AM cues below 16 Hz are presented in at least four broad spectral regions. Interestingly, all listeners reported having recognized the disyllables /aba/ and /apa/ except those presented with “4-band AM speech”. In this speech-processing condition, stimuli were not recognized as speech sounds.

94

In the last condition (called “4-band AM speech”), the same signal processing scheme was used as in the “32-band AM speech” condition, except that AM cues were extracted from only 4, broad (8-ERBN wide) frequency bands. Thus, the original FM speech cues were discarded, and AM cues were distorted substantially compared to the original AM speech cues. This vocoder also reproduces the sound processing typically achieved by current cochlear implant’ sound processors (cf. Shannon et al., 1995). In each condition, 4 different sequences were created. Each sequence was composed of 4 tokens of the same phonetic category, repeated 4 times in a different random order. Thus, all sequences have the same number of stimuli. Two sequences were used for the familiarization phase. Two sequences were used for the test phase. The tokens used in the test phase for each phonetic category were different from the ones used in the familiarization phase. The inter-stimulus interval was varied randomly between 390 and 600 ms all along the 16-item sequences. This random variation was introduced to prevent infants from using small variations in duration between items within and between categories. Finally, all the sound sequences were equated in duration (18 s).

95

Figure 1. Spectograms of /aba/ (left panels) and /apa/ (right panels). From top to bottom panels: unprocessed, “32-band AM+FM”, “32-band AM”, “32-band AM<16Hz”, and “4-band AM” speech conditions.

96

3. Procedure

The experiment was conducted inside a sound-attenuated room where three lamps were fixed: a green one on the center wall, and a red one on each side wall. Below the green lamp was a hole for the lens of a video-camera. Out of the infant view, two loudspeakers were placed behind the red lamps and delivered the speech stimuli at 70 dB SPL (RMS) at the level of the infants’ head. The infant was seated on the caregiver’s lap in the center of booth. The caregiver was instructed not to speak or interfere in any way with the infant’s behavior and wore ear plugs plus headphones delivering masking music. The entire experimental session was controlled by a computer outside the booth, and a TV screen was connected to the camera. The experimenter sat outside the booth, looked at the video of the infant on the TV screen to monitor infant’s looking behavior. The experimenter used a response box composed of three buttons corresponding to the three lights inside the booth. The response box was connected to the computer controlling the experiments. The experimenter pressed the buttons of the response box according to the direction of the infant's head (center, right or left). During the entire session, the experimenter wore headphones delivering masking music and was unaware of the nature of the displayed sequences (familiar or novel) during the test. During the familiarization and the test phase, each experimental trial started by the blinking of the green center lamp. When the infant oriented to the green lamp, the experimenter pressed the “center” button, switching the green lamp off, and simultaneously one of the red lamps on. When the infant turned his/her head towards the red blinking lamp, the experimenter pressed the “right” or “left” button, triggering the presentation of the auditory stimulus from the corresponding loudspeaker. The trial was stopped when the infant turned his/her head away for more than 2 s, corresponding to the button released by the experimenter, or when the end of the stimulus sequence was reached. Information about the duration of the head-turn was automatically stored on the computer. During the familiarization phase, infants heard two sequences of the same phonetic category (half of the infants heard /aba/ and the other half /apa/). The familiarization trials continued (as described above) until infants listened to each

97

sequence for a cumulative duration of 30 s. This was achieved to complete 60 s of familiarization time. The number of trials necessary to reach this criterion differed across participants; those who reached the criterion with a number of trials significantly larger than the group mean were not included in the analyses. In the test phase, two different sequences of the familiarized item, and two sequences of the novel item were displayed twice in random order and counterbalanced between the left and right side of presentation. Thus, the test phase was composed of 8 trials for each subject. The order of test sequences was counterbalanced between subjects (half of the subjects received a familiar stimulus sequence as first test trial while, the other half received a novel sequence as first test trial).

B. Results

Mean looking times were calculated for each participant across the 8 test trials (4 novel and 4 familiar sequences). An analysis of variance (ANOVA) for repeated measures was run with 4 Conditions as the between-subject factor and 2 Types of sequences (Familiar versus Novel) as the within-subject factor. This analysis did not indicate any main effect of Condition (F(3,84) = 0.78; p = .51) or Type of sequence (F(1,84) = 0.26; p = .61), but a significant interaction between Condition and Type of Sequence (F(3,84) = 3.06; p = .033). This interaction is mainly due to the fact that in the “32-band AM+FM” condition, infants listened longer to the novel sequences than to the familiar sequences (9.3 s; SD = 2.7 s versus 8.2 s; SD = 3.2 s, respectively) compared to the “4-band AM” condition in which infants listened less to the novel sequences than to the familiar (7.1 s; SD = 2.6 s versus 8.1 s; SD = 2.1 s, respectively). In the other two conditions (“32-band AM” and “32-band AM<16 Hz”), the mean looking times were similar for the novel and the familiar sequences (see Figure 2). Planned comparisons confirmed that the significant interaction mentioned above is mainly due to the interaction between the “32-band AM+FM” and “4- band AM” conditions and Type of sequences (F(1,84) = 8.59; p = .004). Additionally, paired t-tests were carried out for each condition separately. A significant preference for the novel sequences was observed in the “32-band

98

AM+FM” condition (t(21) = 2.33, p = .03), and a significant preference for the familiar sequences was observed in the “4-band AM” condition (t(21) = 2.43, p = .02). No significant difference was observed in the other two “32-band AM” and “32-band AM<16 Hz” conditions (p > .05 in both conditions).

Figure 2. Mean looking times in Experiment 1 for familiar and novel stimuli during the test phase for each speech-processing condition (errors bars represent the standard errors).

C. Discussion

As expected, infants exhibited significantly longer looking times for the novel stimuli in the “32-band AM+FM” speech condition, indicating that they discriminated /aba/ from /apa/ when provided with intact AM and FM speech cues. This novelty preference is a classical demonstration of infants’ discrimination capacities. However, the familiarity preference can also be taken as an indication of discrimination (e.g., Hunter & Ames, 1988). In behavioral procedures including a familiarization phase, familiarization is supposed to bias stimulus preference in such a way that all significant differences (whatever the direction, since familiarized stimuli were counterbalanced across subjects) would indicate that infants have processed novel and familiar stimuli differently. Here, an unexpected preference for familiar stimuli was observed for 6-month-old infants in the “4-band AM” speech condition, where the frequency resolution of the vocoder was reduced to 4 broad (i.e., 8-ERBN-wide) frequency bands. Although there are a number of hypotheses as to which factors affect the direction

99

of preferences, no consensus has emerged (e.g., Hunter & Ames, 1988; Rose, Gottfried, Melloy-Carminar, & Bridger, 1982; Thiessen & Saffran, 2003; Wagner & Sakovits, 1986). Hunter and Ames (1988) proposed a model of infant preferences for novel and familiar stimuli based on the interaction among three factors, namely age, familiarization time and task difficulty. According to this model, the present results in the “4-band AM” speech condition may be explained by the difficulty to process this relatively impoverished signal. In this speech condition, the original FM speech cues were discarded, and AM cues were distorted substantially compared to the “32-band AM+FM” condition and to the AM speech cues that listeners typically extract from the relatively narrow (1-

ERBN-wide) frequency bands corresponding to the outputs of their cochlear filters (Kates, 2011). The results obtained in the “32-band AM” condition were even more unexpected given that the stimuli used here were supposed to convey more accurate modulation information than the “16-band AM” speech stimuli previously used by Bertoncini et al. (2011). Indeed, in the present study, a higher frequency resolution was used and higher AM rates (e.g., periodic AM fluctuations at F0) were transmitted in high-CF analysis bands. Here, the AM component was lowpass filtered at ERBN/2 within each band, versus 64 Hz in Bertoncini et al. (2011). It is thus possible that the inherent random amplitude fluctuations of the noise carriers masked to some extent the original AM speech cues (e.g., Dau, Kollmeier, & Kohlrausch, 1997a,b; Dau, Verhey, & Kohlrausch, 1999; Lorenzi et al., 2001; see also Kates, 2011). Such a “modulation masking” effect did not occur in Bertoncini et al. (2011)’ study because pure-tone carriers were used instead of band of noise. Alternatively, another factor could have accounted for this difference. As mentioned above, the preset duration of the familiarization period could have been inappropriate (i.e., too short) when artificially degraded stimuli such as noise-vocoded speech sounds were used. In Bertoncini et al.’s (2011), 6-month-old infants received no familiarization, and were found to prefer spontaneously alternating series of tone-vocoded speech sounds to repeating ones. In the present version of the procedure, discrimination could require further perceptual processing than in the previous study, where the repetitive juxtaposition of /aba/ and /apa/ could have prompted immediate comparison. It may be the case that the information needed to represent a given

100

stimulus category has to be gathered and correctly formatted before being compared to the (correctly formatted) other category. This is consistent with the studies conducted by Holt and Carney (2005, 2007) on children and adults. These authors proposed that “robustness” of the internal representation of speech stimuli and correct discrimination of a given speech contrast increase with increasing repetition/presentation of the stimuli. Again, the duration of the familiarization phase used in Experiment 1 might have been too short for infants to complete such a perceptual comparison, and to retrieve the original AM speech cues from the statistical amplitude fluctuations produced by noise carriers. Hunter and Ames (1988) indicated that the absence of any preference following a familiarization phase can be explained by a too short familiarization time. The duration necessary to be properly familiarized with a given stimulus may depend on the vocoded condition. The high variability in infants’responses observed in the “32-band AM” and “32-band AM<16Hz” speech conditions may also question the effectiveness of the familiarization period, suggesting a transitional phase in preference in these 2 conditions (see also Hunter & Ames, 1988; Rose et al., 1982). The absence of a significant difference between the “32-band AM+FM”, “32-band AM” and “32-band AM<16Hz” speech conditions precluded any conclusion about the effects of degrading AM and/or FM information. Thus, when the AM and FM speech cues are severely degraded, infants may need an extended exposure with the contrasted stimuli. In our view, this could also result in a more consistent effect of preference for one type of sequence. A second experiment in which the familiarization time was extended to a minimum of 2 min was conducted to test this hypothesis.

101

III. EXPERIMENT 2

Four groups of twenty 6-month-old French-learning infants were tested with a minimum familiarization duration extended to 2 min.

A. Method

1. Participants

Data from 80 infants were analyzed (45 boys and 35 girls range: 5 months 23 days; 7 months 3 days; mean = 6 months and 13 days; SD = 8 days). The data from 18 additional infants were not included for the following reasons: fussing and crying (n = 14), looking time lower than 1500 ms for one trial (n = 1), outlier number of familiarization trials (n = 2) and outlier looking time differences between novel and familiar conditions (n = 1).

2. Stimuli

Infants were presented with the same stimuli and sequences of stimuli as those used in Experiment 1.

3. Procedure

The same procedure (HPP) was used, except that the familiarization time criterion was changed. In the familiarization phase, infants heard two sequences of the same phonetic category, and they had to listen to each sequence for at least 60 s; therefore, the familiarization phase had a minimum cumulative duration of 120 s. The test phase was left unchanged: each subject received 8 trials containing two different sequences of new tokens of the familiarized item, and two sequences of the novel item, that were played twice in a random order.

B. Results

As for Experiment 1, the mean looking times for novel and familiar stimuli were calculated for each participant on the 8 trials of the test phase. An ANOVA for repeated measures was run with 4 Conditions as the between-subject factor and 2 Types of sequences (Familiar versus Novel) as the within-subject factor. The main effect of Condition was not significant (F(3,76) = 1.29; p = .28).

102

However, the analysis showed a significant effect of Type of sequence (F(1,76) = 8.88; p = .004) and a significant interaction between Condition and Type of sequence (F(3,76) = 5.67; p = .0015). Planned comparisons indicated that only the condition “32-band AM< 16Hz” differed significantly from the three others (α = .05). Once again, separate analyses revealed some differences between conditions. A significant preference for the novel sequences compared to the familiar ones was observed in three conditions: in the “32-band AM+FM” condition (7.29 s; SD = 1.7 s versus 6.08 s; SD = 2.07 s, respectively; t(19) = 2.22, p = .039), in the “32-band AM” condition (6.2 s, SD = 2.6 s versus 4.8 s, SD = 2.1 s, respectively, (t(19) = 2.99, p = .014), and in the “4-band AM” condition (7.34 s; SD = 3.2 s versus 5.9 s; SD = 2.5 s, respectively; t(19) = 2.50, p = .022). In the “32-band AM<16 Hz” condition, the mean looking time for novel sequences was shorter than for familiar sequences (6.3 s, SD = 3.1 s versus 7.3 s, SD = 3.1 s). This difference was significant (t(19) = -2.55; p = .028), indicating that infants listened to familiar stimuli for a longer time in this condition (see Figure 3).

Figure 3. Mean looking times in Experiment 2 for familiar and novel stimuli during the test phase for each speech-processing condition (errors bars represent the standard errors).

103

C. Discussion

In this second experiment, each condition was re-tested with a preset familiarization time of 2 min instead of 1 min (as used in Experiment 1). With a longer familiarization time, the 6-month-old infants showed the classical pattern of preference for novelty in three conditions out of four. However, infants showed a preference for familiar stimuli in the “32-band AM<16 Hz” condition, that is when the fastest (> 16 Hz) AM fluctuations related to F0 variations were reduced. The results of this second experiment revealed that discrimination of /aba/ and /apa/ was possible, and indicated by a classical novelty preference in both “32-band AM” and “4-band AM” conditions. The preference pattern in the “32- band AM<16 Hz” condition turns out to be significant, and confirm the hypothesis that extended familiarization is necessary to evidence discrimination responses in some vocoder conditions.

IV. GENERAL DISCUSSION

The present study was designed to explore how 6-month-old infants process a French voicing contrast on the basis of AM and FM cues, at an age when their perceptual mechanisms are not completely tuned to their native language. The disyllables (/aba/ versus /apa/) were used and several speech- processing conditions were designed to investigate whether infants discriminate this contrast when the modulation properties of speech sounds are severely degraded. Discrimination of speech modulation cues. Six-month-old infants were found to discriminate the voicing contrast with the vocoded speech stimuli. In the second experiment, results demonstrated that within-channel FM cues are not necessary to discriminate voicing at 6 months of age (cf. “32-band AM” condition). They also demonstrated that infants could discriminate voicing in the absence of periodic AM fluctuations related to F0 (cf. “32-band AM<16 Hz” condition). However, even with an extended familiarization time, infants showed an opposite pattern of discrimination compared to the “32-band AM”

104

condition where fast AM variations related to F0 were preserved. This suggests that infants are sensitive to a degradation of fast F0 cues in the AM domain, an indication that, as for adults (e.g., Rosen, 1992), the presence of fast, F0-related periodic information constitutes reliable cues to voicing cues for infants. This is also consistent with the outcome of previous psychophysical studies showing that the ability to detect changes in AM cues as a function of time – i.e., auditory temporal resolution – is efficient by 6 months of age (Levi & Werner, 1996; for a review, see Saffran et al., 2006 and Werner & Gray, 1998). In Experiments 1 and 2, the results showed that infants are able to discriminate speech signals containing only AM cues extracted within a small number of frequency bands (“4-band AM” condition). Moreover, in Experiment 1, the preference pattern for familiar stimuli observed in this condition compared to the “32-band AM+FM” and “32-band AM” conditions suggests that infants are sensitive to a reduction in frequency resolution. This is consistent with the demonstration of adult-like frequency selectivity in infants by the age of 6 months (e.g., Abdala & Folsom, 1995; Spetner & Olsho, 1990). The importance of auditory exposure to the stimuli. These findings suggest that at this early age, auditory processes are capable of making subtle speech distinctions despite severe distortions of speech modulation cues. Nevertheless, it is important to keep in mind that the demonstration of discrimination in infants was not straightforward in all conditions of degradation. In the “32-band AM+FM” condition, increasing the familiarization time did not affect the infants’ pattern of responses. However, in the “4-band AM” condition, the increase in familiarization time by 1 min affected the infants’ response pattern, which changed from a preference to familiar stimuli to a preference to novel stimuli. In the “32-band AM” condition, the extended auditory exposure led to a preference for the novel stimuli. In the “32-band AM< 16Hz” condition, infants showed a significant preference for familiar sequences. Thus, a pre-set and relatively short time of exposure may not have allowed the 6-month-old infants to fully process speech cues in each vocoder condition, and build up the detailed representations of the vocoded signals required for robust discrimination (cf. Holt, 2011). This suggests that the present version of HPP based on a fixed familiarization time period might not have optimally assessed discrimination capacities of degraded stimuli (and quite unfamiliar speech sounds) in infants.

105

The familiarization phase was aimed to provide an equivalent amount of auditory experience with a particular sound to a group of infants. However, it is important to keep in mind that vocoded stimuli strongly differed in terms of amount of spectro-temporal degradation. A procedure using an infant controlled habituation time would be more efficient to assess the discrimination capacity with such degraded speech stimuli. Future studies are warranted to assess whether or not familiarization time should be adapted to each infant in order to reduce inter- individual variability across the different conditions of degradation. Different response patterns of infants. A non-classical preference pattern was observed for infants when AM and FM cues were simultaneously degraded as in the “32-band<16 Hz” and “4-band AM” conditions (Experiment 2 and Experiment 1, respectively). In these two conditions, infants showed a preference for the familiar stimuli instead of the more typical preference for novel stimuli observed in the “32-band AM+FM” and “32-band AM” conditions. Altogether, these results are congruent with the assumption that a familiarity preference could be related to some difficulties in processing stimuli (Hunter & Ames, 1988). This difference in response patterns suggests that the mechanisms involved in voicing perception may differ according to the severity of modulation distortions, and the subsequent “difficulty” in processing these highly distorted – and thus, unfamiliar - signals (Hunter & Ames, 1988; Rose et al., 1982). This suggestion is consistent with the fact that adult participants tested in a pilot study (see footnote 2) could not readily recognize the linguistic nature of the vocoded stimuli in the “4-band AM” condition, and required extensive training to identify consonants, vowels or sentences in the same speech-processing condition as previously shown by Shannon et al. (1995). This suggestion is also consistent with the notion that part of the processing mechanisms recruited for a given task may change whether adults are informed (or not) about the nature of what they are supposed to listen to (see Liebenthal, Binder, Piorkowski, & Remez, 2003). When the nature of the stimuli is unknown and unfamiliar, perceptual mechanisms may differ from those activated by an intact speech signal. In Experiment 2, the particular pattern of results observed in the “32-band AM< 16Hz” condition may reveal the importance of the fast AM cues related to F0 fluctuations for voicing discrimination at 6 months.

106

Clinical implications. In the “4-band AM” condition, six-month-old infants (with normal hearing) were tested with speech stimuli processed by a vocoder simulating the sound processing typically achieved by current cochlear implant sound processors (e.g., Friesen, Shannon, Baskent, & Wang, 2001; Shannon et al., 1995). Results showed that infants could still discriminate voicing despite important degradations in both AM and FM cues. However, a classical “novelty” discrimination response seems to require an extended exposure with the contrasted stimuli when the speech modulation cues were severely degraded by filtering or spectral reduction. The implications of the present results for the understanding of speech perception capacities for infants wearing a cochlear implant should however be treated with caution. Indeed, the present results were obtained using degraded speech with normal-hearing infants, as well as a single speech contrast in a single context. Still, these results are very encouraging for understanding speech development in deaf infants wearing cochlear implants, and point to the necessity to investigate the basic auditory abilities needed to benefit from very early cochlear implantation (e.g., Holt & Svirsky, 2008; Miyamoto, Houston, Kirk, Perdew, & Svirsky, 2003; Nikolopoulos, Archbold, & O’Donoghue, 1999; Svirsky, Robbins, Kirk, Pisoni, & Myamoto, 2000). Further data will be needed to understand how the degraded signals delivered by cochlear implant processors are processed throughout the development of speech perception. In addition to their clinical relevance, these results might draw attention to the role of basic auditory processes upon early speech perception before or during the development of language-specific tuning (Kuhl et al., 2008).

V. CONCLUSION

The present study explored how infants process the acoustic information related to phonetic differences using a novel approach stressing the importance of modulation speech cues. The results indicate that the relatively fast AM and FM cues in speech are not necessary for French-learning 6-month-old infants to discriminate a voicing contrast in silence. These results demonstrate that the perception of voicing is robust as early as 6 months of age. However, the duration

107

of initial auditory exposure (i.e., familiarization time) to the vocoded stimuli had to be increased to evidence discrimination abilities in all speech-processing conditions. Finally, the results show that normal-hearing infants can use the impoverished speech modulation cues delivered by prosthetic devices such as cochlear implants to discriminate voicing, provided they are sufficiently exposed and familiarized to these degraded speech stimuli.

ACKNOWLEDGMENTS

This work was supported by ANR and program “Investissement d’avenir”(ANR- 11-IDEX-0001-02-PSL; Labex “Institut d’Etude de la Cognition”). C. Lorenzi was supported by a grant (HEARFIN Project) from ANR. The authors would like to thank warmly the parents for their participation in this study.

108

REFERENCES

Abdala, C., & Folsom, R.C. (1995). The development of frequency resolution in humans as revealed by the auditory brain-stem response recorded with notched-noise masking. Journal of the Acoustical Society of America, 98, 921-930. Ardoint, M., & Lorenzi, C. (2010). Effects of lowpass and highpass filtering on the intelligibility of speech based on temporal fine structure or envelope cues. Hearing research, 260, 89-95. Aslin, R.N. (1989). Discrimination of frequency transitions by human infants. Journal of the Acoustical Society of America, 86, 582-590. Bertoncini, J., Nazzi, T., Cabrera, L., & Lorenzi, C. (2011). Six-month-old infants discriminate voicing on the basis of temporal envelope cues. The Journal of the Acoustical Society of America, 129, 2761-2764. Bertoncini, J., Serniclaes, W., & Lorenzi, C. (2009). Discrimination of speech sounds based upon temporal envelope versus fine structure cues in 5-to-7 year-old children. Journal of Speech Language and Hearing Research, 52, 682-695. Bijeljac-Babic, R., Serres, J., Hohle, B., & Nazzi, T. (2012). Effect of bilingualism on lexical stress pattern discrimination in French-learning infants. PLoS ONE, 7(2), e30843. Bosch, L., & Sebastian-Galles, N. (2001). Evidence of early language discrimination abilities in infants from bilingual environments. Infancy, 2, 29-49. Brugge, J.F., Javel, E., & Kitzes, L.M. (1978). Signs of functional maturation of peripheral auditory system in discharge patterns of neurons in anteroventral cochlear nucleus of kitten. Journal of Neurophysiology, 41, 1557-1559. Buss, E., Hall, J.W.III, Grose, J.H., & Dev, M.B. (1999). Development of adult- like performance in backward, simultaneous, and forward masking. Journal of Speech Language and Hearing Research, 42, 844-849. Colombo, J., & Horowitz, F.D. (1986). Infants attentional responses to frequency modulated sweeps. Child Development, 57, 287-291. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997a). Modeling auditory processing of amplitude modulation : I. Modulation detection and masking with narrow-band carriers. Journal of the Acoustical Society of America, 102, 2892-2905. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997b). Modeling auditory processing of amplitude modulation : II. Spectral and temporal integration in modulation detection. Journal of the Acoustical Society of America, 102, 2906-2919.

109

Dau, T., Verhey, J., & Kohlrausch, A. (1999). Intrinsic envelope fluctuations and modulation-detection thresholds for narrow-band carriers. Journal of the Acoustical Society of America, 106, 2752-2760. de Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111, 1917-1930. Deng L., & Geisler, C.D. (1987). Responses of auditory-nerve fibers to nasal consonant-vowel syllables. Journal of the Acoustical Society of America, 82, 1977-1988. Diedler, J., Pietz, J., Bast, T., & Rupp (2007). Auditory temporal resolution in children assessed by magnetoencephalography. Neuroreport, 18, 1691- 1695. Drullman, R. (1995). Temporal envelope and fine structure cues for speech intelligibility. Journal of the Acoustical Society of America, 97, 585-592. Dudley, H. (1939). Remaking speech. The Journal of the Acoustical Society of America, 11, 165-165. Eaves, J.M., Summerfield, A.Q., & Kitterick, P.T. (2011). Benefit of temporal fine structure to speech perception in noise measured with controlled temporal envelopes. Journal of the Acoustical Society of America, 130, 501-507. Eimas, P.D., Siqueland, E.R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infant. Science, 171, 303-306. Eisenberg, L.S., Shannon, R.V., Schaefer Martinez, A., Wygonski, J., & Boothroyd, A. (2000). Speech recognition with reduced spectral cues as a function of age. Journal of the Acoustical Society of America, 107, 2704- 2710. Friesen, L.M., Shannon, R.V., Baskent, D., & Wang, X. (2001). Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants. Journal of the Acoustical Society of America, 110, 1150-1163. Gilbert, G., Bergeras, I., Voillery, D., & Lorenzi, C. (2007). Effects of periodic interruption on the intelligibility of speech based on temporal fine- structure or envelope cues. Journal of the Acoustical Society of America, 122, 1336-1339. Gilbert, G., & Lorenzi, C. (2006). The ability of listeners to use recovered envelope cues from speech fine structure. Journal of the Acoustical Society of America, 119, 2438-2444. Glasberg, B.R., & Moore, B.C.J. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Reseach, 47, 103-138. Gnansia, D., Pean, V., Meyer, B., & Lorenzi, C. (2009). Effects of spectral smearing and temporal fine structure degradation on speech masking release. Journal of the Acoustical Society of America, 125, 4023-4033.

110

Hall, J.W.III, & Grose, J.H. (1994). Development of temporal resolution in children as measured by the temporal modulation transfer function. Journal of the Acoustical Society of America, 96, 150-154. Hirsh-Pasek, K., Kemler Nelson, D.G., Jusczyk, P.W., Wright, K., Druss, B., & Kennedy, L. (1987). Clauses are perceptual units for young infants. Cognition, 26, 269-286. Höhle, B., Bijeljac-Babic, R., Herold, B., Weissenborn, J., & Nazzi, T. (2009). The development of language specific prosodic preferences during the first half year of life: evidence from German and French. Infant Behavior and Development, 32, 262-274. Holt, R.F. (2011). Enhancing speech discrimination through stimulus repetition. Journal of Speech, Language, and Hearing Research, 54, 1431–1447. Holt, R.F., & Carney, A.E. (2005). Multiple looks in speech sound discrimination in adults. Journal of Speech, Language, and Hearing Research, 48, 922– 943. Holt, R.F., & Carney, A.E. (2007). Developmental effects of multiple looks in speech sound discrimination. Journal of Speech, Language, and Hearing Research, 50, 1404–1424. Holt, R.F., & Svirsky, M.A. (2008). An exploratory look at paediatric cochlear implantation: Is earliest always best? Ear and Hearing, 29, 492-511. Hopkins, K., Moore, B.C.J., & Stone, M.A. (2008). Effects of moderate cochlear hearing loss on the ability to benefit from temporal fine structure information in speech. Journal of the Acoustical Society of America, 123, 1140–1153. Hopkins, K., Moore, B.C.J., & Stone, M.A. (2010). The effects of the addition of low-level, low-noise noise on the intelligibility of sentences processed to remove temporal envelope information. Journal of the Acoustical Society of America, 128, 2150-2161. Hoonhorst, I., Colin, C., Markessis, E., Radeau, M., Deltenre, P., & Serniclaes, W. (2009). French native speakers in the making: from language-general to language-specific voicing boundaries. Journal of Experimental Child Psychology, 104, 353-366. Hunter, M.A., & Ames, E.W. (1988). A multifactor model of infant preferences for novel and familiar stimuli. Advances in infancy research, 5, 69-95. Johnson, D.H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. Journal of the Acoustical Society of America, 68, 1115-1122. Joris, P.X., & Yin, T.C. (1992). Responses to amplitude-modulated tones in the auditory nerve of the cat. Journal of the Acoustical Society of America, 91, 215-232. Joris, P.X., Schreiner, C.E., & Rees, A. (2004). Neural processing of amplitude- modulated sounds. Physiological Reviews, 84, 541-577.

111

Kale, S., & Heinz, M.G. (2010). Envelope coding in auditory nerve fibers following noise-induced hearing loss. Journal of the Association for Research in Otolaryngology, 11, 657-673. Kates, J.M. (2011). Spectro-temporal envelope changes changes caused by temporal fine-structure modification. Journal of the Acoustical Society of America, 129, 3981-3990. Kiang, N.Y., Pfeiffer, R.R., Warr, W.B., & Backus, A.S. (1965). Stimulus Coding in the Cochlear Nucleus. Annals of Otology, Rhinology and Laryngology, 74, 463-485. Kettner, R.E., Feng, J.Z., & Brugge, J.F. (1985). Postnatal development of the phase-locked response to low frequency tones of the auditory nerve fibers in the cat. Journal of Neuroscience, 5, 275-283. Kuhl, P.K. (2004). Early language acquisition: cracking the speech code. Nature Reviews. Neuroscience, 5, 831-43. Kuhl, P.K., Conboy, B.T., Coffey-Corina, S., Padden, S., Rivera-Gaxiola, M., & Nelson, T. (2008). Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e). Philosophical Transaction of the Royal Society of London B Biological Science, 363, 979–1000. Kuhl P.K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9, F13-F21. Kuhl, P.K., Williams, K.A., Lacerda, F., Stevens, K.N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infant by 6 months of age. Science, 255, 606- 608. Levi, E.C., Folsom, R.C., & Dobie, R.A. (1995). Coherence analysis of envelope- following responses (EFRs) and frequency-following responses (FFRs) in infants and adults. Hearing Research, 89, 21-7. Levi, E. C., & Werner, L. A. (1996). Amplitude modulation detection in infancy: Update on 3-month-olds. Abstracts of the Association for Research in Otolaryngology, 19, 142. Liebenthal, E., Binder, J.R., Piorkowski, R.L., & Remez, R.E. (2003). Short-term reorganization of auditory analysis induced by phonetic experience. Journal of cognitive neuroscience, 15, 549-58. Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., & Moore, B.C.J. (2006). Speech perception problems of the hearing impaired reflect inability to use temporal fine structure. Proceedings of the National Academy of Sciences USA, 103, 18866–18869. Lorenzi, C., Simpson, M.I.G., Millman, R.E., Griffiths, T.D., Woods, W.P., Rees, A., & Green, G.G.R. (2001). Second-order modulation detection thresholds for pure-tone and narrow-band noise carriers. Journal of the Acoustical Society of America, 110, 2470-2478.

112

Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106, 1367-1381. Miyamoto, R.T., Houston, D.M., Kirk, K.I., Perdew, A.E., & Svirsky, M.A. (2003) Language Development in Deaf Infants Following Cochlear Implantation. Acta Oto- Laryngologica, 123, 241-244. Moore, B.C.J. (2003). An introduction to the psychology of hearing, 5th Ed. (Academic Press, San Diego). Moore, B.C.J., & Sek, A. (1996). Detection of frequency modulation at low modulation rates: Evidence for a mechanism based on phase locking. Journal of the Acoustical Society of America, 100, 2320-2331. Moore, D.R., Cowan, J.A., Riley, A.; Edmondson-Jones, A.M., & Ferguson, M.A. (2011). Development of Auditory Processing in 6- to 11-Yr-Old Children. Ear & Hearing, 32, 269-285. Nazzi, T., Jusczyck, P.W., & Johnson, E.K. (2000). Language discrimination by English-learning 5-month-olds: Effects of rhythm and familiarity. Journal of Memory and Language, 43(1), 1-19. Nelson, P.B., Jin, S.H., Carney, A.E., & Nelson, D.A. (2003). Understanding speech in modulated interference: cochlear implant users and normal- hearing listeners. Journal of the Acoustical Society of America, 113, 961- 969. Nikolopoulos, T.P., Archbold, S.M., & O’Donoghue, G.M. (1999). The development of auditory perception in children following cochlear implantation. International Journal of Pediatric Otorhinolaryngology, 49, 189–191. Olsho, L.W. (1985). Infant auditory perception: Tonal masking. Infant Behavior Development, 8, 371-384. Palmer, A.R., & Russel, I.J. (1986). Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells. Hearing research, 24, 1-15. Plantinga, J., & Trainor, L.J. (2009). Melody recognition by two-month-old infants. Journal of the Acoustical Society of America, 125(2), EL58-EL62. Qin, M.K., & Oxenham, A.J. (2003). Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. Journal of the Acoustical Society of America, 114, 446-54. Rose, J.E., Brugge, J.F., Anderson, D.J., & Hind, J.E. (1967). Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. Journal of Neurophysiology, 30, 769-793. Rose, S.A., Gottfried, A.W., Melloy-Carminar, P., & Bridger, W.H. (1982). Familiarity and novelty preferences in infant recognition memory: Implications for information processing. Developmental Psychology, 18(5), 704-713.

113

Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistic aspects. Philosophical Transaction: Biological Sciences, 336, 367-373. Saberi, K., & Hafter, E.R. (1995). A common neural code for frequency- and amplitude-modulated sounds. Nature, 374, 537-539. Saffran, J., Werker, J., & Werner, L.A. (2006). The infant’s auditory world: Hearing, speech and the beginnings of language. In D. Kuhn & R.S. Siegler (ed.) Handbook of Child Psychology, Vol. 2, Cognition, Perception and Language (6th edition) New York: Wiley (pp. 58-108). Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., & Zkelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270, 303-304. Sheft, S., Ardoint, M., & Lorenzi, C. (2008). Speech identification based on temporal fine structure cues. Journal of the Acoustical Society of America, 124, 562-575. Sinex, D.G., & Geisler, C.D. (1983). Responses of auditory-nerve fibers to consonant-vowel syllables. Journal of the Acoustical Society of America, 73, 602-615. Skoruppa, K., Pons, F., Christophe, A., Bosch, L., Dupoux, E., et al. (2009) Language specific stress perception by nine-month-old French and Spanish infants. Developmental Science, 12, 914–919. Smith, N.A., Trainor, L.J., & Shore, D.I. (2006). The development of temporal resolution: between-channel gap detection in infant and adult. Journal of Speech, Language, and Hearing Research, 49, 1104-1113. Smith, Z.M., Delgutte, B., & Oxenham, A.J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416, 87-90. Spetner, N.B., & Olsho, L.W. (1990). Auditory Frequency Resolution in Human Infancy. Child Development, 61, 632-652. Svirsky, M.A., Robbins, A.M., Kirk, K.I., Pisoni, D.B., & Myamoto, R.T. (2000). Language development in profoundly deaf children with cochlear implant. Psychological Science, 11, 153-158. Thiessen, E.D., & Saffran, J.R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39, 706–716. Trainor, J.L., Samuel, S.S., Desjardins, R.N., & Sonnadara, R.R. (2001). Measuring temporal resolution in infants using mismatch negativity. Neuroreport, 12, 2443-2448. Trehub, S.E., Schneider, B.A., & Henderson, J.L. (1995). Gap detection in infants, children, and adults. Journal of Acoustical Society of America, 98, 2532-2541. Wagner, S., & Sakovits, J. (1986). A process analyses of infant visual and cross- modal recognition memory: Implications for an amodal code. In L. Lipsitt, & C. Rovee-Collier, Advances in infancy research, 4, 196–217. Werker, J.F., & Tees, R.C. (1999). Influences on infant speech processing: toward a new synthesis. Annual Review of Psychology, 50, 509-35.

114

Werker, J.F., & Tees, R.C. (1983). Developmental changes across childhood in the perception of non-native speech sounds. Canadian Journal of Psychology, 37, 278-86. Werner, L.A., Marean, G.C., Halpin, C.F., Spetner, N.B., & Gillenwater, J.M. (1992). Infant auditory temporal acuity: gap detection. Child Development, 63, 260-272. Werner, L.A., & Gray, L. (1998). Behavioral studies of hearing development. In E.W Rubel, A.N.Popper, and R.R. Fay (Eds.). Development of the auditory system. Vol. 5, Springer handbook of auditory research. New York: Springer-Verlag, 12-79. Wightman, F., Allen, P., Dolan, T., Kistler, D., & Jamieson, D. (1989). Temporal resolution in children. Child Development, 60, 611-624. Zeng, F.G., Nie, K., Liu, S., Stickney, G., Del Rio, E., Kong, Y.Y., & Chen, H. (2004). On the dichotomy in auditory perception between temporal envelope and fine structure cues. Journal of the Acoustical Society of America, 116, 1351-1354. Zeng, F.G., Nie, K., Stickney, G.S., Kong, Y.Y., Vongphoe, M., Bhargave, A., Wei, C., & Cao, K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences USA, 102, 2293-2298.

115

Chapter 4. Discrimination of voicing and place of articulation on the basis of AM cues in French 6-month-old infants (visual habituation procedure)

116

Chapter 4. Discrimination of voicing and place of articulation on the basis of AM cues in French 6-month-old infants (visual habituation procedure)

1. Introduction: Infants discriminate voicing and place of articulation with reduced spectral and temporal modulation cues

The first two studies of this PhD work presented preliminary steps in the assessment of infants’ abilities to use speech modulation cues. The experiments described in Chapters 2 and 3 show that French-learning 6-month-old infants are able to discriminate a voicing contrast when FM cues are severely degraded and when AM cues are severely reduced by either decreasing frequency resolution or attenuating fast AM rates. However, the infants’ responses (i.e., their preference for familiar or novel stimuli) varied across vocoder conditions and as a function of the familiarization time (Cabrera, Bertoncini & Lorenzi, 2013, see Chapter 3).

Role of modulation cues in the discrimination of voicing versus place of articulation This third study aims to investigate further the role of AM and FM cues in phonetic discrimination at 6 months and the impact of the exposure time to the degraded speech sounds. The reception of voicing is known to be robust in normal-hearing adults (for instance, it is barely affected by filtering or masking noise, e.g., Miller & Nicely, 1955). The third study attempts to extend the current investigation to another phonetic feature – place of articulation – that is known to be more susceptible to signal distortions such as those produced by filtering, noise or vocoding (e.g., Miller & Nicely, 1955; Shannon, Zeng, Kamath, Wygonski, & Zkelid 1995). In a canonical study, Rosen (1992) systematically described the roles of AM (envelope/periodicity) and FM (temporal fine structure) cues in linguistic contrasts. Table 2 summarizes the temporal system proposed by Rosen (1992). It suggests that perception of voicing is conveyed by both AM and FM features whereas perception of place is mostly received by FM (temporal fine structure

117

cues). Regarding place of articulation, phonetic information is mostly conveyed by the frequency spectrum and by formant transitions contained in the fast amplitude variations. Thus, it is expected that, discrimination of place should be more affected than discrimination of voicing in French-learning infants, when target syllables are vocoded to degrade selectively FM cues.

FM (fine AM (envelope) structure) 500 - <50-100 Hz 5000 Hz Manner +++ + Segmental Voicing +++ ++ cues Place +++ Rhythm +++ Suprasegmental Syllabicity +++ cues Intonation + ++

Table 2. The potential role of AM cues (temporal envelope, corresponding to the amplitude variations over time below about 50-100 Hz) and FM cues (temporal fine structure, corresponding to the faster variations between 500-5000 Hz within each frequency band) for different speech contrasts. The number of “+” represents the extent to which modulation features operate in a linguistic contrast according to Rosen (1992).

To explore and extend the previous chapters, the discrimination of voicing (/aba/-/apa/) and place of articulation (/aba/-/ada/) is tested in French 6-month-old infants. As in Chapter 3 (Cabrera et al., 2013), the speech signals are processed by four vocoders in order to reduce FM cues, fast AM cues and the spectral resolution of the signals. However, speech processing (vocoding) is slightly different from that used in Chapter 3 (Cabrera et al., 2013). First, FM cues are replaced by a sine-wave tone (as in Chapter 2, Bertoncini, Nazzi, Cabrera & Lorenzi, 2011) rather than by a narrow band noise in order to limit modulation masking effects introduced by the random fluctuations of the noise carrier after filtering by analysis filters (see Kates, 2011). Second, the AM cues are extracted in 8 bands (rather than 4) in another experimental condition.

118

Processing difficulty Figure 3 illustrates the model of Hunter and Ames (1988) assuming that three factors are involved in the novelty/familiarity preference in infants: age, task difficulty (i.e., the nature of the stimulus and thus the infants’ difficulty to process it) and familiarization time. Hunter and Ames (1988) proposed that when the familiarization time increases, infants’ preference varies from no preference to familiar preference, and then from familiar to novel preference via a no preference period.

Figure 3. Effect of familiarization time on preference for familiar and novel stimuli in infants. Redrawn from Hunter and Ames (1988).

This model can be used to predict infants’ preferences observed in each vocoded condition during the test phase in Chapter 3 (Cabrera et al., 2013). The predictions are shown in Figure 4. In the study presented in Chapter 3, two familiarization times have been given to 6-month-old infants: one versus two minutes. It is expected that the familiarization time required to reach novelty preference should increase as a function of the infants’ difficulty to process the vocoded stimuli.

119

Figure 4. Responses of each experimental group (Cabrera et al., 2013) as a function of the two familiarization times used (solid lines). Hypothesized patterns4 of responses derived from the model of Hunter and Ames (1988) are represented in dotted lines. It is assumed that stimuli used in the “32-band AM+FM” speech condition should be the easiest to process by 6-month-old infants, and those used in the “32-band AM<16Hz” condition should be the most difficult ones.

The results show that the short familiarization time (1 min) has increased the response variability in the “32-band AM” and “32-band AM<16 Hz” conditions, leading to a “no preference” pattern. Moreover, this amount of time may not be sufficient for 6-month-olds to exhibit a novelty preference in the “4- band AM” condition. With a longer familiarization time (2 min), infants show a novelty reaction in all conditions except when the AM cues are severely reduced (“32-band AM<16 Hz”). In this last condition, infants exhibit a familiarity preference even with 2 min of sound familiarization. Thus, the familiarization duration plays a major role on the infants’ discrimination responses for highly degraded speech signals. The results also reveal that the fastest AM cues (> 16 Hz) may be required by 6-month-old infants to perceive correctly a voicing contrast.

1 The pattern of results may be different in the “32-band AM” condition: the preference for familiar sequences may occur between 1 and 2 min. In this case, the pattern of results for this condition should be closer to that obtained in the 32-band AM < 16 Hz condition.

120

In the present study of Chapter 4, the behavioral method used to test infants is modified. The visual habituation method (see Werker et al., 1998) is adapted to introduce an infant-controlled habituation phase. Before switching to the phase testing discrimination ability, infants have to reach a habituation criterion corresponding to a decrement in their looking times during the presentation of a repeated sound sequence. The amount of habituation time required by infants to switch to the test phase will be compared between vocoded conditions to assess processing difficulty (e.g., Hunter & Ames, 1988).

This study is presented in the following article (in preparation): Cabrera, L., Lorenzi, C., & Bertoncini, J. “Infants discriminate voicing and place of articulation with reduced spectral and temporal modulation cues.”

121

2. Article: Cabrera, Lorenzi & Bertoncini (in preparation)

Infants discriminate voicing and place of articulation with reduced spectral and temporal modulation cues

Cabrera Laurianne Laboratoire de Psychologie de la Perception CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France Lorenzi Christian Institut d’Etude de la Cognition Ecole normale supérieure, Paris Sciences et Lettres 29 rue d’Ulm, 75005 Paris, France Bertoncini Josiane Laboratoire de Psychologie de la Perception CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France

In preparation Running title: speech modulation cues in infants

122

ABSTRACT

A visual-habituation procedure was used to assess the capacity of 6- month-old infants with normal-hearing to discriminate voiced versus unvoiced (/aba/ - /apa/) and labial versus dental (/aba/ - /ada/) stop consonants. The stimuli were processed by tone-excited vocoders to: (i) degrade frequency-modulation (FM) cues while preserving amplitude-modulation (AM) cues within 32 analysis frequency bands, (ii) degrade FM and fast (>16 Hz) AM cues while preserving slow AM cues within 32 bands, and (iii) degrade FM cues while preserving AM cues within 8 bands. Infants exhibited discrimination responses for both phonetic contrasts in each processing condition. However, when fast AM cues were degraded, infants required a longer exposition to vocoded stimuli to reach the habituation criterion. These results demonstrate that young infants are able to discriminate voicing and place on the sole basis of slow (<16 Hz) AM cues, provided that they have been sufficiently exposed to the reduced speech sounds. The results also suggest that AM cues faster than 16 Hz may play some role in phonetic discrimination for infants.

Key words: Speech perception, Amplitude modulation, Frequency modulation, Infants

PACS numbers: 43.71.Ft, 43.71. Rt, 43.66.Mk

123

I. INTRODUCTION

A large number of studies have investigated separately auditory and speech perception in infants (for a review see, Kuhl, 2004; Saffran et al., 2006) but information about the basic auditory capacities involved in the typical early development of speech processing is still lacking. The importance of amplitude- modulation cues (AM, the variations in amplitude over time) and frequency- modulation cues (FM, the oscillations in instantaneous frequency close to the center frequency of the band) in speech perception has been demonstrated repeatedly for adults (Smith et al., 2002; Zeng et al., 2005). The present study explored the role of these AM and FM cues in speech perception at an early age of development. To the best of our knowledge, only a few studies have assessed the ability of infants and children to use modulation cues in discrimination and identification tasks using noise or tone-excited vocoded stimuli. As for children, Newman and Chatterjee (2013) showed that 2-year-old toddlers accurately recognize words on the sole basis of AM cues extracted within 8 frequency bands. Bertoncini et al. (2009) showed that 5-year-old children discriminate nonsense bisyllables as well as older children and adults on the basis of AM cues extracted within 16 bands. However, Eisenberg et al. (2000) showed that 5- to 7-year-old children require a greater spectral resolution than adults to identify speech sounds on the basis of these AM cues. Less information is available regarding infants. Bertoncini et al. (2011) studied the ability of 6-month-olds to discriminate a /apa/-/aba/ voicing contrast on the basis of AM cues extracted within 16 frequency bands. As for Bertoncini et al. (2009), the speech AM patterns (i.e., the acoustic “temporal envelopes”) were lowpass filtered at 64 Hz, attenuating therefore the fast periodic AM cues related to the fundamental frequency (F0). A head-turn preference procedure was used to assess preference for sequences composed of alternated versus repeated /apa/ and /aba/ stimuli. The results showed that infants were able to detect the alternation of vocoded stimuli, providing evidence that voicing could be discriminated on the sole basis of AM cues below 64 Hz. This initial study suggests robust auditory processing of speech modulation cues as early as 6-months. However, this was just a first step in the investigation

124

of the role of auditory processing of AM and FM cues in speech perception during language acquisition. Several trails were opened that are explored in the present study. First, the investigation of the role of AM and FM cues in infants’ discrimination capacities was extended to other French phonological contrasts. Two phonetic contrasts were used here: place of articulation (/aba/ versus /ada/) and voicing (/aba/ versus /apa/). For adult listeners, the perception of place of articulation was found to be more dependent on spectral and temporal resolution (Başkent, 2006; Shannon et al., 1995) and FM cues (Rosen, 1992) than the perception of voicing. The present investigation will assess whether or not the perception of place and voicing show a similar dependence on spectral and temporal resolution for infants. Second, the procedure used by Bertoncini et al. (2011) included a comparison of alternated versus repeated stimulus sequences. This procedure could be viewed as favoring immediate discrimination, and the results suggested relatively transient effects. Here, a different procedure was used to allow the occurrence of some more lasting effects related to mechanisms coping with speech. The procedure included an infant-dependent habituation phase that could provide an indication (that is, the habituation time) of processing difficulty. According to the model proposed by Hunter and Ames, (1988) to account for novelty preference in infants, habituation times reflect the interaction between several factors such as age and processing difficulty. In the present study, longer times needed to attain a habituation criterion may be indicative of a specific difficulty in processing stimuli with reduced spectro-temporal modulations. When the habituation criterion was reached, discrimination was assessed by measuring the difference in looking times for sequences of familiar versus novel stimuli (Werker et al., 1998). Finally, the auditory processing of speech modulation cues was further explored for infants by using several tone-vocoders designed to evaluate the respective role of: (i) FM cues, (ii) fast AM cues related to bursts, formant transitions and F0 periodic fluctuations, and (iii) spectral resolution in phonetic perception. This was achieved by: (i) replacing the original FM cues within each analysis band by pure tones with a fixed frequency, (ii) lowering the cutoff frequency of the demodulation lowpass filter used to extract AM from half the

125

bandwidth of analysis filters to 16 Hz, and (iii) reducing the number of analysis bands from 32 to 8 bands.

II. METHOD

1. Participants

Six–month-old infants were recruited from a birth database. All families were informed about the goals of the current study and provided a written consent before their participation, in accordance with the current French ethical requirements. Data from 160 infants (20 infants x 2 contrasts x 4 conditions) were analyzed (87 girls; age range: 5 months 27 days - 7 months 17 days; mean = 6 months and 12 days; standard deviation (SD) = 10 days). All infants were born full-term, without any medical history and no family speech disorders. All infants are normal hearing (based on parental report of newborn hearing screening results). The data from 155 additional infants were not included for the following reasons: fussing and crying (n=116), looking time shorter than 1000 ms for one trial (n=12), failure to reach the habituation criteria (n=27). 2. Stimuli Eight exemplars of each category /aba/, /apa/ and /ada/ were selected from a set of vowel-consonant-vowel (VCV) nonsense bisyllables uttered by a French female speaker who was asked to speak clearly in adult directed speech. The F0 was estimated at 242 Hz using the YIN algorithm (de Cheveigné and Kawahara, 2002). The stimuli were recorded in a soundproof room, and digitized via a 16-bit analog-to-digital converter at a 44.1-kHz sampling rate. The stimuli were not significantly different in duration (634 ms (SD=68.8 ms) for /aba/, 632 ms (SD=47.5 ms) for /apa/, and 622 ms (SD=68.4 ms) for /ada/). For each phonetic category, two sound sequences were created for the habituation phase including 4 repetitions of 4 different tokens in two different orders. Two other sequences were created for the test phase with 4 repetitions of the other 4 tokens of each category. The inter-stimulus interval varied randomly all along the 16-item sequences, between 600 and 1300 ms. This variation was introduced to make small variations in duration between items irrelevant within and between categories. All the sequences had the same duration (26 s).

126

The stimuli were processed by vocoders to alter their spectro-temporal modulations. Tone-excited vocoders were used instead of noise-excited vocoders (Eisenberg et al., 2000; Newman and Chatterjee, 2013) because they were found to distort less speech AM cues (Kates, 2011). Four different vocoder conditions were designed. In the first condition (called “32-band AM+FM speech”), the original speech signal was passed through a bank of 32 2nd-order gammatone filters (Gnansia et al., 2009; Patterson, 1987), each 1-equivalent rectangular bandwidth (ERB) wide with center frequencies (CFs) uniformly spaced along an ERB scale ranging from 80 to 8,020 Hz. The Hilbert transform was then applied to each bandpass filtered speech signal to extract the AM component and FM carrier. The AM component was low-pass filtered using a zero-phase Butterworth filter (36 dB/octave rolloff) with a cutoff frequency set to ERBN/2. The final narrow-band speech signal was obtained by multiplying each sample of the FM carrier by the filtered AM function. The narrow-band speech signals were finally added up and the level of the wideband speech signal was adjusted to have the same root-mean-square value as the input signal. Thus, the vocoded speech signals retained the original AM and FM speech cues within each of the 32 analysis frequency bands. In the second condition (called “32-band AM speech”), the same signal processing scheme was used, except that the FM carrier was replaced by a sine wave carrier with frequency at the CF of the gammatone filter, and with random starting phase in each analysis frequency band. Thus, the resulting vocoded speech signal retained AM speech cues within 32 bands, but discarded the original (within-channel) FM speech cues. In the third condition (called “32-band AM<16Hz speech”), the same signal processing scheme was used as in the “32-band AM speech” condition, except that the AM component was low-pass filtered with a cutoff frequency of 16 Hz for each of the 32 bands in order to remove the fast AM cues related to bursts, formant transitions and F0 periodic fluctuations. Thus, the resulting vocoded speech signal retained mainly the slowest (<16 Hz) AM speech cues within 32 bands, and discarded the original FM speech cues. In the last condition (called “8-band AM speech”), the same signal processing scheme was used as in the “32-band AM speech” condition, except that AM cues were extracted from only 8, broad (4-ERBN wide) frequency bands.

127

Thus, the original FM speech cues were discarded, and AM cues were distorted substantially compared to the original AM speech cues. Figure 1 shows the spectrograms of one exemplar of /aba/ stimuli in each experimental condition.

Figure 1. Spectrograms of /aba/ stimuli in each speech-processing condition. Upper left panel: intact condition (“32-bands-AM+FM”); upper right panel: “32-bands-AM”; lower left panel “32-bands-AM<16Hz”; lower right panel “8- bands-AM” speech conditions.

3. Procedure

A “visual habituation” method was used (Mattock et al., 2008; Werker et al., 1998) in which sound sequences were presented as a function of the infants’ looking orientation at a picture (a black and white checkerboard) displayed on a screen. The infants were seated on the caregiver’s lap in front of the screen in a sound treated room. The caregiver was instructed not to interfere with the infant’s behavior (i.e., not to show the screen at any time) and wore earplugs and headphones delivering masking music. Two loudspeakers located on each side of the infant’s monitor played auditory stimuli at a level of approximately 70 dB SPL. The infant’s looking time was monitored online via a video camera linked to

128

a monitor in another room. The observer, blind to the audio file presented, recorded the duration of the infant’s looking time by a key press and controlled stimuli presentation using Habit X.10 (Cohen et al., 2000). The experiment began with a habituation phase, during which infants heard several sequences of the same sound category. The habituation phase ended when the mean looking time on three consecutive trials decreased by 50% compared to the mean of the longest looking times registered on three preceding trials. The test phase directly ensued during which infants heard 4 novel (N) and 4 familiar (F) alternating sequences presented with the order counterbalanced across subjects (such as N-F-N-F-N-F- N-F or F-N-F-N-F-N-F-N). In all trials, auditory and visual presentations continued until the infant looked away for 2 s (automatically calculated by the computer via the experimenter who released the key press as soon as infants looked away) or at the end of the sound file (maximum 26 s). At the end of the trial, the checkerboard disappeared and a more attractive display (flashing balls) appeared to draw the infant’s attention to the TV monitor. No auditory stimulus was presented during this interval between trials. Once the infant looked at the screen, the experimenter initiated the next trial. Four independent groups (n=20) were tested for the voicing contrast (one group per vocoded condition): half of the subjects were habituated with /aba/ stimuli, and the other half with /apa/. Four independent groups (n=20) were tested for the contrast of place of articulation (one group per vocoded condition): half of the subjects were habituated with /aba/ stimuli, and the other half with /ada/.

III. RESULTS

The cumulated looking times to reach the habituation criterion and the mean looking times in the test phase were recorded and analyzed in each condition. The discrimination reactions were assessed by comparing the looking times for novel and familiar sequences in the test phase. Figure 2 shows the mean looking time in the 8 groups of infants (2 contrasts x 4 conditions) for both the novel and familiar sequences. In all groups, infants showed longer looking times for the novel sequences during the test phase. An omnibus analysis of variance (ANOVA) was run with 4 Vocoder conditions and 2 Phonetic contrasts as

129

between-subject factors, and 2 Types of sequences (familiar versus novel) as within-subject factor. This analysis revealed a main effect of the Type of sequences (mean novel=7.5 s, SD=0.81 s versus mean familiar=6.1 s, SD=0.91 s; F(1,152)=40.11, p<.001).There was no significant effect of Vocoded conditions (F(1,152)=2.01, p=.12) and Contrast (F(1,152)=1.67, p=.20), and no significant interaction between factors (F(1,152)=1.09, p=.35). Pairwise comparisons with one-tailed Student t test indicated that novel sequences elicited significantly longer looking times in each vocoder condition and for each phonetic contrast (α = .05). Thus, 6-month-olds discriminated voicing and place contrasts in each vocoded condition.

Figure 2. Mean looking times for familiar and novel stimuli during the test phase, for voicing and place contrasts in each speech-processing condition: 32-band AM+FM speech, 32-band AM speech, 32-band AM<16Hz speech, 8-band AM speech (errors bars represent the standard errors).

Figure 3 shows the mean cumulated habituation times among 40 infants in each vocoded condition. The habituation time was longer in the “32-band AM<16Hz speech” condition (mean=133.9 s; SD=57.4 s) compared to the “8- band AM speech” (mean=108.3 s; SD=47 s) and the “32-band AM speech” (mean=101.6 s; SD=41.8 s) conditions. The minimum habituation time was found in the “32-band AM+FM speech” condition (mean=96.1 s; SD=44 s). A factorial

130

ANOVA was conducted on mean habituation times with 4 Vocoded conditions and 3 Stimuli (/aba/, /apa/ or /ada/) as between-subject factors. This analysis revealed a main effect of Condition (F(3,148) = 4.2, p=.007). Post-hoc Scheffé tests indicated that habituation times were significantly longer in the “32-band AM<16Hz speech” condition compared to the “32-band AM+FM speech” and “32-band AM speech” conditions. The remaining comparisons were not statistically significant. The analysis also showed no significant effect of Stimuli (F(2;148)=2.46, p=.09) and no significant interaction between Condition and Stimuli (F(6,148)=1.54, p=.17).

Figure 3. Mean cumulated habituation times in each vocoder condition in the habituation phase (errors bars represent the standard errors).

IV. DISCUSSION

The present study aimed to assess the perception of speech modulation cues for 6-month-old normal-hearing infants. Two French phonetic contrasts (/aba/ versus /apa/; and /aba/ versus /ada/) were used and several speech- processing conditions were designed to investigate whether infants can discriminate these contrasts when AM and FM modulation cues are severely degraded. The current study replicated previous results obtained by Bertoncini et al. (2011). However, some changes were introduced in the present study to explore

131

further infants’ perception: (i) different tone-excited-vocoders were used to degrade selectively AM and FM speech cues, and (ii) discrimination of two phonetic contrasts (voicing and place of articulation) was examined in 6-month- old normal-hearing infants. In addition, an infant-controlled habituation procedure was used to obtain additional information about how infants process the reduced speech stimuli. Note that for all the conditions except the “Intact” one (“32-band AM+FM speech”), the speech stimuli were completely unfamiliar to the infants. It was thus expected that the time needed by the infants to be familiarized with these stimuli would be influenced by the difficulty in extracting the residual modulation information. Discrimination data. The results showed that 6-month-old infants discriminated the reduced speech signals in all processing conditions. This was manifested by a clear-cut novelty preference, manifested by looking times significantly longer during the presentation of novel stimuli (compared to the presentation of familiar stimuli). The infants did neither require the FM, the fast (>16 Hz) AM nor the fine spectral speech cues to perceive variations in voicing and place of articulation in quiet. Altogether, these results indicate that as early as 6-months, the slowest AM cues extracted from a limited number of broad frequency bands are sufficient to discriminate phonetic contrasts. This pattern of robust discrimination is consistent with that reported for voicing reception in adults (e.g., Shannon et al., 1995). It is however different from that reported for the reception of place of articulation in adults, where spectral and temporal resolution (Başkent, 2006; Shannon et al., 1995) and FM cues (Rosen, 1992) were found to influence strongly identification responses. It is unfortunately impossible to conclude whether this apparent discrepancy reflects genuine differences in sensory/linguistic processing across age groups or differences in methodology across studies (including the fact that here the original syllables were produced in French). Habituation times. Still, differences appeared in the habituation time required to switch to the test phase, indicating that infants are sensitive to the reduction of spectral or temporal modulation cues ordinarily present in speech signals. These differences showed that the attenuation of fast AM speech cues had a detrimental effect on speech processing (compared to conditions where fast AM cues were preserved). Previous studies (Hunter and Ames, 1988; Holt, 2011)

132

suggested that differences in habituation time are related to: (i) the age of the infants, (ii) the nature (i.e., familiarity, complexity) of the stimuli, and (iii) the build up of a detailed representation of the signal. The longer habituation required for temporally-smeared speech signals may thus reveal: (i) the importance of fast AM cues (corresponding to bursts, formant transitions and periodic F0-related fluctuations) in the perception of phonetic contrasts and (ii) the importance of training when speech cues are severely degraded along the temporal dimension. It is interesting to note that no difference was observed between the habituation times required by infants when the fast AM cues were reduced and those required when the fast AM cues were preserved in a small number of frequency bands. Thus, the longer habituation required for temporally-smeared speech signals may also suggest that spectral resolution plays a role in the robust perception of phonetic contrasts in infants.

V. CONCLUSION

The current study investigated the role of speech modulation cues in phonetic discrimination for young normal-hearing infants. The present results showed that the discrimination of voicing and place of articulation (that is, between French plosives /b/, /p/, /d/) is possible in the absence of FM and fast (>16 Hz) AM cues when the spectral resolution of speech signals is preserved, and also when spectral information is severely reduced to 8 broad frequency bands. These results demonstrate that the slowest AM cues are sufficient for phonetic discrimination in infants. However, when the fast AM cues were attenuated, infants required a longer time to be fully habituated to the degraded stimuli, suggesting that fast AM cues contribute to phonetic discrimination in infants. The results also showed that infants discriminated both phonetic contrasts in the speech-processing condition reproducing cochlear-implant processing (“8- band AM speech”; Friesen et al., 2001). This suggests that cochlear-implant devices deliver sufficient information to the normal-hearing infants’ auditory system for phonetic discrimination in quiet.

133

ACKNOWLEDGMENTS

C. Lorenzi was supported by a grant (HEARFIN Project) from ANR. This work was supported by ANR and program “Investissement d’avenir” (ANR-11-IDEX- 0001-02-PSL; and Labex “Institut d’Etude de la Cognition”). The authors would like to thank warmly the parents for their participation in this study.

134

REFERENCES

Başkent, D. (2006). “Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels,” J. Acoust. Soc. Am., 120, 2908–2925. Bertoncini, J., Nazzi, T., Cabrera, L., and Lorenzi, C. (2011). “Six-month-old infants discriminate voicing on the basis of temporal envelope cues,” J. Acoust. Soc. Am., 129, 2761–2764. Bertoncini, J., Serniclaes, W., and Lorenzi, C. (2009). “Discrimination of speech sounds based upon temporal envelope versus fine structure cues in 5-to 7- year-old children,” J. Speech Lang. Hear. Res., 52, 682–695. de Cheveigné, A., and Kawahara, H. (2002). “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., 111, 1917–1930. Cohen, L. B., Atkinson, D. J., and Chaput, H. H. (2000). Habit 2000: A new program for testing infant perception and cognition, The University of Texas, Austin. Eisenberg, L. S., Shannon, R. V., Martinez, A. S., Wygonski, J., and Boothroyd, A. (2000). “Speech recognition with reduced spectral cues as a function of age,” J. Acoust. Soc. Am., 107, 2704–2710. Friesen, L. M., Shannon, R. V., Baskent, D., and Wang, X. (2001). “Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants,” J. Acoust. Soc. Am., 110, 1150–1163. Gnansia, D., Péan, V., Meyer, B., and Lorenzi, C. (2009). “Effects of spectral smearing and temporal fine structure degradation on speech masking release,” J. Acoust. Soc. Am., 125, 4023–4033. Holt, R. F. (2011). “Enhancing Speech Discrimination Through Stimulus Repetition,” J. Speech Lang. Hear. Res., 54, 1431–1447. Hunter, M. A., and Ames, E. W. (1988). “A multifactor model of infant preferences for novel and familiar stimuli,” Advances in Infancy Research, 5, 69–95. Kates, J. M. (2011). “Spectro-temporal envelope changes caused by temporal fine structure modification,” J. Acoust. Soc. Am., 129, 3981–3990. Kuhl, P. K. (2004). “Early language acquisition: cracking the speech code,” Nat. Rev. Neurosci., 5, 831–843. Mattock, K., Molnar, M., Polka, L., and Burnham, D. (2008). “The developmental course of lexical tone perception in the first year of life,” Cognition, 106, 1367–1381. Newman, R., and Chatterjee, M. (2013). “Toddlers’ recognition of noise-vocoded speech,” T J. Acoust. Soc. Am., 133, 483–494. Patterson, R. D. (1987). “A pulse ribbon model of monaural phase perception,” J. Acoust. Soc. Am., 82, 1560–1586. Rosen, S. (1992). “Temporal information in speech: acoustic, auditory and

135

linguistic aspects,” Philos. Trans. R. Soc. Lond., B, Biol. Sci., 336, 367– 373. Saffran, J. R., Werker, J. F., and Werner, L. A. (2006). “The infant’s auditory world: Hearing, speech, and the beginnings of language,” In D. Kuhn & R. Siegler (Eds.), Handbook of child psychology, (6th ed.; Vol. 2, pp. 58–108). New York, NY: Wiley. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science, 270, 303–304. Smith, Z. M., Delgutte, B., and Oxenham, A. J. (2002). “Chimaeric sounds reveal dichotomies in auditory perception,” Nature, 416, 87–90. Werker, J. F., Shi, R., Desjardins, R., Pegg, J. E., Polka, L., and Patterson, M. (1998). “Three methods for testing infant speech perception,” In A. Slater (Ed.), Perceptual development: Visual, auditory, and speech perception in infancy, (pp. 389–420). East Sussex, United Kingdom, Psychological Press. Zeng, F.-G., Nie, K., Stickney, G. S., Kong, Y.-Y., Vongphoe, M., Bhargave, A., Wei, C., et al. (2005). “Speech recognition with amplitude and frequency modulations,” Proc. Natl. Acad. Sci. U.S.A., 102, 2293–2298.

136

Chapter 5. Discrimination of lexical tones on the basis of AM cues in 6- and 10- month-old infants: influence of lexical-tone expertise (visual habituation procedure)

137

Chapter 5. Discrimination of lexical tones on the basis of AM cues in 6- and 10-month-old infants: influence of lexical-tone expertise (visual habituation procedure)

1. Introduction: Linguistic experience shapes the perception of spectro-temporal fine structure cues (Adult data)

The experiments presented in the previous chapters have shown that French-learning 6-month-old infants are able to discriminate speech signals varying in voicing and place when modulation cues are reduced temporally and spectrally. These data are consistent with those found with adults and children (Bertoncini, Serniclaes & Lorenzi, 2009; Shannon Zeng, Kamath, Wygonski, & Ekelid, 1995). Discrimination has been found for the two phonetic contrasts of voicing and place of articulation in each processing condition, demonstrating robust processing of speech modulation cues by the age of 6 months. However, the results also showed that infants require the fast (>16Hz) AM cues in each band to “easily” discriminate syllables. Altogether, these results suggest that the processing of speech modulation cues is “adultlike” in infants. The present present research program also aims to explore the perception of the speech modulation cues during the acquisition of different languages. This fifth chapter attempts to assess whether the perceptual reorganization for speech observed during the first year of life (e.g., see Chapter 1 2.2) is associated with a “reorganization” in the auditory processing of speech modulation cues. The previous experiments tested 6-month-old infants because they are not entirely tuned to their native language. The purpose of the present set of studies is to test young adults and older infants who have more experience with their native language. Two experiments are presented here: the first experiment is conducted with young normal-hearing adults and the second with normal-hearing infants of 6 and 10 months of age, belonging to two different linguistic environments. As described in Chapter 1 (IV), the role of the modulation cues might differ according to the phonological characteristics of languages, such as

138

French (i.e., syllable-timed) and Mandarin (tonal language). In tonal languages a pitch change at the syllable level modifies the word meaning (e.g., Liang, 1963). Mandarin-speaking adults were shown to highly depend on FM and spectral cues conveying mainly the fundamental frequency (F0), and thus, voice-pitch information, to identify words of their native language (i.e., Xu & Pfingst, 2008). However, to the best of our knowledge, no comparison between French and Mandarin speakers’ ability to process vocoded lexical tones was done. This comparison should indicate to what extent linguistic experience shapes the perception of speech modulation cues. The first study in this chapter evaluates adults’ performance according to their native language (French versus Mandarin) when adults have to discriminate vocoded lexical tones. Three vocoder conditions are designed to evaluate the influence of linguistic experience on the perception of the speech modulation cues conveying F0 variations (and thus, voice-pitch trajectories). A first condition compares the ability of French speakers and Mandarin speakers to discriminate “intact” lexical tones (i.e., with intact spectro-temporal cues). A second condition aims to compare their ability to discriminate lexical tones on the sole basis of AM cues in 8 broad frequency bands (that is, in the absence of spectro-temporal fine structure cues). Finally, a third condition is designed using non-linguistic sounds in order to test their ability to discriminate pitch trajectories only. Differences are observed across vocoded conditions and languages, suggesting that language experience constrains the weight of speech modulation cues in adults. This study was conducted in collaboration with Pr. Feng Ming Tsao at the National Taiwan University in Taipei, where Mandarin adults and infants have been tested with the same setup as the one used in Paris.

The first article presented in this chapter has been submitted to the Journal of Acoustical Society of America in September 2013: Cabrera, L., Tsao, F.M., Gnansia, D., Bertoncini, J. & Lorenzi, C. “Linguistic experience shapes the perception of spectro-temporal fine structure cues”.

139

2. Article : Cabrera, Tsao, Gnansia, Bertoncini & Lorenzi (submitted)

Linguistic experience shapes the perception of spectro-temporal fine structure cues

Laurianne Cabrera Laboratoire de Psychologie de la Perception, CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France Feng Ming Tsao Department of Psychology, National Taiwan University No.1, Sec. 4, Roosevelt Road, Taipei, 106, Taiwan Dan Gnansia Neurelec, Vallauris, France Josiane Bertoncini Laboratoire de Psychologie de la Perception, CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France Christian Lorenzi Institut d’Etude de la Cognition, Ecole normale supérieure, Paris Sciences et Lettres 29 rue d’Ulm, 75005 Paris, France

Article submitted to the Journal of the Acoustical Society of America On : September 2013 Abbreviated title: Linguistic experience and fine structure cues PACS numbers: 43.71 Hw, 43.71 Rt, 43.66 Mk

140

ABSTRACT

The discrimination of lexical tones was assessed for adults speaking either Mandarin or French, using Thai tones processed by a 8-band vocoder to degrade fine spectral details and frequency-modulation cues. Discrimination was also assessed using click trains with fundamental frequency (F0) following the same F0 contours as the lexical tones. Mandarin speakers were more impaired by vocoding than French speakers, but showed higher discrimination of the same F0 contours in a non-speech (i.e., click trains) context. These results suggest that language experience shapes the weight of the fine spectral and temporal cues conveying F0 information in speech and non-speech perception.

Key words: Amplitude modulation, Frequency modulation, Lexical tones

141

I. INTRODUCTION

Tonal variations at the syllable level distinguish word meaning in tonal languages (e.g., Liang, 1963). Native listeners rely mainly on fundamental- frequency (F0) – and thus, voice-pitch cues – to discriminate lexical tones. However, other acoustic cues such as duration, amplitude or voice quality may also play a role (e.g., Whalen and Xu, 1992; but see Kuo et al., 2008). Over the last decades, psycholinguistic studies have investigated whether expertise in tonal language influences the relative weight of these acoustic cues in lexical-tone perception (Gandour and Harshman, 1978). Burnham and Francis (1997) showed that non-native (English-speaking) listeners are less accurate in discriminating lexical tones than native (Thai-speaking) listeners (see also Burnham and Mattock, 2007). They also showed that non-native listeners rely more on the mean F0 to perceive tones compared to native listeners who are able to categorize F0 patterns in spite of phonetic and tonal variability. Lee et al. (2008, 2010) explored further the influence of language expertise on the identification of lexical tones by using degraded speech sounds (i.e., “fragmented” tones obtained by removing a variable number of pitch periods at the onset, center and final part of the syllables). Tone identification performance was found to depend on the nature of the residual (available) acoustic information for the non- native listeners (English speakers learning Mandarin) only. The results confirmed that non-native listeners rely heavily on F0 height whereas native listeners (Mandarin speakers) rely on F0 direction (see also Huang and Johnson, 2010). Altogether, these studies are consistent with the notion that expertise in tonal language influences the relative weight of the acoustic cues involved in lexical- tone perception. From a wider perspective, they are consistent with the now widely-shared idea that linguistic experience shapes the role of speech cues in speech perception (e.g., Burnham and Mattock, 2007). The search for the acoustic cues used to discriminate or recognize lexical tones has been recently renewed by the use of “vocoders” (Dudley, 1939) to manipulate the spectral and temporal modulation components of speech signals (see Shamma and Lorenzi, 2013 for a review). These studies showed that

142

Chinese-speaking listeners rely more on the frequency-modulation (FM5) cues compared to English- or French-speaking listeners who hinge on amplitude- modulation (AM6) cues to identify native speech sounds (e.g., Shannon et al., 1995; Smith et al., 2002; Xu and Pfingst, 2008. Wang et al., 2011). Fu et al. (1998) showed that for native Mandarin speakers, lexical-tone recognition was more affected by a reduction of temporal resolution (that is by the selective attenuation of the fast (F0-related) AM cues above 50 Hz) than by a reduction of spectral resolution (tones were vocoded using 1, 2, 3 or 4 broad frequency bands). In contrast, consonant and vowel recognition were found to be mostly affected by a reduction of spectral resolution. More recently, Kong and Zeng (2006) confirmed that for native Mandarin speakers, lexical-tone recognition with primarily AM cues was affected by a reduction of temporal resolution, but showed that lexical-tone recognition with primarily AM cues was also affected by a reduction of spectral resolution (this was achieved by extracting AM cues within 1, 2, 4, 8, 16 or 32 bands). Altogether, these studies indicate that the slowest AM cues play a major role in consonant recognition, whereas fast AM cues, FM cues, and fine spectral details are more important in lexical-tone recognition. These results suggest that language experience shapes the weight of the spectro-temporal modulation cues conveying F0 information in speech perception. This conclusion should however be taken with great caution because the vocoder- based studies cited above were conducted separately and thus, did not compare directly lexical-tone recognition across listeners from different linguistic backgrounds using the same material and methodology. The goal of the present study was to assess the effect of two different language experiences on the ability to use fine spectral details and FM cues in lexical-tone discrimination. As in Burnham and Francis (1997), three Thai lexical-tone contrasts (low versus rising F0 patterns, low versus falling F0 patterns, and rising versus falling F0 patterns) were used. Rising and low tones have highly similar F0 trajectories until the mid- point of the tone, after which the F0 value increases in the rising tone, and slowly decreases in the low tone. This acoustic similarity between rising and low tones renders them difficult to discriminate for non-native listeners. It is not the case for

5 The FM cues correspond to the oscillations in instantaneous frequency close to the center frequency of the band. 6 The AM cues correspond to the variations in amplitude over time.

143

the F0 trajectories of the rising and falling tones that are totally different. Moreover, the rising and falling tones differ on other cues such as duration, making them easier to distinguish (see Abramson, 1978). Twelve groups of adult listeners (6 groups of non-native3 French-speaking adults and six groups of native7 Mandarin-speaking adults) had to discriminate these three lexical-tone contrasts using a same/different task. The stimuli were processed in two ways. In the first condition (called “intact” speech condition), the AM and FM cues were preserved in 32, narrow frequency bands. In the second condition (called “Vocoded” speech condition), the fine spectral details and FM cues were severely degraded using a 8-band tone-excited vocoder. The listeners’ discrimination performance was also assessed in a third “non-speech” condition where the stimuli were broadband click trains with F0 following the same F0 contours as the ones in the original lexical tones. Two different interstimulus intervals (ISI) were used (500 and 1500 ms) in order to assess whether linguistic experience affects information loss in the short-term memory representation of cues used to discriminate lexical tones and click trains. The two ISI were also used to assess whether listeners engage in an auditory or phonetic mode to discriminate lexical tones depending upon their linguistic experience (see Burnham and Francis, 1997; Clément et al., 1999; Durlach and Braida, 1969; Werker and Tees, 1984). In the “Intact” condition, French-speaking adults were expected to be less accurate in lexical-tone discrimination than Mandarin-speaking adults (this was particularly expected when using a long ISI). In the “Vocoded” condition, the native listeners (i.e., Mandarin speakers) were expected to be more affected by the degradation in fine spectral and FM cues than the non-native ones (French speakers) if native listeners rely more on F0 direction than non-native listeners to process lexical tones. Finally, in the “non-speech” condition, the native listeners were expected to be more accurate in the discrimination of F0 contours than non- native listeners if linguistic experience affects the weight of the fine spectral and temporal cues conveying F0 information in both speech and non-speech perception.

7 “Native” (in italics) will be used here for listeners who have a tonal language as native language, and “non-native” for the listeners who have not a tonal language as native language.

144

II. METHOD

A. Participants

One hundred and twenty young adult subjects with normal hearing were tested (mean age= 24 years; standard deviation (SD) = 2.6 years; 54 girls). They were split into 12 groups of 10 subjects. Sixty participants were tested in Paris. They were native French speakers and did not learn any tonal language; the other sixty were tested in Taiwan (at Taipei) and were native Mandarin speakers.

B. Stimuli

All stimuli were recorded digitally via a 16-bit A/D converter at a 44.1- kHz sampling frequency and equalized in root mean square (rms) power. Three Thai tones (rising, falling and low) were pronounced by a native female speaker (F0=100-350 Hz) with the syllable /ba/. In each category, eight different tokens were chosen because of their higher clarity. The mean duration of the stimuli was 661.6 ms (SD=32.3 ms) for the rising tones, 509.9 ms (SD=36.8) for the falling tones, and 636 ms (SD=31.2) for the low tones. The syllables were processed in two ways. In the first condition (called “Intact” speech condition), the original speech signal was passed through a bank of 32 2nd-order gammatone filters (Gnansia et al., 2009; Patterson, 1987) ranging from 80 to 8,020 Hz. The width of each gammatone filter was set to 1 ERBN (average equivalent rectangular bandwidth of the auditory filter as determined using young normally hearing listeners tested at moderate sound levels; Moore, 2007). The Hilbert transform was then applied to each bandpass filtered speech signal to extract the AM and FM components. The AM component was low-pass filtered using a zero-phase Butterworth filter (36 dB/octave rolloff) with a cutoff frequency set to ERBN/2. In each band, the FM carrier was multiplied by the filtered AM function. Finally, the narrowband speech signals were added up and the level of the resulting speech signal was adjusted to have the same rms value as the input signal. In the second condition (called “Vocoded” speech condition), the same signal processing scheme was used, except that AM cues were extracted from 8

145

broad (4-ERBN wide) gammatone filters. It is important to note that the AM component was low-pass filtered using a zero-phase Butterworth filter (36 dB/octave rolloff) with a cutoff frequency set to ERBN/2 (with the ERBN calculated using the central frequency of the 4-ERBN wide analysis filter). The original FM carriers were replaced by sine wave carriers with frequencies at the center frequency of the gammatone filters, and with random starting phase in each analysis band. The vocoded speech signal contained only the original AM cues extracted within 8 broad frequency bands. Vocoding resulted therefore in a severe reduction of the F0-related voice pitch cues. The subjects were also tested in another experimental condition called the “non-speech” condition. The stimuli used in this condition were generated as follows. The F0 trajectory of each original lexical tone was first extracted using the YIN algorithm (de Cheveigné and Kawahara, 2002). Then, this F0 variation was applied to the F0 of a periodic click train over time (more precisely, the signal was a train of 88-microseconds square pulses, which were repeated at a rate equal to 1/F0). The click trains were limited to the frequency range between 80 to 22050 Hz, and were equated in rms power.

C. Procedure

A same/different discrimination task was adapted from Burnham and Francis (1997) including two different ISIs: 500 and 1500 ms. Eight trials were first proposed in each condition together with unrelated sounds (unprocessed syllables /co/ and /mi/) to train subjects with the task. This was followed by a test phase composed of 48 trials. Half of the trials consisted in the presentation of two stimuli of the same category, and the other half in the presentation of two stimuli belonging to two different categories. “Same” and “different” trials were presented in random order within two blocks (of 24 trials). Each subject was randomly assigned to a given experimental condition and a given ISI duration. Thus, six independent groups of ten adults from each language background were tested in a soundproof booth in Paris or in Taipei. All stimuli were presented in free field using a Fostex (model PM0.5) speaker at a sound pressure level of 70 dB. Subjects sat in front of a computer controlling the experiment. Subjects sat 50 cm from the speaker, located on his/her right side (i.e., at 40 deg azimuth and 0

146

deg elevation). Subjects were instructed to listen carefully to the pairs of sounds. For each trial, they had to press one key when they judged that the two sounds were the same, and another key when they judged that the two sounds were different. They were asked to respond as fast and as accurately as possible. The subject’s accuracy was estimated by a d’ score (Macmillan and Creelman, 1991).

III. RESULTS

The d’ scores of the non-native and native participants for each tone contrast are represented in Figure 1 for the “Intact”, “Vocoded” and “Non- Speech” condition.

147

Figure 1. d’ scores of the native (Mandarin-speaking) and non-native (French- speaking) adult listeners in the three experimental conditions (Intact, Vocoded and Non-speech) for each lexical-tone pair (RL: Rising-Low; RF: Rising-Falling; FL: Falling- Low). The bars represent the standard errors.

Figure 1 (upper panel) shows that both French and Mandarin speakers reached near perfect discrimination for each contrast, that is for each pair of lexical tones. Figure 1 (middle panel) indicates that both French and Mandarin speakers showed poorer discrimination performance for each contrast when lexical tones were processed by the tone-excited vocoder. However, performance remained well above chance for each contrast and for each group of subject. French speakers showed better discrimination scores than Mandarin speakers for two contrasts (rising versus falling tones; falling versus low tones). Finally, Figure 1 (lower panel) shows that for each contrast, both French and Mandarin speakers reached near perfect discrimination of click trains with F0 following the F0 contours of the original lexical tones. Mandarin speakers showed slightly better discrimination scores than French speakers for each original contrast. To assess the role of language and ISI in the three experimental conditions, an omnibus analysis of variance (ANOVA) was run on the d’ scores with 2 Languages, 2 ISI and 3 Conditions as between-subject factors, and 3 Contrasts as within-subject factor. This analysis revealed a main effect of Condition [F(2,109)=158.33, p< .001] and a post-hoc Tukey test showed that the “Vocoded” speech condition led to lower discrimination scores compared to “Intact” and “Non-speech” conditions. A main effect of Contrast was found [F(2,216)=7.2, p<.001] and post-hoc comparisons (Scheffé test) showed that: (i) as expected, the “rising-falling” contrast was the easiest to discriminate, and (ii) no difference was observed between the other two contrasts. Moreover, a significant Condition x Contrast interaction was observed [F(4,216)=3.3,p=.013] indicating that the “rising-falling” contrast was the easiest to discriminate in the “Vocoded” speech condition. Furthermore, the significant interaction Condition x Contrast x Language [F(4,216)=3.18, p=.015] revealed that the higher scores obtained for the “rising- falling” contrast were mainly due to native listeners. The significant interaction Contrast x ISI x Language [F(2,216)=4.6, p = .01] indicated that the better d’ scores for the “rising-falling” contrast were exhibited by the non-native

148

participants with an ISI of 500 ms and by the native ones with and ISI of 1500 ms. Finally, a significant interaction between Condition x ISI x Language [F(2,109)=3.26, p = .04] showed that non-native participants were better than native ones with a short ISI in the “Vocoded” condition. A significant interaction Condition x Language was also found [F(2,109)=9.46, p< .001]. To explore further this interaction and to compare the discrimination performance of native and non-native subjects in each condition, separated ANOVAs were run on the total d’ scores (across contrasts) within each condition with 2 Languages as between-subject factor. In the “Intact” condition, no main effect of Language was observed. In the “Vocoded” speech condition, a main effect of language [F(1,38)=6.33, p = .016] was observed and post-hoc comparisons (Tukey test) indicated that the d’ scores of the non-native participants were significantly higher than those from native participants. In the “Non-speech” condition, a main effect of language was observed [F(1,38)=6.15, p = .018] and post-hoc comparisons revealed that native participants obtained higher d’ scores compared to non-natives ones.

IV. DISCUSSION

The present study aimed to investigate the role of language experience on the processing of spectro-temporal cues in lexical-tone discrimination. The discrimination of lexical tones was compared between native (Mandarin speakers and thus lexical-tone users) and non-native listeners (French speakers and thus, non-lexical tone users) in two experimental conditions: one with intact AM, FM and spectral cues and another one with degraded FM and fine spectral cues. Moreover, the perception of pitch contours per se was also tested in a “non- speech” condition containing the F0 variations of the lexical tones applied to a broadband click train. In apparent contrast with previously published work, French and Mandarin speakers showed similar discrimination performance in the “Intact” speech condition (but see Hallé et al., 2004). This indicates that both French and Mandarin speakers were able to perceive correctly differences in pitch contours with the present syllables. The absence of difference between language groups

149

results from a ceiling effect, that is from (i) the high performance of both groups with the current discrimination task and (ii) the clarity of the present speech stimuli. In the “Vocoded” speech condition, the performance of both groups decreased significantly but remained above chance level (Student t test; all p<.001). In this condition, the cues conveying voice pitch information were severely degraded and Mandarin speakers were more impaired than French speakers in the discrimination task. These results indicate an effect of language experience on the perception of F0 variations and confirm that lexical-tone users are more dependent on FM and fine spectral cues than non-users to perceive lexical tones. In the “Non-speech” condition, Mandarin speakers showed better discrimination of (F0) pitch contours compared to French speakers. These results are in line with several studies showing an effect of linguistic experience on the identification of pitch contours for non-linguistic signals such as sine waves, harmonic complex tones, or iterated rippled noises (e.g., Bent et al., 2006; Swaminathan et al., 2008; Xu et al., 2006). Overall, the results suggest that Mandarin speakers are more dependent on F0 variations - and thus on FM and fine spectral cues - than French speakers when discriminating lexical tones. As shown in the “Vocoded” condition, French speakers are better able to make use of the remaining information such as AM, duration or/and loudness than Mandarin speakers. Furthermore, the duration of ISI influenced differently subjects’ performance in that experimental condition. Better performance for the “rising-falling” contrast was observed with a short ISI for French speakers, and with a long ISI for Mandarin speakers. This difference can be interpreted in two ways. First, it may reveal that linguistic experience affects the rate of information loss in the short-term memory representation of the voice-pitch cues used to discriminate lexical tones (i.e., Mandarin speakers may show less information loss in the short-term memory representation of the voice- pitch cues than French speakers). Alternatively, it may reveal that Mandarin speakers are engaged in a categorization process (e.g., Durlach and Braida, 1969) even in that degraded speech condition. Taken together, these results showed that Mandarin speakers rely more on the fine spectral details and FM cues than French listeners to recognize lexical

150

tones. This suggests that linguistic experience shapes the weight of spectro- temporal fine structure acoustic cues in speech sounds. The results obtained with click trains suggest that the influence of linguistic experience on the weight of spectro-temporal fine structure acoustic cues extends to non-linguistic sounds.

ACKNOWLEDGMENTS

The authors wish to thank all the participants of this study. C. Lorenzi was supported by a grant from ANR (HEARFIN project). This work was also supported by ANR-11-0001-02 PSL* and ANR-10-LABX-0087.

REFERENCES

Abramson, A. S. (1978). “Static and dynamic acoustic cues in distinctive tones,” Lang Speech, 21, 319–325. Bent, T., Bradlow, A. R., and Wright, B. A. (2006). “The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds,” J Exp Psychol Human, 32, 97–103. Burnham, D., and Francis, E. (1997). “The role of linguistic experience in the perception of Thai tones,” Southeast Asian linguistic studies in honour of Vichin Panupong. Burnham, D., and Mattock, K. (2007). “The perception of tones and phones,” Language Experience in Second Language Speech Learning: In honor of James Emil Flege. De Cheveigné, A., and Kawahara, H. (2002). “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., 111, 1917–1930. Clément, S., Demany, L., and Semal, C. (1999). “Memory for pitch versus memory for loudness,” J. Acoust. Soc. Am., 106, 2805–2811. Dudley, H. (1939). Remaking speech. J. Acoust. Soc. Am., 11, 169–177. Durlach, N. I., and Braida, L. D. (1969). Intensity perception. I. Preliminary theory of intensity resolution. J. Acoust. Soc. Am., 46, 372–383. Fu, Q.-J., Zeng, F.-G., Shannon, R. V., and Soli, S. D. (1998). “Importance of tonal envelope cues in Chinese speech recognition,” J. Acoust. Soc. Am., 104, 505. Gandour, J. T., and Harshman, R. A. (1978). Crosslanguage differences in tone perception: A multidimensional scaling investigation. Lang. Speech, 21(1), 1-33.

151

Gnansia, D., Péan, V., Meyer, B., and Lorenzi, C. (2009). “Effects of spectral smearing and temporal fine structure degradation on speech masking release,” J. Acoust. Soc. Am., 125, 4023–4033. Hallé, P. A., Chang, Y. C., and Best, C. T. (2004). "Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners". J. Phonetics, 32, 395-421. Huang, T., and Johnson, K. (2010). “Language specificity in speech perception: perception of Mandarin tones by native and nonnative listeners,” Phonetica, 67, 243–267. Kong, Y.-Y., and Zeng, F.-G. (2006). “Temporal and spectral cues in Mandarin tone recognition,” J. Acoust. Soc. Am., 120, 2830–2840. Kuo, Y.-C., Rosen, S., and Faulkner, A. (2008). “Acoustic cues to tonal contrasts in Mandarin: Implications for cochlear implants,” J. Acoust. Soc. Am., 123, 2815. Lee, C.-Y., Tao, L., and Bond, Z. S. (2008). “Identification of acoustically modified Mandarin tones by native listeners,” J Phonetics, 36, 537–563. Lee, C. Y., Tao, L., and Bond, Z. S. (2010). Identification of acoustically modified Mandarin tones by non-native listeners. Lang. speech, 53(2), 217-243. Liang, Z. A. (1963). “The auditory perception of Mandarin tones,” Acta Phys. Sin, 26, 85–91. Macmillan, N. A., and Creelman, C. D. (1991). Detection theory: A user's guide. Cambridge University Press. New York. Patterson, R. D. (1987). “A pulse ribbon model of monaural phase perception,” J. Acoust. Soc. Am., 82, 1560–1586. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science, 270, 303– 304. Shamma, S. and Lorenzi, L. (2013) "On the balance of envelope and temporal fine structure in the encoding of speech in the early auditory system". J. Acoust. Soc. Am., 133, 2818-2833. Smith, Z. M., Delgutte, B., and Oxenham, A. J. (2002). “Chimaeric sounds reveal dichotomies in auditory perception,” Nature, 416, 87–90. Swaminathan, J., Krishnan, A., and Gandour, J. T. (2008). “Pitch encoding in speech and nonspeech contexts in the human auditory brainstem,” Neuroreport, 19, 1163–1167. Werker, J. F., and Tees, R. C. (1984). “Phonemic and phonetic factors in adult cross-language speech perception,” J. Acoust. Soc. Am., 75, 1866. Whalen, D. H., and Xu, Y. (1992). “Information for Mandarin tones in the amplitude contour and in brief segments,” Phonetica, 49, 25–47. Wang, S., Xu, L., and Mannell, R. (2011) "Relative Contributions of Temporal Envelope and Fine Structure Cues to Lexical Tone Recognition in Hearing-Impaired Listeners". J. Assoc. Res. Otolaryngol., 12, 783-794. Xu, L., and Pfingst, B. E. (2008). “Spectral and temporal cues for speech

152

recognition: implications for auditory prostheses,” Hear. Res., 242, 132– 140. Xu, Y., Gandour, J. T., and Francis, A. L. (2006). “Effects of language experience and stimulus complexity on the categorical perception of pitch direction,” J. Acoust. Soc. Am., 120, 1063–1074.

153

3. Introduction: The perception of speech modulation cues is guided by early language-specific experience (Infant data)

The previous experiment (Cabrera, Tsao, Gnansia, Bertoncini, & Lorenzi, submitted) suggests that in adults, linguistic experience impacts the use of speech modulation cues in lexical-tone perception, and particularly the use of those conveying voice-pitch information (i.e., spectro-temporal fine structure cues). Mandarin speakers showed poorer discrimination of lexical tones than French speakers when fine spectral details and FM cues are severely degraded by vocoding. Compared to French participants, Mandarin speakers seem therefore more dependent on FM and spectral cues when perceiving lexical tones. The second experiment of this chapter seeks to investigate whether this influence of language experience could be evidenced during infancy, that is in the first year of development. The ability to discriminate lexical tones is tested in 6- and 10-month-old infants from two linguistic environments (Mandarin versus French). The same vocoder conditions as those used for adults (“intact”, “vocoded” and “nonspeech”) are used with infants. A visual habituation procedure is used to assess their discrimination ability (as in Cabrera, Lorenzi, & Bertoncini, in preparation, Chapter 4). In the vocoded condition, stimuli are processed to preserve slow and fast AM cues in 8 broad frequency bands only. As mentioned before, fast AM cues (> 50 Hz) were shown to play an important role in lexical- tone perception in adults (e.g., Fu, Zeng, Shannon, & Soli, 1998; Kong & Zeng, 2006). However, the studies reported in Chapters 2 and 3 suggest that fast AM cues should be more important than slower ones for 6-month-old infants when perceiving a consonant contrast. Thus, a tone-excited vocoder extracting the AM cues with a relatively high cutoff frequency (set at ERBN/2) is chosen to evaluate the role of fine spectro-temporal cues in lexical-tone discrimination. The following experiment reports the discrimination results of 6- and 10-month-old infants learning French or Mandarin in the three stimulus conditions. These results indicate that the perception of speech modulation cues differs according to early language exposure between 6 and 10 months of age. As in adults, the results

154

of 10-month-olds suggest that the weight of spectro-temporal fine structure cues is influenced by linguistic factors.

This study is presented in the following article (in preparation): Cabrera, L., Hu, Y.H., Li, L.Y., Tsao, F.M., Lorenzi, C., & Bertoncini, J. “The perception of speech modulation cues is guided by the early language-specific experience.”

155

4. Article: Cabrera, Tsao, Hu, Li, Lorenzi & Bertoncini (in preparation)

The perception of speech modulation cues is guided by early language-specific experience

Laurianne Cabrera Laboratoire de Psychologie de la Perception, CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France You Hsin Hu Department of Psychology, National Taiwan University No.1, Sec. 4, Roosevelt Road, Taipei, 106, Taiwan Lu Yang Li, Department of Psychology, National Taiwan University No.1, Sec. 4, Roosevelt Road, Taipei, 106, Taiwan Feng-Ming Tsao Department of Psychology, National Taiwan University No.1, Sec. 4, Roosevelt Road, Taipei, 106, Taiwan Christian Lorenzi Institut d’Etude de la Cognition, Ecole normale supérieure, Paris Sciences et Lettres 29 rue d’Ulm, 75005 Paris, France Josiane Bertoncini Laboratoire de Psychologie de la Perception CNRS, Université Paris Descartes 45 rue des saints Pères, 75006 Paris, France

In preparation Running title: speech modulation cues and language-specific experience

156

ABSTRACT

Although a number of studies showed that infants reorganize their perception of speech sounds according to their native language during their first year of life, information is lacking about the contribution of basic auditory mechanisms to this process. The present study aimed to evaluate when infants’ perception of the basic acoustic cues involved in speech perception (i.e., frequency- and amplitude-modulation information) is in line with their native language experience. The discrimination of a lexical-tone contrast (rising versus low) was assessed in 10- and 6-month-old infants learning either French or Mandarin using a visual habituation paradigm. The lexical tones were presented in two conditions designed to either keep intact or to severely degrade the frequency-modulation and fine spectral cues needed to accurately perceive voice- pitch trajectory. The discrimination of the same voice-pitch trajectories was tested with non-speech signals (click trains) in French- and Mandarin-learning 10- month-old infants. Consistent with previous studies, the younger infants of both language groups and the older ones learning a tonal language discriminated the lexical-tone contrast while French-learning 10-month-olds failed when lexical tones were left intact. However, only the French 10-month-olds showed a reliable discrimination response when the frequency modulation and fine spectral cues of lexical tones were severely degraded. In the non-speech condition, Mandarin 10- month-olds were found to discriminate the pitch trajectories of lexical tones better than French infants. Altogether, these results reveal that the perceptual reorganization occurring at the end of the first year for speech sounds is closely connected with changes in the auditory ability to use speech modulation cues.

Key words: Lexical tones; Amplitude modulation, Frequency modulation, Infants

157

I. INTRODUCTION

In the first months of life, infants are able to discriminate almost all phonetic units, including non-native ones (see Kuhl, 2004 and Werker & Tees, 2005 for reviews). However, this early “universal” perceptual ability turns into a more native language-specific later in the first year of life. This was initially demonstrated by Werker and Tees (1983) who found that young English-learning infants were able to discriminate a consonant contrast in Hindi between 6 and 8 months of age, but this ability declined between 10 and 12 months. Several studies replicated these initial results with other consonant contrasts using behavioral and electrophysiological methods (e.g., Best, McRoberts, LaFleur, & Silver-Isenstadt, 1995; Cheour et al., 1998; Rivera-Gaxiola, Silva-Pereyra, & Kuhl, 2005; Tsushima et al., 1994). Infants also become more accurate in discriminating phonetic contrasts of their native language during their first year (e.g., Conboy, Sommerville, & Kuhl, 2008; Kuhl et al., 2006; Tsao, Liu, & Kuhl, 2006). This “perceptual reorganization” has been observed for consonant and vowel perception, with language-specific re-organization taking place around 6 months of age for vowel categories (e.g., Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992; Polka & Werker, 1994). Together, these results demonstrate that speech perception is shaped by language experience during the first year of life, with infants becoming more and more sensitive to their native speech contrasts and less sensitive to non-native ones. Compared to the perception of segments like consonants and vowels, the developmental time course of lexical-tone perception has received less attention. This is surprising given that more than 70% of the world’s languages can be described as tonal languages (see Yip, 2002). In tonal languages such as Thai or Mandarin, variations in fundamental frequency (F0), – and thus, in voice pitch – at the syllable level distinguish word meaning (e.g., Liang, 1963). In a pioneering study, Mattock and Burnham (2006) explored the ability to discriminate Thai lexical-tone patterns in English- and Mandarin-learning infants before 12 months. The lexical-tone patterns consisted of the syllable /ba/ containing a so called “contour-level” contrast (i.e., a low F0 syllable with a flat trajectory versus a syllable with a rising F0 trajectory) and a “contour-contour” contrast (i.e., a

158

syllable with a rising F0 trajectory versus a syllable with a falling F0 trajectory). Only English infants showed a decline in lexical-tone discrimination between 6 and 9 months of age, for the contour-level contrast only. Moreover, infants of both languages did not show a deterioration in their discrimination of pitch variations when presented in a non-linguistic signal (i.e., pitch differences corresponded to musical tones played on a violin). In a subsequent comparison study, Mattock et al. (2008) tested the discrimination of the same contour-level lexical-tone contrast in 4- 6- and 9-month-old, English- and French-learning infants. English and French are both non-tonal languages but they differ in their stress pattern (i.e., stress-timed rhythm versus syllable timed, respectively; see Ramus, Nespor, & Mehler, 1999). Mattock, Molna, Polka and Burnham (2008) hypothesized that the typical stress pattern of the native language may have an influence on the perception of lexical tones. The results showed that the 6-month- olds and the 4-month-olds English and French-learning infants exhibit the same ability to discriminate the contour-level lexical-tone contrast. However, this ability is not observed in 9 month-olds for both language groups. Thus, when pitch variations at the syllable level have little relevance in the acquisition of a given language such as English or French, infants become less sensitive to this variation (at the syllable level) during their first year. Moreover, Yeung, Chen and Werker (2013) suggested that some evidence for the reorganization in lexical-tone perception may be found as early as 4 months of age. Four-month-old infants learning one of two different tonal languages (either Mandarin or Cantonese) show a preference for the lexical tones closest to those of their native language. Thus, long before the end of their first year, infants learn to process pitch variations at the syllable level as a reliable phonological cue when their native language is a tonal language. Several factors may influence speech reorganization. Language experience (and thus, repeated exposure to the native phonetic categories and language regularities), cognitive skills (i.e., attentional processes related to the executive and inhibitory control) and social skills (i.e., social interactions) have been shown to influence the development of phonetic discrimination (e.g., Conboy et al., 2008; Kuhl, Tsao & Liu, 2003; Saffran, Aslin, & Newport, 1996; Saffran, 2002; see Kuhl, 2004 for a review). The development of speech perception can be also described as being driven by the intrinsic properties of the speech signal and by

159

sensory constraints imposed by the human auditory system (see Aslin & Pisoni, 1980, and Nittrouer, 2002). For instance, psycholinguistic studies showed that substantial exposure to a given language is required to accurately perceive certain phonetic contrasts. For example, the acoustic similarity between phonetic contrasts belonging to two languages impairs the ability to discriminate these contrasts for 8-month-old infants learning these two languages (e.g., Catalan- Spanish; Bosch & Sebastián-Gallés, 2003; Sebastián-Gallés & Bosch, 2009). Moreover, the development of discrimination abilities for infants has been shown to differ across speech contrasts (e.g., Aslin, Pisoni, Hennessy, & Perey, 1981; Pisoni, 1977; Mattock et al., 2006; Polka, Colantonio, & Sundara, 2001). Narayan, Werker and Beddor (2010) studied the discrimination of a Filipino contrast [na]-[ηa] and an English and Filipino contrast [ma]-[na]. They showed that the [na]-[ηa] contrast is more difficult to discriminate than [ma]-[na] for English infants between 4 and 12 months of age and for Filipino-learning infants – although native – until the age of 10-12 months. These results indicate that a perceptually less salient contrast requires longer language-specific experience to be perceived correctly, and thus demonstrate that language experience enhances the perception of specific acoustic parameters. Moreover, these studies suggest that speech reorganization may involve relatively basic auditory processes (i.e., the processing of spectro-temporal cues). Over the last decades, psychoacoustic studies have developed an original and useful paradigm to explore the auditory processing of speech features. This paradigm is based on the notion that speech information is mainly conveyed by the temporal modulations appearing at the output of cochlear filters (e.g., Steeneken & Houtgast, 1980). To test this notion, “vocoders” (Dudley, 1939), which are speech analysis and synthesis systems, can be used to manipulate the modulation components of speech in a number of frequency bands (see Shamma & Lorenzi, 2013, for a review). Studies repeatedly found that English- and French-speaking adults rely almost exclusively on the amplitude-modulation (AM) cues or acoustic “temporal envelope” cues (the slow variations in amplitude over time), to identify syllables, words and sentences (e.g., Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995; Smith, Delgutte, & Oxenham, 2002). These AM cues are known to convey variations in speech rhythm (e.g., Rosen, 1992). Not surprisingly, a different pattern of results was found for adults who use a

160

tonal language. Xu and Pfingst (2008) showed that the identification of lexical tones is mainly based on frequency-modulation (FM) cues or acoustic “temporal fine structure” (the oscillations in instantaneous frequency close to the center frequency of the band). These spectro-temporal fine structure cues are known to convey the most salient (F0-related) pitch information (e.g., Smith et al., 2002). These psychoacoustic data reveal that the weight of AM and FM cues in speech perception is not entirely constrained by the general properties of the human auditory system because it varies across languages. It is therefore reasonable to assume that language experience influences the weight of these modulation cues in speech perception, and contributes to the language-specific speech organization found in infants. To date, the time course of the effect of linguistic experience on the perception of modulation cues is still largely unexplored. Only two studies have investigated the perception of AM and FM speech cues in 6-month-old infants learning French (Bertoncini, Nazzi, Cabrera, & Lorenzi, 2011; Cabrera, Bertoncini & Lorenzi, 2013). These studies showed that French-learning infants are able to discriminate phonetic contrasts (voiced versus voiceless stop consonants, and labial versus coronal stop consonants) on the sole basis of the slowest (<16 Hz) AM cues in a small number of broad frequency bands. In other words, French-learning infants do not require FM and fine spectral cues to discriminate these French phonetic contrasts. These results are compatible with the idea that during the first year of life, infants may progressively rely more on the modulation features that are relevant to the phonology of their native language. However, empirical data on other (and phonologically distinct) languages are still lacking. The present study investigates to what extent language experience influences the perception of speech modulation cues. This is achieved by testing the discrimination abilities of French- and Mandarin-learning infants using vocoded lexical tones. More specifically, this study determines if the perceptual reorganization observed for speech during the first year of life is associated with a reorganization in the weighting of the modulation cues relevant to the native- language perception. As in Mattock and Burnham (2006) and Mattock et al. (2008), the syllables /ba/, containing either rising or low Thai tones were used in the present

161

study. In Thai for example, the contour-level contrast (rising versus low) is known to be more difficult to discriminate than a contour-contour contrast (rising versus falling) by non-native adults8 (see Abramson, 1978; Burnham & Francis, 1997; Gandour & Harshman, 1978). Although the tokens are non-native for both infant groups, the Thai lexical tones are close to their Mandarin counterparts. This lexical-tone contrast is known to be difficult for non-native listeners. However, experience in a tonal language using similar acoustic patterns (i.e., Mandarin) should help infants to discriminate this contrast. The stimuli were left intact or vocoded in order to degrade selectively FM cues and fine spectral details, and thus, the perception of voice-pitch trajectory. These two types of stimuli were then presented for discrimination to older infants (aged 10 months) and to younger infants (aged 6 months) learning either a syllable-timed or tonal language (i.e., French or Mandarin, respectively). Infants’ discrimination abilities were tested for intact and vocoded speech conditions using a visual habituation paradigm (see Werker et al., 1998). If speech reorganization is associated with a change in the perception of AM and FM cues (that is, changes in the perceptual weight of AM and FM cues), only infants older than 6 months should be expected to show a discrimination pattern systematically related to their native language. Ten-month-old infants learning either French or Mandarin were tested in the first experiment in order to test this hypothesis. It was assumed that (i) only the Mandarin 10-month-olds would discriminate the lexical tones in the Intact condition, and (ii) Mandarin 10- month-olds would be more impaired by the reduction of the FM cues and fine spectral details conveying the F0 variations than French-learning infants of the same age. In a second experiment, younger babies, 6-month-olds, learning French or Mandarin, were included. Six-month-old infants were expected to show the similar pattern of discrimination if (as expected) linguistic experience had not yet shaped their perception of speech modulation cues.

8 These two lexical tones are the most difficult to distinguish for non-native speakers because rising and low tones have highly similar F0 trajectories until the mid-point of the tone after which the F0 increases in the rising tone and slowly decreases in the low tone.

162

A third experiment was designed to further explore the perception of F0 trajectory for French and Mandarin 10-month-old infants by using non-speech stimuli (i.e., click trains with F0 variations similar to those of the original lexical tones). Adults speaking a tonal language have been shown to have enhanced abilities to identify pitch contours in non-speech signals (e.g., Bent, Bradlow & Wright, 2006; Swaminathan, Krishnan & Gandour, 2008; Xu, Gandour & Francis, 2006). It was expected that only Mandarin-learning infants would be able to discriminate F0 variations, even in a non-linguistic condition if at 10 months, linguistic experience has already affected the weight of the FM and fine spectral cues conveying F0 information.

II. EXPERIMENT 1

Experiment 1 tested French-learning and Mandarin-learning 10-month-old infants on the lexical tone contrast /ba/ rising versus /ba/ low processed in two conditions (Intact versus Vocoded).

A. Method

1. Participants

French-learning infants were recruited from a database at the University of Paris Descartes (Paris), and Mandarin-learning infants were recruited at the National University of Taiwan (Taipei). Data from 64 ten-month-old infants were analyzed in this experiment: 32 French-learning infants, 16 in the Intact condition (mean = 309 days, range = 300 to 328 days) and 16 in the Vocoded condition (mean = 313 days, range = 302 to 333 days) and 32 Mandarin-learning infants, 16 in the Intact condition (mean = 316 days, range = 296 to 333 days) and 16 in the Vocoded condition (mean = 319 days, range = 300 to 340 days). All families were informed about the goals of the current study and provided written consent before their participation, in accordance with the current French and Taiwanese ethical requirements. All infants were born full-term, without any history of medical complications. All infants had normal hearing (based on parental report of newborn-hearing screening results). An additional 50 infants participated in the

163

study, but were not included for the following reasons: fussing and crying (n=35), and failure to reach the habituation criteria (n=15). Infants were included in the analyses if they completed at least four trials in the habituation phase (to ensure that infants heard the habituation stimulus more than 30 sec) and did not require an excessive number of habituation trials (compared to their mean group + 2 standard deviations).

2. Stimuli

A female native speaker of Thai language produced several utterances of the syllable /ba/, with two different lexical tones: rising and low (i.e., rising F0 trajectory versus flat F0 trajectory; F0 range: 100-350 Hz). The speaker was asked to speak clearly in an adult-directed register (in order to not accentuate acoustic differences between lexical tones, see Liu, Tsao, & Kuhl, 2007). In each category, eight different occurrences were chosen based on their clarity and duration. The mean durations were similar in rising tones (661.6 ms, SD = 32.3 ms) and low tones (636 ms, SD=31.2; t(12)=1.67, p=.13). Figure 1 represents the mean F0 variation calculated across the eight exemplars in each category. F0 variation is shown as a function of time. Duration is normalized across exemplars. Two types of audio files were generated: a repeated sequence made of low tones only, and a repeated sequence made of rising tones only. Within each audio-file, tones were separated by a silent inter-stimulus-interval (ISI) varying randomly from 600 to 1300 ms. This variation was introduced to make small variations in duration between items irrelevant within and between categories. The total duration of each audio-file was around 26 s. Each file was constructed by taking four acoustically different tokens of a stimulus category repeated four times, for a total of 16 randomly ordered stimuli. Four different random orders were created for both /ba/ rising and /ba/ low stimuli and two were used in the habituation phase, and two other in the test phase. In the “Intact” condition, the AM and FM cues of the original speech signal were extracted within 32 independent frequency bands (from 80 to 8020 Hz). The original speech signal was passed through a bank of 32 2nd order gammatone filters, each 1 equivalent-rectangular-bandwidth (ERB) wide9

9 The ERB corresponds to the bandwidth of cochlear filters as measured for normal- hearing adults at moderate level (Glasberg & Moore, 1990)

164

(Gnansia, Péan, Meyer, & Lorenzi, 2009; Patterson, 1987). Then, in each band, AM and FM components were extracted using the Hilbert transform. The AM component was low-pass filtered using a zero-phase Butterworth filter (36 dB/octave rolloff) with a cutoff frequency set to ERB/2 in order to preserve the fast, F0-related AM fluctuations. Then, the AM and FM components were recombined in each frequency band and 32 modulated signals were summed. The level of the resulting speech signal was adjusted to have the same root-mean- square (rms) value as the input signal. In this condition, signal processing resulted in near-perfect stimulus reconstruction.

340

300

260

220 Low Rising

180 Mean F0 (Hz) F0 Mean 140

100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time (normalized)

Figure 1. Stimuli mean F0 (Hz) as a function of time. F0 was averaged across the eight exemplars used in the low and the rising categories. Stimulus duration was normalized across exemplars.

In the “Vocoded” condition, the AM and the FM of the original input signal were extracted as described above, but from only 8 relatively broad frequency bands (4-ERB wide). The original FM carriers were replaced by sine- wave carriers with frequencies at the center frequency of the gammatone filters, and with random starting phase in each analysis frequency band. As for intact stimuli, the level of the vocoded speech signal was adjusted to have the same rms

165

value as the input signal. Both intact and vocoded stimuli had identical bandwidth. However, the FM cues and fine spectral details of speech signals were severely degraded in the vocoded condition. Figure 2 represents the spectrograms of a lexical-tone contrast (/ba/ low and /ba/ rising) in each condition.

Figure 2. Spectrograms of the Intact (top panels) and Vocoded (bottom panels) versions of a /ba/ low (left panels) and /ba/ rising (right panels) stimulus. 3. Procedure

French-learning babies were tested in Paris (France), and Mandarin- learning babies were tested in Taipei (Taiwan). In each location, infants were seated on the caregiver’s lap, approximately 2 m from the TV monitor, in a sound-proof booth. A video camera positioned below the stimulus monitor was linked to another screen outside the booth and was used to observe the infant’ looking times online. Two loudspeakers located on each side of the infant’s monitor, approximately 30° to the left and right of centerline of the caregiver’s chair, delivered the auditory stimuli at a level of 70 dB SPL. The observer was unaware of the audio file presented. He/she recorded the duration of the infant’s looking time by pressing a key and controlled stimulus presentation using Habit X.10 (Cohen, Atkinson, & Chaput, 2000). The caregiver was instructed not to

166

interfere with the infant’s behavior (i.e., not to show the screen at any time) and wore earplugs and headphones delivering masking music. A “visual habituation” method was used to assess discrimination of lexical tones in infants (e.g., Houston, Horn, Qi, Ting, & Gao, 2007; Mattock et al., 2008; Narayan et al., 2010; Werker et al., 1998). Audio files were contingently presented to the infants’ looking orientation at a display (a black and white checkerboard) on the TV monitor. Auditory and visual presentations continued until the infant looked away for 2 s (automatically calculated by the computer via the experimenter who released the key press as soon as infants looked away) or at the end of the audio file (maximum 26 s). At the end of the trial, the checkerboard disappeared and flashing balls appeared to draw the infant’s attention to the TV monitor. No auditory stimulus was presented during this interval between trials. Once the infant looked at the screen, the experimenter initiated the next trial. The experiment began with a habituation phase, during which infants heard the same sound category. In the present procedure, the habituation phase ended when the mean looking time (LT) on three consecutive trials decreased by 50% compared to the highest three previous consecutive trials . The test phase directly ensued, during which infants received four novel (N) and four familiar (F) trials presented alternatively with the order counterbalanced across subjects (such as N-F-N-F-N- F-N-F or F-N-F-N-F-N-F-N). Finally, in each condition, half of the infants were habituated to rising tones and the other half were habituated to low tones. The video recording of each infant was then coded by the experimenter who was unaware of which test trial was novel or familiar. The experimenter proceeded through the video file, frame-by-frame (the videos have a resolution of 25 frames per second) logging infant looks and non-looks. This off-line coding was thought to be more reliable than the on-line coding conducted during the experiment.

B. Results

The mean LT in the test phase was recorded for each group of infants. A preference score (or discrimination index) was computed for each infant (see Beach & Kitamura, 2011; Gava, Valenza, Turati, & de Schonen, 2008) to test if infants preferred the novel sequences in the test phase. This preference score corresponded to the total LT for the four novel trials divided by the total LT

167

registered during the eight test trials to both familiar and novel trials. Discrimination response was assessed by comparing the preference score for novel trials against chance level (50% of total looking time in the test phase). Figure 3 represents the mean preference scores for each group. All the groups exhibited mean scores above 50%. However, only the Mandarin-learning group showed a mean score above 55% in the Intact condition, and only the French-learning group showed a mean score above 55% in the Vocoded condition.

Figure 3. Mean preference scores for novel trials in the test phase in the two signal-processing conditions (Intact and Vocoded) for 10-month-old infants learning French and Mandarin. The error bars represent the standard errors. The dotted horizontal line showed chance level (50%).

An analysis of variance (ANOVA) conducted on the infants’ preference scores showed no main effect or interaction with the factor of habituation stimulus (rising versus low). Thus, the data were collapsed across this variable in the main analyses. A comparison between groups was then run on the preference scores (i.e., the proportion of LT for the novel trials in the test phase) in a 2 (Conditions: Intact versus Vocoded) × 2 (Languages: French versus Mandarin) ANOVA. The analysis showed no main effect of Condition (F(1,60)=0.16; p=.70) or a main effect of Language (F(1,60)=.99; p=.32) but only a marginal interaction between Language and Condition (F(1,30)=3.58; p=.063). In other words, the preference scores vary across stimulus condition depending on native language. One-sample t-tests were then applied in each condition and for each language group to determine whether the preference scores for novel sequences

168

were significantly above chance. In the Intact condition, the Mandarin 10-month- olds showed a preference for novelty significantly above chance level (50%) [mean = 57.13 %, SD = 11.1 ; one-tailed t(15)=2.56, p = .01], while the French 10-month-olds only showed a marginal preference for novelty [mean = 54.66%, SD = 11.2 ; one tailed t(15)=1.66, p = .058]. In the Vocoded condition, a significant preference for novelty was observed in the French 10-month-olds [mean = 58.78%, SD = 11.5; one tailed t(15)=3.04, p = .004] but not in the Mandarin infants [mean = 50.83%, SD = 10.14, one tailed t(15)=0.33, p = .37]. Thus, a reverse pattern of results was found regarding language environment and signal processing condition (these results explain the marginal interaction observed in the ANOVA between Language and Condition).

C. Discussion

Experiment 1 investigated the perception of lexical tones in 10-month-old lexical-tone learning and non-learning infants. In the Intact condition where speech modulation cues were close to those in the original signal, only the Mandarin-learning infants exhibited a discrimination response to the contrast between low and rising lexical tones. This result is consistent with previous studies on lexical-tone perception in infants using difficult contrasts for non tonal- language listeners (Mattock et al., 2006; Mattock and Burnham, 2008; Yeung et al., 2013). Unlike French-learning infants, Mandarin-learning infants were able to discriminate a non-native Thai lexical tone because of their experience with pitch variations at the syllable level. In the Vocoded condition, the FM cues and fine spectral details conveying information about F0 variations were severely degraded. In this condition, an opposite pattern of results was observed compared to the Intact condition. French-learning infants discriminated the vocoded lexical tones whereas Mandarin learners did not. Therefore, Mandarin-learning infants seem to have more difficulty than the French-learning infants with the degradation of FM cues and fine spectral details signaling F0 information, and thus, voice- pitch trajectories. Mandarin 10-month-old infants may have learnt to attend to and rely specifically on these spectro-temporal fine structure cues to process lexical- tone contrasts.

169

III. EXPERIMENT 2

A. Method

1. Participants

Data from 64 6-month-old infants were analyzed in this experiment: 32 French-learning infants, 16 in the Intact condition (mean = 197 days, range = 179 to 213 days), 16 in the Vocoded condition (mean = 201 days, range = 183 to 208 days), and 32 Mandarin-learning infants, 16 in the Intact condition (mean = 197 days, range = 167 to 219 days) and 16 in the Vocoded one (mean = 192 days, range = 173 to 209 days). An additional 53 infants participated in the study, but were not included for the following reasons: fussing and crying (n=42) and failure to reach the habituation criteria (n=11).

2. Stimuli and procedure

The stimuli and procedure were exactly the same as those used in Experiment 1.

B. Results

As in Experiment 1, the preference scores were computed for each 6- month-old infant in each language group, and for each stimulus condition. Figure 4 represents the mean preference scores for the four groups of 6-month-olds. In the Vocoded condition, all scores are above 50% except for the group of Mandarin-learning infants. Moreover, in the Intact condition, the scores are only around 55% for both language groups.

170

Figure 4. Mean preference scores for novel trials in the test phase in the two speech-processing conditions (Intact and Vocoded) for 6-month-old infants learning French and Mandarin. The error bars represent the standard errors. The dotted horizontal line showed chance level (50%).

As in Experiment 1, the data were collapsed across habituation stimulus (rising versus low) because a preliminary ANOVA revealed no main effect or interaction with this factor. A comparison between groups was then run on the preference scores in a 2 (Conditions: Intact versus Vocoded) × 2 (Languages: French versus Mandarin) ANOVA. The analysis revealed a slight effect of Condition (F(1,60)=3.25; p=.076) and no effect of Language (F(1,60)=1.17; p=.28) on the preference scores for novel trials, or interaction between Language and Condition (F(1,30)=1.01; p=.32). Then, one-sample t-tests were run in each group in order to compare the infant’s preference scores to chance level (50%). In the Intact condition, a significant preference for novelty was observed for both language groups: for the French group [mean = 54.97 %, SD = 11.1 ; one-tailed t(15)=1.79, p = .047], and for the Mandarin group (mean = 54.8%, SD = 9.3 ; one tailed t(15)=2.07, p = .028). In the Vocoded condition, preference for novelty was not significantly above chance in both groups: a marginal effect for the French group was observed [mean = 53.11%, SD = 7.7; one-tailed t(15)=1.62, p = .063], but no effect for the Mandarin group [mean = 48%, SD = 8.8; one-tailed t(15)=-0.79, p = .22]. Thus, no obvious difference between language groups was observed in 6-month-old infants. Nevertheless, the data indicated that both groups of 6-month-olds were

171

able to discriminate the lexical tones but this was observed in the Intact condition only.

C. Discussion

Six-month-old infants learning Mandarin or French show the same patterns of results in the two signal-processing conditions. Both groups discriminated the intact lexical-tones and both groups failed at discriminating the vocoded lexical tones. This result suggests that the difference observed in the 10- month-old infants from different linguistic backgrounds is related to age and linguistic experience. Spectro-temporal fine structure cues influence lexical-tone discrimination in 6-month-old infants and Mandarin-learning 10-month-olds. Thus, exposure to a tonal language may contribute to the weight of fine spectro- temporal cues conveying voice-pitch information in lexical-tone perception. Mandarin-learning 10-month-old infants may have higher ability to use the pitch trajectory of lexical tones signaled by spectro-temporal fine structure cues. However, a replication of the 10-month-olds’ results with non-speech stimuli that is, in the absence of phonetic information, would provide stronger evidence of an influence of linguistic experience on the auditory processing of FM and fine spectral cues. Thus, a third experiment was designed to verify whether Mandarin 10- month-olds have better sensitivity to relatively slow F0-modulation patterns found in lexical tones than their French peers. More precisely, discrimination of F0 trajectories in non-speech signals corresponding to click trains was assessed in French and Mandarin 10-month-olds.

IV. EXPERIMENT 3

Experiment 3 aimed to evaluate the perception of the F0-modulation patterns or “F0 contours” per se (that is, in the absence of any other phonetic cues), and whether native language experience could influence the performance at 10 months of age. A signal-processing condition was designed to extract specifically the F0 variations of the lexical tones rising and low. It was assumed that if exposure to a tonal language induces better perception of the F0

172

modulations as found in lexical tones, Mandarin-learning 10-month-old infants would discriminate better the contour differences between non-speech sounds than French-learning 10-month-olds.

A. Method

1. Participants

Two other groups of 10-month-old infants learning either French or Mandarin were tested. Data from 32 10-month-old infants were analyzed in this experiment: 16 French-learning infants (mean = 316 days, range = 307-330 days) and 16 Mandarin-learning infants (mean = 311 days, range = 300-338 days). An additional 39 infants participated in the study, but were not included for the following reasons: fussing and crying (n=33), failure to reach the habituation criteria (n=6).

2. Stimuli and Procedure

The same original stimuli (low versus rising Thai lexical tones) were used in the present experiment. The signal-processing algorithm was designed to generate stimuli in a Non-Speech condition. The F0 trajectory of each original lexical tone was first extracted using the YIN algorithm (de Cheveigné & Kawahara, 2002). Then, this F0 trajectory was used to modulate the periodicity of a broadband click train over time (more precisely, the signal was a train of 88- microseconds square pulses, which were repeated at a rate equal to 1/F0). The click trains were limited to the frequency range between 80 to 22050 Hz, and were equated in rms power. Figure 5 represents the spectrograms of these non- speech stimuli. New audio files were generated with these non-speech stimuli according to the procedure of Experiment 1. The procedure was the same as in Experiments 1 and 2 except that the test phase was shortened to only four trials with two novel and two familiar trials, because of the extremely artificial timbre (buzz like) of these stimuli.

173

Figure 5. Spectrograms of the non-speech version of a /ba/ low (left panel) and /ba/ rising (right panel).

B. Results

As in Experiment 1 and 2, preference scores were calculated for each language group. Figure 6 represents these scores for each group. It shows that in this condition Mandarin infants exhibit a higher proportion of LT for novel trials than French infants.

Figure 6. Mean preference scores in the test phase in the Non-Speech condition for the 10-month-old infants learning French and Mandarin. The error bars represent the standard errors. The dotted horizontal line showed chance level (50%).

As in previous experiments, there was no effect of the habituation stimulus (rising versus low) on preference scores. Thus, data were collapsed across this variable in the main analyses. Moreover, an ANOVA revealed no effect of Language (F(1,30)=2.03; p=.16) on the preference scores.

174

One-sample t-tests were calculated for each group to assess whether the infants’ preference scores are significantly higher than chance level. The French 10-month-olds did not show a significant preference for novelty [mean = 54.09, SD = 14.8; one tailed t(15)= 1.11; p=.14]. However, the Mandarin-learning infants showed a significant preference for novelty in the same condition [mean = 61.1%, SD = 12.95; one tailed t(15)=3.43; p=.002]. Thus, compared to the Intact condition, the same pattern of results was observed between French- and Mandarin-learning 10-month-old infants in this Non-Speech condition.

C. Discussion

The purpose of Experiment 3 was to compare the ability to discriminate several utterances of two different categories of F0-contours per se in Mandarin- learning and French-learning 10-month-olds. F0 contours were extracted from the original lexical tones and applied to click trains. In this Non-Speech condition, only Mandarin-learning infants were found to be able to discriminate the F0 contours. This finding is inconsistent with the results of Mattock and Burnham (2006) who found that both English- and Mandarin-learning 10-month-old infants were able to discriminate F0 contours played by a musical instrument (a violin). The F0 trajectories of the present non-linguistic stimuli correspond precisely to those of the linguistic stimuli and provide a better test of the ability to discriminate F0 contours as found in lexical tones in absence of phonetic information. Our results confirm that Mandarin- learning 10-month-olds are more sensitive to the F0 contours of lexical tones than their French peers. These results together with those of Experiment 1 showing that Mandarin-learning 10-month- olds are less able to perceive the same F0 variations in the Vocoded condition (that is, in the absence of salient voice-pitch cues conveyed by fine spectro- temporal cues) could be considered as the two sides of the same coin: At 10 months of age, Mandarin-learning infants demonstrate enhancement of their capacity to discriminate two categories of (and dependence on) F0 contours compared to French-learning infants.

175

V. GENERAL DISCUSSION

The three experiments of the present study were designed to investigate the relatively basic auditory mechanisms (i.e., the processing of spectro-temporal modulation cues) underlying the speech reorganization observed during the first year of life. Several explanatory factors have been explored, such as computational and cognitive factors (e.g., Conboy et al., 2008; Saffran, 2002), to explain the increasing efficiency in the discrimination of native phonetic contrasts and the decline in the discrimination abilities of the difficult non-native contrasts. It is now well accepted that exposure to a specific language input shapes phonetic categories early in infancy (see Kuhl, 2004; Werker & Tees, 2005). On the other hand, when exploring the auditory mechanisms constraining speech perception in adults, psychoacoustic studies have found that the role of spectro-temporal modulation cues in speech identification differs according to native language. More precisely, a different use of temporal-envelope cues (AM cues) and spectro- temporal fine structure cues (i.e., FM cues and fine spectral details) has been shown for tonal versus non-tonal languages (e.g., Smith et al., 2002; Xu & Pfingst, 2008). Such results led us to hypothesize that a reorganization in the weight of these spectro-temporal modulation cues may occur during the first year of life along with the progressive specialization of infants to the sounds of their native language. In the present study, the discrimination of lexical tones (/ba/ rising versus /ba/ low) similar to those used in recent psycholinguistic developmental investigations (Mattock & Burnham, 2006; Mattock et al., 2008) has been assessed using degraded speech sounds for Mandarin- and French- learning infants. The findings show that the perceptual reorganization for speech is concomitant to a reorganization of the AM and FM processing by the auditory system. When the speech signals preserved the speech modulation cues of the original tone signals (as in the Intact condition), French-learning 10-month-old infants did not discriminate a lexical-tone contrast whereas Mandarin-learning 10- month-olds and 6-month-olds did. These results support previous ones revealing a decline in the responsiveness to non-native phonological characteristics, including lexical tones, after 6 months (e.g., Mattock & Burnham, 2006; Mattock et al.,

176

2008). The results observed in a condition degrading the fine spectro-temporal modulation cues of the tone signals (the Vocoded condition) show a completely reverse pattern: only the French-learning 10-month-olds demonstrated a discrimination response to the vocoded lexical tones. In this Vocoded condition, the fine spectral-temporal cues (FM and fine spectral details) conveying F0, and thus, voice-pitch information, were severely degraded and this had a clear detrimental effect for listeners who discriminate a lexical-tone contrast. The present results indicated that fine spectro-temporal cues are necessary to discriminate lexical tones in the youngest infants whatever their linguistic environments and in 10-month-old learning lexical tones. Moreover, these results suggested that French-learning infants, who become progressively less sensitive to lexical-tone variations, might remain (or become more) sensitive to the residual acoustic cues present in the Vocoded condition such as AM cues. Furthermore, linguistic experience may have impeded Mandarin-learning 10-month-old infants to rely suitably on the remaining acoustic cues in the absence of the spectro- temporal fine structure cues. This result is also consistent with the study by Beach and Kitamura (2011) showing that infants of 9 months of age attended to some specific acoustic cues (i.e., the spectral profile) when discriminating a native phonetic contrast. Moreover, the results of Experiment 3 strongly suggest that 10-month-old infants differ in their ability to process F0 contours even for non-speech stimuli as a result of their native language experience. Mandarin-learning 10-month-old infants track efficiently F0 variations in both speech and non-speech contexts (F0 trajectory -rising versus low- extracted from the original lexical tones and applied to trains of clicks). Contrary to French-learning 10 month-olds, they seem to have learned to rely more on F0 variations. The present results suggest that the early linguistic experience does not only determine the discrimination performance on native and non-native phonetic contrasts, but that it also shapes the perception of basic acoustic cues such as spectral and temporal modulations. Exposure to a tonal language may improve the listeners’ ability to track and use the pitch trajectory of lexical tone signaling these spectro-temporal modulations. Thus, the auditory perception of modulation (AM and FM) cues underlying speech perception is not constrained in a rigid manner by a hard-wired architecture (e.g., Elhilali, Chi, & Shamma, 2003; Jørgensen,

177

Ewert, & Dau, 2013). On the contrary, it is flexible and dependent on the exposure to a specific auditory input. This is strongly consistent with training effects observed on pitch perception for musicians, lexical-tone users and trained subjects (e.g., Chandrasekaran, Krishnan, & Gandour, 2007; Fitzgerald & Wright, 2011; Micheyl, Delhommeau, Perrot, & Oxenham, 2006) and that have also been observed at a low level of the auditory neural pathway (i.e., in the brainstem; e.g., Kraus & Chandrasekaran, 2010; Wong, Skoe, Russo, Dees, & Kraus, 2007). More precisely, the detection of AM and FM cues has been shown to be improved by training in adult listeners (Bruckert, Herrmann & Lorenzi, 2006; Sabin, Eddins, & Wright, 2012). Such results are also consistent with recent demonstrations of long-term plasticity in the auditory cortex of animals for spectro-temporal modulation processing (e.g., Bao, Chang, Woods, & Merzenich, 2004; Kilgard & Merzenich, 1998; Ohl & Scheich, 2005). Moreover, cortical responses to spectro- temporal variations in acoustic stimuli have been shown to sharpen during development (e.g., Chang, Bao, Imaizumi, Schreiner, & Merzenich, 2005) and to be influenced by the auditory environment or task (e.g., Bao, Chang, Teng, Heiser, & Merzenich, 2013; Bao et al., 2004; Niwa et al., 2012). These psychoacoustical and neurophysiological studies highlight the plasticity of the auditory system for spectro-temporal processing and the impact of the listening environment. They are consistent with the notion that experience shapes the neural representation and perception of speech modulation cues. Our results provide another illustration of the integrated and flexible relationship between different processing levels of auditory perception.

Finally, from a clinical perspective, this exploration in normal-hearing infants should improve our understanding of the speech perception by deaf people equipped with a cochlear implant (CI). CIs are electronic devices aiming to restore hearing in profoundly deaf people by transmitting only AM information over a limited number of frequency channels (8 independent channels for the best CI users, e.g., Dorman & Loizou, 1997; Friesen, Shannon, Baskent, & Wang, 2001). Today, CIs are proposed before the age of 12 months and provide great benefits for the development of spoken language for deaf infants (e.g., Holt & Svirsky, 2008; Miyamoto, Houston, Kirk, Perdew, & Svirsky, 2003; Svirsky, Robbins, Kirk, Pisoni, & Miyamoto, 2000; Svirsky, Teoh, & Neuburger, 2004).

178

However, CIs do not convey the fine spectral and FM cues required for accurate voice-pitch perception and thus may be less efficient to convey lexical-tone variations compared to consonant variations (e.g., Fu, Zeng, Shannon, & Soli, 1998; Kong & Zeng, 2006; Kuo, Rosen, & Faulkner, 2008). After cochlear implantation, deaf infants are able to perceive and learn a tonal language (e.g., Ciocca, Francis, Aisha, & Wong, 2002; Wei et al., 2000; Zheng et al., 2011). However, children with CIs show a large individual variability in their performance and as a group show impaired abilities compared to normal-hearing children (e.g., Peng, Tomblin, Cheung, Lin, & Wang, 2004; Xu et al., 2004, 2011; Zhou, Huang, Chen & Xu, 2013; see Xu & Zhou, 2012 for a review). The exploration of the ability to discriminate vocoded lexical-tone signals at an early age in normal-hearing infants from different linguistic backgrounds may help understanding speech development in deaf infants wearing CIs. The present data suggest that some aspects of linguistic experience may influence the ability to discriminate lexical tones in the absence of spectro-temporal fine structure cues. The results of French 10-month-olds in the Vocoded condition provided some evidence that the remaining available information such as AM cues could be used to signal lexical-tone differences. Increasing experience in the processing of the AM cues signaling segmental contrasts in French may have possibly helped 10-month-olds to use efficiently the AM speech cues in the vocoded condition. Thus, lexical-tone discrimination might be possible even in this severely degraded condition. This result is consistent with the observation that, although poorer than normal, lexical-tone perception remains possible under CI stimulation for some children with CI (e.g., Xu & Zhou, 2012; Zheng et al., 2011; Zhou et al., 2013). Our results corroborate recent findings showing high word identification scores in children using CIs learning a tonal language and thus revealing the large variability in lexical-tone identification abilities under CI stimulation (see Zhou et al., 2013). Note that the use of a vocoder in normal- hearing infants may help to understand the development of language skills in deaf infants wearing CIs and particularly those who have to distinguish fine pitch variations to learn tonal languages via CIs. To conclude, the present findings suggest that the perceptual reorganization occuring at the end of the first year for speech sounds impacts also basic auditory abilities to use the speech modulation cues.

179

ACKNOWLEDGMENTS

The authors wish to thank Ni Zi Cheng for the recruitment of infants in Taipei, and all the families who participated to this research. The authors also wish to thank Dan Gnansia for designing the signal-processing algorithms and Kelly Tremblay for reviewing. C. Lorenzi was supported by a grant from ANR (HEARFIN project). This work was also supported by ANR-11-0001-02 PSL* and ANR-10-LABX-0087.

REFERENCES

Abramson, A. S. (1978). Static and dynamic acoustic cues in distinctive tones. Language and speech, 21(4), 319–325. Aslin, R. N., & Pisoni, D. B. (1980). Some developmental processes in speech perception. Child phonology, 2, 67–96. Aslin, R. N., Pisoni, D. B., Hennessy, B. L., & Perey, A. J. (1981). Discrimination of voice onset time by human infants: New findings and implications for the effects of early experience. Child Development, 52(4), 1135–1145. Bao, S., Chang, E. F., Teng, C.-L., Heiser, M. A., & Merzenich, M. M. (2013). Emergent categorical representation of natural, complex sounds resulting from the early post-natal sound environment. Neuroscience, 248C, 30–42. Bao, S., Chang, E. F., Woods, J., & Merzenich, M. M. (2004). Temporal plasticity in the primary auditory cortex induced by operant perceptual learning. Nature neuroscience, 7(9), 974–981. Beach, E. F., & Kitamura, C. (2011). Modified Spectral Tilt Affects Older, but Not Younger, Infants’ Native-Language Fricative Discrimination. Journal of Speech, Language and Hearing Research, 54(2), 658–667. Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of experimental psychology human perception and performance, 32(1), 97–103. Bertoncini, J., Nazzi, T., Cabrera, L., & Lorenzi, C. (2011). Six-month-old infants discriminate voicing on the basis of temporal envelope cues. The Journal of the Acoustical Society of America, 129(5), 2761–2764. Best, C. T., McRoberts, G. W., LaFleur, R., & Silver-Isenstadt, J. (1995). Divergent developmental patterns for infants’ perception of two nonnative consonant contrasts. Infant behavior and development, 18(3), 339–350.

180

Bosch, L., & Sebastián-Gallés, N. (2003). Simultaneous bilingualism and the perception of a language-specific vowel contrast in the first year of life. Language and Speech, 46(2-3), 217–243. Bruckert, L., Herrmann, M., & Lorenzi, C. (2006). No adaptation in the amplitude modulation domain in trained listeners. The Journal of the Acoustical Society of America, 119, 3542–35545. Burnham, D., & Francis, E. (1997). The role of linguistic experience in the perception of Thai tones. Southeast Asian linguistic studies in honour of Vichin Panupong, 29–47. Cabrera, L., Bertoncini, J., & Lorenzi, C. (2013). Perception of speech modulation cues by 6-month-old infants. Journal of speech, language, and hearing research, in press. Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain research, 1128, 148–156. Chang, E. F., Bao, S., Imaizumi, K., Schreiner, C. E., & Merzenich, M. M. (2005). Development of spectral and temporal response selectivity in the auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 102(45), 16460–16465. Cheour, M., Ceponiene, R., Lehtokoski, A., Luuk, A., Allik, J., Alho, K., & Näätänen, R. (1998). Development of language-specific phoneme representations in the infant brain. Nature neuroscience, 1(5), 351–353. Ciocca, V., Francis, A. L., Aisha, R., & Wong, L. (2002). The perception of Cantonese lexical tones by early-deafened cochlear implantees. The Journal of the Acoustical Society of America, 111, 2250–2256. Cohen, L. B., Atkinson, D. J., & Chaput, H. H. (2000). Habit 2000: A new program for testing infant perception and cognition. Austin: The University of Texas. Conboy, B. T., Sommerville, J. A., & Kuhl, P. K. (2008). Cognitive control factors in speech perception at 11 months. Developmental psychology, 44(5), 1505–1512. De Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930. Dorman, M. F., & Loizou, P. C. (1997). Speech intelligibility as a function of the number of channels of stimulation for normal-hearing listeners and patients with cochlear implants. The American journal of otology, 18(6 Suppl), S113–114. Dudley, H. (1939). Remaking speech. The Journal of the Acoustical Society of America, 11, 169–177. Elhilali, M., Chi, T., & Shamma, S. A. (2003). A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech communication, 41(2), 331–348.

181

Fitzgerald, M. B., & Wright, B. A. (2011). Perceptual learning and generalization resulting from training on an auditory amplitude-modulation detection task. The Journal of the Acoustical Society of America, 129, 898–906. Friesen, L. M., Shannon, R. V., Baskent, D., & Wang, X. (2001). Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. The Journal of the Acoustical Society of America, 110(2), 1150–1163. Fu, Q.-J., Zeng, F.-G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104, 505–515. Gandour, J. T., & Harshman, R. A. (1978). Crosslanguage differences in tone perception: A multidimensional scaling investigation. Language and. Speech, 21(1), 1-33. Gava, L., Valenza, E., Turati, C., & De Schonen, S. (2008). Effect of partial occlusion on newborns’ face preference and recognition. Developmental science, 11(4), 563–574. Glasberg, B. R., & Moore, B. C. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1), 103–138. Gnansia, D., Péan, V., Meyer, B., & Lorenzi, C. (2009). Effects of spectral smearing and temporal fine structure degradation on speech masking release. The Journal of the Acoustical Society of America, 125(6), 4023– 4033. Holt, R. F., & Svirsky, M. A. (2008). An exploratory look at pediatric cochlear implantation: is earliest always best? Ear and hearing, 29(4), 492–511. Houston, D. M., Horn, D. L., Qi, R., Ting, J. Y., & Gao, S. (2007). Assessing speech discrimination in individual infants. Infancy, 12(2), 119–145. Jørgensen, S., Ewert, S. D., & Dau, T. (2013). A multi-resolution envelope-power based model for speech intelligibility. The Journal of the Acoustical Society of America, 134, 436–446. Kilgard, M. P., & Merzenich, M. M. (1998). Plasticity of temporal information processing in the primary auditory cortex. Nature neuroscience, 1(8), 727– 731. Kong, Y.-Y., & Zeng, F.-G. (2006). Temporal and spectral cues in Mandarin tone recognition. The Journal of the Acoustical Society of America, 120(5 Pt 1), 2830–2840. Kraus, N., & Chandrasekaran, B. (2010). Music training for the development of auditory skills. Nature Reviews Neuroscience, 11(8), 599–605. Kuhl, P. K. (2004). Early language acquisition: cracking the speech code. Nature reviews. Neuroscience, 5(11), 831–843. Kuhl, P. K, Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental science, 9(2), F13– F21.

182

Kuhl, P. K., Tsao, F. M., & Liu, H. M. (2003). Foreign-language experience in infancy: Effects of short-term exposure and social interaction on phonetic learning. Proceedings of the National Academy of Sciences, 100(15), 9096-9101. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. Kuo, Y.-C., Rosen, S., & Faulkner, A. (2008). Acoustic cues to tonal contrasts in Mandarin: Implications for cochlear implants. The Journal of the Acoustical Society of America, 123, 2815–2864. Liang, Z. A. (1963). The auditory perception of Mandarin tones. Acta Phys. Sin, 26, 85–91. Liu, H.-M., Tsao, F.-M., & Kuhl, P. K. (2007). Acoustic analysis of lexical tone in Mandarin infant-directed speech. Developmental psychology, 43(4), 912–917. Mattock, K., & Burnham, D. (2006). Chinese and English infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241– 265. Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106(3), 1367–1381. Micheyl, C., Delhommeau, K., Perrot, X., & Oxenham, A. J. (2006). Influence of musical and psychoacoustical training on pitch discrimination. Hearing research, 219(1), 36–47. Miyamoto, R. T., Houston, D. M., Kirk, K. I., Perdew, A. E., & Svirsky, M. A. (2003). Language development in deaf infants following cochlear implantation. Acta oto-laryngologica, 123(2), 241–244. Narayan, C. R., Werker, J. F., & Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: Evidence from nasal place discrimination. Developmental science, 13(3), 407–420. Nittrouer, S. (2002). Learning to perceive speech: How fricative perception changes, and how it stays the same. The Journal of the Acoustical Society of America, 112, 711–719. Niwa, M., Johnson, J. S., O’Connor, K. N., & Sutter, M. L. (2012). Active engagement improves primary auditory cortical neurons’ ability to discriminate temporal modulation. The Journal of neuroscience: the official journal of the Society for Neuroscience, 32(27), 9323–9334. Ohl, F. W., & Scheich, H. (2005). Learning-induced plasticity in animal and human auditory cortex. Current opinion in neurobiology, 15(4), 470–477. Patterson, R. D. (1987). A pulse ribbon model of monaural phase perception. The Journal of the Acoustical Society of America, 82(5), 1560–1586.

183

Peng, S.-C., Tomblin, J. B., Cheung, H., Lin, Y.-S., & Wang, L.-S. (2004). Perception and production of Mandarin tones in prelingually deaf children with cochlear implants. Ear and hearing, 25(3), 251–264. Pisoni, D. B. (1977). Identification and discrimination of the relative onset time of two component tones: implications for voicing perception in stops. The Journal of the Acoustical Society of America, 61(5), 1352–1361. Polka, L., Colantonio, C., & Sundara, M. (2001). A cross-language comparison of/d/–/ð/perception: Evidence for a new developmental pattern. The Journal of the Acoustical Society of America, 109, 2190–2201. Polka, L., & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 421–435. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73(3), 265–292. Rivera-Gaxiola, M., Silva-Pereyra, J., & Kuhl, P. K. (2005). Brain potentials to native and non-native speech contrasts in 7-and 11-month-old American infants. Developmental Science, 8(2), 162–172. Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistic aspects. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 336(1278), 367–373. Sabin, A. T., Eddins, D. A., & Wright, B. A. (2012). Perceptual learning evidence for tuning to spectrotemporal modulation in the human auditory system. The Journal of Neuroscience, 32(19), 6542–6549. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8- month-old infants. Science (New York, N.Y.), 274(5294), 1926–1928. Saffran, J. R. (2002). Constraints on statistical language learning. Journal of Memory and Language, 47(1), 172–196. Sebastián-Gallés, N., & Bosch, L. (2009). Developmental shift in the discrimination of vowel contrasts in bilingual infants: is the distributional account all there is to it? Developmental Science, 12(6), 874–887. Shamma, S., & Lorenzi, C. (2013). On the balance of envelope and temporal fine structure in the encoding of speech in the early auditory system. The Journal of the Acoustical Society of America, 133(5), 2818–2833. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science (New York, N.Y.), 270(5234), 303–304. Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416(6876), 87–90. Steeneken, H. J., & Houtgast, T. (1980). A physical method for measuring speech-transmission quality. The Journal of the Acoustical Society of America, 67, 318–326. Svirsky, M A, Robbins, A. M., Kirk, K. I., Pisoni, D. B., & Miyamoto, R. T. (2000). Language development in profoundly deaf children with cochlear implants. Psychological science, 11(2), 153–158.

184

Svirsky, M. A., Teoh, S.-W., & Neuburger, H. (2004). Development of language and speech perception in congenitally, profoundly deaf children as a function of age at cochlear implantation. Audiology and Neurotology, 9(4), 224–233. Swaminathan, J., Krishnan, A., & Gandour, J. T. (2008). Pitch encoding in speech and nonspeech contexts in the human auditory brainstem. Neuroreport, 19(11), 1163–1167. Tsao, F.-M., Liu, H.-M., & Kuhl, P. K. (2006). Perception of native and non- native affricate-fricative contrasts: cross-language tests on adults and infants. The Journal of the Acoustical Society of America, 120(4), 2285– 2294. Tsushima, T., Takizawa, O., Sasaki, M., Shiraki, S., Nishi, K., Kohno, M., … Best, C. (1994). Discrimination of English/rl/and/wy/by Japanese infants at 6-12 months: Language-specific developmental changes in speech perception abilities. In Third International Conference on Spoken Language Processing. Wei, W. I., Wong, R., Hui, Y., Au, D. K., Wong, B. Y., Ho, W. K., … Chung, E. (2000). Chinese tonal language rehabilitation following cochlear implantation in children. Acta oto-laryngologica, 120(2), 218–221. Werker, J. F., Shi, R., Desjardins, R., Pegg, J. E., Polka, L., & Patterson, M. (1998). Three methods for testing infant speech perception. In Perceptual development: Visual, auditory, and speech perception in infancy (A. Slater., pp. 389–420). East Sussex, United Kingdom: Psychological Press. Werker, J F, & Tees, R. C. (1983). Developmental changes across childhood in the perception of non-native speech sounds. Canadian journal of psychology, 37(2), 278–286. Werker, J. F., & Tees, R. C. (2005). Speech perception as a window for understanding plasticity and commitment in language systems of the brain. Developmental psychobiology, 46(3), 233–251. Wong, P. C., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nature neuroscience, 10(4), 420–422. Xu, L., Chen, X., Lu, H., Zhou, N., Wang, S., Liu, Q., … Han, D. (2011). Tone perception and production in pediatric cochlear implants users. Acta oto- laryngologica, 131(4), 395–398. Xu, L., Li, Y., Hao, J., Chen, X., Xue, S. A., & Han, D. (2004). Tone production in Mandarin-speaking children with cochlear implants: a preliminary study. Acta oto-laryngologica, 124(4), 363–367. Xu, L., & Pfingst, B. E. (2008). Spectral and temporal cues for speech recognition: implications for auditory prostheses. Hearing research, 242(1-2), 132–140. Xu, L., & Zhou, N. (2012). Tonal languages and cochlear implants. In Auditory Prostheses (pp. 341–364). Springer.

185

Xu, Y., Gandour, J. T., & Francis, A. L. (2006). Effects of language experience and stimulus complexity on the categorical perception of pitch direction. The Journal of the Acoustical Society of America, 120(2), 1063–1074. Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68, 123-139. Yip, M. (2002). Tone. Cambridge University Press. Zheng, Y., Soli, S. D., Tao, Y., Xu, K., Meng, Z., Li, G., … Zheng, H. (2011). Early prelingual auditory development and speech perception at 1-year follow-up in Mandarin-speaking children after cochlear implantation. International journal of pediatric otorhinolaryngology, 75(11), 1418– 1426. Zhou, N., Huang, J., Chen, X., & Xu, L. (2013). Relationship Between Tone Perception and Production in Prelingually Deafened Children With Cochlear Implants. Otology & Neurotology, 34(3), 499–506.

186

General discussion

187

General discussion

Over the last few decades, a myriad of psycholinguistic studies revealed sophisticated abilities in young infants to perceive speech sounds (e.g., categorical perception, preference for mother voice; see Kuhl, 2004). Early studies attempted to describe the acoustic properties of speech sounds and the extent to which humans are specialized in the processing of these acoustic properties (e.g., Eimas, Siqueland, Jusczyk, & Vigorito, 1971). Since then, very few developmental studies have attempted to further characterize the acoustic cues used by infants in speech perception (although see Mehler et al., 1988; Newman, 2009). At the same time, a wealth of psychoacoustic and neurophysiological studies investigated the development of the auditory system using non-linguistic sounds. They showed that the peripheral auditory system is functional and well developed around 6 months of age, although it continues to mature into late childhood (see Saffran, Werker, & Werner, 2006; and Werner, Fay, & Popper, 2012 for reviews). More specifically, these studies showed that the frequency and temporal auditory resolution mature over the first six months of life. This progressive maturation should affect the development of speech perception knowing that, at least for adults, resolving fine spectral and temporal details in speech sounds is indispensable to speech perception (especially, in adverse listening conditions). Despite this, infants demonstrate exquisite abilities to perceive speech sounds and learn the properties of their native language by exposure and social interactions during their first six to twelve months of life (Kuhl, 2004). Based on this review, it appears that further experimental works are required to investigate more directly the mutual dependence between spectro-temporal auditory processes on one hand and speech processes on the other hand during the first year of life.

Goals and main results of this doctoral research

The main purpose of the present doctoral research was to investigate early speech perception abilities by using a psychoacoustic approach to speech perception inspired by the pioneering work of Shannon, Zeng, Kamath, Wygonski and Ekelid (1995) and Zeng et al. (2005). According to this approach, the most important cues conveyed by speech sounds are spectro-temporal modulations

188

(i.e., AM and FM), and the peripheral and central auditory systems operate as sophisticated demodulation systems (see Moore, 2004 for a review). The five experiments reported in this dissertation were aimed at evaluating, for the first time in the literature, the processing of slow and fast speech modulation cues by infants in different conditions of vocoded speech. The last two experiments and final part of this work was designed to explore more specifically the influence of the linguistic environment on the processing of speech modulation cues by infants. In these experiments, infants’ ability to discriminate speech syllables processed by vocoders was tested using two behavioral methods: the head-turn preference and the visual habituation procedures. The main results of the experiments conducted with infants are summarized in Table 1.

Voicing Place Lexical tone Contrast (/aba/-/apa/) (/aba/-/ada/) (rising-low) Vocoded 32-band 32-band 32-band 4-8-16- 32-band 32-band 32-band 8-band 32-band 8-band condition/ AM+FM AM AM<16 Hz band AM AM+FM AM AM<16 Hz AM AM+FM AM Group French 6-month- olds French 10-month- olds Mandarin 6-month- olds Mandarin 10-month- olds

Table 1. Discrimination results for different phonetic and tone contrasts in 6- and 10-month-olds learning French or Mandarin. Successful discrimination is shown by green symbols and absence of discrimination by red symbols.

1. Fine spectro-temporal details are not required for accurate phonetic discrimination in French-learning infants

The first three studies were performed with normal-hearing French 6- month-old infants (Chapters 2, 3 and 4; Bertoncini, Nazzi, Cabrera & Lorenzi, 2011; Cabrera, Bertoncini & Lorenzi, 2013; Cabrera, Lorenzi & Bertoncini, in

189

preparation). The French VCVs used as stimuli differed primarily on voicing (/aba/-/apa/) or on place of articulation (/aba/-/ada/). These disyllables were processed by several vocoders designed to selectively degrade the AM and FM components within a given number of frequency bands. Overall, the results showed that normal-hearing 6-month-old infants are able to discriminate the voicing and the place contrasts when: (i) FM cues are replaced by sine-wave tones or noise bands (ii) AM cues are reduced to their slowest fluctuations (< 16 Hz) (iii) AM cues are extracted within a reduced number of frequency bands (4 or 16 broad frequency bands compared to 32 bands). Thus, at 6 months of age, infants are able to rely on severely reduced modulation cues to discriminate voiced versus voiceless, and labial versus coronal stop consonants. As found in adults (e.g., Shannon et al., 1995; Zeng et al., 2005), FM cues are not required for speech perception in quiet. Moreover, as for adults, infants are able to use the gross AM speech cues, extracted below 16 Hz or within a limited number of bands, to discriminate phonetic contrasts.

1.1 Implications for speech perception and development

1.1.1 Role of slow AM cues

As shown in Figure 1, the slow modulations (under 16 Hz) correspond to the syllabic rate of normal speech (3-4 Hz; see Greenberg, Hollenback, & Ellis, 1996; Rosen, 1992). The envelope features corresponding to this modulation range may be described mainly by the acoustic features of intensity, duration, as well as rise time and fall time which are associated with the perceptual correlates of loudness, length, as well as attack and decay (see Rosen, 1992). Faster AM (> 16 Hz) cues are related to formant transitions, bursts and periodic fluctuations produced by the vocal folds at the F0 rate (Rosen, 1992). Our results are consistent with the vast literature in psycholinguistic (c.f. Chapter 1) demonstrating that infants and neonates (i) use syllables as a unit to segment speech sounds (and particularly in French, which is a syllable-timed language; e.g., Bertoncini, Floccia, Nazzi, & Mehler, 1995; Bertoncini & Mehler, 1981; Bijeljac-Babic, Bertoncini, & Mehler, 1993) and (ii) are able to distinguish phonetic differences in syllables (e.g., see Kuhl, 2004 for a review).

190

Moreover, our results are consistent with previous studies (described in Chapter 1. 2.1.1) demonstrating that infants are able to discriminate different languages when presented with low-pass filtered speech signals (< 400 Hz cutoff frequency) conveying reduced spectral cues, and with backward speech (conveying reduced prosodic cues; e.g., Mehler et al., 1988). These studies emphasized the role of prosodic cues related to voice-pitch in speech perception, but low-pass filtering and rendering speech backwards are relatively gross manipulations. Specifically, these manipulations did not reduce completely and selectively the spectro-temporal modulation cues conveying either prosodic or phonetic information. Our approach allowed us to perform more precise/selective degradations of the speech signal along the spectral and temporal (AM/FM) dimensions.

Figure 1. A plot of the modulation spectrum of typical speech. The modulation index (m) - the modulation depth of the envelope fluctuations - is plotted against modulation frequency. Linguistic units (words and syllables) associated with different modulation rates are represented with arrows. AM rates higher than about 50 Hz correspond to the vocal-folds, F0-related, periodic vibrations. AM rates between 20-50 Hz correspond to formant transitions. Adapted from Plomp (1983).

1.1.2 Role of fast AM cues

The results described in Chapters 3 and 4 (Cabrera et al., 2013; Cabrera et al., in preparation) show that 6-month-olds may have some difficulties in the

191

processing of temporally reduced syllables (AM < 16 Hz). When using the same amount of familiarization time across the vocoded conditions, infants show different patterns of responses across conditions (Cabrera et al., 2013). In other words, the time of exposure to the vocoded speech stimuli had to be adjusted across speech degradations. According to Hunter and Ames (1988, see also Rose, Gottfried, Melloy-Carminar, & Bridger, 1982; Thiessen & Saffran, 2003; Wagner & Sakovits, 1986), such differences in infants’ responses may be related to the difficulty in processing certain stimuli. Those results were initially obtained using noise-vocoded stimuli and the HPP method. In subsequent experiments, the original FM cues of speech stimuli were then replaced, within each frequency band, by sine-wave tones rather than by noise bands (Chapter 3; Cabrera et al., in preparation). This was expected to limit the so-called “modulation-masking effects” produced by the random fluctuations induced by the noise carriers at the output of auditory filters (see Dau, Kollmeier, & Kohlrausch, 1997a, 1997b; Kates, 2011). Moreover, an infant-controlled habituation time was used in each vocoded condition. The results obtained in these additional experiments indicate that an even longer habituation time is required by infants when, compared to the other vocoded speech conditions, the AM speech cues are severely degraded (< 16Hz). Rosen (1992) proposed that the fastest AM cues (> 16 Hz) in each frequency band signal formant transitions, burst and F0-related cues. Our data are consistent with this view by showing that in infants, the reception of voicing and place is made more difficult when filtering out the fastest AM cues. In addition, our findings are consistent with the notion that several modulation cues signal phonetic information, but are neither necessary nor sufficient to distinguish phonetic categories. This is compatible with the view developed by early psycholinguistic studies (see Chapter 1 I; e.g., Cooper, Liberman, & Borst, 1951; Liberman, Delattre, Cooper, & Gerstman, 1954; Stevens & Blumstein, 1978).

1.2. Implications for auditory perception in infants and adults

1.2.1 Development of AM processing

The auditory system of 6-month-olds is known to be well developed in terms of frequency selectivity and temporal resolution (see Saffran et al., 2006)

192

when tested with non-linguistic sounds. The first three studies of this PhD work confirm that at this age, normal-hearing infants are able to encode and use the slowest AM speech cues and gross AM cues in a limited number of bands to discriminate phonetic feature (that is, a relatively abstract information). These studies also show that infants, as early as 6 months, perceive fast (> 16 Hz) speech AM cues and discriminate complex AM cues extracted from either broad (8 ERB width) or narrow (1 ERB width) bands. These data suggest that all the auditory processing stages involved in the detection, discrimination and recognition of complex AM patterns in quiet are mature by the age of 6 months. Figure 2 illustrates the different stages of the most recent perceptual model of AM processing developed by Dau et al., (1997 a, b) to account for (adult) auditory perception of temporal envelope information (see also Ives et al., 2013; Jørgensen, Ewert, & Dau, 2013). The discrimination abilities found here in 6-month-olds may indicate that peripheral filters, peripheral non linearities, central modulation filters, internal noise and template matching are mature. It is not clear from the literature (see Ardoint, Lorenzi, Pressnitzer, & Gorea, 2008) that the auditory system is equipped with high level mechanisms extracting invariant AM features (obviously, this mechanism should operate beyond the modulation filtering stage described by Dau et al., 1997 a, b). It is important to note that in the current tasks discrimination of phonetic features required that infants extract some invariant features of the AM patterns across the different speech utterances (i.e., the different tokens for each category) presented to them. The current data suggest that the extraction of invariant AM cues (irrespective of the way it is achieved) is efficient by the age of 6 months. Moreover, the effect of internal noise is not fully understood in infants and children. Internal noise may be a factor that is highly constraining for infants’ and children’s performance in masking paradigms (e.g., Buss, Hall, & Grose, 2006), yet our results indicate that 6-month-old infants succeed at discriminating the phonetic contrasts in all conditions, suggesting that internal noise does not impede discrimination in the present paradigm.

193

Figure 2. Block diagram of the psychoacoustical model describing the distinct stages of AM processing by the peripheral and central auditory system (Dau et al., 1997a b, see also Ardoint et al., 2008).

1.2.2 Development of FM processing

Our literature review showed that young infants are able to perceive FM cues but this capacity develop until the age of about 10 years (e.g., Colombo & Horowitz, 1986; Dawes & Bishop, 2008). Our results confirm that at 6 months of age, and irrespective of native language, normal-hearing infants are able to encode and use FM speech cues and fine-spectral details, as shown by the different

194

response patterns for intact (AM + FM) and AM-vocoded conditions (see Chapter 3 and 5). These data suggest that all the processing auditory stages involved in detection, discrimination and recognition of complex FM patterns (e.g., neural phase locking and FM-to-AM conversion, see Ives et al., 2013) are functional by the age of 6 months.

1.2.3 Spectral resolution in infants

The discrimination ability observed for 6-month-olds may appear inconsistent with the results of Eisenberg, Shannon, Martinez, Wygonski and Boothroyd (2000), showing that 7-year-old children require higher frequency resolution than adults and older children to accurately recognize words and sentences. Moreover, 24-month-olds require more than four frequency bands to identify words with AM cues (Newman & Chatterjee, 2013). However, our results, together with those of Bertoncini, Serniclaes and Lorenzi (2009) with 5-8- year-old children, indicate that 6-month-olds and 5-year-old children can discriminate speech syllables generated using a low spectral resolution (4, 8 and 16 frequency bands). This discrepancy suggests that, for more sophisticated tasks such as those testing recognition of words and sentences, as well as for older children (who likely have more sophisticated lexical-semantic processing), automatic speech processing requires greater redundancy in the speech signal (especially during childhood).

1.2.4 Temporal resolution in infants

Regarding the temporal resolution of the speech signal, our experiments reveal that, as for adults, the slowest (< 16 Hz) AM speech cues enable infants to achieve phonetic discrimination (at least, in the case of voicing and place of articulation). However, the results also suggest a contribution of fast (> 16 Hz) AM cues in infants’ speech discrimination in quiet. These findings are consistent with the notion that temporal acuity is adult-like by 6 months of age, but also suggest that the perceptual weight of slow and fast AM speech cues may change during development. At first glance, these results are inconsistent with those obtained in adults showing that the slow AM cues appear to play the most important role in syllable identification (e.g., Drullman, 1996), while fast AM

195

cues are not required for accurate speech identification in quiet (Shannon et al., 1995; see also Stone, Füllgrabe, & Moore, 2008, the only study showing a role of the fast AM rate in speech identification in adults, which was conducted on speech presented against competing talkers). Nevertheless, it is still not clear from these data in adults that fast AM rates do not play any role in speech processing in quiet. It may be the case that the availability of fast AM rates facilitates speech recognition in quiet in adults, but the demonstration of this would require more sensitive methods such as measures of reaction times in speech recognition tasks. Thus, the results of the present experiments suggest that fast AM (or temporal envelope) cues may play a much greater role in infants than in adults for speech perception in quiet. Future studies might investigate this point further by comparing infants and adults’ abilities to use the fast versus slow modulation cues in phoneme discrimination.

1.2.5 Effects of listening conditions

Finally, it is important to note that throughout this PhD work, the stimuli were presented in quiet and the same vocalic context was used (a/C/a). It is possible that the role of modulation cues varies according to listening conditions (i.e., noise, interfering speech, filtering) as for adults (e.g., Fogerty, 2011). For instance, infants have been shown to be more disturbed by noise than adults (e.g., Newman, 2009). However, such results may not be caused by poorer temporal or spectral resolution (see Werner, 2013), but rather by factors such as distraction and informational masking. Whether the weight of speech modulation cues changes across listening conditions remains an open question.

2. Linguistic experience changes the weight of temporal envelope and spectro-temporal fine structure cues in phonetic discrimination

The aim of this doctoral research was not only to assess the ability of 6- month-old infants to use AM and FM cues in phonetic discrimination, but also to investigate the development of this ability. In adults, the weight of speech modulation cues differs across phonetic contrasts (see Rosen, 1992; Sheft et al., 2008). Moreover, non lexical-tone users (such as English or French native

196

speakers) and lexical-tone users (such as Mandarin native speakers) seem to organize their perception of speech modulation cues differently (e.g., Smith, Delgutte, & Oxenham, 2002; Xu & Pfingst, 2008). The last chapter of this PhD research explored further whether linguistic experience of a given language influences the weight of these modulation cues (i.e., AM and FM cues) in speech discrimination. Young adults and 6- and 10-month-old infants were tested with lexical-tone contrasts in three vocoded conditions. Overall, the results showed that (i) language experience (French versus Mandarin) shapes the weight of the fine spectral and temporal cues conveying F0 (and thus, voice-pitch) information and that (ii) this shaping starts between 6 and 10 months of life, when infants reorganize their perception of speech sounds and become more tuned to their native language (e.g., Mattock & Burnham, 2006; Werker & Tees, 1984). In addition, the results of this last chapter show that the perception of fine spectro- temporal cues constrain the discrimination of lexical tones. Lower discrimination performance was observed in French- and Mandarin-speaking adults when the fine spectro-temporal details were reduced in comparison with the intact condition. This parallels other work showing that FM and the fast AM cues convey prosodic information such as intonation (see Rosen 1992). And so, as expected, our results indicate that a degradation of these modulation cues similarly impairs the discrimination of pitch patterns at the syllable level. The same finding is observed in younger infants and in Mandarin-learning 10-month- old infants. At the same time, the discrimination of lexical tones remains possible on the basis of the gross AM cues. Although deteriorated, the performance of the adult listeners was still above chance, and the French-learning 10-month-olds exhibited a discrimination response in the vocoded condition. Furthermore, these two experiments conducted with both adult and infant listeners indicate an influence of the native language experience on the ability to discriminate vocoded lexical tones. Thus, the last two studies (Chapter 5, Cabrera, Tsao, Gnansia, Bertoncini & Lorenzi, submitted ; Cabrera, Tsao, Hu, Li, Lorenzi & Bertoncini, in preparation) demonstrate that (long-term) linguistic experience influences the perception of modulation cues in adults, and that infants may also be guided by their early (and relatively short) linguistic experience. Basic auditory processes (i.e., modulation filtering, etc., see Jørgensen et al., 2013) constrain speech

197

discrimination abilities, but these auditory processes are also influenced by higher processing mechanisms fine-tuned by language expertise. These experiments illustrate the interplay between (or the mutual dependence of) basic hearing capacities and speech perception abilities at a young age. In the future, additional age groups (younger and older infants) should be tested in order to examine more precisely the language-driven process inducing a shaping of speech modulation processing. Additional vocoded conditions should be used to precisely determine the nature of the speech modulation cues that infants learn to use as their linguistic experience grows and as a function of their native language (for instance, studies investigating the effect of varying systematically the vocoder frequency resolution from 1 to 32 bands and the vocoder demodulation filter cutoff frequency from 4 Hz to ERB/2).

2.1. Implications for auditory perception in infants and adults

How could linguistic factors shape the processing of speech modulation cues? In the present experiments conducted with adults, the duration of ISI influenced differentially subjects’ performance according to their language experience, suggesting that language factors affect either storage in short-term memory or phonological (high level) mechanisms. The subjects' linguistic environment might have influenced the capacity of a buffer specialized for pitch (Clément, Demany, & Semal, 1999). Alternatively, it may have affected more specialized speech-processing mechanisms operating beyond the storage of modulation cues in short-term auditory memory. This hypothesis is consistent with the data indicating that musical experience – and thus training – influences the perception of pitch variations (e.g., Micheyl, Delhommeau, Perrot, & Oxenham, 2006). Moreover, this is also congruent with previous data collected in tonal-language speakers showing a better perception of non-speech sounds varying in pitch, compared to English-speakers (e.g., Bent, Bradlow, & Wright, 2006; Chandrasekaran, Krishnan, & Gandour, 2009). The effects of native language and age observed in our experiments are also reminiscent of previous investigations that demonstrated different perceptual weighting of AM and FM cues in different listening conditions, and according to the age of listeners (i.e., young and elderly adults; Fogerty & Humes, 2012; Fogerty, 2011). Furthermore, the detection of AM and FM cues in such studies was found to improve with

198

substantial training (several hours) in adult listeners (e.g., Bruckert, Herrmann, & Lorenzi, 2006; Sabin, Eddins, & Wright, 2012). Together, these results suggest that (i) the perception of spectro-temporal modulation cues in speech and non- speech signals may be influenced by the listener's perceptual experience and that (ii) the perception of modulation cues is plastic. Neurophysiological studies recently demonstrated such plasticity at the neuronal (i.e., single-unit) level, and contributed to the understanding of how the environment shapes the processing of speech modulation cues. Short-term plasticity in the auditory cortex was also observed in animals during AM and FM detection tasks (e.g., Johnson, Yin, O’Connor, & Sutter, 2012; Niwa, Johnson, O’Connor, & Sutter, 2012, 2013). Furthermore, the development of the cortical responses to spectro-temporal variations was shown to be dependent on age (e.g., Chang, Bao, Imaizumi, Schreiner, & Merzenich, 2005) and on the auditory environment (i.e., the presence of specific noises; e.g., Bao, Chang, Teng, Heiser, & Merzenich, 2013; Bao, Chang, Woods, & Merzenich, 2004). Attentional mechanisms are also known to influence the neural responses in the auditory cortex (e.g., Niwa et al., 2012). Thus, the neural activity at a low stage of the auditory-neural pathway could be influenced by several higher-level mechanisms (i.e., attention, linguistic processing). Evidence for long-term neural plasticity was found in the auditory cortex after long exposure and training in spectro-temporal modulation processing (e.g., Bao et al., 2004; Kilgard & Merzenich, 1998; Ohl & Scheich, 2005). Altogether, these psychoacoustic and neurophysiological studies demonstrate the plasticity of the auditory mechanisms processing spectro- temporal cues under different listening conditions and are compatible with the hypothesis that listening experience modifies the perceptual weight of speech modulation cues. Our results provide another illustration of the close relationship between relatively low-level auditory processes and higher perceptual-processing stages.

2.2. Implications for speech perception and development

The last study of this PhD work confirms that a reorganization in lexical- tone perception occurs during the first year of life (Mattock & Burnham, 2006; Mattock, Molnar, Polka, & Burnham, 2008; Yeung, Chen, & Werker, 2013).

199

However, our results also suggest that the ages of 6 and 10 month may not be the best endpoints of this particular reorganization. It is possible that the reorganization for lexical tones occurs before the age of 6 months (as suggested by Yeung et al., 2013). The case of lexical tones highlights the interaction between basic auditory processing (i.e., pitch processing) and high-levels of speech processing in perceptual reorganization. Models of perceptual reorganization for speech (e.g., Kuhl, 2004; Werker & Curtin, 2005) focused on high levels of speech perception (such as phonology and lexical acquisition) and emphasized the importance of neural factors linked to computational, attentional and social skills on this reorganization (see Kuhl, 2010). However, early phonetic discrimination abilities – related to relatively low levels of speech processing – predict language skills in older infants (e.g., Tsao, Liu, & Kuhl, 2004). Moreover, the early auditory abilities have been shown to impact language development and may be related to language disorders (e.g., Leppänen et al., 2010). A causal relationship between basic auditory processing (i.e., pitch perception) and the language learning ability (i.e., the learning of nonadjacent dependency rules in language) was proposed by Mueller, Friederici and Männel (2012). These psycholinguistic studies emphasize the importance of the different levels of auditory processing in language acquisition (see also Beach & Kitamura, 2011). Additionally, our results indicate that this perceptual reorganization influences also non-speech processing when the stimuli share the same pitch characteristics with the native speech stimuli. This highlights the overlap of speech and non-speech processing mechanisms. The expertise in a given language influences the ability to use certain acoustic cues even in a non-speech context (e.g., Bent et al., 2006). It was also shown that the training to categorize non- speech sounds in adults influences the brain activations which become closer to those obtained when categorizing speech sounds (e.g., Leech, Holt, Devlin, & Dick, 2009; Liu & Holt, 2011). These studies question the evidence in favor of a “specialized” processing for speech sounds (see Bent et al., 2006), and suggest an explanation in terms of expertise with speech sounds and their acoustic components. Altogether, these results are compatible with a view that auditory mechanisms and more specifically, modulation processing can operate on both speech and non-speech sounds.

200

3. Behavioral methods used to assess discrimination of vocoded syllables in infants

In the present docoral research, the head-turn preference and the visual- habituation procedures were chosen and accordingly designed to test 6 and 10- monh-olds infants (e.g., Mattock et al., 2008; Narayan, Werker, & Beddor, 2010; Polka & Werker, 1994; Werker et al., 1998). The advantages of these procedures for our purpose are numerous. First, these procedures are relatively easy to implement compared to a conditioned-head turn procedure and were designed with the secondary (and long-term) objective of testing the phonetic discrimination abilities of deaf infants wearing cochlear implants (CIs) in future studies (see Houston, Horn, Qi, Ting, & Gao, 2007; Houston, Pisoni, Kirk, Ying, & Miyamoto, 2003). In visual habituation, the source of audio-visual stimulation is unique (compared to HPP) and should be easier to implement in deaf infants wearing CIs (who show some difficulties localizing the source of a sound; e.g., Godar & Litovsky, 2010). Moreover, these testing procedures may be used to assess a variety of speech perception abilities (and not only discrimination abilities) from the age of 4-5 months to 18-24 months and thus allow the evaluation of speech perception across development. These methods may use speech stimuli more complex than syllables, such as words and sentences. Thus, in future studies, these procedures could be adapted for categorization or word segmentation tasks using vocoded speech stimuli. However, a strong limitation of these procedures (including visual habituation) is that they do not provide graded measures of discrimination performance that could be ranked across sound conditions. In our study, some degradations were expected to be more deleterious than others (i.e., the reduction of spectral resolution in the case of place discrimination). Chapter 2 attempts to reveal graded responses indicating more or less robust perceptual responses or specific difficulties to process degraded speech stimuli. Here, we have analyzed the familiarization time required by infants to exhibit a clear preference for familiarity or novelty as proposed by Hunter and Ames (1988). The results obtained in the intact condition were used as a baseline to compare infants’ response patterns in each vocoded condition. The longest familiarization times

201

(expected to correspond to the most difficult condition according to our model) were obtained for the condition where fast AM cues were reduced. According to that view, the reduction of FM cues or the reduction of spectral resolution may be considered as producing experimental conditions of intermediate difficulty. Even if the experiment of Chapter 4 (using the VH procedure) confirms that the condition with reduced AM cues is probably the most difficult, no scaling of difficulty is observed between the other conditions (i.e., there is no difference in the discrimination responses or total habituation time). Future studies may use the conditioned-head turn (see Kuhl, 2004; Mattock & Burnham, 2006) or the observer-bias procedure (e.g., Olsho, Koch, Halpin, & Carter, 1987) to assess individual performance that could be studied within individual across ages and stimulus degradations. Electrophysiological methods may also be used to explore the neural mechanisms (and their development) underlying the perception of these impoverished speech signals. An oddball paradigm combined with measurements of the neural responses to vocoded speech sounds may be appropriate to explore further the discrimination of phonetic contrasts in different vocoded conditions. The number of analysis-frequency bands and the cutoff frequency used for AM extraction may be varied systematically in order to compare the amplitude of neural responses across spectrally and temporally degraded conditions. This exploration should contribute substantially to the still ongoing debate about the early specialization of speech perception mechanisms (see Dehaene-Lambertz, Dehaene, & Hertz-Pannier, 2002; Telkemeyer et al., 2009). The exact nature of the information processing (or the extent to which vocoded sounds are perceived as speech stimuli) cannot be determined in our studies. Electrophysiological measurements could then be used to reveal whether similar mechanisms are recruited for the perception of intact and vocoded speech stimuli. HPP and VH procedures succeeded at revealing the abilities of young infants to discriminate spectro-temporally degraded speech signals. These procedures reveal the importance of time of exposure to degraded speech signals. The present studies should therefore pave the way toward further exploration of low-level auditory processes involved in speech perception during infancy. HPP and VH procedures may also be used in future studies to assess speech recognition with vocoded signals, and they could easily be adapted to study deaf

202

infants wearing CIs. Other methods that can measure infants' individual performance more precisely should be used, keeping the same aims in mind.

4. Implications for pediatric cochlear implantation

4.1 Simulation of electrical hearing for French contrasts

The experiments using vocoders extracting only the AM cues in a small number of frequency bands (4 or 8 bands) can be taken as simulations of speech discrimination in CI listeners. The results of these experiments have indicated that at 6 months, normal-hearing infants are able to use the highly impoverished speech information transmitted by CI processors to discriminate phonetic contrasts such as voicing or place when the language is syllable-timed (e.g., French). This is consistent with the speech perception skills observed in deaf children wearing CIs (e.g., Bouton, Bertoncini, Serniclaes, & Colé, 2011; Bouton, Serniclaes, Bertoncini, & Colé, 2012; Henkin, Kileny, Hildesheimer, & Kishon- Rabin, 2008). That study thus confirms that children with CIs are able to discriminate phonetic features, (although being poorer than those of normal- hearing children, given that their scores were above chance level), but also confirms that place and nasality are more difficult to perceive than voicing and manner. Our experiment conducted with normal-hearing 6-month-olds corroborated these findings by showing that discrimination of voicing and place is possible when speech cues are impoverished in conditions simulating CI processing. However, inconsistent with Bouton et al. (2012), our experiment did not show any difference between the discrimination of voicing and place features. As discussed above, the behavioral methods used in the present PhD work failed to demonstrate graded discrimination performance between voicing and place. Other methods measuring individual performance should be used to reveal this expected difference (e.g., Olsho et al., 1987).

4.2 Simulation of electrical hearing for lexical-tone contrasts

For lexical-tone discrimination, the results of the last experiment are entirely consistent with the difficulties observed in deaf infants equipped with CIs who are learning a tonal language (see Xu et al., 2011). The pitch information (F0 pattern) of lexical tones is mostly conveyed by the spectro-temporal fine structure

203

cues that are known poorly transmitted by current CI processors. Consequently, some lexical-tone contrasts are particularly difficult to perceive and systematic perceptual confusions in CI users listening to a tonal language are likely. Moreover, the production of lexical tones by CI users may also be deficient and this should impair oral communication in daily life. In deaf children learning a tonal language, perception and production of lexical tones are delayed compared to that in normal-hearing children. However, a large variability is observed among deaf children equipped with CIs in a word-identification task (Zhou, Huang, Chen, & Xu, 2013). Our results obtained with French-learning normal-hearing 6-month-olds confirm that the discrimination of lexical tones is more difficult than that of consonants in a vocoded condition simulating CI processing. Specifically, French 6-month-old infants are able to use the gross AM cues in a limited number of bands to discriminate voicing and place (Cabrera, Lorenzi & Bertoncini, in preparation), but they are unable to do so for a lexical-tone contrast (see Cabrera, Hu, Li, Tsao, Lorenzi & Bertoncini, in preparation). The fine spectral and temporal modulation cues conveying F0 information (and thus, voice-pitch information) are thus required to accurately discriminate a lexical-tone contrast at 6 months of age. Interestingly, the results obtained with French 10-month-olds reveal that discrimination of lexical tones remains possible in the absence of fine spectro- temporal modulation cues. This result is consistent with the ability of some children with CIs to identify lexical tones (Zhou et al., 2013). In the absence of lexical-tone experience, French 10-month-olds are able to use the remaining acoustic cues (e.g., the gross AM cues in a limited number of frequency bands) to distinguish the lexical tones. This ability could be related to their native language experience (which is not a tonal language), which may have influenced (increased) the perceptual weight of AM cues. This result deserves careful attention and should be explored more directly in deaf infants and children wearing CIs. Moreover, rehabilitation programs for deaf infants may include specific training (auditory and cognitive, see Ingvalson & Wong, 2013) to enhance the perception of AM variations (such as those used in consonant contrasts).

204

No data are available regarding the abilities of deaf infants or toddlers (under 2 years) to discriminate phonetic and tone contrasts with the impoverished signal delivered by CIs. The effects of CI processing on the development of speech perception were only observed after the age of 2 years by using tasks requiring comprehension abilities (i.e., word-identification tasks). Thus, the positive effect of early implantation could be only hypothesized from the data obtained in older children. The period before 2 years remains largely unexplored but early speech rehabilitation methods could be improved by adapting them to the specific difficulties of young infants when perceiving speech sounds through CI processors. Results of vocoder studies simulating speech perception under CI conditions in normal-hearing infants should help to improve some aspects of early rehabilitation in deaf infants and may help to identify major difficulties in speech discrimination with the current CI processors.

5. Conclusions

In summary, the five experimental studies that were reported in the present dissertation attempted to characterize the role of slow and fast speech modulation cues in speech discrimination in quiet for infants. Normal-hearing 6-month-old native French listeners require neither fine spectral cues (i.e., higher spectral resolution) nor temporal fine structure (FM) cues to discriminate French plosive consonants, but phonetic discrimination seems more difficult in the absence of fast (> 16 Hz) AM cues. In contrast, fine spectral cues and temporal fine structure are required in these infants to discriminate lexical tones. Moreover, between 6 and 10 months, native Mandarin listeners become more sensitive to the spectro- temporal fine structure cues when discriminating the lexical-tone contrasts. Together, these results demonstrate that the weight of speech modulation cues (AM, FM) in speech discrimination is affected by language experience, and more generally, that the processing of speech modulation cues is plastic. Undoubtedly, the present research program represents only a first step in the attempt to use vocoded speech to explore systematically the contribution of low-level auditory mechanisms to speech perception during early development. The present

205

experiments raise new questions and should trigger additional investigations – possibly with CI users – that will have both theoretical and clinical value.

206

References Ardoint, M., Lorenzi, C., Pressnitzer, D., & Gorea, A. (2008). Investigation of perceptual constancy in the temporal-envelope domain. The Journal of the Acoustical Society of America, 123(3), 1591–1601. Bao, S., Chang, E. F., Teng, C.-L., Heiser, M. A., & Merzenich, M. M. (2013). Emergent categorical representation of natural, complex sounds resulting from the early post-natal sound environment. Neuroscience, 248C, 30–42. Bao, S., Chang, E. F., Woods, J., & Merzenich, M. M. (2004). Temporal plasticity in the primary auditory cortex induced by operant perceptual learning. Nature neuroscience, 7(9), 974–981. Beach, E. F., & Kitamura, C. (2011). Modified Spectral Tilt Affects Older, but Not Younger, Infants’ Native-Language Fricative Discrimination. Journal of Speech, Language and Hearing Research, 54(2), 658–667. Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The influence of linguistic experience on the cognitive processing of pitch in speech and nonspeech sounds. Journal of experimental psychology human perception and performance, 32(1), 97–103. Bertoncini, J., Bijeljac-Babic, R., Blumstein, S. E., & Mehler, J. (1987). Discrimination in neonates of very short CVs. The Journal of the Acoustical Society of America, 82(1), 31–37. Bertoncini, J., Bijeljac-Babic, R., Jusczyk, P. W., Kennedy, L. J., & Mehler, J. (1988). An investigation of young infants’ perceptual representations of speech sounds. Journal of Experimental Psychology: General, 117(1), 21– 33. Bertoncini, J., Floccia, C., Nazzi, T., & Mehler, J. (1995). Morae and syllables: Rhythmical basis of speech representations in neonates. Language and Speech, 38(4), 311–329. Bertoncini, J., & Mehler, J. (1981). Syllables as units in infant speech perception. Infant Behavior and Development, 4, 247–260. Bertoncini, J., Nazzi, T., Cabrera, L., & Lorenzi, C. (2011). Six-month-old infants discriminate voicing on the basis of temporal envelope cues. The Journal of the Acoustical Society of America, 129(5), 2761–2764. Bertoncini, J., Serniclaes, W., & Lorenzi, C. (2009). Discrimination of speech sounds based upon temporal envelope versus fine structure cues in 5-to 7- year-old children. Journal of Speech, Language, and Hearing Research, 52(3), 682–695. Best, C. T., & Jones, C. (1998). Stimulus-alternation preference procedure to test infant speech discrimination, Infant Behavior and Development, 21 (Sup.1), 295. Bijeljac-Babic, R., Bertoncini, J., & Mehler, J. (1993). How do 4-day-old infants categorize multisyllabic utterances? Developmental psychology, 29(4), 711–721.

207

Bijeljac-Babic, R., Serres, J., Höhle, B., & Nazzi, T. (2012). Effect of Bilingualism on Lexical Stress Pattern Discrimination in French-Learning Infants. PLoS ONE, 7(2), e30843. Bouton, S., Bertoncini, J., Serniclaes, W., & Colé, P. (2011). Reading and reading-related skills in children using cochlear implants: prospects for the influence of cued speech. Journal of deaf studies and deaf education, 16(4), 458–473. Bouton, S., Serniclaes, W., Bertoncini, J., & Colé, P. (2012). Perception of speech features by French-speaking children with cochlear implants. Journal of speech, language, and hearing research, 55(1), 139–153. Bruckert, L., Herrmann, M., & Lorenzi, C. (2006). No adaptation in the amplitude modulation domain in trained listeners. The Journal of the Acoustical Society of America, 119, 3542–3545. Buss, E., Hall III, J. W., & Grose, J. H. (2006). Development and the role of internal noise in detection and discrimination thresholds with narrow band stimuli. The Journal of the Acoustical Society of America, 120, 2777– 2788. Cabrera, L., Bertoncini, J., & Lorenzi, C. (2013). Perception of speech modulation cues by 6-month-old infants. Journal of speech, language, and hearing research: in press. Cabrera, L., Hu, Y.H., Li, L.Y., Tsao, F.M., Lorenzi, C., & Bertoncini, J. The perception of speech modulation cues is guided by the early language- specific experience, in preparation. Cabrera, L., Lorenzi, C., & Bertoncini, J. Infants discriminate voicing and place of articulation with reduced spectral and temporal modulation cues, in preparation. Cabrera, L., Tsao, F.M., Gnansia, D., Bertoncini, J. & Lorenzi, C. Linguistic experience shapes the perception of spectro-temporal fine structure cues, submitted. Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2009). Relative influence of musical and linguistic experience on early cortical processing of pitch contours. Brain and language, 108(1), 1–9. Chang, E. F., Bao, S., Imaizumi, K., Schreiner, C. E., & Merzenich, M. M. (2005). Development of spectral and temporal response selectivity in the auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 102(45), 16460–16465. Clément, S., Demany, L., & Semal, C. (1999). Memory for pitch versus memory for loudness. The Journal of the Acoustical Society of America, 106(5), 2805–2811. Colombo, J., & Horowitz, F. D. (1986). Infants’ attentional responses to frequency modulated sweeps. Child development, 57(2), 287–291. Cooper, F. S., Liberman, A. M., & Borst, J. M. (1951). The interconversion of audible and visible patterns as a basis for research in the perception of

208

speech. Proceedings of the National Academy of Sciences of the United States of America, 37(5), 318–325. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997 a). Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band. The Journal of the Acoustical Society of America, 102(5 Pt 1), 2892–2905. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997 b). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. The Journal of the Acoustical Society of America, 102(5 Pt 1), 2906–2919. Dawes, P., & Bishop, D. V. (2008). Maturation of visual and auditory temporal processing in school-aged children. Journal of Speech, Language and Hearing Research, 51(4), 1002–1017. Dehaene-Lambertz, G., Dehaene, S., & Hertz-Pannier, L. (2002). Functional neuroimaging of speech perception in infants. Science, 298(5600), 2013– 2015. Drullman, R. (1995). Temporal envelope and fine structure cues for speech intelligibility. The Journal of the Acoustical Society of America, 97(1), 585–592. Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science (New York, N.Y.), 171(3968), 303–306. Eisenberg, L. S., Shannon, R. V., Martinez, A. S., Wygonski, J., & Boothroyd, A. (2000). Speech recognition with reduced spectral cues as a function of age. The Journal of the Acoustical Society of America, 107(5 Pt 1), 2704–2710. Fogerty, D. (2011). Perceptual weighting of individual and concurrent cues for sentence intelligibility: frequency, envelope, and fine structure. The Journal of the Acoustical Society of America, 129(2), 977–988. Fogerty, D., & Humes, L. E. (2012). A correlational method to concurrently measure envelope and temporal fine structure weights: effects of age, cochlear pathology, and spectral shaping. The Journal of the Acoustical Society of America, 132(3), 1679–1689. Fu, Q.-J., Zeng, F.-G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104, 505–515. Godar, S. P., & Litovsky, R. Y. (2010). Experience with bilateral cochlear implants improves sound localization acuity in children. Otology & neurotology: official publication of the American Otological Society, American Neurotology Society [and] European Academy of Otology and Neurotology, 31(8), 1287–1292. Greenberg, S., Hollenback, J., & Ellis, D. (1996). Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus. In International Conference on Spoken Language Processing (pp. S32–35). Henkin, Y., Kileny, P. R., Hildesheimer, M., & Kishon-Rabin, L. (2008). Phonetic processing in children with cochlear implants: an auditory event- related potentials study. Ear and hearing, 29(2), 239–249.

209

Hirsh-Pasek, K., Kemler Nelson, D. G., Jusczyk, P. W., Cassidy, K. W., Druss, B., & Kennedy, L. (1987). Clauses are perceptual units for young infants. Cognition, 26(3), 269–286. Holt, R. F. (2011). Enhancing Speech Discrimination Through Stimulus Repetition. Journal of Speech, Language, and Hearing Research, 54(5), 1431–1447. Houston, D. M., Horn, D. L., Qi, R., Ting, J. Y., & Gao, S. (2007). Assessing speech discrimination in individual infants. Infancy, 12(2), 119–145. Houston, D. M., Pisoni, D. B., Kirk, K. I., Ying, E. A., & Miyamoto, R. T. (2003). Speech perception skills of deaf infants following cochlear implantation: a first report. International Journal of Pediatric Otorhinolaryngology, 67(5), 479–495. Hunter, M. A., & Ames, E. W. (1988). A multifactor model of infant preferences for novel and familiar stimuli. Advances in Infancy Research, 5, 69–95. Ingvalson, E. M., & Wong, P. C. M. (2013). Training to improve language outcomes in cochlear implant recipients. Frontiers in psychology, 4, 263. Ives, D. T., Calcus, A., Kalluri, S., Strelcyk, O., Sheft, S., & Lorenzi, C. (2013). Effects of noise reduction on AM and FM perception. Journal of the Association for Research in Otolaryngology: JARO, 14(1), 149–157. Johnson, J. S., Yin, P., O’Connor, K. N., & Sutter, M. L. (2012). Ability of primary auditory cortical neurons to detect amplitude modulation with rate and temporal codes: neurometric analysis. Journal of neurophysiology, 107(12), 3325–3341. Jørgensen, S., Ewert, S. D., & Dau, T. (2013). A multi-resolution envelope-power based model for speech intelligibility. The Journal of the Acoustical Society of America, 134, 436–446. Kates, J. M. (2011). Spectro-temporal envelope changes caused by temporal fine structure modification. The Journal of the Acoustical Society of America, 129(6), 3981–3990. Kilgard, M. P., & Merzenich, M. M. (1998). Plasticity of temporal information processing in the primary auditory cortex. Nature neuroscience, 1(8), 727– 731. Kong, Y.-Y., & Zeng, F.-G. (2006). Temporal and spectral cues in Mandarin tone recognition. The Journal of the Acoustical Society of America, 120, 2830. Kuhl, P. K. (2004). Early language acquisition: cracking the speech code. Nature reviews. Neuroscience, 5(11), 831–843. Kuhl, P. K. (2010). Brain mechanisms in early language acquisition. Neuron, 67(5), 713–727. Leech, R., Holt, L. L., Devlin, J. T., & Dick, F. (2009). Expertise with artificial nonspeech sounds recruits speech-sensitive cortical regions. The Journal of neuroscience, 29(16), 5234–5239. Leppänen, P. H. T., Hämäläinen, J. A., Salminen, H. K., Eklund, K. M., Guttorm, T. K., Lohvansuu, K., … Lyytinen, H. (2010). Newborn brain event- related potentials revealing atypical processing of sound frequency and the

210

subsequent association with later literacy skills in children with familial dyslexia. Cortex, 46(10), 1362–1376. Liang, Z. A. (1963). The auditory perception of Mandarin tones. Acta Phys. Sin, 26, 85–91. Liberman, A. M., Delattre, P. C., Cooper, F. S., & Gerstman, L. J. (1954). The role of consonant-vowel transitions in the perception of the stop and nasal consonants. Psychological Monographs: General and Applied, 68(8), 1– 13. Liu, R., & Holt, L. L. (2011). Neural changes associated with nonspeech auditory category learning parallel those of speech category acquisition. Journal of cognitive neuroscience, 23(3), 683–698. Mattock, K., & Burnham, D. (2006). Chinese and English infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241– 265. Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106(3), 1367–1381. Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2), 143–178. Micheyl, C., Delhommeau, K., Perrot, X., & Oxenham, A. J. (2006). Influence of musical and psychoacoustical training on pitch discrimination. Hearing research, 219(1-2), 36–47. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. The Journal of the Acoustical Society of America, 27, 338–352. Moore, B. C. (2004). An introduction to the psychology of hearing (Vol. 4). Academic press San Diego. Mueller, J. L., Friederici, A. D., & Männel, C. (2012). Auditory perception at the root of language learning. Proceedings of the National Academy of Sciences of the United States of America, 109(39), 15953–15958. Narayan, C. R., Werker, J. F., & Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: Evidence from nasal place discrimination. Developmental science, 13(3), 407–420. Newman, R. S. (2009). Infants’ listening in multitalker environments: Effect of the number of background talkers. Attention, Perception, & Psychophysics, 71(4), 822–836. Newman, R., & Chatterjee, M. (2013). Toddlers’ recognition of noise-vocoded speech. The Journal of the Acoustical Society of America, 133(1), 483– 494. Niwa, M., Johnson, J. S., O’Connor, K. N., & Sutter, M. L. (2012). Active engagement improves primary auditory cortical neurons’ ability to

211

discriminate temporal modulation. The Journal of neuroscience: the official journal of the Society for Neuroscience, 32(27), 9323–9334. Niwa, M., Johnson, J. S., O’Connor, K. N., & Sutter, M. L. (2013). Differences between primary auditory cortex and auditory belt related to encoding and choice for AM sounds. The Journal of neuroscience: the official journal of the Society for Neuroscience, 33(19), 8378–8395. Ohl, F. W., & Scheich, H. (2005). Learning-induced plasticity in animal and human auditory cortex. Current opinion in neurobiology, 15(4), 470–477. Olsho, L. W., Koch, E. G., Halpin, C. F., & Carter, E. A. (1987). An observer- based psychoacoustic procedure for use with young infants. Developmental Psychology, 23(5), 627–640. Plomp, R. (1983). Perception of speech as a modulated signal. In Proceedings of the Tenth International Congress of Phonetic Sciences (pp. 29–40). Polka, L., & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 421–435. Rose, S. A., Gottfried, A. W., Melloy-Carminar, P., & Bridger, W. H. (1982). Familiarity and novelty preferences in infant recognition memory: Implications for information processing. Developmental Psychology, 18(5), 704–713. Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistic aspects. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 336(1278), 367–373. Sabin, A. T., Eddins, D. A., & Wright, B. A. (2012). Perceptual learning evidence for tuning to spectrotemporal modulation in the human auditory system. The Journal of Neuroscience, 32(19), 6542–6549. Saffran, J. R., Werker, J. F., & Werner, L. A. (2006a). The infant’s auditory world: Hearing, speech, and the beginnings of language. In Handbook of child psychology (D. Kuhn & R. Siegler., Vol. 2, pp. 58–108). Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science (New York, N.Y.), 270(5234), 303–304. Sheft, S., Ardoint, M., & Lorenzi, C. (2008). Speech identification based on temporal fine structure cues. The Journal of the Acoustical Society of America, 124(1), 562–575. Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416(6876), 87–90. Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. The Journal of the Acoustical Society of America, 64, 1358–1368. Stone, M. A., Füllgrabe, C., & Moore, B. C. (2008). Benefit of high-rate envelope cues in vocoder processing: Effect of number of channels and spectral region. The Journal of the Acoustical Society of America, 124, 2272–2282.

212

Telkemeyer, S., Rossi, S., Koch, S. P., Nierhaus, T., Steinbrink, J., Poeppel, D., … Wartenburger, I. (2009). Sensitivity of newborn auditory cortex to the temporal structure of sounds. The Journal of neuroscience: the official journal of the Society for Neuroscience, 29(47), 14726–14733. Thiessen, E. D., & Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental psychology, 39(4), 706–716. Trehub, S. E. (1976). The discrimination of foreign speech contrasts by infants and adults. Child Development, 466–472. Tsao, F.-M., Liu, H.-M., & Kuhl, P. K. (2004). Speech perception in infancy predicts language development in the second year of life: a longitudinal study. Child development, 75(4), 1067–1084. Wagner, S. H., & Sakovits, L. J. (1986). A process analysis of infant visual and cross-modal recognition memory: Implications for an amodal code. Advances in infancy research. Werker, J. F., & Curtin, S. (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development, 1(2), 197–234. Werker, J. F., Shi, R., Desjardins, R., Pegg, J. E., Polka, L., & Patterson, M. (1998). Three methods for testing infant speech perception. In Perceptual development: Visual, auditory, and speech perception in infancy (A. Slater., pp. 389–420). East Sussex, United Kingdom: Psychological Press. Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1), 49–63. Werner, L. A. (2013). Infants’ detection and discrimination of sounds in modulated maskers. The Journal of the Acoustical Society of America, 133(6), 4156–4167. Werner, L., Fay, R. R., & Popper, A. N. (2012). Human auditory development (Vol. 42). Springer. Xu, L., Chen, X., Lu, H., Zhou, N., Wang, S., Liu, Q., … Han, D. (2011). Tone perception and production in pediatric cochlear implants users. Acta oto- laryngologica, 131(4), 395–398. Xu, L., & Pfingst, B. E. (2008). Spectral and temporal cues for speech recognition: implications for auditory prostheses. Hearing research, 242(1-2), 132–140. Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68, 123-139. Zeng, F.-G., Nie, K., Stickney, G. S., Kong, Y.-Y., Vongphoe, M., Bhargave, A., … Cao, K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences of the United States of America, 102(7), 2293–2298.

213

Zhou, N., Huang, J., Chen, X., & Xu, L. (2013). Relationship Between Tone Perception and Production in Prelingually Deafened Children With Cochlear Implants. Otology & Neurotology, 34(3), 499–506.

214