ARTICLE IN PRESS

Journal of 37 (2009) 344–356 www.elsevier.com/locate/phonetics

Influence of native language phonetic system on audio-visual speech

Yue Wanga,Ã, Dawn M. Behneb,1, Haisheng Jianga

aDepartment of Linguistics, Simon Fraser University, RCB 9224, 8888 University Drive, Burnaby, BC, Canada V5A 1S6 bDepartment of Psychology, Norwegian University of Science & Technology, NO-7491 Trondheim, Norway

Received 19 May 2008; received in revised form 18 February 2009; accepted 20 April 2009

Abstract

This study examines how native language (L1) experience affects auditory–visual (AV) perception of nonnative (L2) speech. Korean, Mandarin and English perceivers were presented with English CV syllables containing fricatives with three places of articulation: labiodentals nonexistent in Korean, interdentals nonexistent in Korean and Mandarin, and alveolars occurring in all three L1s. The stimuli were presented as auditory-only, visual-only, congruent AV and incongruent AV. Results show that for the labiodentals which are nonnative in Korean, the Koreans had lower accuracy for the visual domain than the English and the Mandarin perceivers, but they nevertheless achieved native-level perception in the auditory and AV domains. For the interdentals nonexistent in Korean and Mandarin, while both nonnative groups had lower accuracy in the auditory domain than the native English group, they benefited from the visual information with improved performance in AV perception. Comparing the two nonnative groups, the Mandarin perceivers showed poorer auditory and AV identification for the interdentals and greater AV-fusion with the incongruent AV material than did the Koreans. These results indicate that nonnative perceivers are able to use visual speech information in L2 perception, although acquiring accurate use of the auditory and visual domains may not be similarly achieved across native groups, a process influenced by L1 experience. r 2009 Elsevier Ltd. All rights reserved.

1. Introduction 1999, 2003; Weikum et al., 2007). The relative contribution of auditory and visual speech information has also been Examining the auditory–visual (AV) perception of revealed by what is known as the ‘‘McGurk effect’’, where English fricative consonants by native Korean and an auditory stimulus (e.g., /b]/) dubbed onto a different Mandarin Chinese perceivers, this study explores the visual stimulus (e.g., /g]/) can lead to an intermediate extent to which AV in a second language percept that bears the features from both the auditory and (L2) is affected by a native language (L1) phonetic system. the visual input (e.g., /d]/, McGurk & MacDonald, 1976). However, although the ability to integrate AV speech information exists cross-linguistically, language specific 1.1. Previous findings patterns have also been observed. For example, whereas the McGurk effect occurs for native perceivers of a variety It has been well established that visual speech informa- of languages such as English, Finnish and French tion facilitates speech perception for adults (e.g., Bernstein, (Massaro, 1998, 2001; Massaro, Tsuzaki, Cohen, Gesi, & Auer, & Takayanagi, 2004; Davis & Kim, 2004; Erber, Heredia, 1993; McGurk & MacDonald, 1976; Sams, 1969; Sumby & Pollack, 1954; Summerfield, 1979) as well Manninen, Surakka, Helin, & Katto, 1998), it appears to as for infants (Kuhl & Meltzoff, 1982; Patterson & Werker, be weak for native perceivers of Chinese and Japanese (Sekiyama, 1997; Sekiyama & Tohkura, 1993). In addition, ÃCorresponding author. Tel.: +1 778 782 6924; fax: +1 778 782 5659. E-mail addresses: [email protected] (Y. Wang), [email protected] while perceivers across languages (e.g., English, Spanish) (D.M. Behne), [email protected] (H. Jiang). demonstrate similar auditory–visual processing patterns 1Tel.: +47 73 591978. perceiving a /b]/-/d]/ continuum (Chen & Massaro, 2004;

0095-4470/$ - see front matter r 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.wocn.2009.04.002 ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 345

Massaro et al., 1993), Chinese perceivers show less use of (e.g., Speech Learning Model, SLM, Flege, 1995, 2007; visual cues in the perception of /b]/ and /d]/ than Dutch the Perceptual Assimilation Model, PAM, Best, 1995; Best perceivers (de Gelder & Vroomen, 1992). Although this & Tyler, 2007) posit an interactive relationship between L1 lack of efficient use of visual speech cues has been and L2 phonetic systems, stating that a nonnative listener’s attributed to the fact that Chinese as a tone language inability to accurately perceive L2 sounds is due to relies more on the less visually distinct prosodic informa- incorrect assimilation of L2 sounds with L1 phonetic tion (Sekiyama, 1997; Sekiyama & Tohkura, 1993), categories and (in the case of L2 learners) a failure to research has also shown a significant effect of visual establish new L2 phonetic categories. It has been proposed information on tone perception in Chinese (Burnham, Lau, that category formation is most likely if an L2 speech Tam, & Schoknecht, 2001). sound is perceived to be phonetically distant from the These cross-linguistic differences in L1 AV processing closest L1 sound, whereas the L2 sounds that are suggest that nonnative (especially inexperienced L2) users’ phonetically similar to those in the L1 are often incorrectly AV perception may be differentially affected by their L2 assimilated to an L1 phonemic category, and the formation and their L1. While nonnatives may be facilitated by of separate phonetic categories for those similar L2 sounds general visual speech information (Davis & Kim, 2004; can be most challenging (e.g., Flege, 1995, 2007). Erdener & Burnham, 2005; Hardison, 1999; Navarra & L2 visual speech learning may also be a matter of Soto-Faraco, 2007; Reisberg, McLean, & Goldfield, 1987; establishing visual speech categories. To classify visual Sekiyama, Kanno, Miura, & Sugita, 2003), they also reveal speech categories, previous research has identified various poorer L2 AV perception than in their L1 (de Gelder & homophonous groups based on the visual speech informa- Vroomen, 1992; Sekiyama, 1997; Werker, Frost, & tion in speakers’ articulatory movements (in English, for McGurk, 1992). For example, with only visual speech example, bilabials [p,b,m], labiodentals [f,v], interdentals information, Italian and English natives cannot discrimi- [y,j], alveolars [t,d,n,s,z], etc., Binnie, Montgomeery, & nate Catalan and Spanish (Soto-Faraco et al., 2007). Jackson, 1974; Hardison, 1999). In terms of L2, Hazan et Nonnative perceivers, such as L2 perceivers of English and al. (2006) specified three types of relationships between L1 Japanese, are also more susceptible to the McGurk and L2 visual speech categories: a visual category occurring for L2 stimuli (Sekiyama & Tohkura, 1993) than for L1 in both L1 and L2 (e.g., /s/ in English and Mandarin); a stimuli, although Chinese perceivers exhibit an equally visual category existent in L2 but not in L1 (e.g., English /f/ weak McGurk effect for their L1 and an L2 (Japanese or for Korean perceivers); and a visual category occurring in English) (Sekiyama, 1997; Sekiyama, Tohkura, & Umeda, both L1 and L2, but used for different phonetic distinctions 1996). in the L2 (e.g., English labiodental /v/ occurring in Spanish Further research shows that while L2 perceivers do not only as voiceless). While the visual categories that are have difficulty perceiving the visual cues common in their common to an L1 and L2 do not present much difficulty, L1 and L2, they may more likely be impeded in correct use L2 perceivers may have lost sensitivity to visual cues that of L2 visual cues non-existent in their L1 compared to are not used in their L1 and need to learn to associate these those occurring both in their L1 and L2 (Hardison, 1999, visual cues with a visual category in the L2 to establish new 2003, 2005a, 2005b; Hazan, Sennema, Faulkner, & Ortega- L2 categories (Hazan et al., 2005, 2006). Further research Llebaria, 2006; Ortega-Llebaria, Faulkner, & Hazan, has indeed shown that new visual categories may be 2001). For example, Spanish learners outperform Japanese established with linguistic experience, as L2 perceivers’ use learners in the perception of the visual cues for the English of nonnative visual speech information is positively /b-v/ contrast, since /v/ exists allophonically in Spanish as correlated with their length of residence (LOR) in an L2 the voiceless counterpart /f/, whereas it is nonexistent in country (Wang, Behne, & Jiang, 2008) and their overall Japanese (Hazan et al., 2006). Likewise, nonproficient proficiency in the L2 (Werker et al., 1992), and as French learners of English cannot efficiently perceive the sensitivity to visual information in L2 sounds can be visual information of the English interdental fricatives enhanced through auditory–visual training (Hardison, absent in French (Werker et al., 1992). These studies 2003, 2005b; Hazan et al., 2005). indicate that L2 learners have difficulty in correctly using It should be noted that L2 categories may be new in the visual cues that do not contrast phonetically in their terms of their auditory distinctions, visual distinctions or L1, suggesting that subsequent research should focus on both, where the relative familiarity of audio and visual new L2 sounds. distinction of a new L2 category do not necessarily present Indeed, while L2 speech perception has been studied the same challenge for a learner (e.g., Hardison, 1999; extensively in the auditory learning domain, research has Hazan et al., 2006). Thus the connection of the auditorily- not adequately explored how L1 and L2 systems interact to and visually-based L1 and L2 phonetic distinctions needs affect the perception of new L2 visual speech categories to be examined in greater detail. For instance, similar (Hardison, 1999, 2003; Hazan, Sennema, Iba, & Faulkner, phones with a voicing distinction such as /p/ and /b/ often 2005; Hazan et al., 2006). Theories developed on the basis cause confusion for L2 learners when perceiving the of nonnative or L2 auditory speech learning may inform contrast auditorily (e.g., Flege, 1980), but may not be patterns of visual speech learning. These theories necessarily difficult in the auditory–visual domain, as has ARTICLE IN PRESS 346 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 been shown with Spanish perceivers who can correctly labiodentals and alveolars to which interdentals may be perceive the nonnative /v/ due to the existence of /f/ in their assimilated). L1 (Hazan et al., 2006), possibly because voicing is not as Subsequent questions follow from the interaction of visually distinctive as other features such as place of auditory and of these fricatives, with articulation. Indeed, previous studies indicate that the regard to how Korean and Mandarin perceivers weigh L2 difficulty in L2 visual speech perception may lie in the auditory and visual cues under the influence of their L1s, places of articulation that are not used in the L1 (Hardison, and the extent to which auditory or visual cues contribute 1999; Hazan et al., 2006; Werker et al., 1992). However, to the percept in the L2. To tease apart the influence of while most research compares the visual perception of the auditory and visual information in a single stimulus, this place of articulation of L2 consonant contrasts, those study makes use of the McGurk illusion (McGurk & contrasts (e.g., /b/ vs. /v/) are often spread across manners MacDonald, 1976). Thus, in addition to auditory, visual, of articulation (e.g., Hardison, 2005b; Hazan et al., 2006; and congruent auditory–visual input modalities, incongru- Sekiyama & Tohkura, 1993; Werker et al., 1992). Since ent auditory–visual input (with labiodental and alveolar manner of articulation, as well as place of articulation, may fricatives as auditory or visual components, and the affect visual speech perception (Faulkner & Rosen, 1999; intermediate interdentals projected to be one of the major Green & Kuhl, 1989), an extension for further research is fused percepts) is included to determine the contribution of to control for these differences in order to further each component modality. Of particular interest are the characterize the relationships of L1 and L2 visual extent to which the Mandarin perceivers either remain categories and thus address L2 auditory–visual learning anchored to the familiar auditory or visual components, or patterns. perceive the fused nonnative interdental percept (or other intermediate sounds), and the extent to which the Koreans use these modalities where a nonnative sound is involved in 1.2. The current study both the input (labiodentals) and one of the expected fused percepts (interdentals). Given the previous finding that This study examines nonnative auditory–visual percep- nonnatives who do not rely heavily on visual cues in their tion of English fricatives differing in three visually distinct L1 tend to make more use of visual information in an L2 places of articulation: labiodental, interdental, and alveo- (e.g., Sekiyama (1994), we predict that both Korean and lar. Two nonnative groups, Korean and Mandarin Mandarin perceivers will be more responsive to the visual Chinese, as well as native English controls, are included component or fused percept bearing the articulatory for their differences in L1 phonetic systems, where English features of the visual component than the English natives. contains fricatives and corresponding visual cues that are Furthermore, based on the previous claim that languages nonnative to the Korean and Chinese perceivers. In with fewer visible place of articulation contrasts are less particular, Mandarin does not contain interdental frica- visually marked (Hazan et al., 2006), we predict that, tives, but prevocalic voiceless labiodentals and alveolars compared to Mandarin perceivers, Koreans will use less both exist (Ladefoged & Wu, 1984; Svantesson, 1986; Tse, visual information. 1988), whereas Korean contains neither labiodental nor interdental fricatives, but voiceless alveolars occur (Hard- 2. Methods ison, 1999; Kang & Lee, 2002; Kim, 1972; Schmidt, 1996). The systematic differences in the phonetic systems of 2.1. Participants these languages make it possible to explore the questions addressed in this study with regard to the effect of L1 Three groups of young adults (mean age ¼ 23, balanced influence. Based on the phonetic inventories of their L1s, for the numbers of males and females) participated in the the English alveolars which occur in both Korean and study: 15 native Canadian English, 15 native Korean, and Mandarin are expected to be perceived in a comparable 20 native Mandarin speakers. All were undergraduate or manner by the native and nonnative perceivers, while graduate students at Simon Fraser University (SFU, Mandarin natives may be unable to correctly perceive Vancouver, Canada) at the time of the study. The interdentals nonexistent in Mandarin, and Korean natives participants’ language background information is summar- may fail to accurately perceive both labiodentals and ized in Table 1. The Korean and Mandarin participants interdentals not occurring in Korean. Moreover, based on were recruited to have comparable backgrounds in their the assimilation patterns proposed by auditory speech experience with English. They all had studied English as an learning theories where an L2 sound is often incorrectly L2 in a classroom setting 3–5 h/week since an average age assimilated to a perceptually close L1 sound category (Best, of 10–11 years. As an admission requirement into SFU, 1995; Best & Tyler, 2007; Flege, 1995, 2007), Mandarin they passed a test of English, either the International natives’ perception of English interdental fricatives may be English Language Testing System (IELTS) with a mini- inferior to that of Korean natives, since the front visual mum score of 6.5 (out of 9) or the Test of English as a categories (thus perceptual space) are more crowded for Foreign Language computer-based test (TOEFL CBT) Mandarin in the vicinity of interdental fricatives (i.e., with with a minimum score of 230 (out of 300). None were ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 347

Table 1 Participants’ language background information.

Age (years) AOL (years) AOA (years) LOR (years) English study (years) L1 input (%) English input (%)

English (n ¼ 15) 22 (4) – – – – – – Korean (n ¼ 15) 24 (4) 10 (5) 19 (5) 4 (1) 10 (3) 66 (23) 34 (23) Mandarin (n ¼ 20) 24 (4) 11 (3) 21 (4) 2 (1) 12 (4) 60 (23) 40 (23)

Note: AOL: mean age of initial English learning; AOA: mean age of arrival in Canada; LOR: mean length of residence in Canada (years); English study: mean number of years of formal English study (as a school subject); L1 input: mean % daily input in Korean and Mandarin, respectively; English input: mean % daily input in English (L2). The standard deviations are included in parentheses.

Table 2 The set of English CV syllables containing voiceless fricatives presented as auditory-only (A), visual-only (V), and congruent auditory–visual (AVc) modalities for each of the fricative POAs (labiodental, interdental, alveolar), and as incongruent auditory–visual stimuli (AVi), including two auditory and visual input alternatives differing in fricative POA (Alabiodental–Valveolar, Aalveolar–Vlabiodental).

Modality A, V, or Avc AVi

Fricative POA Labiodental Interdental Alveolar AlabiodentalValveolar AalveolarVlabiodental Sample stimuli fi yi si fi-si si-fi f] y] s] fa-sa sa-fa fu yu su fu-su su-fu Occurrence English English English – – Mandarin Mandarin Korean

Note: POA ¼ place of articulation; AlabiodentalValveolar: auditory input ¼ labiodental, visual input ¼ alveolar; AalveolarVlabiodental: auditory input ¼ alveolar, visual input ¼ labiodental.

English majors. They came to Canada after age 18 years height, advancement, and lip-rounding (Hazan et al., and had been there for a relatively short length of time 2005, 2006; Jongman, Wang, & Kim, 2003). Thus each (1–4 years), but never resided in any other English speaking place of articulation was represented by six exemplars environments prior to their arrival in Canada. They (2 voicing conditions 3 vowels) to ensure responses to reported using and being exposed to their native Korean phonetic distinctions rather than acoustic idiosyncrasies. or Mandarin as their more dominant language since their These syllables were the basis for developing stimuli arrival. All three groups of participants reported having varying in AV composition of the syllable’s fricative POA: normal or corrected vision and no known history of speech (1) congruent auditory and visual syllables (AVc), and impairments. They were compensated for their (2) auditory-only syllables (A), (3) visual-only syllables participation. (V), and (4) incongruent auditory and visual syllables (AVi; cf. McGurk & MacDonald, 1976). 2.2. Stimuli In the AVi stimuli, the auditory and visual components had mismatched labiodental and alveolar fricatives such Stimuli, exemplified in Table 2, were based on 18 English that they were either [AlabiodentalValveolar] (e.g., auditory CV syllables having a fricative onset ([f], [v], [y], [j], [s], or [f]] dubbed onto visual [s]]) or [AalveolarVlabiodental] [z]) followed by a vowel ([i], []], or [u]). The fricatives (e.g., auditory [s]] dubbed onto visual [f]]). Interdentals included a series of auditory–visual categories (labiodental, (nonnative to Mandarin and Korean perceivers) were not interdental, alveolar) differing in place of articulation, from included in the creation of the mismatched stimuli, but are relatively front to relatively less front in the vocal tract. expected to be one of the major occurrences of AV-fusion, As described earlier, whereas these three places of being the only fricative category intermediate to the articulation are contrastive in English, in Mandarin only labiodental and alveolar places of articulation. The voiceless labiodental and alveolar fricatives are phonemi- interdentals share the ‘‘nonsibilant’’ acoustic feature with cally distinctive, and in Korean only voiceless alveolar the labiodentals and the ‘‘nonlabial’’ visual feature with the fricatives exist. Both voiced and voiceless fricatives were alveolars, as is shown by previous research that a fused included for a broad evaluation of place of articulation percept usually combines auditory and visual information identification. The vowels ([i, ], u]), phonemically dis- (e.g., McGurk & MacDonald, 1976; Sekiyama et al., 2003; tinctive in English, Mandarin, and Korean, represent Werker et al., 1992). In a given AVi syllable the A and V diverse vocal tract configurations differing in tongue components always had the same fricative voicing and ARTICLE IN PRESS 348 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 vowel. Thus 12 AVi stimuli were included (2 AV-input (errors were 1.5% labiodental, 2% interdental, .5% place [AlabiodentalValveolar,AalveolarVlabiodental] 2voi- alveolar, and 1% voicing and other errors) for the audio cing conditions 3 vowels). signals in the AVc stimuli and 100% correct responses for A total of 66 stimuli were used across stimulus the audio signals in the AVi stimuli. These scores are conditions (18 A, 18 V, 18 AVc, 12 AVi). comparable (and exceed) those in previous AV studies of Audio and video recordings were made with an adult auditory perception of English fricatives by native English male speaker of Canadian English sitting against a white listeners (e.g., Jongman et al., 2003; Werker et al., 1992). background in the recording studio in the Language and The naturalness of the AV stimuli was tested in a 5-point Brain Lab at Simon Fraser University. While the speaker goodness rating task (5 being the most natural) where the produced six randomized repetitions of the 18 English same two evaluators judged whether the audio and video syllables (6 fricatives 3 vowels) at a normal speaking rate, signals were naturally synchronized, regardless of what recordings of the speaker’s face were made using a digital they heard or saw. AVc stimuli were rated 4.4 and the AVi camcorder (SONY DCR-HC30/40) positioned approxi- stimuli rated 4.6. mately 2 m away. In addition, separate audio recordings were simultaneously made with a Shure KSM 109 2.3. Procedure condenser microphone via an audio interface (M-audio MobilePre USB) to a PC at a 44.1 kHz sampling rate. A perception experiment was generated using E-prime These high quality audio recordings were used to replace 1.0 (Psychology Software Tools, integrating video clips the audio track from the camcorder recording. imported from Microsoft Powerpoint files) to present A best example was selected from among the six stimuli and log participant responses. Participants were repetitions for each syllable such that the durational tested in an identification task using the full set of stimuli difference among the 18 selected syllables was under blocked by modality: A, V, AV (with AVi and AVc stimuli 10%, the approximate just noticeable difference for in the same block). Two randomized repetitions of each duration (Lehiste, 1970). For these selected syllables, the stimulus were included in each block. The presentation audio-video recordings from the camcorder were aligned order of the blocks was counter-balanced across partici- with the corresponding high quality audio recordings by pants. The test began with instructions for the task, synchronizing the two waveforms using SoundForge 8.0. familiarization with the stimuli (e.g., matching the symbols The audio track from the camcorder was then deleted. and the sounds they represent), and five practice trials for The audio tracks were then normalized to attain the same each modality (A, V, AVc/AVi). The practice trials were unweighted RMS value for the resulting stimuli. The video presented with the same speaker as used in the test, but no tracks were edited to have a 1.2 s neutral face before and feedback was provided to avoid any learning effect. The after the stimulus; that is, before the frame where mouth test was followed by a debriefing session, which included a opening first occurred and after the frame where the mouth post-test questionnaire. The full experiment, including a was fully closed. All stimuli had a frame length of .06 s, and short break after three test blocks, lasted about one hour a resolution of 640 480 pixels. for a participant. The resulting AV materials were used as AVc stimuli. Stimuli were presented auditorily over loudspeakers, They were also the basis for creating the A-only, V-only, visually on a computer monitor, or both. Participants were and AVi stimuli. The A-only and V-only stimuli were tested individually, sitting approximately 1 m from a 20 in created from the AVc stimuli by removing the video tracks LCD flat panel computer monitor and two loudspeakers or muting the audio tracks, respectively. The AVi stimuli (Altec Lansing) positioned on each side of the monitor, so were based on the same auditory and visual components as that the audio and video sources were approximately the those used in the AVc, A-only and V-only conditions. same distance from the perceiver. Loudspeakers were used To create the AVi stimuli, the audio and video components instead of headsets to avoid any bias for the audio were aligned from syllables differing only in place of component. The audio signal had a comfortable level of articulation (e.g., A[f]]V[s]]). Starting with the AVc approximately 70 dB which has previously been shown to stimulus which had the target video component (e.g., AVc be an appropriate level for speech perception experiments [s]]), the audio signal of the target audio component (Na´belek & Robinson, 1982; Takata & Na´belek, 1990). (e.g., [f]]) was aligned with the onset of the fricative audio For each trial, a visual fixation point was displayed in the signal from the AVc, and the original AVc audio middle of the monitor for one second, followed by the (e.g., audio [s]]) was removed, so that the resulting AVi target stimulus. had the original AVc video component (e.g., [s]]) with a Six response alternatives [f, v, y, j, s, z] were then shown new audio signal (e.g., [f]]). on the monitor, together with an ‘‘other’’ option to allow To test the intelligibility of the audio signals and the participants to type in an alternative response. ‘‘Th’’ and naturalness of the AV signals, the final AVc and AVi ‘‘dh’’ were used to represent [y]and[j], respectively stimuli were evaluated by two phonetically trained native (Jongman et al., 2003). The participants were given the speakers of English. An identification task testing intellig- ‘‘other’’ option since previous research has shown that ibility of the audio signals showed 95% correct responses participants’ responses are not limited to the given type of ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 349 responses, and in the case of incongruent stimuli, they are a higher overall percent correct (84%) than the Korean not limited to one (fused) response (e.g., Hardison, 1999; group (80%), which in turn had a higher percent correct Sekiyama et al., 1996; Werker et al., 1992). However, these than the Mandarin group (70%). Significant main effects studies have also revealed that most of the alternative were also observed for Modality [F(2,47) ¼ 179.6, responses involved a difference in manner of articulation po.0001] and POA [F(2,47) ¼ 29.1, po.0001], along with and/or voicing in addition to POA. For instance, the significant interactions of Group Modality [F(5,47) ¼ (fused) response for the AV incongruent ba-ga stimuli 6.1, p o.001], Group POA [F(5,44) ¼ 5.2, po.001], could be /d]/or/j]/(McGurk & Buchanan, 1981; Group Modality POA [F(8,41) ¼ 4.1, po.001]. How- McGurk & MacDonald, 1976). Since most of these studies ever, the results showed no significant main effect of focus on POA, responses were typically analyzed in terms Voicing [F(1,48) ¼ .4, p ¼ .517]. Furthermore, although a of POA regardless of voicing and manner (e.g., Burnham & significant main effect of Vowel was observed [F(2,47) ¼ Dodd, 2004; Hayashi & Sekiyama, 1998; Werker et al., 22.5, po.0001], there were no reliable interactions invol- 1992). Thus, as used in other similar AV studies (e.g., Chen ving L1 group (Group Vowel: [F(4,45) ¼ 1.2, p ¼ .308], & Hazan, 2007; Hazan et al., 2006; Jongman et al., 2003), Group Vowel Modality [F(8,41) ¼ 1.7, p ¼ .110], the current study employed the ‘‘quasi-closed-set’’ task Group Vowel POA [F(8,41) ¼ 1.1, p ¼ .395], Group (i.e., 6 target fricatives and ‘‘other’’) with the focus on Vowel Voicing [F(4,45) ¼ 1.4, p ¼ .228], Group POA. Modality POA Voicing Vowel: [F(16,33) ¼ .9, The participants were to identify the fricative closest to p ¼ .463]), indicating that vowel context affected the three the one they perceived and respond by pressing the native groups in a similar fashion. Since the goal of the corresponding key on the keyboard. Participants present study was to compare native group difference were allowed a maximum of 4 s to respond in each trial. as a function of AV-input modality and fricative place of In the blocks where visual stimuli were involved, partici- articulation, further analyses were thus conducted to pants were instructed to always watch the speaker’s face. focus only on these factors which revealed significant In addition to the visual fixation mark they were to attend interactions. to at the beginning of each trial, visual notes were also Given the significant interactions of Group, Modality placed on the computer monitor and keyboard to remind and POA, sets of two-way ANOVAs were conducted for them to fix their eyes to the monitor. Only one participant each POA to compare how the AV modalities were was tested at a time which allowed the experimenter to perceived by different native groups. Significant Group observe the testing process, ensuring that the perceiver’s and Modality interactions were observed only for labio- eyes focused on the speaker’s face. dentals [F(4,45) ¼ 3.3, po.015] and interdentals [F(4,45) ¼ 4.7, po.002], but no interaction occurred for alveolars 3. Results where all native groups exhibited significantly poorer V than A or AVc perception (see Fig. 1). Subsequent one-way Correct responses were tabulated according to L1 group, repeated measures ANOVAs and post hoc analyses AV-input modality, place of articulation of the fricatives, (Bonferroni adjustment) were carried out with Modality voicing of the fricatives, vowel context, as well as AV-input as a factor for each Group for labiodentals and inter- place (for the AVi condition). Separate data analyses were dentals. As shown in Fig. 1, for labiodentals, whereas the carried out for the A, V, and AVc conditions, and for the native English group did not reveal any difference in input AVi condition. modality (with equally high scores across the three modalities) [F(2,42) ¼ .9, p ¼ .413], both the Korean 3.1. Auditory, visual, and congruent auditory–visual [F(2,42) ¼ 7.8, po.002] and Mandarin [F(2,57) ¼ 5.3, conditions po.009] perceivers’ identification in the V modality was significantly poorer than in the AVc modality. For The percent correct identification of the fricatives was interdentals, while the English natives showed poorer analyzed using a 5-way mixed analysis of variance identification in V than in both A and AVc conditions (ANOVA), with Group (English, Korean, Mandarin) as [F(2,42) ¼ 15.7, po.0001], the Koreans had lower accuracy a between-subjects factor, and Modality (A, V, AVc), Place in the A and V modalities compared to the AVc modality of articulation (POA) (labiodental, interdental, alveolar), [F(2,42) ¼ 11.2, po.0001], and the Mandarin perceivers Voicing (voiceless, voiced), and Vowel ([i, ], u]) as repeated showed lower identification accuracy in the A than measures. The dependent variable was perceivers’ correct AVc modality [F(2,57) ¼ 5.3, po.009]. These results identification for POA (Hazan et al., 2005, 2006; Jongman indicate that the three native groups’ perception of et al., 2003; Werker et al., 1992). Fig. 1 presents mean labiodentals and interdentals differed depending on the percent correct responses for the English, Korean, and input modality. Mandarin groups as a function of Modality and POA. To compare these group effects directly, sets of one-way A significant main effect of Group was found ANOVAs and post hoc analyses (Tukey-HSD) were [F(2,47) ¼ 18.9, po.0001], with the post hoc analysis carried out with Group as a between-subjects factor (Tukey-HSD) showing that native English perceivers had for the labiodentals and interdentals at each modality. ARTICLE IN PRESS 350 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356

Labiodental Interdental Alveolar

100 * **

80

60

40

% correct responses English 20 Korean Mandarin 0 A V AV A V AV A V AV

Fig. 1. Mean percent correct responses for each modality (A, V, AVc) and place of articulation (labiodental, interdental, alveolar), by English, Korean, and Mandarin groups. The symbol ‘‘*’’ indicates statistically significant (po.05) differences among the three participant groups for each modality and 1 place of articulation. The error bars show a 2 standard deviation of the corresponding mean.

As shown in Fig. 1, significant group differences were Table 3 observed in the following conditions. For the labiodentals, Confusion matrix for fricative identification by English (Eng), Korean in the V condition, only the Korean perceivers showed (Kor), and Mandarin (Man) participants, with mean percent responses (%) for labial, interdental, and alveolar response alternatives. lower identification accuracy than did the English natives [F(2,47) ¼ 5.1, po.010]. For the interdentals in the A Presented stimuli condition, the Mandarin perceivers exhibited a significantly lower identification accuracy than the Korean perceivers, Labiodental Interdental Alveolar who in turn had a significantly lower identification Eng Kor Man Eng Kor Man Eng Kor Man accuracy than the native English perceivers [F(2,47) ¼ 18.9, po.0001], and in the AVc condition, only the Perceived as A Labiodental 87 93 85 11 29 31 0 0 0 Mandarin perceivers’ identification was lower than that Interdental 13 6 9 89 67 50 0113 of the native English [F(2,47) ¼ 10.2, po.0001]. No other Alveolar 0 1 2 0 3 19 100 98 85 conditions revealed significant group differences. V Labiodental 90 76 82 11 5 9 38 21 34 To summarize, (1) for the labiodentals, both the Korean Interdental 8 14 12 71 67 60 26 32 37 and Mandarin perceivers showed an increase in identifica- Alveolar 2 8 5 16 25 30 33 44 27 tion accuracy in the AVc than in V modality. However, AVc Labiodental 93 96 90 6141300 0 only the Korean perceivers’ V identification did not reach Interdental 7 4 7 94 84 67 0116 the native English level; (2) for the interdentals, both the Alveolar 0 0 1 0 2 18 99 98 81 Korean and Mandarin perceivers showed better identifica- tion accuracy in the AVc than A or V modalities. Whereas Note: The ‘‘other’’ responses are not included since they account for less the Koreans showed poorer than native English identifica- than 2% of the responses, mostly due to missed responses. tion only in the A condition, the Mandarin perceivers showed poorer A and AVc identification than did both the English and Korean groups, and (3) for the alveolars, the three groups show similar patterns, all having much poorer 3.2. Confusion patterns scores in the V than in A or AVc condition, probably due to the alveolars being least visually accessible among the The perceivers’ response patterns for POA were analyzed three types of fricatives. Overall, these results reveal that for the A, V and AVc materials, as shown in Table 3. the Korean and Mandarin perceivers perceived the AVc To compare group differences in perceptual confusion condition better than A or V conditions for their patterns, 2-way mixed ANOVAs were conducted for each corresponding nonnative sounds (labiodentals and/or Modality and POA, with Group (English, Korean, interdentals), although they still could not reach native Mandarin) as the between-subject factor and Response performance identifying these sounds, particularly the pattern (perceived as labiodental, interdental, or alveolar) labiodental V identification for the Koreans, the inter- as the repeated measure, and the post hoc analyses being dental AVc identification for the Mandarin participants, Tukey-HSD (for between-group comparisons) and Bon- and the interdental A identification for both nonnative ferroni adjusted (for with-in group comparisons). Signifi- groups. cant group differences in response patterns were found for ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 351 the following conditions: labiodental responses in the V Mandarin group had a higher mean percent of AV-fusion condition [F(4,45) ¼ 3.5, po.038], interdental responses in responses than the Korean group whose responses were in theA[F(4,45) ¼ 11.4, po.0001], V [F(4,45) ¼ 3.9, turn greater than the English group (Tukey-HSD, po.05), po.026], and AVc [F(4,45) ¼ 8.0, po.0001] conditions, indicating that the nonnative perceivers more easily fused and alveolar responses in the A [F(4,45) ¼ 7.5, po.001], the incongruent A and V components of the stimuli V[F(4,45) ¼ 3.1, po.020], and AVc [F(4,45) ¼ 22.7, (McGurk effect) despite the intermediate fricatives being po.0001] conditions. nonnative. Further separate one-way ANOVAs for each Post hoc analyses show that, for the labiodentals in the V input place revealed that this pattern of native group condition, while all groups to some degree misperceived the difference only reliably occurred with the Aalveolar+Vlabiodental labiodentals as interdentals, the Koreans (for whom the input [F(2,47) ¼ 18.9, po.0001], that is to say, with stimuli labiodentals were nonnative) gave a greater percentage of where the visual component was more visually accessible. alveolar responses (8%) than the Mandarin group (5%) For the A-match responses, significant results were which in turn showed more alveolar responses than did the observed for Group [F(2,97) ¼ 8.5, po.001] and Input English group (2%) (po.040), indicating that for the place [F(1,98) ¼ 29.5, po.0001], as well as a Group Input nonnatives, particularly for the Koreans, the labiodentals interaction [F(5,94) ¼ 3.6, po.034]. Separate one-way and alveolars were less visually distinguishable. The three ANOVAs with each input place revealed a significant groups did not differ significantly in the A and AVc group difference for the Aalveolar+Vlabiodental input conditions in terms of the confusion patterns, all more [F(2,47) ¼ 17.5, po.0001], showing that the Mandarin likely misperceiving labiodentals as interdentals than as perceivers gave fewer responses matching the auditory alveolars (po.05 for all groups). component than English and Korean participants (Tukey- For the interdentals in the V condition, both the HSD, po.05). Mandarin and Korean groups’ interdental misperception Finally, for the V-match responses, there was no was more biased towards the alveolars than labiodentals significant main effect of Group [F(2,97) ¼ 2.3, p ¼ .109] (Mandarin: 30% versus 9%, po.001; Korean: 25% versus or Input place [F(1,98) ¼ .23, p ¼ .632]. However, a 5%, po.005), while the English natives misperceived the significant Group Input interaction was observed interdentals as labiodentals or alveolars to a more similar [F(5,94) ¼ 4.4, po.020]. Further analyses showed a reliable degree (16% versus 11%, p4.716). In both the A and AVc difference only for the Alabiodental+Valveolar input conditions, the Mandarin group misperceived 18–19% of [F(2,47) ¼ 5.0, po.011], with a greater number of re- the interdentals as alveolars and for the Koreans this sponses for the Mandarin than Korean and English misperception was 2–3% (A: po.0001, AVc: po.001), perceivers. while the English natives never confused interdentals and Overall, all groups more predominantly responded to the alveolars. Consistently, it is also noted that for the auditory than the visual input. The nonnatives, Mandarin alveolars across conditions, Mandarin perceivers more in particular, had fewer A-match responses compared to often (than the English and Koreans) perceived alveolars as the native English perceivers. While the native English interdentals (A: po.0001, V: po.031, AVc: po.0001). group only demonstrated AV-fusion when the auditory These results indicate that across modalities, the nonna- component was labiodental and the visual component was tives, particularly the Mandarin group, more likely mixed alveolar but not the reverse, the nonnatives, especially the interdentals with alveolars than did the native English Mandarin group, showed similar effects for both input perceivers. places, demonstrating that the Mandarin perceivers more easily fused the incongruent auditory and visual compo- 3.3. AV incongruent condition nents of the stimuli. These results showed that whereas the Mandarin perceivers were more affected by the visual Responses to AVi stimuli were analyzed based on input, the pattern of results for the Koreans is more like whether they matched the fricative in the stimulus’ that of the native English perceivers, with more use of auditory component (A-match), visual component auditory information. (V-match) or were intermediate (i.e., interdental, since no relevant ‘‘other’’ responses applied) to A and V compo- 4. Discussion nents (AV-fusion). Sets of 2-way mixed ANOVAs were carried out for each of these response types, with Group as 4.1. Perception of auditory and visual modalities a between-subjects factor, and AV-place input ([Alabiodental+ Valveolar], [Aalveolar+Vlabiodental]) as a repeated measure. The results show that native English perceivers can in Fig. 2 displays the mean percent responses for each general accurately perceive the fricatives in AVc and response type as a function of AV-place input. A conditions, agreeing with previous results of the Results for the AV-fusion (interdental) responses show perception of English fricatives (e.g., Jongman et al., significant main effects of Group [F(2,97) ¼ 4.9, po.011] 2003) and other consonants (e.g., Chen & Hazan, 2007; and Input place [F(1,98) ¼ 24.4, po.0001], and a Group Hardison, 1999; Massaro, 1998; Sekiyama & Tohkura, Input interaction [F(5,94) ¼ 5.6, po.007]. Overall, the 1993; Werker et al., 1992). The results are also consistent ARTICLE IN PRESS 352 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356

100 * A labiodental + Valveolar A alveolar + V labiodental

80

60

40

% correct responses * * 20 English Korean Mandarin 0 Fusion match Fusion V match A match A A match A V

Fig. 2. Mean percent correct responses for incongruent AV stimuli by English, Korean, and Mandarin groups. ‘‘A match’’ and ‘‘V match’’ indicate correct responses, respectively, matching the A and V components of the stimuli. ‘‘Fusion’’ refers to the intermediate interdental responses, matching neither the A nor V component, and corresponding to the McGurk effect. The symbol ‘‘*’’ indicates statistically significant (po.05) differences among the three 1 participant groups for each modality and place of articulation. The error bars show a 2 standard deviation of the corresponding mean.

with the previous findings that even native perception of 1979), these nonnative perceivers conceivably made use of the interdentals and alveolars (particularly the latter) is the visual information as an additional channel of input in poorer in the V than in the A and AVc conditions (Chen & perceiving the difficult nonnative sounds. Indeed, previous Hazan, 2007a, 2007b; Cienkowski & Carney, 2002; research has consistently shown that the perception of L2 McGurk & MacDonald, 1978), and that consonants stimuli (such as English /f, r/ by Japanese) increases with involving labials are most visually distinctive (Hardison, additional visual information (Hardison, 1999) in that 1999; Woodward & Barber, 1960). Thus, it appears that the auditory perception of the most difficult sounds benefits native English perceivers actively use the auditory compo- most from visual cues (Hardison, 2005a, 2005b), and that nent across fricative places of articulation. Their use of the visual cues enhance L2 speech comprehension (Navarra & visual information decreases with the decrease of visual Soto-Faraco, 2007; Reisberg et al., 1987; Soto-Faraco salience (from labiodentals to alveolars) and with the et al., 2007). increase of auditory salience (from the nonsibilant labio- Results for labiodental perception show that for the dentals to sibilant alveolars). Koreans whose L1 does not contain labiodentals, percep- Agreeing with the previous findings (e.g., Hardison, tion reached the native English level in the A and AVc 1999, 2003; Hazan et al., 2005, 2006; Navarra & Soto- conditions, but not the V condition. Similar patterns have Faraco, 2007; Sekiyama, 1997), group comparisons also previously been reported, with Korean learners demon- reveal differences for the English, Korean, and Chinese strating 92% correct perception of the English /f/ in an A natives in the use of auditory and visual cues when condition and 97% in an AV condition (Hardison, 1999). perceiving English. The results show that native and The relatively accurate auditory perception of the labio- nonnative perceivers vary in their accuracy perceiving the dentals may be because, as nonsibilants, labiodentals are different POAs and in weighing the auditory and visual acoustically and perceptually very distinctive from the input, depending on whether the auditory and/or visual closest L1 sounds in Korean which are the sibilant alveolar cues are linguistically distinctive in their L1. fricatives (e.g., Jongman, Wayland, & Wong, 2000; Jong- Of particular interest is the perception of the interdental man et al., 2003). In the V condition, perceptual accuracy fricatives nonnative to both Mandarin and Korean with labiodentals is lower for the Korean learners than for perceivers. The results support the first hypothesis raised the other language groups, indicating that they have in this study that the nonnative perceivers would not reach greater difficulty fully working out the visual cues for the native performance (particularly in the A and/or AVc English labiodentals. Their lack of accurate use of the modalities). However, both nonnative groups performed visual input may be due to the relatively distinctive better in the AVc than the A condition, suggesting an auditory input, and resembles the native English group’s ability to exploit visual information in L2 speech percep- performance where the visual benefit is not apparent when tion. Given previous findings that perceivers rely more on the auditory perception reaches ceiling. From an alter- visual speech information when intelligibility is poor native angle, the Korean perceivers’ unbalanced perfor- (Erber, 1969; Sumby & Pollack, 1954; Summerfield, mance in the auditory and visual domain indicates that ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 353 grasping auditory cues for the labiodentals may have V-nonlabials (McGurk & MacDonald, 1978; Sekiyama & preceded visual cues. Moreover, the confusion patterns for Tohkura, 1991). It is worth noting that this direction effect visual perception show that, compared to native English is less apparent for the Mandarin perceivers than for the and Mandarin perceivers, the Koreans more often con- Korean and native English perceivers, even though for fused labiodentals with alveolars, a place of articulation the Koreans both the input component (labiodental) and native for them but not immediately adjacent to the the expected fused percept (interdental) are nonnative. This labiodentals. This may have been due to the lack of may be due to the Mandarin perceivers’ less stable use of place of articulation contrasts for front fricatives in their auditory information (compared to the Korean and native Korean, leading to their incorrect assimilation of the English natives’ exclusive use of the auditory input), L2 visual cues to an L1 category. leading them to be more susceptible to the auditory–visual The visual perception of alveolars has a low accuracy by fusion. the nonnatives and natives alike. As sibilants, alveolars are more acoustically robust than the nonsibilant labiodentals 4.3. General discussion and interdentals (e.g., Jongman et al., 2003). It has been claimed that the degree of visual contribution is The present results from L2 AV speech perception inversely related to the ambiguity of auditory information suggest the possibility of bridging learning patterns across (e.g., Chen & Massaro, 2004). Thus, since the auditory speech input modalities. Indeed, the acquisition of non- (acoustic) cues of alveolars are robust, they are less native visual cues may be analogous to that of auditory L2 dependent on visual cues. The results consistently show speech learning (Hazan et al., 2005, 2006; Ortega-Llebaria very high scores for auditory perception of alveolars. This et al., 2001; Sekiyama, 1997). Just as with auditory L2 supports previous observations that a visual benefit is in perception, difficulty with the perception of L2 visual part determined by the visual salience of the place of information may also be attributed to influence from an articulation, and that consonants which are less front are L1. For L2 visual information which is nonassimilable to less likely to have visual influence than relatively front ones an L1 category, learners need to learn to associate L2 (e.g., Dodd, 1977; Hardison, 1999; Hazan et al., 2006). specific visual cues to corresponding L2 phones in order to establish new L2 categories. 4.2. Incongruent auditory–visual condition Theories based on auditory perception (e.g., the SLM) suggest that whether new L2 categories can be created All three language groups captured auditory information depends on the perceived phonetic distance of an L2 sound more often than visual information or AV-fusion (i.e., the from the closest L1 speech category: category formation McGurk illusion), which is consistent with previous may be blocked under the mechanism of category findings showing more auditory responses for the AV assimilation, where similar L1 sounds exist in the vicinity incongruent /p-f/ (Hardison, 1999) and /b-g/ (Chen & of the target L2 sounds (Flege, 2007). Given this Hazan, 2007) combinations. However, as has also been account, the current study hypothesizes that perception found previously, (e.g., Burnham & Dodd, 1998; Chen & of the nonnative interdentals can be more difficult Hazan, 2007; Sekiyama & Tohkura, 1993; Sekiyama et al., for the Mandarin perceivers than for the Koreans, since 2003), the nonnative perceivers (Mandarin in particular), the Mandarin perceptual space is more crowded in the compared with the native perceivers, have more occur- neighborhood of the interdentals (with both labiodentals rences of AV-fusion, possibly because the nonnative and alveolars). Indeed, the results show that the Koreans auditory and visual cues of these sounds are less stable, outperform the Mandarin perceivers in the AVc condition, and therefore more susceptible to cross-modal influence. demonstrating more efficient use of visual information Compared to the Mandarin perceivers, the Koreans show a when integrating with the auditory input. lower degree of fusion and their high accuracy in the Recent AV speech learning theories also suggest that, in A-match condition indicates that they have come further in addition to similarities in visual cues between an L1 and working out auditory cues of the nonnative labiodentals. L2, L2 speech learning models should include further Moreover, a direction effect with AV-fusion is also factors to account for AV L2 speech processing, such as the observed with the native English perceivers, with more relative weighting of auditory and visual cues, and the fused responses for Alabiodental+Valveolar stimuli than distinctiveness of visual cues (Hazan et al., 2006). Indeed, Aalveolar+Vlabiodental stimuli. Given that interdentals share nonnative perceivers may differently weigh the auditory the acoustic feature of ‘‘nonsibilant’’ with labiodentals and and visual input (Sekiyama, 1997; Ortega-Llebaria et al., that they share the visual feature of ‘‘nonlabial’’ with 2001). In the present study, the nonnative Mandarin group alveolars, it is natural that fusion (i.e. the interdental may have increased visual weighting in the perception of response) more easily occurs for native perceivers when the the new L2 interdentals, although they still have difficulty auditory input is labiodental and the visual input is assimilating them to the appropriate L2 categories (e.g., the alveolar. Findings consistent with these results have increased performance in the AVc condition compared to previously been reported where native perceivers demon- the A condition, despite the interdentals still often being strate greater fusion when A-labials are combined with confused with the adjacent alveolars familiar to the ARTICLE IN PRESS 354 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356

Mandarin perceivers in their L1). On the other hand, for lower than that of the Mandarin natives (40%), which may the Korean perceivers, the auditory perception of these have balanced out the slight LOR difference. Further nonnative sounds is relatively good, which may lead to longitudinal research would be needed to examine the their lack of use or less accurate use of the visual domain. dynamic auditory and visual learning patterns as a function These findings suggest that the perception and acquisition of LOR (e.g., from 0 to 5 years) and L1/L2 input. of L2 sounds in the auditory and visual domain may not Moreover, to compare the relative auditory and visual occur in parallel, and may even take place in a contribution, it would be necessary to evaluate the degree complementary manner. of attention to the speech input from these two modalities. Techniques such as eye-tracking which had previously been 5. Concluding remarks and future directions adopted in native AV processing research (e.g., Lansing & McConkie, 1999) would provide quantitative measures of In sum, while L2 learners make use of both auditory and where exactly the perceivers’ eyes focus. visual information in perceiving nonnative speech sounds, Together, the findings of the current study along with the their perception is influenced by the interaction of the AV previous L2 AV research suggest a shared mechanism in L2 speech categories in their L1 and L2. In future research, the speech learning for the auditory and visual domains, correlation of visual perceptual distance between L1 and providing supporting evidence to extend L2 speech L2 visual categories, and the corresponding level of learning theories developed on the basis of auditory speech difficulty in acquisition (e.g., the more dissimilar the L1 perception to include visual speech perception, and thus and L2 visual categories are, the easier the formation of an bridging the L2 perceptual learning patterns across AV L2 category) need to be more extensively studied and even modalities. quantified, as also suggested by the auditory speech learning research (e.g., Flege, 2007; Strange, 1999; Strange, Yamada, Kubo, Trent, & Nishi, 2001). Furthermore, the Acknowledgments current study indicates that L2 auditory and visual speech cues may not be acquired simultaneously; rather, AV We thank Nicole Carter, Chad Danyluck, Angela learning may even occur in a complementary manner. Feehan, Lan Kim, Lindsay Shaw, Rebecca Simms, and Subsequent research and theories should take into account Jung-yueh Tu at Simon Fraser University (SFU) for their visual and auditory relationships of L2 speech sounds with assistance in stimuli development, data collection and corresponding L1 sounds. analysis. We would also like to thank Dr. Ocke-Schwen Additionally, as the current study focuses on the effect of Bohn and the two anonymous reviewers for their valuable L1, it has left unaddressed a number of factors that may comments. Portions of this research were presented at the also affect the perception of L2 speech contrasts, such as 2007 International Conference on Auditory–Visual Speech L2 phonetic context and variability (Hardison, 2005b), Processing (AVSP 2007), and published in the Proceedings length of residence (LOR) in an L2 country and L2 input of AVSP 2007 (Eds.: J. Vroomen, M. Swerts and (Flege, 2007), attention (Guion & Pederson, 2007), etc. One E. Krahmer). Portions of the English and Mandarin data factor worth noting is phonetic context. For example, have been included in a previous article investigating the previous research has shown that vowel context influences effect of length of residence in an L2 speaking environment the neighboring sounds not only for native AV perception on AV speech perception (Wang et al., 2008). This project (e.g., Benguerel & Pichora-Fuller, 1982; Daniloff & Moll, was supported by a standard research grant from the Social 1968) but also for nonnative perception (e.g., Hardison, Sciences and Humanities Research Council of Canada 2005b). Since the vowels used in the current study exist in (SSHRC 410-2006-1034). the native phonetic inventories of all three native groups, no interaction of L1 group and vowel context was observed. A possible future direction is to examine whether References unfamiliar (nonnative) and familiar vowel context would similarly influence the perception of the target L2 sounds. Benguerel, A. P., & Pichora-Fuller, M. K. (1982). Coarticulation effects in Second, although the current nonnative participants all lipreading. Journal of Speech and Hearing Research, 25, 600–607. had a LOR between 1 and 5 years, the average LOR for the Bernstein, L. E., Auer, E., & Takayanagi, S. (2004). Auditory speech Korean participants (4 years) was longer than that for the detection in noise enhanced by lipreading. Speech Communication, 44, 5–18. Mandarin participants (2 years). Given that research has Best, C. T. (1995). A direct realist view of cross-language speech demonstrated the effect of LOR on L2 AV (Wang et al., perception. In W. Strange (Ed.), Speech perception and linguistic 2008) as well as auditory speech learning (e.g., Flege, Yeni- experience: Issues in cross-language research (pp. 171–204). Baltimore: Komshian, & Liu, 1999; McAllister, Flege, & Piske, 2002; York. Riney & Flege, 1998), the 2 years’ LOR difference could Best, C. T., & Tyler, M. D. (2007). Nonnative and second language speech perception: Commonalities and complementarities. In O.-S. Bohn, & have led the Koreans to outperform the Mandarin M. J. Munroe (Eds.), Language experience in second language speech participants in the current results. On the other hand, the learning: In honor of James Emil Flege. Amsterdam/Philadelphia: John Koreans’ reported mean daily English use (34%) was a bit Benjamins. ARTICLE IN PRESS Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356 355

Binnie, C. A., Montgomeery, A. A., & Jackson, P. L. (1974). Auditory and Hardison, D. M. (2005a). Second-language spoken word identification: visual contributions to the perception of consonants. Journal of Speech Effects of perceptual training, visual cues, and phonetic environment. and Hearing Research, 17, 619–630. Applied Psycholinguistics, 26, 579–596. Burnham, D., & Dodd, B. (1998). Familiarity and novelty in infant cross- Hardison, D. M. (2005b). Variability in bimodal spoken language language studies: Factors, problems, and a possible solution. In C. processing by native and nonnative speakers of English: A closer look Rovee-Collier (Ed.), Advances in infancy research (pp. 170–187). at effects of speech style. Speech Communication, 46, 73–93. Burnham, D., & Dodd, B. (2004). Auditory-visual speech integration by Hayashi, Y., & Sekiyama, K. (1998). Native-foreign language effect in prelinguistic infants: Perception of an emergent consonant in the the McGurk effect: A test with Chinese and Japanese. In AVSP-1998 McGurk effect. Developmental Psychobiology, 45, 204–220. (pp. 61–66). Burnham, D., Lau, S., Tam, H., & Schoknecht, C. (2001). Visual Hazan, A., Sennema, A., Iba, M., & Faulkner, A. (2005). Effect of discrimination of Cantonese tone by tonal but non-Cantonese speak- audiovisual perceptual training on the perception and production of ers, and by non-tonal language speakers. In D. Massaro, J. Light, & K. consonants by Japanese learners of English. Speech Communication, Geraci (Eds.), Proceedings of the international conference on auditor- 47, 360–378. y–visual speech processing (pp. 155–160). Hazan, V., Sennema, A., Faulkner, A., & Ortega-Llebaria, M. (2006). Chen, Y., & Hazan, V. (2007). Language effects on the degree of visual The use of visual cues in the perception of non-native consonant influence in audiovisual speech perception. In Proceedings of the 16th contrasts. Journal of the Acoustical Society of America, 119, international congress of phonetic sciences (pp. 2177–2180). Saarbrue- 1740–1751. ken, Germany. Jongman, A., Wang, Y., & Kim, B. (2003). Contributions of semantic and Chen, T., & Massaro, D. W. (2004). Mandarin speech perception by ear facial information to perception of nonsibilant fricatives. Journal of and eye follows a universal principle. Perception and Psychophysics, 66, Speech Language and Hearing Research, 46, 1367–1377. 820–836. Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of Cienkowski, K. M., & Carney, A. E. (2002). Auditory–visual speech English fricatives. Journal of the Acoustical Society of America, 108, perception and aging. Ear and Hearing, 23, 439–449. 1252–1263. Daniloff, R. G., & Moll, K. (1968). Coarticulation of lip rounding. Kang, S., & Lee, H. (2002). An experimental study on English word— Journal of Speech and Hearing Research, 11, 707–721. Initial consonant pronunciation learning by Korean students. Ohak Davis, C., & Kim, J. (2004). Audio–-visual interactions with intact clearly Yonku/Language Research, 38, 385–406. audible speech. Quarterly Journal of Experimental Psychology, 57A, Kim, D. J. (1972). A contrastive study of English and Korean . 1103–1121. Language Teaching, 5, 1–36. De Gelder, B., & Vroomen, J. (1992). Auditory and visual speech Kuhl, P. K., & Meltzoff, A. N. (1982). The bimodal perception of speech perception in alphabetic and non-alphabetic Chinese/Dutch bilinguals. in infancy. Science, 218, 1138–1141. In R. J. Harris (Ed.), Cognitive processing in Bilinguals (pp. 413–426). Ladefoged, P., & Wu, Z. (1984). Places of articulation: An investigation of Amsterdam: Elsevier. Pekingese fricatives and affricates. Journal of Phonetics, 12, 267–278. Dodd, B. (1977). The role of vision in the perception of speech. Perception, Lansing, C. R., & McConkie, G. W. (1999). Attention to facial regions in 6, 31–40. segmental and prosodic visual speech perception tasks. Journal of Erber, N. P. (1969). Interaction of audition and vision in the recognition Speech, Language, and Hearing Research, 42, 526–539. of oral speech stimuli. Journal of Speech Language and Hearing Lehiste, I. (1970). Suprasegmentals. Cambridge, MA: The MIT Press. Research, 12, 423–425. Massaro, D. W. (1998). Perceiving talking faces: From speech perception to Erdener, V., & Burnham, D. (2005). The role of audiovisual speech and a behavioral principle. Cambridge, MA: The MIT Press. orthographic information in nonnative speech production. Language Massaro, D. W. (2001). Auditory visual speech processing. In Proceedings Learning, 55, 191–228. of the 7th European conference on speech communication and Faulkner, A., & Rosen, S. (1999). Contributions of temporal encodings of technology—Eurospeech 2001. voicing, voicelessness, fundamental frequency, and amplitude varia- Massaro, D. W., Tsuzaki, M., Cohen, M. M., Gesi, A., & Heredia, R. tion to audio-visual and auditory speech perception. Journal of the (1993). Bimodal speech perception: An examination across language. Acoustical Society of America, 106, 2063–2073. Journal of Phonetics, 21, 445–478. Flege, J. E. (1980). Phonetic approximation in second language acquisi- McAllister, R., Flege, J., & Piske, T. (2002). The influence of L1 on the tion. Language Learning, 30, 117–134. acquisition of Swedish quantity by native speakers of Spanish, English, Flege, J. E. (1995). Second language speech learning: Theory, findings, and and Estonian. Journal of Phonetics, 30, 229–258. problems. In W. Strange (Ed.), Speech perception and linguistic McGurk, H., & Buchanan, L. (1981). Bimodal speech perception: Vision experience (pp. 233–273). Baltimore: York. and hearing. Unpublished manuscript, Department of Psychology, Flege, J. E. (2007). Language contact in bilingualism: Phonetic system University of Surrey. interactions. In J. Cole, & J. Hualde (Eds.), Laboratory phonology, Vol. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. 9. Berlin: Mouton de Gruyter. Nature, 264, 746–748. Flege, J. E., Yeni-Komshian, G., & Liu, S. (1999). Age constraints on McGurk, H., & MacDonald, J. (1978). Visual influences on speech second language learning. Journal of Memory and Language, 41, perception processes. Perception and Psychophysics, 24, 253–257. 78–104. Na´belek, A. K., & Robinson, P. K. (1982). Monaural and binaural speech Green, K. P., & Kuhl, P. K. (1989). The role of visual information in the perception in reverberation for listeners of various ages. Journal of the processing of place and manner features in speech perception. Acoustical Society of America, 71, 1242–1248. Perception and Psychophysics, 45, 34–42. Navarra, J., & Soto-Faraco, S. (2007). Hearing lips in a second language: Guion, S. G., & Pederson, E. (2007). Investigating the role of attention in Visual articulatory information enables the perception of L2 sounds. phonetic learning. In O.-S. Bohn, & M. Munro (Eds.), Language Psychological Research, 71, 4–12. experience in second language speech learning (pp. 57–77). Amsterdam: Ortega-Llebaria, M., Faulkner, A., & Hazan, V. (2001). Auditory–visual John Benjamins. L2 speech perception: Effects of visual cues and acoustic–phonetic Hardison, D. M. (1999). Second-language spoken word identification: context for Spanish learners of English. In D. Massaro, J. Light, & K. Effects of perceptual training, visual cues, and phonetic environment. Geraci (Eds.), Proceedings of the international conference on auditory– Language Learning, 49, 213–283. visual speech processing (pp. 149–154). Hardison, D. M. (2003). Acquisition of second-language speech: Effects of Patterson, M. L., & Werker, J. F. (1999). Matching phonetic information visual cues, context, and talker variability. Applied Psycholinguistics, in lips and voice is robust in 4.5-month-old infants. Infant Behavioral 24, 495–522. Development, 22, 237–247. ARTICLE IN PRESS 356 Y. Wang et al. / Journal of Phonetics 37 (2009) 344–356

Patterson, M. L., & Werker, J. F. (2003). Two-month-old infants match Conference on Spoken Language Processing (pp. 1481–1484). phonetic information in lips and voice. Developmental Science, 6, Philadelphia, PA. 191–196. Soto-Faraco, S., Navarra, J., Voloumanos, A., Sebastia´n-Galle´s, N., Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but hard to Weikum, & Werker, J. F. (2007). Discriminating languages by understand: A lip-reading advantage with intact auditory stimuli. In B. speechreading. Perception and Psychophysics, 69, 218–231. Dodd, & R. Campbell (Eds.), Hearing by eye: The psychology of lip- Strange, W. (1999). Levels of abstraction in characterizing cross-language reading (pp. 97–113). London: Erlbaum. phonetic similarity. In J. Ohala, Y. Hasegawa, M. Ohala, D. Granville, Riney, T. J., & Flege, J. E. (1998). Changes over time in global foreign & A. Bailey (Eds.), Proceedings of the XIVth international congress of accent and liquid identifiability and accuracy. Studies in Second phonetic sciences (pp. 2513–2519). University of California at Berkeley. Language Acquisition, 20, 213–244. Strange, W., Yamada, R., Kubo, R., Trent, S., & Nishi, K. (2001). Effects Sams, M., Manninen, P., Surakka, V., Helin, P., & Katto, R. (1998). of consonantal context on perceptual assimilation of American English McGurk effect in Finnish syllables, isolated words, and words in vowels by Japanese listeners. Journal of the Acoustical Society of sentences: Effects of word meaning and sentence context. Speech America, 109, 1691–1704. Communication, 26, 75–87. Sumby, W., & Pollack, I. (1954). Visual contribution to speech Schmidt, A. M. (1996). Cross-language identification of consonants, Part intelligibility in noise. Journal of the Acoustical Society of America, 1: Korean perception of English. Journal of the Acoustical Society of 26, 212–215. America, 99, 3201–3211. Summerfield, Q. (1979). Use of visual information for phonetic percep- Sekiyama, K. (1994). Differences in auditory–visual speech perception tion. Phonetica, 36, 314–331. between Japanese and Americans: McGurk effect as a function of Svantesson, J. O. (1986). Acoustic analysis of Chinese fricatives and incompatibility. Journal of the Acoustical Society of Japan (E), 15, affricates. Journal of Chinese Linguistics, 14, 53–70. 143–158. Takata, Y., & Na´belek, A. K. (1990). English consonant recognition in Sekiyama, K. (1997). Cultural and linguistic factors in audiovisual speech noise and in reverberation by Japanese and American listeners. Journal processing: The McGurk effect in Chinese subjects. Perception and of the Acoustical Society of America, 88, 663–666. Psychophysics, 59, 73–80. Tse, J. K. (1988). A spectrographic analysis of the coronal affricates and Sekiyama, K., & Tohkura, Y. (1991). McGurk effect in non-English fricatives in Mandarin Chinese. Ying Yu Yen Chiu Chi K’an/Studies in listeners: Few visual effects for Japanese subjects hearing Japanese English Literature and Linguistics, 14, 149–199. syllables of high auditory intelligibility. Journal of the Acoustical Wang, Y., Behne, D., & Jiang, H. (2008). Linguistic experience and Society of America, 90, 1797–1805. audio–visual perception of nonnative fricatives. Journal of the Sekiyama, K., & Tohkura, Y. (1993). Inter-language differences in the Acoustical Society of America, 124, 1716–1726. influence of visual cues in speech perception. Journal of Phonetics, 21, Weikum, W. M., Vouloumanos, A., Navarra, J., Soto-Faraco, S., 427–444. Sebastia´n-Galle´s, N., & Werker, J. F. (2007). Visual language Sekiyama, K., Kanno, I., Miura, S., & Sugita, Y. (2003). Auditory–visual discrimination in infancy. Science, 25, 1159. speech perception examined by fMRI and PET. Neuroscience Werker, J. F., Frost, P. E., & McGurk, H. (1992). La langue et les levres: Research, 47, 277–287. Cross-language influences on bimodal speech perception. Canadian Sekiyama, K., Tohkura, Y., & Umeda, M. (1996). A few factors Journal of Psychology, 46, 551–568. which affect the degree of incorporating lip-read information Woodward, M. F., & Barber, C. G. (1960). perception in lip into speech perception. In Proceedings of the 4th International reading. Journal of Speech and Hearing Research, 3, 212–222.