Proceedings, FONETIK 2004, Dept. of Linguistics, Stockholm University

Audiovisual perception of Swedish with and without conflicting cues Niklas Öhrström and Hartmut Traunmüller Department of Linguistics, Stockholm University

Abstract Lapidot and Myslobodsky (1996) with speakers of , but the sex difference was Auditory, visual and audiovisual syllables with much smaller among speakers of Hebrew. and without conflicting cues (/ i e ø/) Subjects exposed to stimuli with conflicting presented to men and women showed (1) most audiovisual cues typically report "hearing" to perceive roundedness by eye rather than by what they perceive with their eyes. This illusion ear, (2) a mostly male minority to be less rely- appears to have a neural foundation: Sams et al. ing on vision, (3) presence of rounding to be (1991) observed that visual information from noticed more easily than absence, and (4) all to lip movements modifies activity in the human perceive openness by ear rather than by eye. auditory cortex. Audiovisual integration appears to be very Introduction robust since it works even when the visual and It has been known for a long time that lip read- auditory components are from speakers differ- ing is practiced not only by the deaf, but also by ent in sex (Green et al. 1991). people with normal hearing. Especially in situa- Although it is not far-fetched to ask oneself tions with an unfavorable signal to noise ratio, whether fusions analogous to those observed by visual cues contribute substantially to speech McGurk and MacDonald (1976) would appear comprehension (Erber, 1975; Mártony, 1974). in vowel perception, we are not aware of any Amcoff (1970) investigated lip reading of previous investigation of this kind. However, Swedish vowels and by normal Summerfield and McGrath (1984) did experi- hearing speakers. These investigations showed ments in which manipulated [bVd] syllables very good visual perception of labial features. were presented to English subjects together The topic of audiovisual integration in with [i], [ɑ] and [u] faces. They observed that speech perception became suddenly very popu- the phoneme boundaries in the auditory vowel lar when McGurk and MacDonald (1976) had space were moved closer towards the position published their study in which auditory stimuli of the vowels presented visually. had been dubbed on visual stimuli with con- It is the purpose of the present study to in- flicting cues. They used repeated syllables with vestigate audiovisual integration in the percep- stop consonants and a following . In tion of vowels in a in which lip round- stimuli with conflicting cues, they observed (1) ing is an independent . Since fusions such as when an auditory [b] presented the auditory cues to lip rounding are not very together with a visual [ɡ] evoked the percept of prominent while the visual cues are, it could be a [d] and (2) combinations such as when an expected that perceivers are heavily influenced auditory [ɡ] presented together with a visual [b] by vision in perception of this feature. We shall evoked the percept of [bɡ]. also keep an eye on possible sex differences. While the McGurk-effect appears to work with perceivers with various language back- Method grounds, some differences have been observed: Japanese listeners show almost no effect of the Subjects visual signal when presented stimuli by Japa- The speakers were 2 men (29 and 45 years of nese speakers. However, they do show an effect age) and 2 women (21 and 29 years), while 10 when the speaker is a foreigner (Sekiyama and men (16 to 49 years) and 11 women (18 to 48 Tohkura, 1993; Hayashi and Sekiyama, 1998). years) with normal hearing and vision served as In addition to cultural differences, there may listeners. The speakers were students and re- also be sex differences in audiovisual percep- searchers from the department of linguistics tion since women have been shown to perform while the listeners were all phonetically naïve better in lip reading tasks (Johnson et al., 1988). native speakers of Swedish. This suggestion was confirmed by Aloufy, Proceedings, FONETIK 2004, Dept. of Linguistics, Stockholm University

Speech material Table 2. Confusion matrix for Group 1 (10 ♀, 6 ♂). The materials consisted of four Swedish non- Rows: presented, columns: perceived vowels. sense syllables /ɡiːɡ/, /ɡyːɡ/, /ɡeːɡ/, /ɡøːɡ/. aud vis iye ø ɛ ɒ o The speakers’ faces and the acoustic signal i * 117 11 were recorded using a video camera Panasonic y * 4 124 NV-DS11 and a microphone AKG CK 93. The e * 108 20 recorded material was subsequently edited and ø * 122 6 dubbed using Premiere 6.0. Each visual stimu- * i 74 3 43 7 1 lus was synchronized with each one of the audi- * y 77 51 tory stimuli. Each visual and each auditory * e 21 1 102 1 3 stimulus was also presented alone. In this way, * ø 10 1 115 1 1 24 final stimuli were obtained for each speaker. i i 127 1 In the perception test these stimuli were each y y 128 presented twice in random order. Thus, the per- e e 128 ception test consisted of 192 stimuli in total. ø ø 128 The stimuli were presented in 24 blocks of i y 7 120 1 eight stimuli each, using Windows Media y i 99 28 1 Player. e i 127 1 ø y 128 Procedure i e 128 y ø 128 The subjects participated one by one. They car- e ø 19 109 ried headphones AKG K135 and were seated ø e 8 59 61 with their faces at 60 cm from a computer i ø 16 108 4 screen. The height of the faces on screen was y e 107 17 4 roughly 17 cm. The subjects were instructed to e y 1 127 visually focus on the speaker's mouth while lis- ø i 6 79 43 tening and to write down which vowel they had heard. They used response sheets and were al- Table 3. Confusion matrix for Group 2 (1♀, 4 ♂). lowed to choose any one of the 9 ordinary Swedish letters that represent vowels. Prior to aud vis iye ø ɛ ɒ o the experimental blocks, one training block i * 37 3 with 8 stimuli was run. The subjects were su- y * 238 pervised during the whole session to make sure e * 40 they were focused on the screen all the time. ø * 40 The entire session lasted about 20 min. * i 14 1 24 1 * y 1 25 14 * e 6 3 31 Results * ø 10 1 27 2 A preliminary analysis of the results suggested i i 39 1 that the listeners did not all agree in their be- y y 40 havior. In order to see whether different groups e e 40 need to be distinguished, stepwise regression ø ø 40 analyses were performed for the results of each i y 31 9 listener, using the independent factors "auditory y i 832 openness", "auditory roundedness", "visual e i 40 openness" and "visual roundedness". ø y 40 The result showed that "auditory openness" i e 39 1 explained most of the variance for all 21 listen- y ø 40 ers. For the majority (Group 1: 16 listeners), e ø 31 9 "visual roundedness" explained next to most, ø e 34 6 i.e., it explained more than "auditory rounded- i ø 28 12 14 26 ness" did. For a minority (Group 2: 5 listeners), y e 26 14 "auditory roundedness" explained more of the e y ø i 39 1 variance than "visual roundedness" did.

Proceedings, FONETIK 2004, Dept. of Linguistics, Stockholm University

Subsequently, the two listener groups were kept failed to attain significance. The other group apart and analyzed separately. The results are relied mainly on audition, but did not fully ne- shown in Table 2 and 3. glect visual cues to roundedness. It is of interest to note that Group 2 included 4 of the 10 male listeners (40%) but only 1 of Table 6: Result of stepwise regression analyses. 2 the 11 female listeners (9%). Variance explained (r ) and unnormalized weights When the stimuli were presented in (1) of significant independent factors. auditory mode alone, (2) visual mode alone, Group 1 Group 2 and (3) audiovisual mode, the error rates ob- round open round open tained were 8%, 28%, 0.2% for Group 1, and r2 .926 .965 .972 .995 3%, 39%, 0.6% for Group 2. Cases with con- Constant 0.18 0.01 0.03 0.00 flicting cues are analyzed in Table 4. aud opn ns 0.99 ns 1.02 Table 4. Response percentages for stimuli with fully aud rnd ns ns 0.77 ns conflicting cues (i/ø, y/e, e/y, ø/i). "Fused": always aud opn*rnd 0.27 0.20 ns ns visual roundedness and auditory openness. vis opn ns ns ns ns vis rnd 0.77 ns 0.22 ns Group 1 Group 2 vis opn*rnd ns ns ns ns Auditory 22 75 Visual 2 0 Fused 76 25 Discussion As the principal result of the present investiga- Within Group 1, the pattern of confusions in tion, we have shown that audiovisual integra- roundedness was asymmetric. Table 5 shows tion of the kind that results in a new, fused per- that a vowel was rarely identified as unrounded cept occurs not only in consonants, but also in when the could be seen as rounded, but vowels. Thus, a visual [y] combined with an when the lips were visibly unrounded, identifi- auditory [e] was mostly perceived as an [ø], cations with rounded vowels were not so rare. while an auditory [y] combined with a visual The table also shows that rounded vowels were [e] was mostly perceived as an [i]. This can be more often correctly perceived as rounded. considered as analogous to the presentation of an auditory [b] together with a visual [ɡ] evok- Table 5. Confusion matrix for roundedness. "0" = ing the percept of a [d]. We can see that short "visually unrounded", "1" = visually rounded. segment duration is by no means prerequisite Rows: intended, Columns: perceived. Mean number for such fusions to occur. of responses per subject listed. Instances of double perception analogous to Group 1 Group 2 the "combined" results of McGurk and Mac- Roundedness 0 1 0 1 Donald (1976) have not been investigated in the 0 68 12 38 26 present experiment. According to Colin et al. 1 3 76 22 42 (2002), combinations are less common with voiced consonants than with unvoiced stops In addition to the roundedness value (0 or 1), an while fusions are more common. This would suggest that only fusions should occur with openness value (0 for [i y], 1 for [e ø o], 2 for [ɛ vowels, but in the present experiment some of ɒ]), was assigned to each vowel for numerical evaluation. Based on the numerical values, the the listeners informally reported having heard mean values of perceived roundedness and non-standard diphthongs. openness were computed for each stimulus in Although both lip-rounding and openness each one of the two groups of listeners. In are easily visible features, the perceptual weight stepwise linear regression analyses, the interac- of the visible cues to openness was found to be tion factors (vis. openness * vis. roundedness insignificant and close to zero among all listen- and aud. openness * aud. roundedness) were ers, while that of visible cues to roundedness also considered. The result is shown in Table 6. was found to be distinctly higher than that of As can be seen, both groups relied on audition auditory cues in the majority group of listeners. in openness perception, with no significant con- For vowels, it had not been shown before that tribution of visual cues. However, in rounded- the perception of some of their features can be ness perception, Group 1 relied on vision, and dominated by the visual signal. However, this is the contribution of aud. roundedness even what had been observed by McGurk and Mac- Proceedings, FONETIK 2004, Dept. of Linguistics, Stockholm University

Donald (1976) concerning the presence or ab- References sence of a labiality feature in stop consonants. Aloufi S., Lapidot M., and Mylobodsky M. The dominant role of the visual input appears to (1996) Differences in susceptibility to the be restricted to labiality features in both conso- "blending illusion" among native Hebrew nants and vowels, i.e., to the distinctly most and English speakers. Brain Lang 53, 51– easily visible features. Since lip rounding is not a distinctive fea- 57. Amcoff S. (1970) Visuell perception av talljud ture within the vowel system of English, such a och avläsestöd för hörselskadade. Rapport result could not have been obtained by Sum- 7, Lärarhögskolan i Uppsala, Pedagog. inst. merfield and McGrath (1984), and their data Colin C., Radeau M., Deltenre P., Demolin D., did not suggest any differences of this kind. and Soquet A. (2002) The role of sound in- The fact that the minority of listeners that tensity and stop- voicing on did not rely on lip reading was composed pre- McGurk fusions and combinations. Eur J dominantly of male listeners is in accord with Cogn Psychol 14, 475–491. the reported greater susceptibility of female lis- Eklund I., and Traunmüller H. (1997) Compara- teners to visual input and their greater profi- tive study of male and female whispered ciency in lip reading (Johnson et al., 1988), and phonated versions of the long vowels of which our results confirm. This has been rea- Swedish. Phonetica 54, 1–21. sonably explained by sex differences in gazing Erber N.P. (1975) Auditory-visual perception of behavior, women being more likely to look at a speech. J Speech Hear Disord 40, 481–492. speaker's face. In this connection, it is relevant Green K. P., Kuhl P. K., Meltzoff A. N., and to note that in Japanese culture it is considered Stevens E. B. (1991). Integrating speech in- more polite not to gaze at a speaker's face, formatin across talkers, gender, and sensory which may explain why both men and women modality: Female faces and male voices in are less sensitive to visual input (Sekiyama and the McGurk effect. Percept Psychophys 50, Tohkura, 1993; Hayashi and Sekiyama, 1998). The insignificant perceptual weight of vis- 524–536. Hayashi T., and Sekiyama K. (1998) Native- ual cues to openness may be due to the fact that foreign language effect in the McGurk ef- the openness of the mouth varies with vocal ef- fect: a test with Chinese and Japanese. fort and between speakers and also as a func- AVSP’98, Terrigal, Australia. tion of context, while the visible reflection of http://www.isca-speech.org/archive/avsp98/ lip rounding and labial closure is quite robust to Johnson F.M., Hicks L., Goldberg T., and My- such influences. Moreover, the acoustic cues to lobodsky, M. (1988) Sex differences in lip- labiality are feeble, while the relative position reading. B Psychonomic Soc 26, 106–108. of F is an acoustically strong cue to openness. 1 Mártony J. (1974). On speechreading of Swed- It is interesting to note that an auditory [ø] ish consonants and vowels, STL-QPSR 2- was often perceived as an [ɛ] when combined 3/1974, 11–33. with a visual [e] or even an [i]. This is reflected in the significant auditory rnd * opn interaction, McGurk H., and MacDonald J. (1976) Hearing and it highlights the irrelevance of visual cues lips and seeing voices. Nature 264, 746– for the perception of openness. Only on the ba- 748. sis of acoustics, this behavior can be under- Sams M., Aulanko R., Hämäläinen M., Hari R., stood: the frequency position of F in a Swedish Lounasmaa O.V., Lu S-T., and Simola J. 1 (1991) Seeing speech: visual information [ø] is, typically, higher than in [e], and the F2 of from lip movements modifies activity in the [ø] is close to that of an [ɛ] (Eklund and Traun- müller, 1997). human auditory cortex. Neurosci Lett 127, The observed asymmetric pattern of confu- 141–145. sions in roundedness implies that listeners no- Sekiyama K., Tohkura Y. (1993) Inter-language tice visibly rounded lips as incompatible with differences in the influence of visual cues in the presence of an unrounded vowel, while the speech perception. J 21, 427–444. incompatibility of unrounded lips with the pres- Summerfield, A.Q., and McGrath M (1984) De- ence of a rounded vowel goes more often unno- tection and resolution of audio–visual in- ticed. Listeners appear to consider unrounded compatibility in the perception of vowels. lips as "unmarked" and therefore as less note- Q. J. Exp Psychol -A 36, 51–74. worthy.