1

1 2 Audiovisual speech : Moving beyond McGurk 3 4 Kristin J. Van Engen1, Avanti Dey2, Mitchell S. Sommers1, Jonathan E. Peelle3 5 6 7 1 Department of Psychological and Brain Sciences, Washington University in Saint Louis 8 9 2 Center for Vital Longevity, University of Texas at Dallas 10 11 3 Department of Otolaryngology, Washington University in Saint Louis 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Please address correspondence to: 37 38 Dr. Kristin Van Engen Dr. Jonathan Peelle 39 Dept. of Psychological and Brain Sciences Dept. of Otolaryngology 40 Washington University in Saint Louis Washington University in Saint Louis 41 Saint Louis, MO 63130 Saint Louis, MO 63110 42 email: [email protected] email: [email protected] 43

2

44 Abstract 45 Although listeners use both auditory and visual cues during , the cognitive and 46 neural bases for their integration remain a matter of debate. One common approach to measuring 47 is to use McGurk tasks, in which discrepant auditory and visual cues 48 produce auditory percepts that differ from those based solely on unimodal input. Not all listeners 49 show the same degree of susceptibility to the McGurk , and these individual differences 50 in susceptibility are frequently used as a measure of audiovisual integration ability. However, 51 despite their popularity, we argue that McGurk tasks are ill-suited for studying the kind of 52 multisensory speech perception that occurs in real life: McGurk stimuli are often based on 53 isolated syllables (which are rare in conversations) and necessarily rely on audiovisual 54 incongruence that does not occur naturally. Furthermore, recent data show that susceptibility on 55 McGurk tasks does not correlate with performance during natural audiovisual speech perception. 56 Although the McGurk effect is a fascinating illusion, truly understanding the combined use of 57 auditory and visual information during speech perception requires tasks that more closely 58 resemble everyday communication. 59

3

60 Speech perception in face-to-face conversations is a prime example of multisensory integration: 61 as listeners we have access not only to a speaker’s voice, but to visual cues coming from their 62 face, gestures, and body posture. More than forty years ago, McGurk and MacDonald (1976) 63 published a remarkable (and now famous) example of visual influence on auditory speech 64 perception: when an auditory stimulus (e.g., /ba/) was presented with the face of a talker 65 articulating a different syllable (e.g., /ga/), listeners often experienced an illusory percept distinct 66 from both sources (e.g., /da/). Since that time, McGurk stimuli have been used in countless 67 studies of audiovisual integration in humans (not to mention the multitude of classroom 68 demonstrations on multisensory processing) (Marques, Lapenta, Costa, & Boggio, 2016). At the 69 same time, the stimuli typically used to elicit a McGurk effect differ substantially from what we 70 typically encounter in conversation. In this paper we consider how to best understand the 71 benefits listeners receive from being able to see a speaker’s face while listening to their speech 72 during natural communication. We start by reviewing behavioral findings regarding audiovisual 73 speech perception, as well as some theoretical constraints on our understanding of multisensory 74 integration. With this background, we examine the McGurk effect to assess its usefulness for 75 furthering our understanding of audiovisual speech perception.

76 Benefits of audiovisual speech compared to auditory-only speech 77 A great deal of research in audiovisual speech processing has focused on the finding that speech 78 signals are consistently more intelligible when both auditory and visual information are available 79 compared to auditory-only (or visual-only) perception. Advantages in multisensory processing 80 most obviously come from complementarity between auditory and visual speech signals. For 81 example, /m/ and /n/ are confusable acoustically, but visually distinct; similarly, /p/ and /b/ are 82 visually confusable, but acoustically distinct. Having two modalities of speech information 83 makes it more likely a listener will perceive the intended item. However, advantages may also 84 come in cases when information is redundant across two modalities, particularly in the context of 85 background noise. 86 In experiments that present speech to listeners in noisy backgrounds, recognition of 87 audiovisual (AV) speech is significantly better than recognition of auditory-only speech (Erber, 88 1975; Sommers, Tye-Murray, & Spehar, 2005; Tye-Murray, Sommers, & Spehar, 2007a; Van 89 Engen, Phelps, Smiljanic, & Chandrasekaran, 2014; Van Engen, Xie, & Chandrasekaran, 2017), 90 and listeners are able to reach pre-determined performance levels at more difficult signal-to- 91 noise ratios (SNRs) (Grant & Seitz, 2000; Macleod & Summerfield, 1987; Sumby & Pollack, 92 1954). In addition, AV speech has been shown to speed performance in shadowing tasks (where 93 listeners repeat a spoken passage in real time) (Reisberg, McLean, & Goldfield, 1987) and 94 improve comprehension of short stories (Arnold & Hill, 2001) relative to auditory-only 95 presentations. Given that acoustically challenging speech is also associated with increased 96 cognitive challenge (Peelle, 2018), visual information may reduce the cognitive demand 97 associated with speech-in-noise processing (Gosselin & Gagné, 2011). Below we review two key 98 areas in which audiovisual speech is beneficial compared to unimodal speech. 99 Phonetic discrimination 100 Visual speech conveys information about the position of a speaker’s articulators, which can help 101 a listener identify speech sounds by visually distinguishing their place of articulation. For 102 example, labial articulations (such as bilabial consonants or rounded vowels) are visually salient,

4

103 so words like “pack” and “tack”—which may be difficult to distinguish in a noisy, auditory-only 104 situation—can be readily distinguished if a listener can see the speaker’s lips. 105 Temporal cues in the visual signal also carry information relevant to distinguishing some 106 phonological categories. In English, for example, the difference between a voiceless /t/ and a 107 voiced /d/ ( that share a place of articulation) at the end of a word is largely 108 instantiated as a difference in the duration of the preceding vowel (say the words “” and 109 “bead” aloud and take note of the vowel duration in each—“bead” almost certainly contains a 110 longer vowel). Visual speech can provide a critical durational cue for such phonemes, even if the 111 auditory signal is completely masked by the acoustic environment. For example, if a listener is 112 able to visually observe a speaker’s mouth close at the end of a word, this provides durational 113 information even if the acoustic signal is completely masked. 114 One way to quantify phonetic constraints in speech perception is in the context of a 115 lexical competition framework. Lexical competition frameworks start from the assumption that 116 listeners must select the appropriate target word from among a set of similar-sounding words 117 (phonological neighbors), which act as competitors. Words with a relatively high number of 118 phonological neighbors (such as “cat”) thus have higher levels of lexical competition and may 119 thus rely more on cognitive processes of inhibition or selection compared to words with 120 relatively few phonological neighbors (such as “orange”). Originally developed for auditory-only 121 speech perception (Luce & Pisoni, 1998; Marslen-Wilson & Tyler, 1980), lexical competition 122 frameworks have since been extended to consider AV speech perception (Feld & Sommers, 123 2011; Strand, 2014). Tye-Murray et al. (2007b) introduced the concept of intersection density, 124 suggesting that it is only the words that overlap in both auditory and visual features that compete 125 during word recognition. Specifically, they investigated whether auditory-only, visual-only, or 126 AV neighborhoods predicted listeners’ performance in AV speech perception tasks, and found 127 that it was the combined AV neighborhood—reflecting the intersection of auditory and visual 128 cues—that was most closely related to a listener’s speech perception accuracy. 129 In summary, visual speech can provide informative cues to place of articulation and 130 timing that are important for phonetic identification. Visual information therefore can help 131 reduce the number of lexical competitors during comprehension and increase accuracy of 132 perception. One question that has been largely ignored in studies of the McGurk effect is the 133 extent to which such lexical competition may contribute to the specific perceptual experiences 134 (i.e., ones that do not correspond to either unimodal input in the illusion). 135 Temporal prediction 136 At a basic level, temporal information in the visual signal can also help listeners by simply 137 serving as a cue that a person has begun speaking. In a noisy restaurant, for example, seeing a 138 talker’s mouth can help listeners direct their attention to the talker’s speech and better segregate 139 it from other interfering sound sources. 140 However, in addition to alerting listeners to the start of speech, mouth movements also 141 provide ongoing information about the temporal envelope of the acoustic signal during 142 continuous speech, with louder amplitudes being associated with opening of the mouth 143 (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009; Grant & Seitz, 2000; 144 Summerfield, 1987). The visual signal thus provides clues about the rhythmic structure of 145 connected speech, which helps listeners form expectations about incoming information and 146 therefore supports linguistic processing. In fact, even modulating a circle based on the amplitude 147 envelope of speech can reduce effort during recognition of spoken sentences (Strand, Brown, & 148 Barbour, 2018). As discussed in more detail below, visual information regarding the acoustic

5

149 envelope may aid in how ongoing brain oscillations track the incoming acoustic speech signal. 150 Intuitively one might presume that visual information would necessarily be available to listeners 151 before acoustic information, and thus be predictive by its temporal order. However, even if the 152 relationship is not so straightforward (Schwartz & Savariaux, 2014), it is still possible to 153 consider complementary information the auditory and visual speech signals provide with respect 154 to the acoustic amplitude envelope.

155 Multisensory integration 156 Although it is clear that visual information significantly aids auditory speech recognition, 157 understanding speech perception requires determining how listeners combine the simultaneously- 158 available information from visual and the auditory modalities during communication. The issue 159 is critical because if there is only a single mechanism for audiovisual integration, then any task 160 (such as one using McGurk stimuli) that involves integrative processes necessarily relates to 161 other audiovisual tasks (such as everyday speech perception). On the other hand, if there are 162 multiple ways in which information can be combined across modalities, it is critical to determine 163 whether two tasks rely on the same integrative mechanism. 164 One way to classify different integrative mechanisms is by whether integration occurs at 165 an early or late stage of processing (Peelle & Sommers, 2015): Early integration refers to 166 modulations of activity in primary sensory cortex—that is, “early” in terms of both processing 167 hierarchy and time. Late integration occurs in regions of heteromodal cortex (for example, 168 posterior superior temporal sulcus [STS]), incorporating auditory and visual information into a 169 unified percept only after unimodal processing has occurred. 170 Late integration models of audiovisual integration 171 Many classical frameworks for audiovisual speech processing have focused on late integration 172 (Grant, Walden, & Seitz, 1998; Oden & Massaro, 1978), which makes sense under the intuitive 173 assumption that auditory and visual cortex receive largely unimodal inputs. In other words, 174 perceptual information is necessarily segregated at the level of sensory receptors, setting the 175 stage for independent processing of auditory and visual information. If integration occurs outside 176 of unimodal processing, it is straightforward to postulate independent abilities for auditory-only 177 speech processing, visual-only speech processing, and audiovisual integration. Separating 178 integrative ability is of theoretical benefit in explaining the performance of listeners who are 179 matched in unimodal processing, but differ in their success with audiovisual tasks. For example, 180 Tye-Murray et al. (2016) matched individuals on auditory-only and visual-only performance by 181 using individually determined levels of babble noise and visual blur to obtain 30% accuracy in 182 both unimodal conditions. Despite nearly identical unimodal performance, however, participants 183 varied substantially in the auditory-visual condition when tested using the degraded auditory and 184 visual stimuli. Under this framework, by separately measuring unimodal and audiovisual 185 perception, an individual listener’s ability to integrate audiovisual speech information can be 186 estimated, independent of their ability to encode auditory-only and visual-only speech cues. 187 The posterior STS is a region of heteromodal cortex that shows responses to both 188 auditory and visual inputs, making it an appealing candidate to underlie audiovisual integration 189 (Beauchamp, 2005). We consider the posterior STS to be involved in late integration, as it 190 receives inputs from primary sensory regions. Consistent with an integrative role, regions of 191 middle and posterior STS show activity for AV speech relative to unimodal speech (Venezia et 192 al., 2017). Perhaps most relevant for the current discussion, during incongruent McGurk trials

6

193 (i.e., trials on which the auditory and visual input are incongruent), increased activity is seen in 194 the posterior STS (Calvert, Campbell, & Brammer, 2000; Nath & Beauchamp, 2012; Sekiyama, 195 Kanno, Miura, & Sugita, 2003). 196 Nath and Beauchamp (2011) offered a particularly nice demonstration of audiovisual 197 integration by investigating functional connectivity between auditory regions, visual regions, and 198 posterior STS while participants were presented with syllables and words. Importantly, they 199 varied the accuracy of unimodal information by either blurring the visual speech signal 200 (“auditory reliable”) or putting the speech in noise (“visual reliable”). They found that functional 201 connectivity between sensory cortices and posterior STS changed as a function of unimodal 202 clarity such that when auditory information was clear, connectivity between auditory regions and 203 posterior STS was stronger than when visual information was clear. These functional imaging 204 results are consistent with performance of patients with brain damage, which is most strongly 205 impaired when regions of STS are damaged (Hickok et al., 2018), and that transcranial magnetic 206 stimulation to posterior STS interferes with the McGurk illusion (Beauchamp, Nath, & Pasalar, 207 2010). Together, these findings support the posterior STS playing an important role in 208 audiovisual integration, and suggest that it may weigh modalities as a function of their 209 informativeness. 210 Early integration models of audiovisual integration 211 Despite their intuitive appeal, significant challenges for strict late integration models come from 212 a variety of perspectives. For example, electrophysiological research in nonhuman primates 213 shows that neural activity in primary auditory cortex is modulated by multisensory cues, 214 including from motor/haptic (Lakatos, Chen, O'Connell, Mills, & Schroeder, 2007) and visual 215 (Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008) inputs. In humans, regions of posterior 216 STS also represent visual amplitude envelope information (Micheli et al., 2018). Non-auditory 217 information may arrive at primary auditory cortex through nonspecific thalamic pathways 218 (Schroeder, Lakatos, Kajikawa, Partan, & Puce, 2008) and/or cortico-cortical connections from 219 visual to auditory cortex (Arnal & Giraud, 2012). Multisensory effects in primary auditory 220 regions argue against late integration models because they suggest that, at least by the time 221 information reaches cortex, there are no longer pure unisensory representations. 222 A second, related line of research relates to the role of brain oscillations in speech 223 perception (Giraud & Poeppel, 2012; Peelle & Davis, 2012). Cortical oscillations entrain to 224 acoustic information in connected speech (Luo & Poeppel, 2007), an effect that is enhanced 225 when speech is intelligible (Peelle, Gross, & Davis, 2013). This cortical entrainment may be 226 thought of as encoding a sensory prediction because it relates to the times at which incoming 227 information is likely to occur. Interestingly, visual information increases the entrainment of 228 oscillatory activity in auditory cortex to the amplitude envelope of connected speech in both 229 quiet and noisy listening environments (Crosse, Butler, & Lalor, 2015; Luo, Liu, & Poeppel, 230 2010; Zion Golumbic, Cogan, Schroeder, & Poeppel, 2013). These findings suggest a role for 231 audiovisual integration in auditory cortex. Because oscillatory entrainment is a phenomenon 232 associated with connected speech (that is, sentences or stories) rather than isolated words or 233 syllables, these findings raise the possibility that non-sentence stimuli might fail to engage this 234 type of early integrative process.

7

235 The McGurk effect 236 On the surface, the McGurk effect appears to be an ideal measure of auditory-visual integration: 237 discrepant inputs from two modalities (auditory and visual) are combined to produce a percept 238 that can be distinct from either of the unimodal inputs (Figure 1A). Video demonstrations of the 239 McGurk effect often ask viewers to open and close their eyes, with the percept changing 240 instantaneously depending on the availability of visual input. And, the effect is robust: auditory 241 judgments are affected by visual information even when listeners are aware of the illusion, when 242 male voices are dubbed onto female faces (and vice versa) (Green, Kuhl, Meltzoff, & Stevens, 243 1991), when auditory information lags behind the visual signal by as much as 300 ms (Munhall, 244 Gribble, Sacco, & Ward, 1996; Soto-Faraco & Alsius, 2009; Venezia, Thurman, Matchin, 245 George, & Hickok, 2016), and for point light displays (Rosenblum & Saldana, 1996). 246

Figure 1. A. Schematic of the McGurk effect, in which incongruent auditory and visual speech are presented simultaneously, frequently resulting in a percept that differs from the input. B. Reliability of individual susceptibility to the McGurk effect on separate testing occasions (adapted from Basu Mallick, Magnotti, & Beauchamp, 2015). C. Neuroimaging studies frequently identify left posterior superior temporal sulcus as showing greater activity for incongruent McGurk stimuli than for control items (adapted from Sekiyama, et al., 2003). 247 248 In addition to demonstrating a clear combination of auditory and visual information, the 249 McGurk effect is a useful tool in that it is observable across a range of populations: pre-linguistic 250 infants (Rosenblum, Schmuckler, & Johnson, 1997), young children (Massaro, Thompson, 251 Barron, & Laren, 1986; McGurk & MacDonald, 1976), speakers of different languages 252 (Magnotti et al., 2015; Sekiyama, 1997), and individuals with Alzheimer’s disease (Delbeuck, 253 Collette, & Van der Linden, 2007), among others, have all been shown to be susceptible to the 254 illusion. Because susceptibility to the McGurk effect varies considerably across individuals even 255 within these groups, it has been used as an index of integrative ability: individuals who show 256 greater susceptibility to the McGurk effect are typically considered to be better integrators 257 (Magnotti & Beauchamp, 2015; Stevenson, Zemtsov, & Wallace, 2012). Finally, the McGurk 258 effect appears to be quite reliable: Strand et al. (2014) and Basu Mallick et al. (2015) also both 259 showed high test-retest reliability in the McGurk effect (even, in the case of Basu Mallick, et al., 260 2015, with a year between measurements, Figure 1B), suggesting that these tasks measure a 261 stable trait within individual listeners. 262 Neuroanatomically, as noted above, functional brain imaging studies frequently find 263 activity in left posterior STS associated with McGurk stimuli, illustrated in Figure 1C. Given the 264 heteromodal anatomical connectivity of these temporal regions and their response to a variety of

8

265 multisensory stimuli, this brain region makes intuitive sense as playing a key role in audiovisual 266 integration.

267 Concerns about the McGurk effect 268 There is no question that the McGurk effect demonstrates a clear perceptual combination of 269 auditory and visual information, is a reliable within-listener measure, and reveals considerable 270 variability among listeners. However, the imperfect correspondence between McGurk stimuli 271 and natural speech raises some initial concerns about the face validity of comparing the two. 272 Table 1 summarizes several characteristics of speech, and whether these are present in 273 either natural speech or typical McGurk stimuli.1 Although there are certainly commonalities, 274 there are also significant differences. For example, incongruent auditory and visual cues are 275 central to the McGurk effect, but never occur in natural speech (a speaker cannot produce speech 276 incongruent with their vocal apparatus). Given the artificial nature of the stimuli typically used to 277 elicit the McGurk effect, we move on to consider experimental evidence that might speak to how 278 the McGurk effect might relate to natural speech perception. We argue that there are good 279 reasons to think the mechanisms of integration for McGurk stimuli and audiovisual speech 280 processing may differ. 281 Table 1. Speech characteristics of McGurk stimuli and natural speech McGurk stimuli Natural speech Audiovisual x x Phonetic content x x Audio-visual congruency x Phonological constraints x Lexical constraints x Semantic context x Syntax x 282 283 McGurk susceptibility and audiovisual speech perception 284 Perhaps the most compelling evidence against the use of the McGurk effect for understanding 285 audiovisual integration in natural speech comprehension comes from studies that have 286 investigated whether an individual listener’s susceptibility to McGurk stimuli correlates with 287 their audiovisual speech benefit (for example, the intelligibility of sentences in noise with and 288 without access to the visual signal). Grant and Seitz (1998) showed equivocal results in this 289 regard. For older adults with acquired loss, McGurk susceptibility was correlated with 290 visual enhancement (i.e., the proportion of available improvement listeners made relative to their 291 audio-only performance) for sentence recognition. but McGurk susceptibility did not contribute 292 significantly to a regression model predicting visual enhancement, suggesting McGurk 293 susceptibility did not provide information above and beyond other, non-McGurk factors. 294

1 Of course, different studies may use different variations of McGurk stimuli; we have here focused on the most common type, which involves isolated phonemes (e.g., /ba/) presented without linguistic or acoustic context, with congruent or incongruent visual information.

9

Figure 2. A. Visual enhancement (improved intelligibility for AV speech compared to auditory-only speech) during sentence comprehension does not depend on susceptibility to the McGurk effect across different types of background noise and noise level (adapted from Van Engen, et al., 2017). B. Greater fMRI activity in visual cortex (top) and auditory cortex (bottom) for AV sentences compared to auditory-only sentences (adapted from Okada, Venezia, Matchin, Saberi, & Hickok, 2013). 295 296 More recently, Van Engen et al. (2017) measured participants’ susceptibility to the 297 McGurk effect and their ability to identify sentences in noise (Figure 2A). Listeners were tested 298 in both speech-shaped noise and in two-talker babble across a range of signal-to-noise ratios. For 299 each condition, listeners were tested in both audio-only and audiovisual modalities. McGurk 300 susceptibility did not predict performance overall, nor did it interact with noise level or modality. 301 In fact, in speech-shaped noise there was a trend reflecting a negative association between 302 McGurk susceptibility and speech recognition, particularly in the AV conditions. If anything, 303 then, McGurk susceptibility may predict poorer performance on some AV speech tasks. The 304 same study also failed to find any significant correlations between McGurk susceptibility and 305 visual enhancement in any of the listening conditions. (Notably, the two conditions which 306 showed marginally significant relationships again indicated that, if anything, greater 307 susceptibility to the McGurk illusion was associated with lower rates of visual enhancement.) 308 Similar results were reported by Hickok and colleagues (2018), who found no relationship 309 between McGurk susceptibility and AV enhancement. These empirical findings indicate that 310 McGurk susceptibility is measuring something different from congruent audiovisual processing. 311 Brain imaging studies of AV speech processing using congruent non-McGurk stimuli 312 also implicate primary sensory regions in multisensory processing, consistent with an early 313 integration account. For example, Okada et al. (2013) presented participants with auditory-only 314 or AV sentences. For the AV sentences, they found increased activity in both visual cortex and

10

315 auditory cortex (Figure 2B), in addition to changes in posterior STS. The increased activity in 316 auditory cortex is consistent with visual information affecting “low level” acoustic information, 317 perhaps through altering temporal expectations in connected speech (Peelle & Sommers, 2015). 318 Differences between typical McGurk stimuli and natural audiovisual speech 319 Perhaps the most important difference between McGurk stimuli and everyday audiovisual speech 320 signals is that the auditory and visual inputs in McGurk tasks are, by definition, incongruent. 321 Resolving discrepant inputs is decidedly not what listeners are doing when they use congruent 322 auditory and visual information to better understand spoken utterances. In an fMRI study, 323 Erickson et al. (2014) showed distinct brain regions involved in the processing of congruent AV 324 speech and incongruent AV speech when compared to unimodal speech (acoustic-only and 325 visual-only). Left posterior STS was recruited during congruent bimodal AV speech, whereas 326 left posterior superior temporal gyrus (posterior STG) was recruited when processing McGurk 327 speech, suggesting that left posterior STG may be especially important when there is a 328 discrepancy between auditory and visual cues. Moris Fernandez and colleagues (2017) further 329 showed regions of the brain associated with conflict detection and resolution (anterior cingulate 330 cortex and left inferior frontal gyrus) were more active for incongruent than congruent speech 331 stimuli, a finding which is also apparent in meta-analysis of neuroimaging studies (Erickson, 332 Heeg, Rauschecker, & Turkeltaub, 2014). Using EEG, Crosse and colleagues (2015) have shown 333 that neural entrainment to the speech envelope is inhibited when auditory and visual information 334 streams are incongruent compared to congruent. On balance these studies point (perhaps not 335 surprisingly) to the processes supporting the perception of incongruent speech (including 336 McGurk stimuli) differing from those engaged in the perception of congruent speech. 337 Second, the near-exclusive use of single syllables for demonstrating the McGurk effect 338 (with some exceptions; e.g. Dekle, Fowler, & Funnell, 1992) does not reflect the nature of AV 339 integration in the processing of more natural linguistic stimuli such as words and sentences 340 (Sams, Manninenb, Surakkab, Helinb, & Kättöb, 1998; Windmann, 2004). Sommers et al. 341 (2005), for example, found no correlation between AV perception of syllables and AV 342 perception of words or sentences. Grant and Seitz (1998) similarly showed no correlation 343 between integration measures derived from consonant versus sentence recognition. Although 344 additional research is required, these results suggest that correlations between McGurk 345 susceptibility for syllables, words, and sentences is likely to be similarly low, making it 346 questionable whether integration based on syllable perception is closely related to integration in 347 the processing of words or sentences. Indeed, Windmann (2004) showed that the clarity and 348 likelihood of the McGurk effect was dependent on listeners’ semantic and lexical expectations. 349 General problems for the McGurk task as a measure of audiovisual integration 350 In addition to concerns about whether McGurk tasks tap into the same mechanisms of integration 351 that are relevant for speech communication, there are also more general concerns about the 352 McGurk effect as a measure of integration (see also Alsius, Paré, & Munhall, 2017). 353 One factor complicating interpretation of group and individual differences in McGurk 354 susceptibility is that such differences can also arise from variability in unimodal (auditory-only 355 or visual-only) performance. Group and individual differences in unimodal encoding are 356 particularly important given the extensive variability in visual-only () abilities, even 357 within groups of homogeneous participants. Therefore, even for normal-hearing young adults, 358 individual differences in McGurk susceptibility could be attributed to differences in auditory- 359 visual integration, visual-only encoding, or some combination of both. Strand et al. (2014)

11

360 showed, for example, that McGurk susceptibility depends in part on differences in lipreading 361 skill and detection of incongruity. Issues of individual differences in unimodal performance will 362 be magnified in populations, such as older adults, where both auditory-only and visual-only 363 encoding can vary substantially across individuals, further complicating interpretation of group 364 differences in McGurk susceptibility as arising exclusively from differences in auditory-visual 365 integration. In fact, it has been suggested that published estimates of group differences in 366 McGurk susceptibility are inflated because true group differences are swamped by individual 367 variability (Magnotti & Beauchamp, 2018). 368 Given that individual differences in auditory-visual integration and in unimodal encoding 369 can contribute to measures of McGurk susceptibility, it is perhaps not surprising that individuals 370 vary tremendously in how likely they are to experience the McGurk illusion: susceptibility to the 371 McGurk effect has been shown to vary from 0% to 100% across individuals viewing the 372 identical stimuli (Basu Mallick, et al., 2015; Strand, et al., 2014). Of perhaps even greater 373 concern is that the distribution of responses for McGurk stimuli is often bimodal—most 374 individuals will perceive the fused response from a given McGurk stimulus either less than 10% 375 of the time or greater than 90% of the time. Basu Mallick et al., (2015), for example, found that 376 across 20 different pairs of McGurk stimuli, 77% of participants either showed the fused 377 response less that 10% of the time or greater than 90% of the time. One likely reason for the 378 extensive individual variability is that McGurk susceptibility reflects a combination of 379 differences in auditory-visual integration and unimodal encoding. 380 Finally, Brown and colleagues (2018) investigated whether individual variability in 381 McGurk susceptibility might be explained by differences in sensory or cognitive ability. 382 Although lipreading ability was related to McGurk susceptibility, the authors failed to find any 383 influence of attentional control (using a flanker task), lexical processing speed (using a lexical 384 decision task), or working memory capacity (using an operation span task), even though these 385 abilities are frequently related to auditory speech perception. The lack of any clear perceptual or 386 cognitive correlates (outside of lipreading) of McGurk susceptibility suggests the integration 387 operating during McGurk tasks is cognitively isolated from many factors involved in speech 388 perception. 389 Studies of age differences in susceptibility to the McGurk effect also raise questions as to 390 whether the illusion reflects the operation of mechanisms typically engaged in speech perception. 391 Specifically, as noted, both age and individual differences in visual enhancement can be 392 accounted for by differences in visual-only speech perception – not surprisingly, individuals who 393 are better lipreaders, typically exhibit greater visual enhancement (Tye-Murray, et al., 2016). 394 Although older adults typically have reduced visual-only performance (Sommers, et al., 2005), 395 studies have shown either equivalent (Cienkowski & Carney, 2002) or enhanced (Sekiyama, 396 Soshi, & Sakamoto, 2014) susceptibility to the McGurk effect, relative to young adults.

397 Moving beyond McGurk 398 McGurk stimuli differ in a number of important properties from natural speech, and—perhaps as 399 a result—we are not aware of empirical evidence relating McGurk susceptibility to audiovisual 400 perception of natural speech. For these reasons we argue that McGurk stimuli are not well-suited 401 for improving our understanding of audiovisual speech perception. What, then, is a better 402 alternative? 403 If our goal is to better understand natural speech, we suggest that currently the best option 404 is to use natural speech stimuli: in other words, stimuli that contain all of the speech cues listed

12

405 in Table 1. There are good precedents for using actual speech stimuli across multiple levels of 406 language representation. For example, studies of word perception use real words that can vary in 407 psycholinguistic attributes, including frequency, phonological neighborhood density, or 408 imageability. Similarly, countless studies of sentence processing use sentences that vary in 409 attributes including syntactic complexity, prosodic information, or semantic predictability. By 410 using coherent stimuli that vary in their perceptual and linguistic properties, it has proven 411 possible to learn a great deal about the dimensions that influence speech understanding. Studies 412 of audiovisual speech perception would benefit from a similar approach: using natural speech 413 samples in auditory-only, visual-only, and audiovisual modalities, all of which are encountered 414 in real life conversation. By manipulating specific properties of these signals—for example, the 415 degree to which auditory and visual information are complementary vs. redundant—it is still 416 possible to vary the audiovisual integration demands.

417 Conclusions 418 There is no question that the McGurk effect is a robust and compelling demonstration of 419 multisensory integration: auditory and visual cues combine to form a fused percept. At the same 420 time, there are many reasons to think it may not be an effective means to study the processes 421 occurring during natural speech perception. Speech communication is centered around real 422 words, usually involves connected speech rather than isolated words, and involves auditory and 423 visual inputs that are congruent, meaning they can provide complementary and/or redundant 424 (rather than conflicting) information. To understand what occurs during speech perception, we 425 believe that a more productive line of research would be to focus on audiovisual speech 426 comprehension in situations that are closer to real-life communication. If these results are 427 consistent with those provided by McGurk stimuli, we can be more certain that both are indexing 428 the same phenomena. However, if they disagree, we will need to consider the real possibility that 429 the McGurk effect, although fascinating, does not inform our understanding of everyday speech 430 perception.

431 Acknowledgements 432 This work was supported by grants R01 DC014281, R56 AG018029, and R01 DC016594 from 433 the National Institutes of Health. We are grateful to Violet Brown for helpful comments on prior 434 drafts of this manuscript. 435

13

436 References 437 Alsius, A., Paré, M., & Munhall, K. G. (2017). Forty Years After Hearing Lips and Seeing 438 Voices: the McGurk Effect Revisited. Multisensory Research, 31. doi: 439 10.1163/22134808-00002565 440 Arnal, L. H., & Giraud, A.-L. (2012). Cortical oscillations and sensory predictions. Trends in 441 Cognitive Sciences, 16, 390-398. 442 Arnold, P., & Hill, F. (2001). Bisensory augmentation: A speechreading advantage when speech 443 is clearly audible and intact. British Journal of Audiology, 92, 339-355. 444 Basu Mallick, D., Magnotti, J. F., & Beauchamp, M. S. (2015). Variability and stability in the 445 McGurk effect: contributions of participants, stimuli, time, and response type. Psychon 446 Bull Rev, 22(5), 1299-1307. doi: 10.3758/s13423-015-0817-4 447 Beauchamp, M. S. (2005). See me, hear me, touch me: multisensory integration in lateral 448 occipital-temporal cortex. Current Opinion in Neurobiology, 15, 145-153. 449 Beauchamp, M. S., Nath, A. R., & Pasalar, S. (2010). fMRI-guided transcranial magnetic 450 stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk 451 effect. Journal of Neuroscience, 30, 2414-2417. doi: 10.1523/JNEUROSCI.4865-09.2010 452 Brown, V. A., Hedayati, M., Zanger, A., Mayn, S., Ray, L., Dillman-Hasso, N., & Strand, J. F. 453 (2018). What accounts for individual differences in susceptibility to the McGurk effect? 454 PLoS One, 13(11), e0207160. doi: 10.1371/journal.pone.0207160 455 Calvert, G. A., Campbell, R., & Brammer, M. J. (2000). Evidence from functional magnetic 456 resonance imaging of binding in the human heteromodal cortex. Current 457 Biology, 10, 649-657. 458 Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A. (2009). The 459 natural statistics of audiovisual speech. PLoS Computational Biology, 5, e1000436. 460 Cienkowski, K. M., & Carney, A. E. (2002). Auditory-visual speech perception and aging. Ear 461 Hear, 23(5), 439-449. 462 Crosse, M. J., Butler, J. S., & Lalor, E. C. (2015). Congruent visual speech enhances cortical 463 entrainment to continuous auditory speech in noise-free conditions. Journal of 464 Neuroscience, 35, 14195-14204. doi: 10.1523/JNEUROSCI.1829-15.2015 465 Dekle, D. J., Fowler, C. A., & Funnell, M. G. (1992). Audiovisual integration in perception of 466 real words. Percept Psychophys, 51(4), 355-362. 467 Delbeuck, X., Collette, F., & Van der Linden, M. (2007). Is Alzheimer's disease a disconnection 468 syndrome? Evidence from a crossmodal audio-visual illusory experiment. 469 Neuropsychologia, 45(14), 3315-3323. doi: 10.1016/j.neuropsychologia.2007.05.001 470 Erber, N. P. (1975). Auditory- of speech. Journal of Speech and Hearing 471 Disorders, 40, 481-492. 472 Erickson, L. C., Heeg, E., Rauschecker, J. P., & Turkeltaub, P. E. (2014). An ALE meta-analysis 473 on the audiovisual integration of speech signals. Human Brain Mapping, 35, 5587-5605. 474 Erickson, L. C., Zielinski, B. A., Zielinski, J. E., Liu, G., Turkeltaub, P. E., Leaver, A. M., & 475 Rauschecker, J. P. (2014). Distinct cortical locations for integration of audiovisual speech 476 and the McGurk effect. Front Psychol, 5, 534. doi: 10.3389/fpsyg.2014.00534 477 Feld, J., & Sommers, M. S. (2011). There goes the neighborhood: Lipreading and the structure of 478 the mental lexicon. Speech Communication, 53, 220-228. 479 Giraud, A.-L., & Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging 480 computational principles and operations. Nature Neuroscience, 15, 511-517.

14

481 Gosselin, P. A., & Gagné, J.-P. (2011). Older adults expend more listening effort than younger 482 adults recognizing audiovisual speech in noise. International Journal of Audiology, 50, 483 786-792. 484 Grant, K. W., & Seitz, P.-F. (2000). The use of visible speech cues for improving auditory 485 detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197- 486 1208. 487 Grant, K. W., & Seitz, P. F. (1998). Measures of auditory-visual integration in nonsense syllables 488 and sentences. Journal of the Acoustical Society of America, 104, 2438-2450. 489 Grant, K. W., Walden, B. E., & Seitz, P. F. (1998). Auditory-visual speech recognition by 490 hearing-impaired subjects: Consonant recognition, sentence recognition, and auditory- 491 visual integration. Journal of the Acoustical Society of America, 103, 2677-2690. 492 Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech 493 information across talkers, gender, and sensory modality: female faces and male voices in 494 the McGurk effect. Percept Psychophys, 50(6), 524-536. 495 Hickok, G., Rogalsky, C., Matchin, W., Basilakos, A., Cai, J., Pillay, S., . . . Fridriksson, J. 496 (2018). Neural networks supporting audiovisual integration for speech: A large-scale 497 lesion study. Cortex, 103, 360-371. doi: 10.1016/j.cortex.2018.03.030 498 Lakatos, P., Chen, C.-M., O'Connell, M. N., Mills, A., & Schroeder, C. E. (2007). Neuronal 499 oscillations and multisensory interaction in primary auditory cortex. Neuron, 53, 279- 500 292. 501 Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. (2008). Entrainment of 502 neuronal oscillations as a mechanism of attentional selection. Science, 320, 110-113. 503 Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation 504 model. Ear and Hearing, 19, 1-36. 505 Luo, H., Liu, Z., & Poeppel, D. (2010). Auditory cortex tracks both auditory and visual stimulus 506 dynamics using low-frequency neuronal phase modulation. PLoS Biology, 8, e1000445. 507 Luo, H., & Poeppel, D. (2007). Phase patterns of neuronal responses reliably discriminate speech 508 in human auditory cortex. Neuron, 54, 1001-1010. 509 Macleod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speech 510 perception in noise. British Journal of Audiology, 21, 131-141. 511 Magnotti, J. F., Basu Mallick, D., Feng, G., Zhou, B., Zhou, W., & Beauchamp, M. S. (2015). 512 Similar frequency of the McGurk effect in large samples of native Mandarin Chinese and 513 American English speakers. Exp Brain Res, 233(9), 2581-2586. doi: 10.1007/s00221- 514 015-4324-7 515 Magnotti, J. F., & Beauchamp, M. S. (2015). The noisy encoding of disparity model of the 516 McGurk effect. Psychon Bull Rev, 22(3), 701-709. doi: 10.3758/s13423-014-0722-2 517 Magnotti, J. F., & Beauchamp, M. S. (2018). Published estimates of group differences in 518 multisensory integration are inflated. PLoS One, 13(9), e0202908. doi: 519 10.1371/journal.pone.0202908 520 Marques, L. M., Lapenta, O. M., Costa, T. L., & Boggio, P. S. (2016). Multisensory integration 521 processes underlying speech perception as revealed by the McGurk illusion. Language, 522 Cognition and Neuroscience, 31, 1115-1129. doi: 10.1080/23273798.2016.1190023 523 Marslen-Wilson, W. D., & Tyler, L. K. (1980). The temporal structure of spoken language 524 processing. Cognition, 8, 1-71.

15

525 Massaro, D. W., Thompson, L. A., Barron, B., & Laren, E. (1986). Developmental changes in 526 visual and auditory contributions to speech perception. J Exp Child Psychol, 41(1), 93- 527 113. 528 McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746-748. 529 Micheli, C., Schepers, I. M., Ozker, M., Yoshor, D., Beauchamp, M. S., & Rieger, J. W. (2018). 530 Electrocorticography reveals continuous auditory and visual speech tracking in temporal 531 and occipital cortex. Eur J Neurosci. doi: 10.1111/ejn.13992 532 Moris Fernandez, L., Macaluso, E., & Soto-Faraco, S. (2017). Audiovisual integration as conflict 533 resolution: The conflict of the McGurk illusion. Hum Brain Mapp, 38(11), 5691-5705. 534 doi: 10.1002/hbm.23758 535 Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk 536 effect. Percept Psychophys, 58(3), 351-362. 537 Nath, A. R., & Beauchamp, M. S. (2011). Dynamic changes in superior temporal sulcus 538 connectivity during perception of noisy audiovisual speech. Journal of Neuroscience, 31, 539 1704-1714. 540 Nath, A. R., & Beauchamp, M. S. (2012). A neural basis for interindividual differences in the 541 McGurk effect, a multisensory speech illusion. Neuroimage, 59(1), 781-787. doi: 542 10.1016/j.neuroimage.2011.07.024 543 Oden, G. C., & Massaro, D. W. (1978). Integration of featural information in speech perception. 544 Psychological Review, 85, 172-191. 545 Okada, K., Venezia, J. H., Matchin, W., Saberi, K., & Hickok, G. (2013). An fMRI study of 546 audiovisual speech perception reveals multisensory interactions in auditory cortex. PLOS 547 ONE, 8, e68959. 548 Peelle, J. E. (2018). Listening effort: How the cognitive consequences of acoustic challenge are 549 reflected in brain and behavior. Ear and Hearing, 39, 204-214. doi: 550 10.1097/AUD.0000000000000494 551 Peelle, J. E., & Davis, M. H. (2012). Neural oscillations carry speech rhythm through to 552 comprehension. Frontiers in Psychology, 3, 320. 553 Peelle, J. E., Gross, J., & Davis, M. H. (2013). Phase-locked responses to speech in human 554 auditory cortex are enhanced during comprehension. Cerebral Cortex, 23, 1378-1387. 555 Peelle, J. E., & Sommers, M. S. (2015). Prediction and constraint in audiovisual speech 556 perception. Cortex, 68, 169-181. 557 Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but hard to understand: A 558 speechreading advantage with intact stimuli. In R. Campbell & B. Dodd (Eds.), Hearing 559 by eye: The psychology of lip-reading (pp. 97-113). London: Erlbaum. 560 Rosenblum, L. D., & Saldana, H. M. (1996). An audiovisual test of kinematic primitives for 561 visual speech perception. J Exp Psychol Hum Percept Perform, 22(2), 318-331. 562 Rosenblum, L. D., Schmuckler, M. A., & Johnson, J. A. (1997). The McGurk effect in infants. 563 Percept Psychophys, 59(3), 347-357. 564 Sams, M., Manninenb, P., Surakkab, V., Helinb, P., & Kättöb, R. (1998). McGurk effect in 565 Finnish syllables, isolated words, and words in sentences: Effects of word meaning and 566 sentence context. Speech Communication, 26, 75-87. 567 Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. (2008). Neuronal oscillations 568 and visual amplification of speech. Trends in Cognitive Sciences, 12, 106-113.

16

569 Schwartz, J.-L., & Savariaux, C. (2014). No, there is no 150 ms lead of visual speech on auditory 570 speech, but a range of audiovisual asynchronies varying from small audio lead to large 571 audio lag. PLOS Computational Biology, 10, e1003743. 572 Sekiyama, K. (1997). Cultural and linguistic factors in audiovisual speech processing: the 573 McGurk effect in Chinese subjects. Perception and Psychophysics, 59(1), 73-80. 574 Sekiyama, K., Kanno, I., Miura, S., & Sugita, Y. (2003). Auditory-visual speech perception 575 examined by fMRI and PET. Neuroscience Research, 47, 277-287. 576 Sekiyama, K., Soshi, T., & Sakamoto, S. (2014). Enhanced audiovisual integration with aging in 577 speech perception: A heightened McGurk effect in older adults. Frontiers in Psychology, 578 5. 579 Sommers, M. S., Tye-Murray, N., & Spehar, B. (2005). Auditory-visual speech perception and 580 auditory-visual enhancement in normal-hearing younger and older adults. Ear and 581 Hearing, 26, 263-275. 582 Soto-Faraco, S., & Alsius, A. (2009). Deconstructing the McGurk-MacDonald illusion. J Exp 583 Psychol Hum Percept Perform, 35(2), 580-587. doi: 10.1037/a0013483 584 Stevenson, R. A., Zemtsov, R. K., & Wallace, M. T. (2012). Individual differences in the 585 multisensory temporal binding window predict susceptibility to audiovisual . J 586 Exp Psychol Hum Percept Perform, 38(6), 1517-1529. doi: 10.1037/a0027339 587 Strand, J. F. (2014). Phi-square Lexical Competition Database (Phi-Lex): An online tool for 588 quantifying auditory and visual lexical competition. Behavioral Research Methods, 46, 589 148-158. doi: 10.3758/s13428-013-0356-8 590 Strand, J. F., Brown, V. A., & Barbour, D. L. (2018). Talking points: A modulating circle 591 reduces listening effort without improving speech recognition. Psychonomic Bulletin and 592 Review. doi: 10.3758/s13423-018-1489-7 593 Strand, J. F., Cooperman, A., Rowe, J., & Simenstad, A. (2014). Individual differences in 594 susceptibility to the McGurk effect: links with lipreading and detecting audiovisual 595 incongruity. J Speech Lang Hear Res, 57(6), 2322-2331. doi: 10.1044/2014_JSLHR-H- 596 14-0059 597 Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal 598 of the Acoustical Society of America, 26, 212-215. 599 Summerfield, A. Q. (1987). Some preliminaries to a compreensive account of audio-visual 600 speech perception. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of 601 lip reading (pp. 3-87). Hillsdale, NJ: Erlbaum. 602 Tye-Murray, N., Sommers, M. S., & Spehar, B. (2007a). Audiovisual integration and lipreading 603 abilities of older adults with normal and impaired hearing. Ear and Hearing, 28, 656-668. 604 Tye-Murray, N., Sommers, M. S., & Spehar, B. (2007b). Auditory and visual lexical 605 neighborhoods in audiovisual speech perception. Trends in Amplification, 11, 233-241. 606 Tye-Murray, N., Spehar, B., Myerson, J., Hale, S., & Sommers, M. (2016). Lipreading and 607 audiovisual speech recognition across the adult lifespan: Implications for audiovisual 608 integration. Psychol Aging, 31(4), 380-389. doi: 10.1037/pag0000094 609 Van Engen, K. J., Phelps, J. E. B., Smiljanic, R., & Chandrasekaran, B. (2014). Enhancing 610 speech intelligibility: Interactions among context, modality, speech style, and masker. 611 Journal of Speech, Language, and Hearing Research, 57, 1908-1918. 612 Van Engen, K. J., Xie, Z., & Chandrasekaran, B. (2017). Audiovisual sentence recognition not 613 predicted by susceptibility to the McGurk effect. Attention, Perception, and 614 Psychophysics, 79, 396-403. doi: 10.3758/s13414-016-1238-9

17

615 Venezia, J. H., Thurman, S. M., Matchin, W., George, S. E., & Hickok, G. (2016). Timing in 616 audiovisual speech perception: A mini review and new psychophysical data. Atten 617 Percept Psychophys, 78(2), 583-601. doi: 10.3758/s13414-015-1026-y 618 Venezia, J. H., Vaden, K. I., Jr., Rong, F., Maddox, D., Saberi, K., & Hickok, G. (2017). 619 Auditory, Visual and Audiovisual Speech Processing Streams in Superior Temporal 620 Sulcus. Front Hum Neurosci, 11, 174. doi: 10.3389/fnhum.2017.00174 621 Windmann, S. (2004). Effects of sentence context and expectation on the McGurk illusion. 622 Journal of Memory and Language, 50, 212-230. 623 Zion Golumbic, E., Cogan, G. B., Schroeder, C. E., & Poeppel, D. (2013). Visual input enhances 624 selective speech envelope tracking in auditory cortex at a "cocktail party". Journal of 625 Neuroscience, 33, 1417-1426. 626