Modelling Multimodal Integration – the Case of the Mcgurk-Effect
Total Page:16
File Type:pdf, Size:1020Kb
1 Modelling multimodal integration 2 { The case of the McGurk-Effect { Fabian Tomaschek1; DanielDuran2 1 3 Department of General Linguistics, University of T¨ubingen,Germany 2 4 Deutsches Seminar, Albert-Ludwigs-Universit¨atFreiburg, Germany. 5 Corresponding author: 6 Fabian Tomaschek 7 Seminar f¨urSprachwissenschaft 8 Eberhard Karls University T¨ubingen 9 Wilhelmstrasse 19 10 T¨ubingen 11 e-mail: [email protected] 12 Version: September 27, 2019 13 Abstract 14 The McGurk effect is a well known perceptual phenomenon in which lis- 15 teners perceive [da], when presented with visual instances of [ga] syllables 16 combined with the audio of [ba] syllables. However, the underlying cognitive 17 mechanisms are not yet fully understood. In this study, we investigated the 18 McGurk effect from the perspective of two learning theories { the Exemplar 19 theory and Discriminative Learning. We hypothesized that the McGurk ef- 20 fect arises from distributional differences of the [ba, da, ga] syllables in a 21 lexicon of a given language. We tested this hypothesis using computational 22 implementations of these theories, simulating learning on the basis of lexica 23 in which we varied the distributions of these syllables systematically. These 24 simulations support our hypothesis. 25 For both learning theories, we found that the probability of observing 26 the McGurk effect in our simulations was greater, when lexica contained a 27 larger percentage of [da] and [ba] instances. Crucially, the probability of the 0 28 McGurk effect was inversely proportional to the percentage of [ga], what ever 29 the percentage of [ba] and [da]. 30 To validate our results, we inspected the distributional properties of [ba, 31 da, ga] in different languages for which the McGurk effect was or was not 32 previously attested. Our results mirror the findings for these languages. Our 33 study shows that the McGurk effect { an instance of multi-modal perceptual 34 integration { arises from distributional properties in different languages and 35 thus depends on language learning. 1 36 1 Introduction 37 It is commonly known what the the McGurk effect is(McGurk & MacDonald, 38 1976). Hearing [ba] and seeing hgai, test subjects perceived the percepteme 1 39 〚da〛 . While it is meanwhile very clear that the human brain integrates vi- 40 sual and acoustic information, because \[why] would the brain not use such 41 available information", as MacDonald (2018) puts it, it is still not fully un- 42 derstood, how the fused perceptual outcome comes to being. In this present 43 study, we claim that distributional properties of 〚ba, da, ga〛 in the language 44 are the reason for the fused outcome. We therefore investigate how distri- 45 butional properties of speech affect the McGurk effect. Phonetic categories, 46 such as 〚ba, da, ga〛 are not signalled by discrete acoustic cues. Rather, acous- 47 tic cues show a large variability, for example due to speaking rate (Gay, 1978; 48 Hirata, 2004), word length (Altmann, 1980), coarticulation (Ohman,¨ 1966; 49 Magen, 1997), practice effects (Tomaschek et al., 2013, 2014, 2018a,c), as 50 well as idiosyncratic speaker variation (Tomaschek & Leeman, 2018; Weirich 51 & Fuchs, 2006). There is a large number of studies showing that listeners are 52 highly sensitive to the distributional characteristics of phonetic cues indicat- 53 ing membership to a phonemic category (Clayards et al., 2008; Yoshida et al., 54 2010; Nixon et al., 2014, 2015; Nixon & Best, 2018; Nixon, 2018). Ignoring 55 the number of phonemic categories that can be discriminated by an acous- 56 tic continuum, two sources of variance have been shown to affect perceptual 57 outcomes in listeners: the amount of overlap between categories along and 58 the variance within each category. The stronger the overlap of cues between 1The classic example of the McGurk effect is described as follows: Hearing [ba] and seeing hgai, test subjects perceive a fused perceptual outcome 〚da〛. We will call the recognized linguistic items or stimulus responses, be they consistent with the perceptual cues or inconsistent, resulting in fused perceptual outcome, perceptemes. The term per- cepteme is established in order to distinguish the cognitive response in a categorization task from its phonetic auditory and visual input. Such phonetic input items, i.e. spe- cific instances of realized speech items, are denoted in usual phonetic transcription with square brackets as [da]. Phonemes, i.e. abstract linguistic categories, are denoted in usual phonemic/phonological transcription as /da/. Visual representations of speech items are denoted with angle brackets as hdai. Note that we do not use the term viseme for the visual representations of speech items. Fisher (1963) introduced this term as an abbreviation for \visual phonemes", denoting \mutually exclusive classes of sounds visually perceived" { i.e. groups of responses in a confusion matrix (Fisher, 1968, cf.). This term, however, is often used today with a somewhat different meaning, referring to the smallest visual unit of speech in visual or multimodal speech synthesis or recognition. Perceptemes will be in- dicated by white square brackets, e.g. 〚da〛. And, of course this works with 〚bi, di, gi〛 and to a certain degree with voiceless plosives as well as glides. See furthermore Rosenblum (2008) for a short review of multimodal speech perception and Sumby & Pollack (1954) for an earlier finding of the effect. 2 59 phonemic categories, and the larger the variance within a phonemic category, 60 the less concise are the listeners' decisions about to which phonemic category 61 stimuli belong. 62 We approached the problem of distributional properties by means of com- 63 putational simulations. We investigated the McGurk effect with two compu- 64 tationally formalized cognitive models that both account for distributional 65 effects in speech production and speech perception. To our knowledge, only 66 Massaro and colleagues used computer simulations to explain the McGurk 67 effect (Massaro & Cohen, 1995). They explained the fused perception of 68 〚da〛 by calculating the probability of confusing auditory and acoustic cues 69 as 〚da〛 given the acoustics of the stimulus. To assess the probability of con- 70 fusion, perceived cues are \matched against prototype descriptions [. ] and 71 an identification decision is made on the basis of the relative goodness of 72 match of the stimulus information with the relevant prototype description" 73 (Massaro & Cohen, 1995). 74 In the last decade, new models were presented that explain linguistic and 75 congitive processes during speech production and perception on the basis of 76 the speaker's experience of language. We use the computational implementa- 77 tions of two of those models in the present study to investigate the origins of 78 the McGurk effect. The first is Exemplar Theory (Johnson, 1997; Goldinger, 79 1998; Pierrehumbert, 2001) which assumes (on an abstract functional level) 80 that individual instances of perceived events are stored in memory as fully 81 specified exemplars. Categories, e.g. equivalent or similar speech sounds, 82 form \clouds" of exemplars which are clustered close together in a percep- 83 tual feature space. Within these exemplar clouds, variability and frequency of 84 occurrence are preserved implicitly. Categorization of new inputs is based on 85 a comparison computing the similarity between the new perceptual stimulus 86 with all stored exemplars. Frequency of occurrence, reflected by the number 87 of exemplars, is an important factor in these similarity computations: the 88 more exemplars in memory with the same associated labels, the higher this 2 89 category label is weighted in categorization (Duran, 2013, 2015) . 90 The second is Naive Discriminative Learning (Baayen et al., 2011), a 91 computational algorithm based on classical conditioning that simulates error- 92 driven learning (Rescorla & Wagner, 1972; Ramscar et al., 2010, 2013) ac- 93 cording to which phonemic categories are discriminated by means of physical 94 cues such as formant frequencies and, in our case, visual information of the 95 lips. The strength with which cues discriminate outcomes depends on the 96 frequency of co-occurrence of the outcomes and the cues, as well as the dis- 97 tributional properties of cues within and across categories. Frequency of 2See appendix A for a discussion of \similarity" in exemplar-theoretic models. 3 98 occurrence is implicitly represented by means of stronger association weights 99 between cues and outcomes. However, this strength is further modulated by 100 the distribution of cues between outcomes. 101 We hypothesized that the McGurk effect is a result of the distributional 102 properties of the acoustic and visual cues signalling 〚ba, da, ga〛 (see Section 103 2). The more frequent the 〚da〛 category, the more often we should be able 104 to attest the McGurk effect in our models. More precise details about the 105 results are not possible at this point because of the high dimensionality of 106 the data and the non-linear behavior of the models under investigation. A 107 verbal description of the expected results would be circular, as the mathe- 108 matical calculations of the present study can be regarded as predictions for 109 participants' behavior in actual experiments. Furthermore, the current study 110 serves as a test bed to compare these two theories and their predictions for 111 linguistic behavior. 112 In the remainder of the paper, we first present the results of a survey 113