Is there a musical McGurk effect? A proposal for an investigation into the embodiment of music Joseph German, Peabody Institute of the Johns Hopkins University

Abstract: After reviewing research related to the McGurk effect, embodied cognition in speech and music processing, and the influence of vision on music perception at both a high and a low level, I describe an experiment to investigate the possible existence of a McGurk-like effect in the perception of music when both visual and auditory components of music making are presented. The interpretation and implications of the potential results are discussed.

Much has been made of the connections between language, speech, and music. It can often be difficult to determine where speech ends and music begins— speech has significant rhythmic and pitch components that can easily be heard as musical (Deutsch et al. 2011). On the other hand, brain imaging techniques reveal that while the perception and processing of musical and speech stimuli share a number of brain areas, there are also areas that are not shared (Brown et al. 2006). The extent to which music is treated like spoken language is thus an open question. Here, I propose an experiment inspired by the famous McGurk effect (sometimes the McGurk-MacDonald effect). After describing the McGurk effect and the result of research of the effect, I discuss the implications of the effect for embodied views of speech processing as exemplified by the motor theory of . I then discuss the possibility of embodiment in music cognition, then review research on the influence of vision on the perception of music. I next describe an experiment to investigate the possibility of a McGurk-like effect in a musical context. Audio and audiovisual stimuli consist of scales played on the violin with the visual and auditory components matching or not matching in the articulation used. Participants in three groups are asked to identify an “answer” recording that sounds like the stimulus. I then describe the potential results of conducting this experiment, focusing on how a differential in effect strength based on group (if an effect exists at all) could be interpreted. Theoretical, practical, and artistic implications of the existence of a musical McGurk effect are discussed. I conclude by noting the significance of the existence of a musical McGurk effect for both music and speech perception.

Background

The McGurk Effect One of the most famous speech perception is the McGurk effect. When a person hears a syllable (such as “bah”) and simultaneously sees a vocal system produce another syllable (such as “gah”), the person will perceive a third syllable (such as “dah”) (McGurk and MacDonald 1976); the effect is especially strong when the auditory signal is degraded (Sekiyama and Tohkura 1991). The McGurk effect has a number of additional interesting features. First, it is highly robust: awareness of the phenomenon’s presence does not appear to diminish the effect to a significant degree (Manuel et al. 1983). In fact, even a mismatch between the genders of the face and voice does not eliminate the effect. However, such a mismatch does eliminate the difference in effect size between genders—females typically have a stronger McGurk effect than males, but there is no difference in such a mismatch (Green et al. 1991). The effect is present from a young age: it has been found in infants as young as five (Rosenblum et al. 1997) and four and a half months (Burnham and Dodd 2004). On the other hand, although it has been found in speakers of all languages that have been tested (Rosenblum 2010), the McGurk effect is nevertheless culture- variant: the degree of the effect varies between cultures. The Japanese, for example, exhibit a much weaker McGurk effect than Americans for high-quality auditory stimuli (Sekiyama and Tohkura 1991), and are better able to detect mismatches such as those found in the McGurk effect (Sekiyama 1997). This has been attributed to the importance of face avoidance in Japanese culture, as well as the absence of consonant clusters in Japanese (Sekiyama 1997). It has been found that the importance of visual cues to speech perception does not increase past the age of six in Japanese speakers, unlike in English speakers (Sekiyama and Burnham 2008). Exploration of the McGurk effect has been largely focused on linguistic sounds, although there have been some exceptions. Fagel (2006) found evidence for an emotional McGurk effect: when there is a disagreement in the emotional content of a stimulus in the auditory and visual domains (a happy face using an angry tone of voice, for example), a third emotion can be perceived. The emotional McGurk effect has also been found in Swedish (Abelin 2008). These experiments in particular are well suited to adaptation to a musical domain, as will be discussed later. In summary, the McGurk effect reveals that what is typically considered an auditory domain (spoken language) is actually influenced by vision as well. The McGurk effect is one of the most important pieces of evidence for embodiment in speech perception. Although no McGurk-like effect has yet been found for musical stimuli (a fact that the proposed experiment aims to address), there is also interest in the possibility of embodiment in music cognition.

Embodied Cognition in Language and Music To reiterate, the McGurk effect demonstrates the importance of visual cues in language perception in general, and is more specifically cited as evidence for an embodied cognition view of language. According to this embodied cognition viewpoint, specifically the motor theory of speech perception, speech processing is performed in part by specialized modules that play a role in both the perception and production of speech (Liberman and Mattingly 1985). Speech is perceived as a series of vocal “gestures”, which are understood in terms of the perceiver’s experience in the production of sound using those gestures The fact that, in the McGurk effect, the perceived gesture can “interfere” with the perception of the sound is seen as support for this view. Because, according to the motor theory, speech is perceived in terms of gestures rather than in terms of either audio or visual stimuli, the apparent conflict between the audible and the visual is not really a conflict; rather, each is a component of the true object of perception (Liberman and Mattingly 1985). This theory is supported by findings that motor-evoked potentials induced by transcranial magnetic stimulation are increased by the perception of speech, indicating a close relationship between speech production and perception (Watkins et al. 2003). Embodied cognition approaches have also been gaining popularity in the field of music cognition. As in motor theory, at least some aspects of music are claimed to be understood in terms of physical actions, even when these actions are not actually performed. Research of this sort most often focuses on rhythm and the actions of tapping and dancing, but while the experiments conducted thus far have not provided anything conclusive, they do give cause for continued interest in the possibility. Justin London, for instance, has investigated the role of embodiment in music cognition. Contrary to what might be expected for embodiment, London has found that tapping motions do not increase the accuracy of tempo judgments (London 2011). On the other hand, London has recently found evidence for embodiment by studying the interaction between dance perception and tempo perception. Other researchers have also found reason to hypothesize embodiment in music cognition. Toiviainen et al. (2010) have found that when moving to music (that is, dancing) people move different parts of their body for different meters and tempos—that is, they embody these tempos and meters in different parts of the body. Perhaps most intriguingly, Styns et al. (2007) found a connection between the rhythm and tempo of walking and music. Participants walk faster to music than to metronomes set at the same tempo; participants also tended to synchronize their walking to the tempo of the music, especially around 120 beats per minute. The researchers note that the range at which such synchronization occurs corresponds to the range of tempos within which most music falls. Leman et al. (2013) similarly found that walking becomes entrained to music specifically intended to explore embodiment in a musical context. In addition to these synchronization effects, they found that gait becomes entrained to the “vigor” of the music. Using different pieces of identical tempo and meter but different amounts of perceived vigor, they found that more vigorous music increases stride length (and thus, given synchrony, walking speed) compared to more relaxing music. Although these experiments provide some evidence for the existence of embodiment in music cognition, they are far from conclusive. The identification of a McGurk-like effect in the musical domain would be some of the strongest evidence for embodied music cognition yet found, at least so far as the linguistic McGurk effect supports linguistic embodiment—a point subject to significant debate (Massaro 1987; Massaro and Chen 2008). Less controversial is the notion that vision influences the perception of music, whether embodiment is involved (as a McGurk-like effect would imply) or not. There is evidence indicating that visual stimuli can affect the perception of both high-level and low-level aspects of music, as I shall discuss in the next section.

Audio-visual integration in a musical context Relatively little work has been done on the McGurk effect, or any sort of comparable effect, in a musical context. One study has shown that there is no significant difference between spoken and sung syllables with respect to the degree of perceptual fusion (Quinto et al. 2010), which is perhaps surprising given that singing and speaking can be disassociated via case studies of aphasics (Racette et al. 2006) and brain imaging (Riecker et al. 2000). However, more research has been done on the effect on visual cues on the perception of music and musical stimuli in general, as well as the reverse. Schutz and Lipscomb (2007), for example, have found that gesture length can affect the perceived length of a marimba tone, even when the duration of the tone does not in fact vary. However, they did not test whether percussionists exhibit this effect any more than other musicians (none of their participants were primarily percussionists), and they did not include any non-musicians in his study—my proposed experiment would address these deficiencies. Regardless, their results clearly show that visual information can affect low-level aspects of an audio stimulus. Vision also appears to exert an influence on the perception of music at a higher level as well. Broughton and Stevens (2008) found that judgments of the expressivity of marimba music were more accurate when the recording included video of the performance as well as audio. Davidson found that the visual aspect of performance may contribute more to the perception of expressive mode than the audio component. Vines et al. (2011) similarly report that whether or not a listener could see a performance was the most important factor in determining whether the performer’s intended expressive mode was accurately conveyed. It has also been found that the visual component is more important when judging the quality of musical performance, even when the judges are professional musicians (Tsay et al. 2013). In the opposite direction, Van den Stock et al. (2009) found that musical cues can influence the perception of facial expressions. Jeong et al. (2011) also found that music can affect the perceived emotions in facial expressions, and that matches between musical and facial emotion produce greater activation in the superior temporal gyrus of the brain than mismatches, while the opposite was true in the face-sensitive fusiform gyrus.

Materials and Methods Participants: There will be three groups: 2 experimental groups and 1 control group. The participants in the two experimental groups will be music students at the Peabody Institute of the Johns Hopkins University. The first experimental group will be made up of violinists and violists. These instruments were chosen for a number of reasons. Firstly, due to the large number of violinists at Peabody, there is a large pool of potential participants available. Secondly, bowing the violin and viola requires relatively large and visible movements of the right hand. Thirdly, changes in articulation on the violin and viola entail relatively large and visible changes in gesture and movement of the right hand (compared, for example, to the relatively subtle motions made by the player of a woodwind instrument). The second experimental group will be made up of players of non-string instruments. Players of stringed instruments besides the violin and viola will be excluded from the study due to uncertainty regarding whether they would best be placed in the experimental or control group, potentially complicating the interpretation of the data. The existence of these two separate experimental groups is motivated by the possibility of a difference between them in the presence or extent of an effect based on embodied cognition. Violinists and violists are “speakers” of the “language” they will be tested on, and themselves perform the gestures they will be seeing. Other musicians, by contrast, do not perform these gestures. The control group will consist of students from other schools of the Johns Hopkins University without significant musical training and who are otherwise comparable (so far as is feasible) with the other two groups. All participants will be surveyed on the history and duration of their musical training, their academic and personal music listening habits and preferences (both live and recorded), and their own practice habits and performing experience, as well as on typical variables such as sex, age, and native language. The effect of listening habits, especially for non-musicians, will be particularly noted.

Stimuli: The stimuli will consist of two parts, an audio component and a video component. The audio component will consist of recordings of ascending and descending one-octave C major scales made up of one quarter note per tone played on the violin at quarter note=92. Four different articulations will be used in the creation of the recordings: slurred, legato, staccato, and spiccato. Multiple recordings of each style will be used. The video component will consist of footage of a violinist playing the same scale using the same four different articulations. The video will be taken intermediate between parallel and perpendicular to the long axis of the violin, with a clear view of the action of the bow hand; however, this angle may be modified if initial results indicate another angle may be more successful. As with the audio component, multiple recordings of each style will be used. The two components will be combined in different ways for different conditions. For the control condition, the audio component will be played alone. For the matched condition, the articulation used in the audio will match the articulation used in the video component, although they will not be taken from the same recording (for example, video of the scale played legato will be accompanied by audio of a different instance of the scale being played legato). For the mismatched condition, the articulation style used in the video will be different from that used in the audio; for example, footage of the scale played legato would be accompanied by audio of the scale played staccato. The audio will not be taken from the same recording as the video even in the matched condition in order prevent the exact match between audio and video that would only occur in that situation from affecting the results. The aim is to isolate the effects of gesture match or mismatch, not movement match/mismatch per se. Procedure: The participants will first fill out the survey, then be instructed as to the nature of the task, including the importance of viewing the video component of the stimulus when it exists. For each trial, one of the three kinds of stimuli will be played. Recordings of each of the articulation types, along with their label, will be played for the participant before beginning the test. For each trial, the participant will observe a stimulus, either control, matched, or mismatched. During the matched and mismatched conditions, the participant will be instructed to attend to the visual component in addition to the audio component. Afterwards, audio-only recordings of each of the four stimuli will be played (the same as the control stimuli). The participant will then be asked to identify which of the four audio recordings sounds most like the stimulus, as well as being given the option of “other” if they feel none of the recordings sound like the stimulus (if participants consistently reply “other”, the experiment may be modified to explore the nature of this “other”). If required, the participant will have the option of reviewing the stimulus up to two additional times. The “correct” recording (the one which actually uses the same articulation as the stimulus) will not be the same audio as in the stimulus, but a different recording made of the same articulation. Testing will be conducted in this manner (as opposed to simply asking the participant what articulation they perceived) due to the likely lack of knowledge in some of the groups, especially the control group. A control group member cannot be expected to know and understand the terms for these articulations. This testing protocol allows us to test every group in the same way. However, this protocol does have the disadvantage that it may not reveal a McGurk effect that does not produce an articulation other than these four; the answer “other” may indicate that this is occurring, but it would provide no information with respect to the nature of this change (which is why the experiment may be modified if this occurs consistently). However, it is not practical to test all possible articulations, both due to increase in identification difficulty and the increased time required.

Potential Results and Implications This study aims to answer two basic questions: Is there a McGurk-like categorical change in perceived articulation under audio-visual mismatch? and Does this effect vary with musical experience? The answers to these questions can inform our understanding of the role of embodiment and experience in music cognition, and may also have practical implications performers and directors, filmmakers, and composers. The influence of the visual stimulus on the perception of the audio stimulus may take a number of different forms. If the influence manifests as it does in the linguistic McGurk effect, then a mismatch would produce the perception of an unsensed third articulation. However, the implications of this would depend on exactly which groups exhibit the effect; unlike with language, it is not a given that all people will demonstrate the phenomenon. One possibility is that only the first experimental group (consisting of violinists and violists) would demonstrate the effect, or at least exhibit the effect to a much greater degree than the other groups. As in this case the effect of seeing the gestures is confined to those who perform the gestures, this would support an embodied view of music perception and a motor theory of music perception, at least insofar as the traditional McGurk effect supports the motor theory of speech perception—a topic which is widely debated. Alternatively, both the experimental groups may exhibit the effect to a similar degree, while the effect is absent or significantly reduced in the non- musician control group. This would suggest that instead of being a matter of perceiving articulation by the gesture as performed by the self, mere exposure to the audiovisual correlation is sufficient—assuming that musicians at Peabody have had relatively significant exposure to violin playing, which does not seem like an implausible assumption. If this result is in fact found, it could shed light on the linguistic McGurk effect. Because many are shared by many languages, it is difficult (but certainly not impossible) to test a person with phonemes and articulatory gestures they do not themselves use in some fashion (although the number of distinct, linguistically significant phonemes varies by language, this does not mean that those phonemes are not used by speakers of other languages—the distinctions just would not be linguistically significant)—a supposition supported (or at least not refuted) by the finding that the McGurk effect is stronger when viewing speakers of foreign languages (Hayashi and Sekiyama 1998). In contrast, many musicians never use the gestures and movements other musicians use. If the effect is similar among all musicians, it could indicate that the linguistic McGurk effect is not due to embodied perception of speech as advocated by the motor theory, or at least not one intimately connected with production. Another possibility is that all the groups will demonstrate a musical McGurk effect to a similar degree. Interpreting such a finding would be complicated by the fact that even the participants in the control group have almost certainly seen a violin being played, although it should be possible to identify frequent “violin viewers” based on the survey questions about personal listening habits. The possibility that a person could exhibit such an effect without at least moderate exposure to violin playing seems remote, however; instead, it would be likely that a slight but significant amount of exposure is necessary. A final possibility (aside from such extremely unlikely results such as non- musicians exhibiting the effect and musicians not) is that none of the groups will demonstrate a significant musical McGurk effect. However, even in this case it is likely that there will be some kind of non-categorical effect. This possibility might be revealed by a difference in identification accuracy between the different conditions; however, specifically examining the nature of such an effect would be a topic for future studies. In addition, this would not eliminate the possibility of a musical McGurk effect using other instruments. Regardless of whether or not a McGurk effect is found, other instruments merit examination in search of such an effect. Using the same gesture-size argument by which violins and violists were selected for this study, the trombone may be a profitable instrument to investigate. Another possibility is to test another kind of gesture used by violinists and violists, vibrato, to see the effect of a mismatch between seeing and vibrato; fingering in general could also be investigated. Should a musical McGurk effect be found, however, a number of avenues for future research present themselves. A musical equivalent of the previously mentioned “emotional McGurk effect” could be sought. In this case, one would examine if a mismatch between the visual expressive mood and the audible expressive mood produces the perception of a third expressive mood. As mentioned before, the exact presentation of a musical McGurk effect would also suggest future directions in the research of the linguistic McGurk effect, possibly using phonemes in one language that have no meaning in another (such as click consonants). The existence of a musical McGurk effect could pose practical issues and have practical applications. A potential problem could arise in the seats far from the stage in large concert halls, where there is a significant delay between the arrival of the light reflected by the performers and the arrival of the sound they produce. This could potentially affect the perception of sudden shifts in articulation, as the heard articulation would take a moment to catch up with the seen articulation, producing a mismatch. How much of an issue this actually presents, and if necessary how it can be solved, would also be potential future avenues for investigation. As for applications, the existence of a musical McGurk effect would have intriguing implications for films, music videos, and other multimedia art. Composers have in the past created works of “spatial music” where the listener changes what they hear by moving around the room and changing their distance from various sound sources, and thus the balance (among other things) of the sounds they perceive; a musical McGurk effect would give them another way to allow the listener to change what they are hearing, in this case by looking or not looking at a visual stimulus. The McGurk effect also has implications for the perception of onscreen performances and pseudo-performances of music. If there is a significant musical McGurk effect, an actor’s skill (or lack thereof) at miming the playing of an instrument could have a significant, potentially adverse impact on the viewer’s perception of the music. In addition to determining the existence of such an effect, this experiment would determine the extent of such a problem by revealing who exhibits the effect.

Conclusion This experiment is unlikely to conclusively settle the issue of whether embodiment plays a role in music cognition. Though the McGurk effect is commonly cited as evidence for embodied speech processing as described by the motor theory of speech perception, this interpretation is not universally accepted, and there is no reason to believe that such an interpretation would be universally accepted in the case of a musical McGurk effect, either. Nevertheless, the discovery of a McGurk-like effect in a musical domain would be some of the strongest evidence for a role for embodiment in music perception yet. In addition, depending of the prevalence of the effect among certain groups, a musical McGurk effect has the potential to inform the interpretation of the traditional phonetic McGurk effect. The existence of a musical McGurk effect would also have significant practical implications. It may be an issue during live performances for audience far from the performers, and may also impact the perception of mimed music making in film and television. In addition, the musical McGurk effect could be used as a technique in the composition of interactive music. The combination of theoretical and practical importance to both music cognition and cognitive linguistics gives this experiment the potential to be extremely rewarding.

References

Abelin, Å. (2008). Seeing glee but hearing fear? Emotional McGurk effect in Swedish. Proceedings of Speech Prosody. May 6, 9.

Broughton, M., & Stevens, C. (2008). Music, movement and marimba: An investigation of the role of movement and gesture in communicating musical expression to an audience. Psychology of Music.

Brown, S., Martinez, M. J., & Parsons, L. M. (2006). Music and language side by side in the brain: a PET study of the generation of melodies and sentences. European journal of neuroscience, 23(10), 2791-2803.

Burnham, D., & Dodd, B. (2004). Auditory–visual speech integration by prelinguistic infants: Perception of an emergent consonant in the McGurk effect. Developmental psychobiology, 45(4), 204-220.

Davidson, J. W. (1993). of performance manner in the movements of solo musicians. Psychology of music, 21(2), 103-113.

Deutsch, D., Henthorn, T., & Lapidis, R. (2011). Illusory transformation from speech to song. The Journal of the Acoustical Society of America, 129(4), 2245-2252.

Fagel, S. (2006, May). Emotional mcgurk effect. In Proceedings of the International Conference on Speech Prosody (Vol. 1).

Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & psychophysics, 50(6), 524-536.

Hayashi, Y., & Sekiyama, K. (1998). Native-foreign langage effect in the McGurk effect: A test with Chinese and Japanese. In AVSP'98 International Conference on Auditory-Visual Speech Processing.

Jeong, J. W., Diwadkar, V. A., Chugani, C. D., Sinsoongsud, P., Muzik, O., Behen, M. E., & Chugani, D. C. (2011). Congruence of happy and sad emotion in music and faces modifies cortical audiovisual activation. NeuroImage, 54(4), 2973-2982.

Leman, M., Moelants, D., Varewyck, M., Styns, F., van Noorden, L., & Martens, J. P. (2013). Activating and relaxing music entrains the speed of synchronized walking. PloS one, 8(7), e67932.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1-36.

London, J. (2011). Tactus≠Tempo: Some Dissociations Between Attentional Focus, Motor Behavior, and Tempo Judgment. Empirical Musicology Review, 6(1).

Manuel, S. Y., Repp, B. H., Studdert-Kennedy, M., & Liberman, A. M. (1983). Exploring the “McGurk effect”. The Journal of the Acoustical Society of America, 74(S1), S66-S66.

Massaro DW (1987). Speech perception by ear and eye: a paradigm for psychological inquiry. Hillsdale, NJ: Lawrence Erlbaum Associates.

Massaro, D. W., & Chen, T. H. (2008). The motor theory of speech perception revisited. Psychonomic bulletin & review, 15(2), 453-457.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature 264, 746- 748.

Patel, A. D. (2003). Language, music, syntax and the brain. Nature neuroscience, 6(7), 674-681.

Quinto, L., Thompson, W. F., Russo, F. A., & Trehub, S. E. (2010). A comparison of the McGurk effect for spoken and sung syllables. Attention, Perception, & Psychophysics, 72(6), 1450-1454.

Racette, A., Bard, C., & Peretz, I. (2006). Making non-fluent aphasics speak: sing along!. Brain, 129(10), 2571-2584.

Riecker, A., Ackermann, H., Wildgruber, D., Dogil, G., & Grodd, W. (2000). Opposite hemispheric lateralization effects during speaking and singing at motor cortex, insula and cerebellum. Neuroreport, 11(9), 1997-2000.

Rosenblum, L. D., Schmuckler, M. A., & Johnson, J. A. (1997). The McGurk effect in infants. Perception & Psychophysics, 59(3), 347-357.

Rosenblum, L. D. (2010). See what I'm saying: The extraordinary powers of our five senses. New York, NY: W. W. Norton & Company Inc.

Schutz, M., & Lipscomb, S. (2007). Hearing gestures, seeing music: Vision influences perceived tone duration. PERCEPTION-LONDON-, 36(6), 888.

Sekiyama, K., & Tohkura, Y. I. (1991). McGurk effect in non-English listeners: Few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. The Journal of the Acoustical Society of America, 90(4), 1797-1805.

Sekiyama, K. (1997). Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects. Perception & Psychophysics, 59(1), 73-80.

Sekiyama, K., & Burnham, D. (2008). Impact of language on development of auditory-visual speech perception. Developmental Science, 11(2), 306-320.

Styns, F., van Noorden, L., Moelants, D., & Leman, M. (2007). Walking on music. Human movement science, 26(5), 769-785.

Toiviainen, P., Luck, G., & Thompson, M. R. (2010). Embodied meter: hierarchical eigenmodes in music-induced movement.

Tsay, C. J. (2013). Sight over sound in the judgment of music performance. Proceedings of the National Academy of Sciences, 110(36), 14580-14585.

Van den Stock, J., Peretz, I., Grezes, J., & de Gelder, B. (2009). Instrumental music influences recognition of emotional body language. Brain topography, 21(3- 4), 216-220.

Vines, B. W., Krumhansl, C. L., Wanderley, M. M., Dalca, I. M., & Levitin, D. J. (2011). Music to my eyes: Cross-modal interactions in the perception of emotions in musical performance. Cognition, 118(2), 157-170.

Watkins, K. E., Strafella, A. P., & Paus, T. (2003). Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia, 41(8), 989-994.