bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 1

Beat gestures influence which speech sounds you hear

Hans Rutger Bosker ab 1 and David Peeters ac

a Max Planck Institute for Psycholinguistics, PO Box 310, 6500 AH, Nijmegen, The Netherlands

b Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, the Netherlands

c Tilburg University, Department of Communication and Cognition, TiCC, Tilburg, the Netherlands

Running title: gestures influence speech

Word count: 11080 (including all words)

Figure count: 2

1 Corresponding author. Tel.: +31 (0)24 3521 373. E-mail address: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 2

ABSTRACT

Beat gestures – spontaneously produced biphasic movements of the hand – are among the most

frequently encountered co-speech gestures in human communication. They are closely temporally

aligned to the prosodic characteristics of the speech signal, typically occurring on lexically stressed

syllables. Despite their prevalence across speakers of the world’s languages, how beat gestures

impact spoken word recognition is unclear. Can these simple ‘flicks of the hand’ influence speech

perception? Across six experiments, we demonstrate that beat gestures influence the explicit and

implicit perception of lexical stress (e.g., distinguishing OBject from obJECT), and in turn, can

influence what vowels listeners hear. Thus, we provide converging evidence for a manual McGurk

effect: even the simplest ‘flicks of the hands’ influence which speech sounds we hear.

Keywords: beat gestures; lexical stress; vowel length; ; manual McGurk effect.

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 3

SIGNIFICANCE STATEMENT

Beat gestures are very common in human face-to-face communication. Yet we know little about

their behavioral consequences for spoken language comprehension. We demonstrate that beat

gestures influence the explicit and implicit perception of lexical stress, and, in turn, can even shape

what vowels we think we hear. This demonstration of a manual McGurk effect provides some of

the first empirical support for a recent multimodal, situated psycholinguistic framework of human

communication, while challenging current models of spoken word recognition that do not yet

incorporate multimodal prosody. Moreover, it has the potential to enrich human-computer

interaction and improve multimodal speech recognition systems.

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 4

INTRODUCTION

The human capacity to communicate evolved primarily to allow for efficient face-to-face

interaction 1. As humans, we are indeed capable of translating our thoughts into communicative

messages, which in everyday situations often comprise concurrent auditory (speech) and visual (lip

movements, facial expressions, hand gestures, body posture) information 2–4. Speakers thus make

use of different channels (mouth, eyes, hands, torso) to get their messages across, while addressees

have to quickly and efficiently integrate or bind the different ephemeral signals they concomitantly

perceive 2,5,6. This process is skilfully guided by top-down predictions 7–9 that exploit a life-time of

communicative experience, acquired world knowledge, and personal common ground built up with

the speaker 10–12.

The classic McGurk effect 13 is an influential – though not entirely uncontroversial 14–16 –

illustration of how visual input influences the perception of speech sounds. In a seminal study 13,

participants were made to repeatedly listen to the sound /ba/. At the same time, they watched the

face of a speaker in a video whose mouth movements corresponded to the sound /ga/. When asked

what they heard, a majority of participants reported actually perceiving the sound /da/. This robust

finding indicates that, during language comprehension, our brains take both verbal (here: auditory)

and non-verbal (here: visual) aspects of communication into account 17,18, providing the addressee

with a best guess of what exact message a speaker intends to convey. We thus listen not only with

our ears, but also with our eyes.

Visual aspects of everyday human communication are not restricted to subtle mouth or lip

movements. In close coordination with speech, the hand gestures we spontaneously and

idiosyncratically make during conversations help us express our thoughts, emotions, and intentions bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 5

19–21. We manually point at things to guide the attention of our addressee to relevant aspects of our

immediate environment 22–25, enrich our speech with iconic gestures that in handshape and

movement resemble what we are talking about 26–28, and produce simple flicks of the hand aligned

to the acoustic prominence of our spoken utterance to highlight relevant points in speech 29–34. Over

the past decades, it has become clear that such co-speech manual gestures are semantically aligned

and temporally synchronized with the speech we produce 35–40. It is an open question, however,

whether the temporal synchrony between these hand gestures and the speech signal can actually

influence which speech sounds we hear.

In this study, we test for the presence of such a potential manual McGurk effect. Do hand

gestures, like observed lip movements, influence what exact speech sounds we perceive? We here

focus on manual beat gestures – commonly defined as biphasic (e.g., up and down) movements of

the hand that speakers spontaneously produce to highlight prominent aspects of the concurrent

speech signal 21,31,41. Beat gestures are amongst the most frequently encountered gestures in

naturally occurring human communication 21 as the presence of an accompanying beat gesture

allows for the enhancement of the perceived prominence of a word 31,42,43. Not surprisingly, they

hence also feature prominently in politicians’ public speeches 44,45. By highlighting concurrent

parts of speech and enhancing its processing 46–49, beat gestures may even increase memory recall

of words in both adults 50 and children 51–53. As such, they form a fundamental part of our

communicative repertoire and also have direct practical relevance.

In contrast to iconic gestures, beat gestures are not related to the spoken message on a semantic

level, but only on a temporal level: their apex is aligned to vowel onset in lexically stressed

syllables carrying prosodic emphasis 20,21,34,54. Neurobiological studies suggest that listeners may

tune oscillatory activity in the alpha and theta bands upon observing the initiation of a beat gesture bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 6

to anticipate processing an assumedly important upcoming word 44. Bilateral temporal areas of the

brain may in parallel be activated for efficient integration of beat gestures and concurrent speech

48. However, evidence for behavioral effects of temporal gestural alignment on speech perception

remains scarce. Given the close temporal alignment between beat gestures and lexically stressed

syllables in speech production, ideal observers 55 could in principle use beat gestures as a cue to

lexical stress in perception.

Lexical stress is indeed a critical speech cue in spoken word recognition. It is well established

that listeners commonly use suprasegmental spoken cues to lexical stress, such as greater

amplitude, higher fundamental frequency (F0), and longer syllable duration, to constrain online

lexical access, speeding up word recognition, and increasing perceptual accuracy 56,57. Listeners

also use visual articulatory cues on the face to perceive lexical stress when the auditory signal is

muted 58–60. Nevertheless, whether non-facial visual cues, such as beat gestures, also aid in lexical

stress perception, and might even do so in the presence of auditory cues to lexical stress – as in

everyday conversations – is unknown. These questions are relevant for our understanding of face-

to-face interaction, where multimodal cues to lexical stress might considerably increase the

robustness of spoken communication, for instance in noisy listening conditions 61, and even when

the auditory signal is clear 62.

Indeed, a recent theoretical account proposed that multimodal low-level integration must be a

general cognitive principle fundamental to our human communicative capacities 2. However,

evidence showing that the integration of information from two separate communicative modalities

(speech, gesture) modulates the mere perception of information derived from one of these two

modalities (e.g., speech) is lacking. For instance, earlier studies have failed to find effects of

simulated pointing gestures on lexical stress perception, leading to the idea that manual gestures bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 7

provide information only about sentential emphasis; not about relative emphasis placed on

individual syllables within words 63. As a result, there is no formal account or model of how

audiovisual prosody might help spoken word recognition 64–66. Hence, demonstrating a manual

McGurk effect would imply that existing models of speech perception need to be fundamentally

expanded to account for how multimodal cues influence auditory word recognition.

Below, we report the results of four experiments (and two direct replications: Experiments S1-

S2) in which we tested for the existence and robustness of a manual McGurk effect. The

experiments each approach the theoretical issue at hand from a different methodological angle,

varying the degree of explicitness of the participants’ experimental task and their response

modality (perception vs. production). Experiment 1 uses an explicit task to test whether observing

a beat gesture influences the perception of lexical stress in disyllabic spoken stimuli. Using the

same stimuli, Experiment 2 implicitly investigates whether seeing a concurrent beat gesture

influences how participants vocally shadow (i.e., repeat back) an audiovisual speaker’s production

of disyllabic stimuli. Experiment 3 then uses those shadowed productions as stimuli in a perception

experiment to implicitly test whether naïve listeners are biased towards perceiving lexical stress

on the first syllable if the shadowed production was produced in response to an audiovisual

stimulus with a beat gesture on the first syllable. Finally, Experiment 4 tests whether beat gestures

may influence what vowels (long vs. short) listeners perceive. The perceived prominence of a

syllable influences the expected duration of that syllable’s vowel nucleus i.e., longer if stressed, 67. Thus,

if listeners use beat gestures as a cue to lexical stress, they may perceive a vowel midway between

Dutch short /ɑ/ and long /a:/ as /ɑ/ when presented with a concurrent beat gesture, but as /a:/ without

it.

For all experiments, we hypothesized that, if the manual McGurk effect exists, we should see bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 8

effects of the presence and timing of beat gestures on the perception and production of lexical

stress (Experiments 1-3) and, in turn, on vowel perception (Experiment 4). In short, we found

evidence in favor of a manual McGurk effect in all four experiments and in both direct replications:

the perception of lexical stress and vowel identity is consistently influenced by the temporal

alignment of beat gestures to the speech signal.

RESULTS

Experiment 1 tested whether beat gestures influence how listeners explicitly categorize

disyllabic spoken stimuli as having initial stress or final stress (e.g., WAsol vs. waSOL; capitals

indicate lexical stress). Native speakers of Dutch were presented with an audiovisual speaker

(facial features masked in the actual experiment to avoid biasing effects of articulatory cues)

producing non-existing pseudowords in a sentence context: Nu zeg ik het woord... [target] “Now

say I the word... [target]”. Pseudowords were manipulated along a 7-step lexical stress continuum,

varying F0 in opposite directions for the two syllables, thus ranging from a ‘strong-weak’ (step 1)

to a ‘weak-strong’ prosodic pattern (step 7). Critically, the audiovisual speaker always produced a

beat gesture whose apex was manipulated to be either aligned to the onset of the first vowel or the

second vowel by varying the silent interval preceding the pseudoword (cf. Figure 1A). The task

involved two-alternative forced choice (2AFC) between the pseudoword having stress on the first

or second syllable. For comparison, in another block, the same participants also categorized audio-

only versions of the stimuli (block order counter-balanced). Outcomes of a Generalized Linear

Mixed Model with a logistic linking function GLMM; 68 demonstrated that participants, in the

audiovisual block, were significantly more likely to report stress on the first syllable if the

beat gesture was aligned to the onset of the first vowel (main effect of Beat Condition: β = 1.142 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 9

in logit space, p < 0.001), as shown by the difference between the blue and orange solid lines in

Figure 1B (mean difference in proportions between the two Beat Conditions: 0.20). No such effect

was observed in the audio-only condition (interaction Beat Condition x Modality: β = -0.813, p <

0.001). These findings demonstrate that the temporal alignment of beat gestures to the speech

signal influences the explicit perception of lexical stress.

Figure 1. Example audiovisual stimulus and results from Experiments 1, 3, and 4. A. In Experiments

1, 2, and 4, participants were presented with an audiovisual speaker (facial features masked in the actual

experiment) producing the Dutch carrier sentence Nu zeg ik het woord... “Now say I the word...” followed

by a disyllabic target pseudoword (e.g., bagpif). The speaker produced three beat gestures, with the last

beat gesture’s apex falling in the target time window. Two auditory conditions were created by aligning the

onset of either the first (top waveform in blue) or the second vowel of the target pseudoword (bottom

waveform in orange) to the final gesture’s apex (gray vertical line). B. Participants in Experiment 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 10

categorized audiovisual pseudowords as having lexical stress on either the first or the second syllable. Each

target pseudoword was sampled from a 7-step continuum (on x-axis) varying F0 in opposite directions for

the two syllables. In the audiovisual block (solid lines), participants were significantly more likely to report

perceiving lexical stress on the first syllable if the speaker produced a beat gesture on the first syllable (in

blue) vs. second syllable (in orange). No such difference was observed in the audio-only control block

(transparent lines). C. Participants in Experiment 3 categorized the audio-only shadowed productions,

recorded in Experiment 2, as having lexical stress on either the first or the second syllable. If the audiovisual

stimulus in Experiment 2 involved a pseudoword with a beat gesture on the first syllable, the elicited

shadowed production was more likely to be perceived as having lexical stress on the first syllable (in blue).

Conversely, if the audiovisual stimulus in Experiment 2 involved a pseudoword with a beat gesture on the

second syllable, the elicited shadowed production was more likely to be perceived as having lexical stress

on the second syllable (in orange). D. Participants in Experiment 4 categorized audiovisual pseudowords

as containing either short /ɑ/ or long /a:/ (e.g., bagpif vs. baagpif). Each target pseudoword contained a first

vowel that was sampled from a 5-step F2 continuum from long /a:/ to short /ɑ/ (on x-axis). Moreover,

prosodic cues to lexical stress (F0, amplitude, syllable duration) were set to ambiguous values. If the

audiovisual speaker produced a beat gesture on the first syllable (in blue), participants were biased to

perceive the first syllable as stressed, making the ambiguous first vowel relatively short for a stressed

syllable, leading to a lower proportion of long /a:/ responses. Conversely, if the audiovisual speaker

produced a beat gesture on the second syllable (in orange), participants were biased to perceive the initial

syllable as unstressed, making the ambiguous first vowel relatively long for an unstressed syllable, leading

to a higher proportion of long /a:/ responses. For all panels, error bars enclose 1.96 x SE on either side; that

is, the 95% confidence intervals.

However, the task in Experiment 1, asking participants to categorize the audiovisual stimuli as bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 11

having stress on either the first or second syllable, is very explicit and presumably susceptible to

response strategies. Therefore, Experiment 2 tested lexical stress perception implicitly and in

another response modality by asking participants to overtly shadow the audiovisual pseudowords

from Experiment 1. To reduce the load on the participants, we selected only three relevant steps

from the lexical stress continua: two relatively clear steps (1 and 7) and one relatively ambiguous

step (step 4). Critically, lexical stress was never mentioned in any part of the instructions. We

hypothesized that, if participants saw the audiovisual speaker produce a beat gesture on the second

syllable, the suprasegmental characteristics of their shadowed productions would resemble a

‘weak-strong’ prosodic pattern (i.e., greater amplitude, higher F0, longer duration on the second

syllable). Acoustic analysis of the amplitude, F0, and duration measurements of the shadowed

productions by three separate Linear Mixed Models (one for each acoustic cue) showed small but

significant effects of Beat Condition, visualized in Figure 2. That is, if the beat gesture was aligned

to the second syllable, participants presumably implicitly perceived lexical stress on the second

syllable, and therefore their own shadowed second syllables were produced with greater amplitude

and a longer duration, while their first syllables were shorter.

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 12

Figure 2. Predicted estimates from the combined models of Experiment 2. In Experiment 2,

participants shadowed the audiovisual sentence-final pseudowords in either Beat Condition. The figure

shows stylized wave forms for the two syllables of the shadowed productions, on either side of the zero

point on the x-axis, showing their durations (ms), aligned around the onset of the second syllable (at 0).

The bold horizontal error bars at the extremes of the stylized waveforms indicate duration SE. The vertical

position of the (black rectangles surrounding the) stylized waveforms shows the produced F0 (in Hz on y-

axis; height of rectangles indicates F0 SE on either side of the midpoint). The color gradient and height of

the stylized waveforms indicates average amplitude (dB). The overall differences between first and second

syllables show pronounced word-final effects: the last syllable of a word typically has a longer duration,

and a lower F0 and amplitude. Critically, after audiovisual trials with a beat gesture on the second syllable

(lower panel; compared to a beat gesture on the first syllable in the upper panel), shadowers produced their

first syllables with a shorter duration and their second syllables with a longer duration and higher amplitude.

No effect of F0 was observed.

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 13

Note that the effects of beat gestures in Experiment 2 were robust but small. Furthermore, no

effect of Beat Condition was found on the F0 of the shadowed productions, the primary cue to

stress in Dutch 69. Therefore, Experiment 3 tested the perceptual relevance of the acoustic

adjustments participants made to their shadowed productions in response to the two beat gesture

conditions. Each participant listened to all shadowed productions from two shadowers from

Experiment 2 and indicated whether they perceived the shadowed productions to have stress on

the first or second syllable (2AFC). Critically, this involved audio-only stimuli and the participants

in Experiment 3 were unaware of the beat gesture manipulation in Experiment 2. Nevertheless,

outcomes of a GLMM on the binomial categorization data demonstrated that participants were

significantly biased to report hearing stress on the first syllable of the audio-only shadowed

productions if the audiovisual stimulus that elicited those shadowed productions had a beat gesture

on the first syllable (main effect of Beat Condition: β = 0.232 in logit space, p = 0.003). This is

indicated by the difference between the blue and orange lines in Figure 1C (mean difference in

proportions between the two Beat Conditions: 0.05). Moreover, an identical replication of this

experiment, Experiment S1 (reported in the Supplementary Material), reported similar findings

using the same stimuli in a new participant sample. These findings suggest that the acoustic

adjustments in response to the different beat gesture alignments in Experiment 2 were perceptually

salient enough for new participants to notice them and use them as cues to lexical stress

identification. Thus, it provides supportive evidence that beat gestures influence implicit lexical

stress perception.

Finally, Experiment 4 tested the strongest form of a manual McGurk effect: can beat gestures

influence the perceived identity of individual speech sounds? The rationale behind Experiment 4

is grounded in the observation that listeners take a syllable’s perceived prominence into account bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 14

when estimating its vowel’s duration 67,70. Vowels in stressed syllables typically have a longer

duration than vowels in unstressed syllables. Hence, in perception, listeners expect a vowel in a

stressed syllable to have a longer duration. That is, one and the same vowel duration can be

perceived as relatively short in one (stressed) syllable, but as relatively long in another (unstressed

syllable). In languages where vowel duration is a critical cue to phonemic contrasts (e.g., to coda

stop voicing in English, e.g., “coat - code” 70; to vowel length contrasts in Dutch, e.g., tak /tɑk/

“branch” - taak /ta:k/ “task” 71), these expectations can influence the perceived identity of speech

sounds.

Participants in Experiment 4 were presented with audiovisual stimuli, much like those used in

Experiments 1-2. However, we used a new set of disyllabic pseudowords whose initial vowel was

manipulated to be ambiguous between Dutch /ɑ-a:/ in its duration while varying the second

formant frequency (F2) in 5-step F2 continua, which is a secondary cue to the /ɑ-a:/ contrast in

Dutch 71 but not a cue to lexical stress. Moreover, each pseudoword token was manipulated to be

ambiguous with respect to its suprasegmental cues to lexical stress (set to average values of F0,

amplitude, and duration). Taken together, these manipulations resulted in pseudowords that were

ambiguous in lexical stress and vowel duration, but varied in F2 as cue to vowel identity.

Participants indicated whether they heard an /ɑ/ or /a:/ as initial vowel of the pseudoword (2AFC;

e.g., bagpif or baagpif?); lexical stress was never mentioned in any part of the instructions.

Outcomes of a GLMM on the binomial vowel categorization data demonstrated an effect of Beat

Condition on participants’ vowel perception (β = -0.240 in logit space, p = 0.023). That is, if the

beat gesture was aligned to the first syllable (blue line in Figure 1D), participants were more likely

to perceive the first syllable as stressed, rendering the ambiguous vowel duration in the first

syllable as relatively short for a stressed syllable, lowering the proportion of long /a:/ responses. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 15

Conversely, if the beat gesture was aligned to the second syllable (orange line in Figure 1D),

participants were more likely to perceive the first syllable as unstressed, making the ambiguous

vowel duration in the first syllable relatively long for an unstressed syllable, leading to a small

increase in the proportion of long /a:/ responses (mean difference in proportions between the two

Beat Conditions: 0.04). This finding, corroborated by similar observations in an identical

replication of this experiment, Experiment S2 (see Supplementary Material), showcases the

strongest form of the manual McGurk effect: beat gestures influence which speech sounds we hear.

DISCUSSION

To what extent do we listen with our eyes? The classic McGurk effect illustrates that listeners

take both visual (lip movements) and auditory (speech) information into account to arrive at a best

possible approximation of the exact message a speaker intends to convey 13. Here we provide

evidence in favor of a manual McGurk effect: the beat gestures we see influence what exact speech

sounds we hear. Specifically, we observed in four experiments that beat gestures influence both

the explicit and implicit perception of lexical stress, and in turn may influence the perception of

individual speech sounds. To our understanding, this is the first demonstration of behavioral

consequences of beat gestures on low-level speech perception. These findings support the idea that

everyday language comprehension is primordially a multimodal undertaking in which the multiple

streams of information that are concomitantly perceived through the different senses are quickly

and efficiently integrated by the listener to make a well-informed best guess of what specific

message a speaker is intending to communicate.

The four experiments we presented made use of a combination of implicit and explicit

paradigms. Not surprisingly, the explicit nature of the task used in Experiment 1 led to strong bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 16

effects of the timing of a perceived beat gesture on the lexical stress pattern participants reported

to hear. Although the implicit effects in subsequent Experiments 2-4 were smaller, conceivably

partially due to suboptimal control over the experimental settings in online testing, they were

nevertheless robust and replicable (cf. Experiments S1-S2 in Supplementary Material). Moreover,

these beat gesture effects even surfaced when the auditory cues were relatively clear (i.e., at the

extreme ends of the acoustic continua), suggesting that these effects also prevail in naturalistic

face-to-face communication outside the lab. In fact, the naturalistic nature of the audiovisual

stimuli would seem key to the effect, since the temporal alignment of artificial hand movements

(i.e., an animated line drawing of a disembodied hand) does not influence lexical stress perception

see 63. Perhaps the present study even underestimates the behavioral influence of beat gestures on

speech perception in everyday conversations, since all auditory stimuli in the experiments involved

‘clean’ (i.e., noise-free) speech. However, in real life situations, the speech of an attended talker

oftentimes occurs in the presence of other auditory signals, such as background music, competing

speech, or environmental noise; in such listening conditions, the contribution of visual beat

gestures to successful communication would conceivably be even greater.

The outcomes of the present study challenge current models of (multimodal) word recognition,

which would need to be extended to account for how prosody, specifically multimodal prosody,

supports and modulates word recognition. Audiovisual prosody is implemented in the fuzzy logical

model of perception 72, but how this information affects lexical processing is underspecified.

Moreover, Experiment 4 implicates that the multimodal suprasegmental and segmental ‘analyzers’

would have to allow for interaction, since one can influence the other. Therefore, existing models

of audiovisual speech perception and word recognition thus need to be expanded to account for (1)

how suprasegmental prosodic information is processed 73; (2) how suprasegmental information bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 17

interacts with and modulates segmental information 74; and (3) how multimodal cues influence

auditory word recognition 17,58.

A recent, broader theoretical account of human communication (i.e., beyond speech alone)

proposed that two domain-general mechanisms form the core of our ability to understand language:

multimodal low-level integration and top-down prediction 2. The present results are indeed in line

with the idea that multimodal low-level integration is a general cognitive principle supporting

language understanding. More precisely, we here show that this mechanism operates across

communicative modalities. The cognitive and perceptual apparatus that we have at our disposal

apparently not just binds auditory and visual information that derive from the same communicative

channel (i.e., articulation), but also combines visual information derived from the hands with

concurrent auditory information – crucially to such an extent that what we hear is modulated by

the timing of the beat gestures we see. As such, what we perceive is the model of reality that our

brains provide us by binding visual and auditory communicative input, and not reality itself.

In addition to multimodal low-level integration, earlier research has indicated that also top-down

predictions play a pivotal role in the perception of co-speech beat gestures. It seems reasonable to

assume that listeners throughout life build up the implicit knowledge that beat gestures are

commonly tightly paired with acoustically prominent verbal information. As the initiation of a beat

gesture typically slightly precedes the prominent acoustic speech feature 40, these statistical

regularities are then indeed used in a top-down manner to allocate additional attentional resources

to an upcoming word 44.

The current first demonstration of the influence of simple ‘flicks of the hands’ on low-level

speech perception should be considered an initial step in our understanding of how bodily gestures

in general may affect speech perception. Future research may investigate whether the temporal bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 18

alignment of types of non-manual gestures, such as head nods and eyebrow movements, also

impact what speech sounds we hear. Moreover, other behavioral consequences beyond lexical

stress perception could be implicated, ranging from phase-resetting neural oscillations 44 with

consequences for speech intelligibility and identification 75,76, to facilitating selective

attention in ‘cocktail party’ listening 77–79, word segmentation 80, and word learning 81.

Finally, the current findings have several practical implications. Speech recognition systems

may be improved when including multimodal signals, even beyond articulatory cues on the face.

Presumably beat gestures are particularly valuable multimodal cues, because, while cross-language

differences exist in the use and weighting of acoustic suprasegmental prosody e.g., amplitude, F0, duration,

82–84, beat gestures could presumably function as a language-universal prosodic cue. Also, the field

of human-computer interaction could benefit from the finding that beat gestures support spoken

communication, for instance by enriching virtual agents and avatars with human-like gesture-to-

speech alignment. The use of such agents in immersive virtual reality environments will in turn

allow for taking the study of language comprehension into naturalistic, visually rich and dynamic

settings while retaining the required degree of experimental control 85. By acknowledging that

spoken language comprehension involves the integration of auditory sounds, visual articulatory

cues, and even the simplest of manual gestures, it will be possible to significantly further our

understanding of the human capacity to communicate.

MATERIALS AND METHODS

Participants

Native speakers of Dutch were recruited from the Max Planck Institute’s participant pool bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 19

(Experiment 1: N = 20; 14 females (f), 6 males (m); Mage = 25, range = 20-34; Experiment 2: N =

26; 22f, 4m; Mage = 23, range = 19-30; Experiment 3: N = 26; 25f, 1m; Mage = 24, range = 20-28;

Experiment 4 was performed by the same individuals as those in Experiment 3). All gave informed

consent as approved by the Ethics Committee of the Social Sciences department of Radboud

University (project code: ECSW2014-1003-196). All research was performed in accordance with

relevant guidelines and regulations. Participants had normal hearing, had no speech or language

disorders, and took part in only one of our experiments.

Materials

Experiments 1 & 2

For our auditory target stimuli, we used 12 disyllabic pseudowords that are phonotactically legal

in Dutch (e.g., wasol /ʋa.sɔl/; see Table S1 in Supplementary Material). In Dutch, lexical stress is

cued by three suprasegmental cues: amplitude, fundamental frequency (F0), and duration 69. Since,

in Dutch, F0 is a primary cue to lexical stress in perception, we used 7-step lexical stress continua

– adopted from an earlier student project 86 – varying in mean F0 while keeping amplitude and

duration cues constant, thus ranging from a ‘strong-weak’ prosodic pattern (SW; step 1) to a ‘weak-

strong’ prosodic pattern (WS; step 7).

These continua had been created by recording an SW (i.e., stress on initial syllable) and a WS

version (i.e., stress on last syllable) of each pseudoword from the first author, a male native speaker

of Dutch. We first measured the average duration and amplitude values, separately for the first and

second syllable, across the stressed and unstressed versions: mean duration syllable 1 = 202 ms;

syllable 2 = 395 ms; mean amplitude syllable 1 = 68.1 dB; syllable 2 = 64 dB. Then, using Praat

87, we set the duration and amplitude values of the two syllables of each pseudoword to these bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 20

ambiguous values. Subsequently, F0 was manipulated along a 7-step continuum, with the two

extremes and the step size informed by the talker’s originally produced F0. Manipulations were

always performed in an inverse manner for the two syllables of each pseudoword: while the mean

F0 of the first syllable decreased along the continuum (from 159.2 to 110.6 Hz in steps of 8.1 Hz),

the mean F0 of the second syllable increased (from 95.7 to 141.3 Hz in steps of 7.6 Hz). Moreover,

rather than setting the F0 within each syllable to a fixed value, more natural output was created by

including a fixed F0 declination within the first syllable (linear decrease of 23.5 Hz) and second

syllable (38.2 Hz). Pilot data from a categorization task on these manipulated stimuli showed that

these manipulations resulted in 7-step F0 continua that appropriately sampled the SW-to-WS

perceptual space (i.e., from 0.91 proportion SW responses for step 1 to 0.16 proportion SW

responses for step 7).

To create audiovisual stimuli, the first author was video-recorded using a Canon XF205 camera

(50 frames per second; resolution: 1280 by 720 pixels) with an external Sennheiser ME64

directional microphone (audio sampling frequency: 48 kHz). Recordings were made from knees

up on a neutral background in the Gesture Lab at the Max Planck Institute for Psycholinguistics.

The speaker produced the Dutch carrier sentence Nu zeg ik het woord... kanon “Now say I the

word... cannon”, with lexical stress on the second syllable of kanon. Concurrently, he produced

three beat gestures in this sentence, with their apex aligned to the syllables Nu, woord, and –non.

The manipulated auditory pseudowords described above were combined with this audiovisual

recording using ffmpeg (version 4.0; available from http://ffmpeg.org/) by (1) removing the

original kanon target word, (2) inserting a silent interval, and (3) inserting the manipulated

pseudowords (cf. Figure 1A). By modulating the length of the intervening silent interval, the onset

of either the first or the second vowel of the target pseudoword was aligned to the final beat bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 21

gesture’s apex. Furthermore, the facial features of the talker were masked to hide any visual

articulatory cues to lexical stress. Finally, in order to mask the cross-spliced nature of the audio,

combining recordings with variable room acoustics, the silent interval and pseudoword were mixed

with stationary noise from a silent recording of the Gesture Lab.

Experiment 4

We created 8 new disyllabic pseudowords, with either a short /ɑ/ or a long /a:/ as first vowel

(e.g., bagpif /bɑx.pɪf/ vs. baagpif /ba:x.pɪf/; cf. Table S1). This new set had a fixed syllable

structure (CVC.CVC, only stops and fricatives) to reduce item variance and to facilitate syllable-

level annotations. The same male talker as before was recorded producing these pseudowords in

four versions: with short /ɑ/ and stress on the first syllable (e.g., BAGpif); with short /ɑ/ and stress

on the second syllable (bagPIF); with long /a:/ and stress on the first syllable (BAAGpif); with long

/a:/ and stress on the second syllable (baagPIF).

After manual annotation of syllable onsets and offsets, we measured the values of the

suprasegmental cues to lexical stress (F0, amplitude, and duration) in all four conditions.

Pseudowords with long /a:/ and stress on the first syllable were selected for manipulation. First,

we manipulated the cues to lexical stress by setting these to ambiguous values using PSOLA in

Praat: each first syllable was given the same mean value calculated across all stressed and

unstressed first syllables (mean F0 = 154 Hz; original contour maintained; amplitude = 65.88 dB;

duration = 263 ms), and each second syllable was given the same mean value across all stressed

and unstressed second syllables (mean F0 = 159 Hz; original contour maintained; amplitude =

63.50 dB; duration = 414 ms). This resulted in manipulated pseudowords that were ambiguous in

their prosodic cues to lexical stress. Since this included manipulating duration across all recorded

conditions as well, it also meant that the duration cues to the identity of the first vowel were bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 22

ambiguous.

Second, the first /a:/ vowel was extracted and manipulated to form a spectral continuum from

long /a:/ to short /ɑ/. In Dutch, the /ɑ-a:/ vowel contrast is cued by both spectral (lower first (F1)

and second formant (F2) values for /ɑ/) and temporal cues shorter duration for /ɑ/, 71. We decided to create

F2 continua since F2 is no cue to lexical stress in Dutch (in fact, in our original recordings F2

values did not differ between stressed and unstressed conditions) while varying F2 does influence

vowel quality perception. These spectral manipulations were based on Burg’s LPC method in

Praat, with the source and filter models estimated automatically from the selected vowel. The

formant values in the filter models were adjusted to result in a constant F1 value (750 Hz,

ambiguous between /ɑ/ and /a:/; value based on the original recordings) and 5 F2 values (step 1 =

1325 Hz, step 5 = 1125 Hz, step size = 50 Hz; values based on the original recordings). Then, the

source and filter models were recombined. Finally, the manipulated vowel tokens were spliced

into the pseudowords. Taken together, these manipulations resulted in pseudoword tokens that

were ambiguous in lexical stress (average values of F0, amplitude, and duration), but varied in F2

as cue to vowel identity.

To create audiovisual stimuli, these manipulated pseudowords were spliced into the audiovisual

stimuli from Experiment 1. Once again, manipulating the silent interval between carrier sentence

offset and target onset resulted in two different alignments of the last beat gesture to either first

vowel onset or second vowel onset. Like in Experiment 1, facial features of the talker were masked,

and silent intervals as well as manipulated pseudowords were mixed with stationary noise to match

the room acoustics across the sentence stimuli. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 23

Procedure

Experiment 1

Participants were tested individually in a sound-conditioning booth. They were seated at a

distance of approximately 60 cm in front of a 50,8 cm by 28,6 cm screen and listened to stimuli at

a comfortable volume through headphones. Stimulus presentation was controlled by Presentation

software (v16.5; Neurobehavioral Systems, Albany, CA, USA).

Participants were presented with two blocks of trials: an audiovisual block, in which the video

and the audio content of the manipulated stimuli were concomitantly presented, and an audio-only

control block, in which only the audio content but no video content was presented. Block order

was counter-balanced across participants. On each trial, the participants’ task was to indicate

whether the sentence-final pseudoword was stressed on the first or the last syllable (2-alternative

forced choice task; 2AFC).

Before the experiment, participants were instructed to always look at the screen during trial

presentations. In the audiovisual block, trials started with the presentation of a fixation cross. After

1000 ms, the audiovisual stimulus was presented (video dimensions: 960 x 864 pixels). At stimulus

offset, a response screen was presented with two response options on either side of the screen, for

instance: WAsol vs. waSOL; positioning of response options was counter-balanced across

participants. Participants were instructed that capitals indicated lexical stress, that they could enter

their response by pressing the “Z” key on a regular keyboard for the left option and the “M” key

for the right option, and that they had to give their response within 4 s after stimulus offset;

otherwise a missing response was recorded. After their response (or at timeout), the screen was

replaced by an empty screen for 500 ms, after which the next trial was initiated automatically. The bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 24

trial structure in the audio-only control block was identical to the audiovisual block, except that

instead of the video content a static fixation cross was presented during stimulus presentation.

Participants were presented with 12 pseudowords, sampled from 7-step F0 continua, in 2 beat

gesture conditions (onset of first or second vowel aligned to apex of beat gesture), resulting in 168

unique items in a block. Within a block, item order was randomized. Before the experiment, four

practice trials were presented to participants to familiarize them with the materials and the task.

Participants were given opportunity to take a short break after the first block.

Experiment 2

Experiment 2 used the same procedure as Experiment 1, except that only audiovisual stimuli

were used. Also, instead of making explicit 2AFC decisions about lexical stress, participants were

instructed to simply repeat back the sentence-final pseudoword after stimulus offset, and to hit the

“Enter” key to move to the next trial. No mention of lexical stress was made in the instructions.

Participants were instructed to always look at the screen during trial presentations.

At the beginning of the experiment, participants were seated behind a table with a computer

screen inside a double-walled acoustically isolated booth. Participants wore a circum-aural

Sennheiser GAME ZERO headset with attached microphone throughout the experiment to ensure

that amplitude measurements were made at a constant level. Participants were presented with 12

pseudowords, sampled from step 1, 4, and 7 of the F0 continua, in 2 beat gesture conditions,

resulting in 72 unique items. Each unique item was presented twice over the course of the

experiment, leading to 144 trials per participants. Item order was random and item repetitions only

occurred after all unique items had been presented. Before the experiment, one practice trial was

presented to participants to familiarize them with the stimuli and the task. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 25

Experiment 3

Experiment 3 was run online using PsyToolkit version 2.6.1; ,88 because of limitations due to the

corona virus pandemic. Each participant was explicitly instructed to use headphones and to run the

experiment with undivided attention in quiet surroundings. Each participant was presented with all

shadowed productions from two talkers from Experiment 2, leading to 13 different lists, with two

participants assigned to each list. Stimuli from the same talker were grouped to facilitate talker and

speech recognition, and presented in a random order. The task was to indicate after stimulus offset

whether the pseudoword had stress on the first or the second syllable (2AFC). Two response

options were presented, with capitals indicating stress (e.g., WAsol vs. waSOL), and participants

used the mouse to select their response. No practice items were presented and the experiment was

self-timed.

Note that the same participants who performed Experiment 3 also participated in Experiment 4.

To avoid explicit awareness about lexical stress influencing their behavior in Experiment 4, these

participants always first ran Experiment 4, and only then took part in Experiment 3.

Experiment 4

Experiment 4 was run online using Testable (https://testable.org). Each participant was

explicitly instructed to use headphones and to run the experiment with undivided attention in quiet

surroundings. Before the experiment, participants were instructed to always look and listen

carefully to the audiovisual talker and categorize the first vowel of the sentence-final target

pseudoword (2AFC). Specifically, at stimulus offset, a response screen was presented with two

response options on either side of the screen, for instance: bagpif vs. baagpif. Participants entered

their response by pressing the “Z” key on a regular keyboard for the left option and the “M” key

for the right option. The next trial was only presented after a response had been recorded (ITI = bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 26

500 ms). No mention of lexical stress was made in the instructions.

Participants were presented with 8 pseudowords, sampled from 5-step F2 continua, in 2 beat

gesture conditions (onset of first or second vowel aligned to apex of beat gesture), resulting in 80

unique items in a block. Each participant received two identical blocks with a unique random order

within blocks.

Statistical analysis

Experiment 1

Trials with missing data (n = 7; 0.1%) were excluded from analysis. Data were statistically

analyzed using a Generalized Linear Mixed Model GLMM; 68 with a logistic linking function as

implemented in the lme4 library version 1.0.5; ,89 in R 90. The binomial dependent variable was

participants’ categorization of the target as either having lexical stress on the first syllable (SW;

e.g., WAsol; coded as 1) or the second syllable (WS; e.g., waSOL; coded as 0). Fixed effects were

Continuum Step (continuous predictor; scaled and centered around the mean), Beat Condition

(categorical predictor; deviation coding, with beat gesture on the first syllable coded as +0.5 and

beat gesture on the second syllable coded as -0.5), and Block (categorical predictor with the

audiovisual block mapped onto the intercept), and all interactions. The use of deviation coding of

two-level categorical factors (i.e., coded with -0.5 and +0.5) allows us to test main effects of these

predictors, since with this coding the grand mean is mapped onto the intercept. The model also

included Participant and Pseudoword as random factors, with by-participant and by-item random

slopes for all fixed effects, as advocated by Barr et al. 91.

The model showed a significant effect of Continuum Step (β = -1.093, SE = 0.284, z = -3.850,

p < 0.001), indicating that higher continuum steps led to lower proportions of SW responses. It bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 27

also showed an effect of Beat Condition (β = 1.142, SE = 0.186, z = 6.155, p < 0.001), indicating

that – in line with our hypothesis – listeners were biased towards perceiving lexical stress on the

first syllable if there was a beat gestures on that first syllable. Since the audiovisual block was

mapped onto the intercept, this simple effect of Beat Condition should only be interpreted with

respect to the audiovisual block. In fact, an interaction between Beat Condition and Block was

observed (β = -0.813, SE = 0.115, z = -7.081, p < 0.001), indicating that the effect of Beat

Condition was drastically reduced in the audio-only control block. No other effects or interactions

were statistically significant.

Experiment 2

All shadowed productions were manually evaluated and annotated for syllable onsets and

offsets. We excluded 11 trials (<0.4%) because no speech was produced, and another 232 trials

(6.2%) because participants produced more than the original target pseudoword (e.g.,

hearing wasol but repeating back kwasol) which would affect our duration measurements. Thus,

acoustic analyses only involved correctly shadowed productions (n = 2997; 80%) and incorrectly

shadowed productions but with the same number of phonemes (n = 504; 13.5%). Measurements

of mean F0, duration, and amplitude were calculated for each individual syllable using Praat, and

are summarized in Figures S3, S4, and S5 in the Supplementary Material.

Individual syllable measurements of F0 (in Hz), duration (in ms), and amplitude (in dB) were

entered into separate Linear Mixed Models as implemented in the lmerTest library (version 3.1-0)

in R. Statistical significance was assessed using the Satterthwaite approximation for degrees of

freedom 92. The structure of the three models was identical: they included fixed effects of Syllable

(categorical predictor; with the first syllable mapped onto the intercept), Continuum Step

(continuous predictor; scaled and centered around the mean), and Beat Condition (categorical bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 28

predictor; with a beat gesture on the first syllable mapped onto the intercept), and all interactions.

The model also included Participant and Pseudoword as random factors, with by-participant and

by-item random slopes for all fixed effects. Model output is given in Table 1 and the combined

predicted estimates of the three models are visualized in Figure 2.

Table 1. Statistical outcomes of the models testing amplitude (dB), F0 (Hz), and duration (ms)

measurements from Experiment 2. Significant effects are highlighted in bold.

Amplitude model F0 model Duration model

β SE t p β SE t p β SE t p

intercept 71.28 0.89 79.88 <0.001 182.03 6.70 27.16 <0.001 268.29 18.55 14.46 <0.001

Syllable -4.47 0.62 -7.19 <0.001 -18.95 3.14 -6.03 <0.001 122.00 25.70 4.75 <0.001

Continuum Step -0.77 0.07 -11.51 <0.001 -7.66 0.84 -9.11 <0.001 0.19 1.07 0.18 0.861

Beat Condition -0.09 0.10 -0.94 0.352 -0.02 1.16 -0.02 0.984 -7.67 1.54 -4.98 <0.001

Syll * Step 1.88 0.08 23.58 <0.001 18.54 1.01 18.31 <0.001 2.17 1.40 1.55 0.120

Syll * Beat 0.53 0.11 4.65 <0.001 0.31 1.43 0.22 0.830 8.87 1.98 4.49 <0.001

Step * Beat 0.09 0.08 1.10 0.270 0.64 1.00 0.64 0.525 -2.31 1.40 -1.66 0.098

Syll * Step * Beat -0.13 0.11 -1.19 0.234 -0.91 1.43 -0.64 0.524 1.06 1.98 0.53 0.594

The observed effects of Syllable in all three models demonstrate word-final effects: the last

syllables of words typically have lower amplitude, lower F0, and longer durations (cf. Figure 2).

The effects of Continuum Step confirm that participants adhered to our instructions to carefully

shadow the pseudowords. That is, for higher steps on the F0 continuum (i.e., sounding more WS-

like), the amplitude and F0 of the first syllable (mapped onto the intercept) decreased (see simple

effects of Continuum Step), while amplitude and F0 of the second syllable increased (interactions bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 29

Syllable and Continuum Step). Finally, and most critically, the duration of the first syllable

(mapped onto the intercept) decreased when the beat gesture was on the second syllable (simple

effect of Beat Condition; cf. Figure 2). That is, when the beat fell on the second syllable, the first

syllable was more likely to be perceived as unstressed, leading participants to reduce the duration

of the first syllable of their shadowed production. Moreover, when the beat fell on the second

syllable, the second syllable itself had a higher amplitude and a longer duration (interactions

Syllable and Beat Condition; cf. Figure 2). That is, with the beat gesture on the second syllable,

the second syllable was more likely to be perceived as stressed, resulting in louder and longer

second syllables in the shadowed productions.

Experiment 3

Similar to Experiment 1, the categorization responses (SW = 1; WS = 0) were statistically

analyzed using a GLMM with a logistic linking function. Fixed effects were Continuum Step and

Beat Condition (coding adopted from Experiment 1), and their interaction. The model included

random intercepts for Participant, Talker, and Pseudoword, with by-participant, by-talker, and by-

item random slopes for all fixed effects.

The model showed a significant effect of Continuum Step (β = -0.985, SE = 0.162, z = -6.061,

p < 0.001), indicating that higher continuum steps led to lower proportions of SW responses.

Critically, it showed a small effect of Beat Condition (β = 0.232, SE = 0.078, z = 2.971, p = 0.003),

indicating that – in line with our hypothesis – listeners were biased towards perceiving lexical

stress on the first syllable if the shadowed production was produced in response to an audiovisual

pseudoword with a beat gesture on the first syllable. The interaction between Continuum Step and

Beat Condition was marginally significant (β = 0.110, SE = 0.059, z = 1.886, p = 0.059),

suggesting a tendency for the more ambiguous steps to show a larger effect of the beat gesture. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 30

Experiment 4

We analyzed the binomial categorization data (/a:/ coded as 1; /ɑ/ as 0) using a GLMM with a

logistic linking function. Fixed effects were Continuum Step and Beat Condition (coding adopted

from Experiment 1), and their interaction. The model included random intercepts for Participant

and Pseudoword, with by-participant and by-item random slopes for all fixed effects.

The model showed a significant effect of Continuum Step (β = -0.922, SE = 0.091, z = -10.155,

p < 0.001), indicating that higher continuum steps led to lower proportions of long /a:/ responses.

Critically, it showed a small effect of Beat Condition (β = -0.240, SE = 0.105, z = -2.280, p =

0.023), indicating that – in line with our hypothesis – listeners were biased towards perceiving the

first vowel as /a:/ if there was a beat gesture on the second syllable (or perceiving /ɑ/ if the beat

gesture fell on the first syllable). No interaction between Continuum Step and Beat Condition was

observed (p = 0.817).

DATA AVAILABILITY STATEMENT

The experimental data of this study will be made publicly available for download upon

publication. At that time, they may be retrieved from https://osf.io/b7kue under a CC-By

Attribution 4.0 International license.

ACKNOWLEDGEMENTS

This research was supported by the Max Planck Society for the Advancement of Science,

Munich, Germany (H.R.B.; D.P.) and the Netherlands Organisation for Scientific Research (D.P.;

Veni grant 275-89-037). We would like to thank Giulio Severijnen for help in creating the

pseudowords of Experiments 1-3, and Nora Kennis, Esther de Kerf, and Myriam Weiss for their bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 31

help in testing participants and annotating the shadowing recordings.

REFERENCES

1. Levinson, S. C. Pragmatics. (Cambridge University Press, 1983).

2. Holler, J. & Levinson, S. C. Multimodal Language Processing in Human Communication.

Trends Cogn. Sci. 23, 639–652 (2019).

3. Levelt, W. J. M. Speaking: From Intention to Articulation. (MIT Press, 1989).

4. Vigliocco, G., Perniss, P. & Vinson, D. Language as a multimodal phenomenon:

implications for language learning, processing and evolution. Philos. Trans. R. Soc. B Biol.

Sci. 369, 20130292 (2014).

5. Hagoort, P. The neurobiology of language beyond single-word processing. Science 366, 55–

58 (2019).

6. Perniss, P. Why We Should Study Multimodal Language. Front. Psychol. 9, (2018).

7. Kaufeld, G., Ravenschlag, A., Meyer, A. S., Martin, A. E. & Bosker, H. R. Knowledge-

based and signal-based cues are weighted flexibly during spoken language comprehension. J.

Exp. Psychol. Learn. Mem. Cogn. 46, 549–562 (2020).

8. Kuperberg, G. R. & Jaeger, T. F. What do we mean by prediction in language

comprehension? Lang. Cogn. Neurosci. 31, 32–59 (2016).

9. Van Petten, C. & Luka, B. J. Prediction during language comprehension: Benefits, costs, and

ERP components. Int. J. Psychophysiol. 83, 176–190 (2012).

10. Clark, H. H. Using Language. (Cambridge University Press, 1996).

11. Hagoort, P., Hald, L., Bastiaansen, M. & Petersson, K. M. Integration of Word Meaning and

World Knowledge in Language Comprehension. Science 304, 438–441 (2004). bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 32

12. Kutas, M. & Federmeier, K. D. Electrophysiology reveals semantic memory use in language

comprehension. Trends Cogn. Sci. 4, 463–470 (2000).

13. McGurk, H. & MacDonald, J. Hearing lips and seeing voices. Nature 264, 746–748 (1976).

14. Rosenblum, L. D. Audiovisual Speech Perception and the McGurk Effect. in Oxford

Research Encyclopedia of Linguistics (2019).

15. Massaro, D. W. The McGurk Effect: Auditory Visual Speech Perception’s Piltdown Man. in

The 14th International Conference on Auditory-Visual Speech Processing (AVSP2017) (eds.

Ouni, S., Davis, C., Jesse, A. & Beskow, J.) (KTH, 2017).

16. Alsius, A., Paré, M. & Munhall, K. G. Forty Years After Hearing Lips and Seeing Voices:

the McGurk Effect Revisited. Multisensory Res. 31, 111–144 (2018).

17. Bosker, H. R., Peeters, D. & Holler, J. How visual cues to speech rate influence speech

perception. Q. J. Exp. Psychol. 38 (in press) doi:10.1177/1747021820914564.

18. Rosenblum, L. D., Dias, J. W. & Dorsi, J. The supramodal brain: implications for auditory

perception. J. Cogn. Psychol. 29, 65–87 (2017).

19. Efron, D. Gesture and Environment. (King’s Crown, 1941).

20. Kendon, A. Gesture: Visible Action as Utterance. (Cambridge University Press, 2004).

21. McNeill, D. Hand and Mind: What Gestures Reveal about Thought. (University of Chicago

Press, 1992).

22. Cooperrider, K. Fifteen ways of looking at a pointing gesture. (2020).

23. Goldin‐Meadow, S. Pointing Sets the Stage for Learning Language—and Creating

Language. Child Dev. 78, 741–745 (2007).

24. Kita, S. Pointing: Where Language, Culture, and Cognition Meet. (Psychology Press, 2003).

25. Peeters, D., Chu, M., Holler, J., Hagoort, P. & Özyürek, A. Electrophysiological and bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 33

Kinematic Correlates of Communicative Intent in the Planning and Production of Pointing

Gestures and Speech. J. Cogn. Neurosci. 27, 2352–2368 (2015).

26. Kelly, S. D., Özyürek, A. & Maris, E. Two Sides of the Same Coin: Speech and Gesture

Mutually Interact to Enhance Comprehension. Psychol. Sci. 21, 260–267 (2010).

27. Özyürek, A., Willems, R. M., Kita, S. & Hagoort, P. On-line Integration of Semantic

Information from Speech and Gesture: Insights from Event-related Brain Potentials. J. Cogn.

Neurosci. 19, 605–616 (2007).

28. Wu, Y. C. & Coulson, S. How iconic gestures enhance communication: An ERP study.

Brain Lang. 101, 234–245 (2007).

29. Biau, E. & Soto-Faraco, S. Beat gestures modulate auditory integration in speech perception.

Brain Lang. 124, 143–152 (2013).

30. Esteve-Gibert, N. & Prieto, P. Prosodic Structure Shapes the Temporal Realization of

Intonation and Manual Gesture Movements. J. Speech Lang. Hear. Res. 56, 850–864 (2013).

31. Krahmer, E. & Swerts, M. The effects of visual beats on prosodic prominence: Acoustic

analyses, auditory perception and . J. Mem. Lang. 57, 396–414 (2007).

32. Leonard, T. & Cummins, F. The temporal relation between beat gestures and speech. Lang.

Cogn. Process. 26, 1457–1471 (2011).

33. Loehr, D. P. Temporal, structural, and pragmatic synchrony between intonation and gesture.

Lab. Phonol. 3, 71–89 (2012).

34. Pouw, W., Harrison, S. J. & Dixon, J. A. Gesture–speech physics: The biomechanical basis

for the emergence of gesture–speech synchrony. J. Exp. Psychol. Gen. 149, 391–404 (2020).

35. Dick, A. S., Goldin‐Meadow, S., Hasson, U., Skipper, J. I. & Small, S. L. Co-speech

gestures influence neural activity in brain regions associated with processing semantic bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 34

information. Hum. Brain Mapp. 30, 3509–3526 (2009).

36. Habets, B., Kita, S., Shao, Z., Özyurek, A. & Hagoort, P. The Role of Synchrony and

Ambiguity in Speech–Gesture Integration during Comprehension. J. Cogn. Neurosci. 23,

1845–1854 (2010).

37. Kita, S. & Özyürek, A. What does cross-linguistic variation in semantic coordination of

speech and gesture reveal?: Evidence for an interface representation of spatial thinking and

speaking. J. Mem. Lang. 48, 16–32 (2003).

38. Özyürek, A. Hearing and seeing meaning in speech and gesture: insights from brain and

behaviour. Philos. Trans. R. Soc. B Biol. Sci. 369, 20130296 (2014).

39. Pouw, W., Paxton, A., Harrison, S. J. & Dixon, J. A. Acoustic information about upper limb

movement in voicing. Proc. Natl. Acad. Sci. (2020) doi:10.1073/pnas.2004163117.

40. Wagner, P., Malisz, Z. & Kopp, S. Gesture and speech in interaction: An overview. Speech

Commun. 57, 209–232 (2014).

41. Ekman, P. & Friesen, W. V. The Repertoire of Nonverbal Behavior: Categories, Origins,

Usage, and Coding. Semiotica 1, 49–98 (1969).

42. Treffner, P., Peter, M. & Kleidon, M. Gestures and Phases: The Dynamics of Speech-Hand

Communication. Ecol. Psychol. 20, 32–64 (2008).

43. Ferré, G. Gesture/speech integration in the perception of prosodic emphasis. in Proceedings

of Speech Prosody 2018 35–39 (ISCA, 2018).

44. Biau, E., Torralba, M., Fuentemilla, L., de Diego Balaguer, R. & Soto-Faraco, S. Speaker’s

hand gestures modulate speech perception through phase resetting of ongoing neural

oscillations. Cortex 68, 76–85 (2015).

45. Casasanto, D. & Jasmin, K. Good and Bad in the Hands of Politicians: Spontaneous Gestures bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 35

during Positive and Negative Speech. PLoS ONE 5, e11805 (2010).

46. Dimitrova, D., Chu, M., Wang, L., Özyürek, A. & Hagoort, P. Beat that Word: How

Listeners Integrate Beat Gesture and Focus in Multimodal Speech Discourse. J. Cogn.

Neurosci. 28, 1255–1269 (2016).

47. Holle, H. et al. Gesture Facilitates the Syntactic Analysis of Speech. Front. Psychol. 3,

(2012).

48. Hubbard, A. L., Wilson, S. M., Callan, D. E. & Dapretto, M. Giving speech a hand: Gesture

modulates activity in auditory cortex during speech perception. Hum. Brain Mapp. 30, 1028–

1037 (2009).

49. Wang, L. & Chu, M. The role of beat gesture and pitch accent in semantic processing: An

ERP study. Neuropsychologia 51, 2847–2855 (2013).

50. So, W. C., Chen-Hui, C. S. & Wei-Shan, J. L. Mnemonic effect of iconic gesture and beat

gesture in adults and children: Is meaning in gesture important for memory recall? Lang.

Cogn. Process. 27, 665–681 (2012).

51. Igualada, A., Esteve-Gibert, N. & Prieto, P. Beat gestures improve word recall in 3- to 5-

year-old children. J. Exp. Child Psychol. 156, 99–112 (2017).

52. Llanes-Coromina, J., Vilà-Giménez, I., Kushch, O., Borràs-Comes, J. & Prieto, P. Beat

gestures help preschoolers recall and comprehend discourse information. J. Exp. Child

Psychol. 172, 168–188 (2018).

53. Vilà-Giménez, I., Igualada, A. & Prieto, P. Observing Storytellers Who Use Rhythmic Beat

Gestures Improves Children’s Narrative Discourse Performance. Dev. Psychol. 55, 250–262

(2019).

54. Rochet-Capellan Amélie, Laboissière Rafael, Galván Arturo & Schwartz Jean-Luc. The bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 36

Speech Focus Position Effect on Jaw–Finger Coordination in a Pointing Task. J. Speech

Lang. Hear. Res. 51, 1507–1521 (2008).

55. Kleinschmidt, D. F. & Jaeger, T. F. Robust speech perception: Recognize the familiar,

generalize to the similar, and adapt to the novel. Psychol. Rev. 122, 148 (2015).

56. Jesse, A., Poellmann & Kong, Y. English Listeners Use Suprasegmental Cues to Lexical

Stress Early During Spoken-Word Recognition. J. Speech Lang. Hear. Res. 60, 190–198

(2017).

57. Reinisch, E., Jesse, A. & McQueen, J. M. Early use of phonetic information in spoken word

recognition: Lexical stress drives eye movements immediately. Q. J. Exp. Psychol. 63, 772–

783 (2010).

58. Jesse, A. & McQueen, J. M. Suprasegmental Lexical Stress Cues in Visual Speech can

Guide Spoken-Word Recognition. Q. J. Exp. Psychol. 67, 793–808 (2014).

59. Scarborough, R., Keating, P., Mattys, S. L., Cho, T. & Alwan, A. Optical and

Visual Perception of Lexical and Phrasal Stress in English. Lang. Speech 52, 135–175

(2009).

60. Foxton, J. M., Riviere, L.-D. & Barone, P. Cross-modal facilitation in speech prosody.

Cognition 115, 71–78 (2010).

61. Sumby, W. H. & Pollack, I. Visual Contribution to Speech Intelligibility in Noise. J. Acoust.

Soc. Am. 26, 212–215 (1954).

62. Golumbic, E. M. Z., Cogan, G. B., Schroeder, C. E. & Poeppel, D. Visual input enhances

selective speech envelope tracking in auditory cortex at a “cocktail party”. J. Neurosci. 33,

1417–1426 (2013).

63. Jesse, A. & Mitterer, H. Pointing Gestures do not Influence the Perception of Lexical Stress. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 37

in Proceedings of INTERSPEECH 2011 2445–2448 (2011).

64. Norris, D. & McQueen, J. M. Shortlist B: A Bayesian model of continuous speech

recognition. Psychol. Rev. 115, 357–395 (2008).

65. Peelle, J. E. & Sommers, M. S. Prediction and constraint in audiovisual speech perception.

Cortex 68, 169–181 (2015).

66. McClelland, J. L. & Elman, J. L. The TRACE model of speech perception. Cognit. Psychol.

18, 1–86 (1986).

67. Nooteboom, S. G. & Doodeman, G. J. N. Production and perception of vowel length in

spoken sentences. J. Acoust. Soc. Am. 67, 276–287 (1980).

68. Quené, H. & Van den Bergh, H. Examples of mixed-effects modeling with crossed random

effects and with binomial data. J. Mem. Lang. 59, 413–425 (2008).

69. Rietveld, A. C. M. & Van Heuven, V. J. J. P. Algemene Fonetiek. (Uitgeverij Coutinho,

2009).

70. Steffman, J. Intonational structure mediates speech rate normalization in the perception of

segmental categories. J. Phon. 74, 114–129 (2019).

71. Bosker, H. R. Accounting for rate-dependent category boundary shifts in speech perception.

Atten. Percept. Psychophys. 79, 333–343 (2017).

72. Massaro, D. W. Perceiving Talking Faces: From Speech Perception to a Behavioral

Principle. (MIT Press, 1998).

73. Cho, T., McQueen, J. M. & Cox, E. A. Prosodically driven phonetic detail in speech

processing: The case of domain-initial strengthening in English. J. Phon. 35, 210–243

(2007).

74. Maslowski, M., Meyer, A. S. & Bosker, H. R. Listeners normalize speech for contextual bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 38

speech rate even without an explicit recognition task. J. Acoust. Soc. Am. 146, 179–188

(2019).

75. Doelling, K. B., Arnal, L. H., Ghitza, O. & Poeppel, D. Acoustic landmarks drive delta–theta

oscillations to enable speech comprehension by facilitating perceptual parsing. NeuroImage

85, 761–768 (2014).

76. Kösem, A. et al. Neural entrainment determines the words we hear. Curr. Biol. 28, 2867–

2875 (2018).

77. Bosker, H. R., Sjerps, M. J. & Reinisch, E. Spectral contrast effects are modulated by

selective attention in ‘cocktail party’ settings. Atten. Percept. Psychophys. 82, 1318–1332

(2020).

78. Bosker, H. R., Sjerps, M. J. & Reinisch, E. Temporal contrast effects in human speech

perception are immune to selective attention. Sci. Rep. 10, (2020).

79. Golumbic, E. M. Z. et al. Mechanisms underlying selective neuronal tracking of attended

speech at a “cocktail party”. Neuron 77, 980–991 (2013).

80. de la Cruz-Pavía, I., Werker, J. F., Vatikiotis-Bateson, E. & Gervain, J. Finding Phrases: The

Interplay of Word Frequency, Phrasal Prosody and Co-speech Visual Information in

Chunking Speech by Monolingual and Bilingual Adults. Lang. Speech 002383091984235

(2019) doi:10.1177/0023830919842353.

81. Jesse, A. & Johnson, E. K. Prosodic temporal alignment of co-speech gestures to speech

facilitates referent resolution. J. Exp. Psychol. Hum. Percept. Perform. 38, 1567–1581

(2012).

82. Chrabaszcz Anna, Winn Matthew, Lin Candise Y. & Idsardi William J. Acoustic Cues to

Perception of Word Stress by English, Mandarin, and Russian Speakers. J. Speech Lang. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 39

Hear. Res. 57, 1468–1479 (2014).

83. Connell, K. et al. English Learners’ Use of Segmental and Suprasegmental Cues to Stress in

Lexical Access: An Eye-Tracking Study. Lang. Learn. 68, 635–668 (2018).

84. Tyler, M. D. & Cutler, A. Cross-language differences in cue use for speech segmentation. J.

Acoust. Soc. Am. 126, 367–376 (2009).

85. Peeters, D. Virtual reality: A game-changing method for the language sciences. Psychon.

Bull. Rev. 26, 894–900 (2019).

86. Severijnen, G. G. A. The Role of Talker-Specific Prosody in Predictive Speech Perception.

(Radboud University, Nijmegen, 2020).

87. Boersma, P. & Weenink, D. Praat: doing phonetics by computer [computer program].

(2016).

88. Stoet, G. PsyToolkit: A Novel Web-Based Method for Running Online Questionnaires and

Reaction-Time Experiments. Teach. Psychol. 44, 24–31 (2017).

89. Bates, D., Maechler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using

lme4. J. Stat. Softw. 67, 1–48 (2015).

90. R Development Core Team. R: A Language and Environment for Statistical Computing

[computer program]. (2012).

91. Barr, D. J., Levy, R., Scheepers, C. & Tily, H. J. Random effects structure for confirmatory

hypothesis testing: Keep it maximal. J. Mem. Lang. 68, 255–278 (2013).

92. Luke, S. G. Evaluating significance in linear mixed-effects models in R. Behav. Res.

Methods 49, 1494–1502 (2017).

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 40

FIGURE CAPTIONS

Figure 1. Example audiovisual stimulus and results from Experiments 1, 3, and 4. A. In Experiments 1, 2, and

4, participants were presented with an audiovisual speaker (facial features masked in the actual experiment)

producing the Dutch carrier sentence Nu zeg ik het woord... “Now say I the word...” followed by a disyllabic target

pseudoword (e.g., bagpif). The speaker produced three beat gestures, with the last beat gesture’s apex falling in the

target time window. Two auditory conditions were created by aligning the onset of either the first (top waveform in

blue) or the second vowel of the target pseudoword (bottom waveform in orange) to the final gesture’s apex (gray

vertical line). B. Participants in Experiment 1 categorized audiovisual pseudowords as having lexical stress on either

the first or the second syllable. Each target pseudoword was sampled from a 7-step continuum (on x-axis) varying

F0 in opposite directions for the two syllables. In the audiovisual block (solid lines), participants were significantly

more likely to report perceiving lexical stress on the first syllable if the speaker produced a beat gesture on the first

syllable (in blue) vs. second syllable (in orange). No such difference was observed in the audio-only control block

(transparent lines). C. Participants in Experiment 3 categorized the audio-only shadowed productions, recorded in

Experiment 2, as having lexical stress on either the first or the second syllable. If the audiovisual stimulus in

Experiment 2 involved a pseudoword with a beat gesture on the first syllable, the elicited shadowed production was

more likely to be perceived as having lexical stress on the first syllable (in blue). Conversely, if the audiovisual

stimulus in Experiment 2 involved a pseudoword with a beat gesture on the second syllable, the elicited shadowed

production was more likely to be perceived as having lexical stress on the second syllable (in orange). D.

Participants in Experiment 4 categorized audiovisual pseudowords as containing either short /ɑ/ or long /a:/ (e.g.,

bagpif vs. baagpif). Each target pseudoword contained a first vowel that was sampled from a 5-step F2 continuum

from long /a:/ to short /ɑ/ (on x-axis). Moreover, prosodic cues to lexical stress (F0, amplitude, syllable duration)

were set to ambiguous values. If the audiovisual speaker produced a beat gesture on the first syllable (in blue),

participants were biased to perceive the first syllable as stressed, making the ambiguous first vowel relatively short

for a stressed syllable, leading to a lower proportion of long /a:/ responses. Conversely, if the audiovisual speaker

produced a beat gesture on the second syllable (in orange), participants were biased to perceive the initial syllable as bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Bosker & Peeters 41

unstressed, making the ambiguous first vowel relatively long for an unstressed syllable, leading to a higher

proportion of long /a:/ responses. For all panels, error bars enclose 1.96 x SE on either side; that is, the 95%

confidence intervals...... 9

Figure 2. Predicted estimates from the combined models of Experiment 2. In Experiment 2, participants

shadowed the audiovisual sentence-final pseudowords in either Beat Condition. The figure shows stylized wave

forms for the two syllables of the shadowed productions, on either side of the zero point on the x-axis, showing their

durations (ms), aligned around the onset of the second syllable (at 0). The bold horizontal error bars at the extremes

of the stylized waveforms indicate duration SE. The vertical position of the (black rectangles surrounding the)

stylized waveforms shows the produced F0 (in Hz on y-axis; height of rectangles indicates F0 SE on either side of

the midpoint). The color gradient and height of the stylized waveforms indicates average amplitude (dB). The

overall differences between first and second syllables show pronounced word-final effects: the last syllable of a

word typically has a longer duration, and a lower F0 and amplitude. Critically, after audiovisual trials with a beat

gesture on the second syllable (lower panel; compared to a beat gesture on the first syllable in the upper panel),

shadowers produced their first syllables with a shorter duration and their second syllables with a longer duration and

higher amplitude. No effect of F0 was observed...... 12

A “Nu zeg ik het woord...”

bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. time B Experiment 1 C Experiment 3 D Experiment 4 0.9 0.9 0.9 ● ● ● 0.8 ● 0.8 0.8 ● 0.7 ● 0.7 0.7 ● 0.6 ● 0.6 ● 0.6 ● ● ● 0.5 ● 0.5 0.5 ● ● ● ● ● 0.4 ● 0.4 0.4 ● ● ● ● 0.3 0.3 P (long /a:/) 0.3 ● ● ● 0.2 ● 0.2 0.2 P (stress on 1st syllable) 0.1 P (stress on 1st syllable) 0.1 0.1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 Lexical stress continuum Lexical stress continuum F2 continuum bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Beat gesture on 1st syllable Amplitude (dB) 200 71 70 190 1st syllable 69 68 180 67 170 2nd syllable 160 150 -8 ms Beat gesture on 2nd syllable 200 190 1st syllable +0.5 dB; +9 ms 180 170 2nd syllable

Fundamental frequency (F0; in Hz) 160 150 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 Duration (in ms)