bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 1
Beat gestures influence which speech sounds you hear
Hans Rutger Bosker ab 1 and David Peeters ac
a Max Planck Institute for Psycholinguistics, PO Box 310, 6500 AH, Nijmegen, The Netherlands
b Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, the Netherlands
c Tilburg University, Department of Communication and Cognition, TiCC, Tilburg, the Netherlands
Running title: Beat gestures influence speech perception
Word count: 11080 (including all words)
Figure count: 2
1 Corresponding author. Tel.: +31 (0)24 3521 373. E-mail address: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 2
ABSTRACT
Beat gestures – spontaneously produced biphasic movements of the hand – are among the most
frequently encountered co-speech gestures in human communication. They are closely temporally
aligned to the prosodic characteristics of the speech signal, typically occurring on lexically stressed
syllables. Despite their prevalence across speakers of the world’s languages, how beat gestures
impact spoken word recognition is unclear. Can these simple ‘flicks of the hand’ influence speech
perception? Across six experiments, we demonstrate that beat gestures influence the explicit and
implicit perception of lexical stress (e.g., distinguishing OBject from obJECT), and in turn, can
influence what vowels listeners hear. Thus, we provide converging evidence for a manual McGurk
effect: even the simplest ‘flicks of the hands’ influence which speech sounds we hear.
Keywords: beat gestures; lexical stress; vowel length; speech perception; manual McGurk effect.
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 3
SIGNIFICANCE STATEMENT
Beat gestures are very common in human face-to-face communication. Yet we know little about
their behavioral consequences for spoken language comprehension. We demonstrate that beat
gestures influence the explicit and implicit perception of lexical stress, and, in turn, can even shape
what vowels we think we hear. This demonstration of a manual McGurk effect provides some of
the first empirical support for a recent multimodal, situated psycholinguistic framework of human
communication, while challenging current models of spoken word recognition that do not yet
incorporate multimodal prosody. Moreover, it has the potential to enrich human-computer
interaction and improve multimodal speech recognition systems.
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 4
INTRODUCTION
The human capacity to communicate evolved primarily to allow for efficient face-to-face
interaction 1. As humans, we are indeed capable of translating our thoughts into communicative
messages, which in everyday situations often comprise concurrent auditory (speech) and visual (lip
movements, facial expressions, hand gestures, body posture) information 2–4. Speakers thus make
use of different channels (mouth, eyes, hands, torso) to get their messages across, while addressees
have to quickly and efficiently integrate or bind the different ephemeral signals they concomitantly
perceive 2,5,6. This process is skilfully guided by top-down predictions 7–9 that exploit a life-time of
communicative experience, acquired world knowledge, and personal common ground built up with
the speaker 10–12.
The classic McGurk effect 13 is an influential – though not entirely uncontroversial 14–16 –
illustration of how visual input influences the perception of speech sounds. In a seminal study 13,
participants were made to repeatedly listen to the sound /ba/. At the same time, they watched the
face of a speaker in a video whose mouth movements corresponded to the sound /ga/. When asked
what they heard, a majority of participants reported actually perceiving the sound /da/. This robust
finding indicates that, during language comprehension, our brains take both verbal (here: auditory)
and non-verbal (here: visual) aspects of communication into account 17,18, providing the addressee
with a best guess of what exact message a speaker intends to convey. We thus listen not only with
our ears, but also with our eyes.
Visual aspects of everyday human communication are not restricted to subtle mouth or lip
movements. In close coordination with speech, the hand gestures we spontaneously and
idiosyncratically make during conversations help us express our thoughts, emotions, and intentions bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 5
19–21. We manually point at things to guide the attention of our addressee to relevant aspects of our
immediate environment 22–25, enrich our speech with iconic gestures that in handshape and
movement resemble what we are talking about 26–28, and produce simple flicks of the hand aligned
to the acoustic prominence of our spoken utterance to highlight relevant points in speech 29–34. Over
the past decades, it has become clear that such co-speech manual gestures are semantically aligned
and temporally synchronized with the speech we produce 35–40. It is an open question, however,
whether the temporal synchrony between these hand gestures and the speech signal can actually
influence which speech sounds we hear.
In this study, we test for the presence of such a potential manual McGurk effect. Do hand
gestures, like observed lip movements, influence what exact speech sounds we perceive? We here
focus on manual beat gestures – commonly defined as biphasic (e.g., up and down) movements of
the hand that speakers spontaneously produce to highlight prominent aspects of the concurrent
speech signal 21,31,41. Beat gestures are amongst the most frequently encountered gestures in
naturally occurring human communication 21 as the presence of an accompanying beat gesture
allows for the enhancement of the perceived prominence of a word 31,42,43. Not surprisingly, they
hence also feature prominently in politicians’ public speeches 44,45. By highlighting concurrent
parts of speech and enhancing its processing 46–49, beat gestures may even increase memory recall
of words in both adults 50 and children 51–53. As such, they form a fundamental part of our
communicative repertoire and also have direct practical relevance.
In contrast to iconic gestures, beat gestures are not related to the spoken message on a semantic
level, but only on a temporal level: their apex is aligned to vowel onset in lexically stressed
syllables carrying prosodic emphasis 20,21,34,54. Neurobiological studies suggest that listeners may
tune oscillatory activity in the alpha and theta bands upon observing the initiation of a beat gesture bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 6
to anticipate processing an assumedly important upcoming word 44. Bilateral temporal areas of the
brain may in parallel be activated for efficient integration of beat gestures and concurrent speech
48. However, evidence for behavioral effects of temporal gestural alignment on speech perception
remains scarce. Given the close temporal alignment between beat gestures and lexically stressed
syllables in speech production, ideal observers 55 could in principle use beat gestures as a cue to
lexical stress in perception.
Lexical stress is indeed a critical speech cue in spoken word recognition. It is well established
that listeners commonly use suprasegmental spoken cues to lexical stress, such as greater
amplitude, higher fundamental frequency (F0), and longer syllable duration, to constrain online
lexical access, speeding up word recognition, and increasing perceptual accuracy 56,57. Listeners
also use visual articulatory cues on the face to perceive lexical stress when the auditory signal is
muted 58–60. Nevertheless, whether non-facial visual cues, such as beat gestures, also aid in lexical
stress perception, and might even do so in the presence of auditory cues to lexical stress – as in
everyday conversations – is unknown. These questions are relevant for our understanding of face-
to-face interaction, where multimodal cues to lexical stress might considerably increase the
robustness of spoken communication, for instance in noisy listening conditions 61, and even when
the auditory signal is clear 62.
Indeed, a recent theoretical account proposed that multimodal low-level integration must be a
general cognitive principle fundamental to our human communicative capacities 2. However,
evidence showing that the integration of information from two separate communicative modalities
(speech, gesture) modulates the mere perception of information derived from one of these two
modalities (e.g., speech) is lacking. For instance, earlier studies have failed to find effects of
simulated pointing gestures on lexical stress perception, leading to the idea that manual gestures bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 7
provide information only about sentential emphasis; not about relative emphasis placed on
individual syllables within words 63. As a result, there is no formal account or model of how
audiovisual prosody might help spoken word recognition 64–66. Hence, demonstrating a manual
McGurk effect would imply that existing models of speech perception need to be fundamentally
expanded to account for how multimodal cues influence auditory word recognition.
Below, we report the results of four experiments (and two direct replications: Experiments S1-
S2) in which we tested for the existence and robustness of a manual McGurk effect. The
experiments each approach the theoretical issue at hand from a different methodological angle,
varying the degree of explicitness of the participants’ experimental task and their response
modality (perception vs. production). Experiment 1 uses an explicit task to test whether observing
a beat gesture influences the perception of lexical stress in disyllabic spoken stimuli. Using the
same stimuli, Experiment 2 implicitly investigates whether seeing a concurrent beat gesture
influences how participants vocally shadow (i.e., repeat back) an audiovisual speaker’s production
of disyllabic stimuli. Experiment 3 then uses those shadowed productions as stimuli in a perception
experiment to implicitly test whether naïve listeners are biased towards perceiving lexical stress
on the first syllable if the shadowed production was produced in response to an audiovisual
stimulus with a beat gesture on the first syllable. Finally, Experiment 4 tests whether beat gestures
may influence what vowels (long vs. short) listeners perceive. The perceived prominence of a
syllable influences the expected duration of that syllable’s vowel nucleus i.e., longer if stressed, 67. Thus,
if listeners use beat gestures as a cue to lexical stress, they may perceive a vowel midway between
Dutch short /ɑ/ and long /a:/ as /ɑ/ when presented with a concurrent beat gesture, but as /a:/ without
it.
For all experiments, we hypothesized that, if the manual McGurk effect exists, we should see bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 8
effects of the presence and timing of beat gestures on the perception and production of lexical
stress (Experiments 1-3) and, in turn, on vowel perception (Experiment 4). In short, we found
evidence in favor of a manual McGurk effect in all four experiments and in both direct replications:
the perception of lexical stress and vowel identity is consistently influenced by the temporal
alignment of beat gestures to the speech signal.
RESULTS
Experiment 1 tested whether beat gestures influence how listeners explicitly categorize
disyllabic spoken stimuli as having initial stress or final stress (e.g., WAsol vs. waSOL; capitals
indicate lexical stress). Native speakers of Dutch were presented with an audiovisual speaker
(facial features masked in the actual experiment to avoid biasing effects of articulatory cues)
producing non-existing pseudowords in a sentence context: Nu zeg ik het woord... [target] “Now
say I the word... [target]”. Pseudowords were manipulated along a 7-step lexical stress continuum,
varying F0 in opposite directions for the two syllables, thus ranging from a ‘strong-weak’ (step 1)
to a ‘weak-strong’ prosodic pattern (step 7). Critically, the audiovisual speaker always produced a
beat gesture whose apex was manipulated to be either aligned to the onset of the first vowel or the
second vowel by varying the silent interval preceding the pseudoword (cf. Figure 1A). The task
involved two-alternative forced choice (2AFC) between the pseudoword having stress on the first
or second syllable. For comparison, in another block, the same participants also categorized audio-
only versions of the stimuli (block order counter-balanced). Outcomes of a Generalized Linear
Mixed Model with a logistic linking function GLMM; 68 demonstrated that participants, in the
audiovisual block, were significantly more likely to report hearing stress on the first syllable if the
beat gesture was aligned to the onset of the first vowel (main effect of Beat Condition: β = 1.142 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 9
in logit space, p < 0.001), as shown by the difference between the blue and orange solid lines in
Figure 1B (mean difference in proportions between the two Beat Conditions: 0.20). No such effect
was observed in the audio-only condition (interaction Beat Condition x Modality: β = -0.813, p <
0.001). These findings demonstrate that the temporal alignment of beat gestures to the speech
signal influences the explicit perception of lexical stress.
Figure 1. Example audiovisual stimulus and results from Experiments 1, 3, and 4. A. In Experiments
1, 2, and 4, participants were presented with an audiovisual speaker (facial features masked in the actual
experiment) producing the Dutch carrier sentence Nu zeg ik het woord... “Now say I the word...” followed
by a disyllabic target pseudoword (e.g., bagpif). The speaker produced three beat gestures, with the last
beat gesture’s apex falling in the target time window. Two auditory conditions were created by aligning the
onset of either the first (top waveform in blue) or the second vowel of the target pseudoword (bottom
waveform in orange) to the final gesture’s apex (gray vertical line). B. Participants in Experiment 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 10
categorized audiovisual pseudowords as having lexical stress on either the first or the second syllable. Each
target pseudoword was sampled from a 7-step continuum (on x-axis) varying F0 in opposite directions for
the two syllables. In the audiovisual block (solid lines), participants were significantly more likely to report
perceiving lexical stress on the first syllable if the speaker produced a beat gesture on the first syllable (in
blue) vs. second syllable (in orange). No such difference was observed in the audio-only control block
(transparent lines). C. Participants in Experiment 3 categorized the audio-only shadowed productions,
recorded in Experiment 2, as having lexical stress on either the first or the second syllable. If the audiovisual
stimulus in Experiment 2 involved a pseudoword with a beat gesture on the first syllable, the elicited
shadowed production was more likely to be perceived as having lexical stress on the first syllable (in blue).
Conversely, if the audiovisual stimulus in Experiment 2 involved a pseudoword with a beat gesture on the
second syllable, the elicited shadowed production was more likely to be perceived as having lexical stress
on the second syllable (in orange). D. Participants in Experiment 4 categorized audiovisual pseudowords
as containing either short /ɑ/ or long /a:/ (e.g., bagpif vs. baagpif). Each target pseudoword contained a first
vowel that was sampled from a 5-step F2 continuum from long /a:/ to short /ɑ/ (on x-axis). Moreover,
prosodic cues to lexical stress (F0, amplitude, syllable duration) were set to ambiguous values. If the
audiovisual speaker produced a beat gesture on the first syllable (in blue), participants were biased to
perceive the first syllable as stressed, making the ambiguous first vowel relatively short for a stressed
syllable, leading to a lower proportion of long /a:/ responses. Conversely, if the audiovisual speaker
produced a beat gesture on the second syllable (in orange), participants were biased to perceive the initial
syllable as unstressed, making the ambiguous first vowel relatively long for an unstressed syllable, leading
to a higher proportion of long /a:/ responses. For all panels, error bars enclose 1.96 x SE on either side; that
is, the 95% confidence intervals.
However, the task in Experiment 1, asking participants to categorize the audiovisual stimuli as bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 11
having stress on either the first or second syllable, is very explicit and presumably susceptible to
response strategies. Therefore, Experiment 2 tested lexical stress perception implicitly and in
another response modality by asking participants to overtly shadow the audiovisual pseudowords
from Experiment 1. To reduce the load on the participants, we selected only three relevant steps
from the lexical stress continua: two relatively clear steps (1 and 7) and one relatively ambiguous
step (step 4). Critically, lexical stress was never mentioned in any part of the instructions. We
hypothesized that, if participants saw the audiovisual speaker produce a beat gesture on the second
syllable, the suprasegmental characteristics of their shadowed productions would resemble a
‘weak-strong’ prosodic pattern (i.e., greater amplitude, higher F0, longer duration on the second
syllable). Acoustic analysis of the amplitude, F0, and duration measurements of the shadowed
productions by three separate Linear Mixed Models (one for each acoustic cue) showed small but
significant effects of Beat Condition, visualized in Figure 2. That is, if the beat gesture was aligned
to the second syllable, participants presumably implicitly perceived lexical stress on the second
syllable, and therefore their own shadowed second syllables were produced with greater amplitude
and a longer duration, while their first syllables were shorter.
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 12
Figure 2. Predicted estimates from the combined models of Experiment 2. In Experiment 2,
participants shadowed the audiovisual sentence-final pseudowords in either Beat Condition. The figure
shows stylized wave forms for the two syllables of the shadowed productions, on either side of the zero
point on the x-axis, showing their durations (ms), aligned around the onset of the second syllable (at 0).
The bold horizontal error bars at the extremes of the stylized waveforms indicate duration SE. The vertical
position of the (black rectangles surrounding the) stylized waveforms shows the produced F0 (in Hz on y-
axis; height of rectangles indicates F0 SE on either side of the midpoint). The color gradient and height of
the stylized waveforms indicates average amplitude (dB). The overall differences between first and second
syllables show pronounced word-final effects: the last syllable of a word typically has a longer duration,
and a lower F0 and amplitude. Critically, after audiovisual trials with a beat gesture on the second syllable
(lower panel; compared to a beat gesture on the first syllable in the upper panel), shadowers produced their
first syllables with a shorter duration and their second syllables with a longer duration and higher amplitude.
No effect of F0 was observed.
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 13
Note that the effects of beat gestures in Experiment 2 were robust but small. Furthermore, no
effect of Beat Condition was found on the F0 of the shadowed productions, the primary cue to
stress in Dutch 69. Therefore, Experiment 3 tested the perceptual relevance of the acoustic
adjustments participants made to their shadowed productions in response to the two beat gesture
conditions. Each participant listened to all shadowed productions from two shadowers from
Experiment 2 and indicated whether they perceived the shadowed productions to have stress on
the first or second syllable (2AFC). Critically, this involved audio-only stimuli and the participants
in Experiment 3 were unaware of the beat gesture manipulation in Experiment 2. Nevertheless,
outcomes of a GLMM on the binomial categorization data demonstrated that participants were
significantly biased to report hearing stress on the first syllable of the audio-only shadowed
productions if the audiovisual stimulus that elicited those shadowed productions had a beat gesture
on the first syllable (main effect of Beat Condition: β = 0.232 in logit space, p = 0.003). This is
indicated by the difference between the blue and orange lines in Figure 1C (mean difference in
proportions between the two Beat Conditions: 0.05). Moreover, an identical replication of this
experiment, Experiment S1 (reported in the Supplementary Material), reported similar findings
using the same stimuli in a new participant sample. These findings suggest that the acoustic
adjustments in response to the different beat gesture alignments in Experiment 2 were perceptually
salient enough for new participants to notice them and use them as cues to lexical stress
identification. Thus, it provides supportive evidence that beat gestures influence implicit lexical
stress perception.
Finally, Experiment 4 tested the strongest form of a manual McGurk effect: can beat gestures
influence the perceived identity of individual speech sounds? The rationale behind Experiment 4
is grounded in the observation that listeners take a syllable’s perceived prominence into account bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 14
when estimating its vowel’s duration 67,70. Vowels in stressed syllables typically have a longer
duration than vowels in unstressed syllables. Hence, in perception, listeners expect a vowel in a
stressed syllable to have a longer duration. That is, one and the same vowel duration can be
perceived as relatively short in one (stressed) syllable, but as relatively long in another (unstressed
syllable). In languages where vowel duration is a critical cue to phonemic contrasts (e.g., to coda
stop voicing in English, e.g., “coat - code” 70; to vowel length contrasts in Dutch, e.g., tak /tɑk/
“branch” - taak /ta:k/ “task” 71), these expectations can influence the perceived identity of speech
sounds.
Participants in Experiment 4 were presented with audiovisual stimuli, much like those used in
Experiments 1-2. However, we used a new set of disyllabic pseudowords whose initial vowel was
manipulated to be ambiguous between Dutch /ɑ-a:/ in its duration while varying the second
formant frequency (F2) in 5-step F2 continua, which is a secondary cue to the /ɑ-a:/ contrast in
Dutch 71 but not a cue to lexical stress. Moreover, each pseudoword token was manipulated to be
ambiguous with respect to its suprasegmental cues to lexical stress (set to average values of F0,
amplitude, and duration). Taken together, these manipulations resulted in pseudowords that were
ambiguous in lexical stress and vowel duration, but varied in F2 as cue to vowel identity.
Participants indicated whether they heard an /ɑ/ or /a:/ as initial vowel of the pseudoword (2AFC;
e.g., bagpif or baagpif?); lexical stress was never mentioned in any part of the instructions.
Outcomes of a GLMM on the binomial vowel categorization data demonstrated an effect of Beat
Condition on participants’ vowel perception (β = -0.240 in logit space, p = 0.023). That is, if the
beat gesture was aligned to the first syllable (blue line in Figure 1D), participants were more likely
to perceive the first syllable as stressed, rendering the ambiguous vowel duration in the first
syllable as relatively short for a stressed syllable, lowering the proportion of long /a:/ responses. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 15
Conversely, if the beat gesture was aligned to the second syllable (orange line in Figure 1D),
participants were more likely to perceive the first syllable as unstressed, making the ambiguous
vowel duration in the first syllable relatively long for an unstressed syllable, leading to a small
increase in the proportion of long /a:/ responses (mean difference in proportions between the two
Beat Conditions: 0.04). This finding, corroborated by similar observations in an identical
replication of this experiment, Experiment S2 (see Supplementary Material), showcases the
strongest form of the manual McGurk effect: beat gestures influence which speech sounds we hear.
DISCUSSION
To what extent do we listen with our eyes? The classic McGurk effect illustrates that listeners
take both visual (lip movements) and auditory (speech) information into account to arrive at a best
possible approximation of the exact message a speaker intends to convey 13. Here we provide
evidence in favor of a manual McGurk effect: the beat gestures we see influence what exact speech
sounds we hear. Specifically, we observed in four experiments that beat gestures influence both
the explicit and implicit perception of lexical stress, and in turn may influence the perception of
individual speech sounds. To our understanding, this is the first demonstration of behavioral
consequences of beat gestures on low-level speech perception. These findings support the idea that
everyday language comprehension is primordially a multimodal undertaking in which the multiple
streams of information that are concomitantly perceived through the different senses are quickly
and efficiently integrated by the listener to make a well-informed best guess of what specific
message a speaker is intending to communicate.
The four experiments we presented made use of a combination of implicit and explicit
paradigms. Not surprisingly, the explicit nature of the task used in Experiment 1 led to strong bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 16
effects of the timing of a perceived beat gesture on the lexical stress pattern participants reported
to hear. Although the implicit effects in subsequent Experiments 2-4 were smaller, conceivably
partially due to suboptimal control over the experimental settings in online testing, they were
nevertheless robust and replicable (cf. Experiments S1-S2 in Supplementary Material). Moreover,
these beat gesture effects even surfaced when the auditory cues were relatively clear (i.e., at the
extreme ends of the acoustic continua), suggesting that these effects also prevail in naturalistic
face-to-face communication outside the lab. In fact, the naturalistic nature of the audiovisual
stimuli would seem key to the effect, since the temporal alignment of artificial hand movements
(i.e., an animated line drawing of a disembodied hand) does not influence lexical stress perception
see 63. Perhaps the present study even underestimates the behavioral influence of beat gestures on
speech perception in everyday conversations, since all auditory stimuli in the experiments involved
‘clean’ (i.e., noise-free) speech. However, in real life situations, the speech of an attended talker
oftentimes occurs in the presence of other auditory signals, such as background music, competing
speech, or environmental noise; in such listening conditions, the contribution of visual beat
gestures to successful communication would conceivably be even greater.
The outcomes of the present study challenge current models of (multimodal) word recognition,
which would need to be extended to account for how prosody, specifically multimodal prosody,
supports and modulates word recognition. Audiovisual prosody is implemented in the fuzzy logical
model of perception 72, but how this information affects lexical processing is underspecified.
Moreover, Experiment 4 implicates that the multimodal suprasegmental and segmental ‘analyzers’
would have to allow for interaction, since one can influence the other. Therefore, existing models
of audiovisual speech perception and word recognition thus need to be expanded to account for (1)
how suprasegmental prosodic information is processed 73; (2) how suprasegmental information bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 17
interacts with and modulates segmental information 74; and (3) how multimodal cues influence
auditory word recognition 17,58.
A recent, broader theoretical account of human communication (i.e., beyond speech alone)
proposed that two domain-general mechanisms form the core of our ability to understand language:
multimodal low-level integration and top-down prediction 2. The present results are indeed in line
with the idea that multimodal low-level integration is a general cognitive principle supporting
language understanding. More precisely, we here show that this mechanism operates across
communicative modalities. The cognitive and perceptual apparatus that we have at our disposal
apparently not just binds auditory and visual information that derive from the same communicative
channel (i.e., articulation), but also combines visual information derived from the hands with
concurrent auditory information – crucially to such an extent that what we hear is modulated by
the timing of the beat gestures we see. As such, what we perceive is the model of reality that our
brains provide us by binding visual and auditory communicative input, and not reality itself.
In addition to multimodal low-level integration, earlier research has indicated that also top-down
predictions play a pivotal role in the perception of co-speech beat gestures. It seems reasonable to
assume that listeners throughout life build up the implicit knowledge that beat gestures are
commonly tightly paired with acoustically prominent verbal information. As the initiation of a beat
gesture typically slightly precedes the prominent acoustic speech feature 40, these statistical
regularities are then indeed used in a top-down manner to allocate additional attentional resources
to an upcoming word 44.
The current first demonstration of the influence of simple ‘flicks of the hands’ on low-level
speech perception should be considered an initial step in our understanding of how bodily gestures
in general may affect speech perception. Future research may investigate whether the temporal bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 18
alignment of types of non-manual gestures, such as head nods and eyebrow movements, also
impact what speech sounds we hear. Moreover, other behavioral consequences beyond lexical
stress perception could be implicated, ranging from phase-resetting neural oscillations 44 with
consequences for speech intelligibility and phoneme identification 75,76, to facilitating selective
attention in ‘cocktail party’ listening 77–79, word segmentation 80, and word learning 81.
Finally, the current findings have several practical implications. Speech recognition systems
may be improved when including multimodal signals, even beyond articulatory cues on the face.
Presumably beat gestures are particularly valuable multimodal cues, because, while cross-language
differences exist in the use and weighting of acoustic suprasegmental prosody e.g., amplitude, F0, duration,
82–84, beat gestures could presumably function as a language-universal prosodic cue. Also, the field
of human-computer interaction could benefit from the finding that beat gestures support spoken
communication, for instance by enriching virtual agents and avatars with human-like gesture-to-
speech alignment. The use of such agents in immersive virtual reality environments will in turn
allow for taking the study of language comprehension into naturalistic, visually rich and dynamic
settings while retaining the required degree of experimental control 85. By acknowledging that
spoken language comprehension involves the integration of auditory sounds, visual articulatory
cues, and even the simplest of manual gestures, it will be possible to significantly further our
understanding of the human capacity to communicate.
MATERIALS AND METHODS
Participants
Native speakers of Dutch were recruited from the Max Planck Institute’s participant pool bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 19
(Experiment 1: N = 20; 14 females (f), 6 males (m); Mage = 25, range = 20-34; Experiment 2: N =
26; 22f, 4m; Mage = 23, range = 19-30; Experiment 3: N = 26; 25f, 1m; Mage = 24, range = 20-28;
Experiment 4 was performed by the same individuals as those in Experiment 3). All gave informed
consent as approved by the Ethics Committee of the Social Sciences department of Radboud
University (project code: ECSW2014-1003-196). All research was performed in accordance with
relevant guidelines and regulations. Participants had normal hearing, had no speech or language
disorders, and took part in only one of our experiments.
Materials
Experiments 1 & 2
For our auditory target stimuli, we used 12 disyllabic pseudowords that are phonotactically legal
in Dutch (e.g., wasol /ʋa.sɔl/; see Table S1 in Supplementary Material). In Dutch, lexical stress is
cued by three suprasegmental cues: amplitude, fundamental frequency (F0), and duration 69. Since,
in Dutch, F0 is a primary cue to lexical stress in perception, we used 7-step lexical stress continua
– adopted from an earlier student project 86 – varying in mean F0 while keeping amplitude and
duration cues constant, thus ranging from a ‘strong-weak’ prosodic pattern (SW; step 1) to a ‘weak-
strong’ prosodic pattern (WS; step 7).
These continua had been created by recording an SW (i.e., stress on initial syllable) and a WS
version (i.e., stress on last syllable) of each pseudoword from the first author, a male native speaker
of Dutch. We first measured the average duration and amplitude values, separately for the first and
second syllable, across the stressed and unstressed versions: mean duration syllable 1 = 202 ms;
syllable 2 = 395 ms; mean amplitude syllable 1 = 68.1 dB; syllable 2 = 64 dB. Then, using Praat
87, we set the duration and amplitude values of the two syllables of each pseudoword to these bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 20
ambiguous values. Subsequently, F0 was manipulated along a 7-step continuum, with the two
extremes and the step size informed by the talker’s originally produced F0. Manipulations were
always performed in an inverse manner for the two syllables of each pseudoword: while the mean
F0 of the first syllable decreased along the continuum (from 159.2 to 110.6 Hz in steps of 8.1 Hz),
the mean F0 of the second syllable increased (from 95.7 to 141.3 Hz in steps of 7.6 Hz). Moreover,
rather than setting the F0 within each syllable to a fixed value, more natural output was created by
including a fixed F0 declination within the first syllable (linear decrease of 23.5 Hz) and second
syllable (38.2 Hz). Pilot data from a categorization task on these manipulated stimuli showed that
these manipulations resulted in 7-step F0 continua that appropriately sampled the SW-to-WS
perceptual space (i.e., from 0.91 proportion SW responses for step 1 to 0.16 proportion SW
responses for step 7).
To create audiovisual stimuli, the first author was video-recorded using a Canon XF205 camera
(50 frames per second; resolution: 1280 by 720 pixels) with an external Sennheiser ME64
directional microphone (audio sampling frequency: 48 kHz). Recordings were made from knees
up on a neutral background in the Gesture Lab at the Max Planck Institute for Psycholinguistics.
The speaker produced the Dutch carrier sentence Nu zeg ik het woord... kanon “Now say I the
word... cannon”, with lexical stress on the second syllable of kanon. Concurrently, he produced
three beat gestures in this sentence, with their apex aligned to the syllables Nu, woord, and –non.
The manipulated auditory pseudowords described above were combined with this audiovisual
recording using ffmpeg (version 4.0; available from http://ffmpeg.org/) by (1) removing the
original kanon target word, (2) inserting a silent interval, and (3) inserting the manipulated
pseudowords (cf. Figure 1A). By modulating the length of the intervening silent interval, the onset
of either the first or the second vowel of the target pseudoword was aligned to the final beat bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 21
gesture’s apex. Furthermore, the facial features of the talker were masked to hide any visual
articulatory cues to lexical stress. Finally, in order to mask the cross-spliced nature of the audio,
combining recordings with variable room acoustics, the silent interval and pseudoword were mixed
with stationary noise from a silent recording of the Gesture Lab.
Experiment 4
We created 8 new disyllabic pseudowords, with either a short /ɑ/ or a long /a:/ as first vowel
(e.g., bagpif /bɑx.pɪf/ vs. baagpif /ba:x.pɪf/; cf. Table S1). This new set had a fixed syllable
structure (CVC.CVC, only stops and fricatives) to reduce item variance and to facilitate syllable-
level annotations. The same male talker as before was recorded producing these pseudowords in
four versions: with short /ɑ/ and stress on the first syllable (e.g., BAGpif); with short /ɑ/ and stress
on the second syllable (bagPIF); with long /a:/ and stress on the first syllable (BAAGpif); with long
/a:/ and stress on the second syllable (baagPIF).
After manual annotation of syllable onsets and offsets, we measured the values of the
suprasegmental cues to lexical stress (F0, amplitude, and duration) in all four conditions.
Pseudowords with long /a:/ and stress on the first syllable were selected for manipulation. First,
we manipulated the cues to lexical stress by setting these to ambiguous values using PSOLA in
Praat: each first syllable was given the same mean value calculated across all stressed and
unstressed first syllables (mean F0 = 154 Hz; original contour maintained; amplitude = 65.88 dB;
duration = 263 ms), and each second syllable was given the same mean value across all stressed
and unstressed second syllables (mean F0 = 159 Hz; original contour maintained; amplitude =
63.50 dB; duration = 414 ms). This resulted in manipulated pseudowords that were ambiguous in
their prosodic cues to lexical stress. Since this included manipulating duration across all recorded
conditions as well, it also meant that the duration cues to the identity of the first vowel were bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 22
ambiguous.
Second, the first /a:/ vowel was extracted and manipulated to form a spectral continuum from
long /a:/ to short /ɑ/. In Dutch, the /ɑ-a:/ vowel contrast is cued by both spectral (lower first (F1)
and second formant (F2) values for /ɑ/) and temporal cues shorter duration for /ɑ/, 71. We decided to create
F2 continua since F2 is no cue to lexical stress in Dutch (in fact, in our original recordings F2
values did not differ between stressed and unstressed conditions) while varying F2 does influence
vowel quality perception. These spectral manipulations were based on Burg’s LPC method in
Praat, with the source and filter models estimated automatically from the selected vowel. The
formant values in the filter models were adjusted to result in a constant F1 value (750 Hz,
ambiguous between /ɑ/ and /a:/; value based on the original recordings) and 5 F2 values (step 1 =
1325 Hz, step 5 = 1125 Hz, step size = 50 Hz; values based on the original recordings). Then, the
source and filter models were recombined. Finally, the manipulated vowel tokens were spliced
into the pseudowords. Taken together, these manipulations resulted in pseudoword tokens that
were ambiguous in lexical stress (average values of F0, amplitude, and duration), but varied in F2
as cue to vowel identity.
To create audiovisual stimuli, these manipulated pseudowords were spliced into the audiovisual
stimuli from Experiment 1. Once again, manipulating the silent interval between carrier sentence
offset and target onset resulted in two different alignments of the last beat gesture to either first
vowel onset or second vowel onset. Like in Experiment 1, facial features of the talker were masked,
and silent intervals as well as manipulated pseudowords were mixed with stationary noise to match
the room acoustics across the sentence stimuli. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 23
Procedure
Experiment 1
Participants were tested individually in a sound-conditioning booth. They were seated at a
distance of approximately 60 cm in front of a 50,8 cm by 28,6 cm screen and listened to stimuli at
a comfortable volume through headphones. Stimulus presentation was controlled by Presentation
software (v16.5; Neurobehavioral Systems, Albany, CA, USA).
Participants were presented with two blocks of trials: an audiovisual block, in which the video
and the audio content of the manipulated stimuli were concomitantly presented, and an audio-only
control block, in which only the audio content but no video content was presented. Block order
was counter-balanced across participants. On each trial, the participants’ task was to indicate
whether the sentence-final pseudoword was stressed on the first or the last syllable (2-alternative
forced choice task; 2AFC).
Before the experiment, participants were instructed to always look at the screen during trial
presentations. In the audiovisual block, trials started with the presentation of a fixation cross. After
1000 ms, the audiovisual stimulus was presented (video dimensions: 960 x 864 pixels). At stimulus
offset, a response screen was presented with two response options on either side of the screen, for
instance: WAsol vs. waSOL; positioning of response options was counter-balanced across
participants. Participants were instructed that capitals indicated lexical stress, that they could enter
their response by pressing the “Z” key on a regular keyboard for the left option and the “M” key
for the right option, and that they had to give their response within 4 s after stimulus offset;
otherwise a missing response was recorded. After their response (or at timeout), the screen was
replaced by an empty screen for 500 ms, after which the next trial was initiated automatically. The bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 24
trial structure in the audio-only control block was identical to the audiovisual block, except that
instead of the video content a static fixation cross was presented during stimulus presentation.
Participants were presented with 12 pseudowords, sampled from 7-step F0 continua, in 2 beat
gesture conditions (onset of first or second vowel aligned to apex of beat gesture), resulting in 168
unique items in a block. Within a block, item order was randomized. Before the experiment, four
practice trials were presented to participants to familiarize them with the materials and the task.
Participants were given opportunity to take a short break after the first block.
Experiment 2
Experiment 2 used the same procedure as Experiment 1, except that only audiovisual stimuli
were used. Also, instead of making explicit 2AFC decisions about lexical stress, participants were
instructed to simply repeat back the sentence-final pseudoword after stimulus offset, and to hit the
“Enter” key to move to the next trial. No mention of lexical stress was made in the instructions.
Participants were instructed to always look at the screen during trial presentations.
At the beginning of the experiment, participants were seated behind a table with a computer
screen inside a double-walled acoustically isolated booth. Participants wore a circum-aural
Sennheiser GAME ZERO headset with attached microphone throughout the experiment to ensure
that amplitude measurements were made at a constant level. Participants were presented with 12
pseudowords, sampled from step 1, 4, and 7 of the F0 continua, in 2 beat gesture conditions,
resulting in 72 unique items. Each unique item was presented twice over the course of the
experiment, leading to 144 trials per participants. Item order was random and item repetitions only
occurred after all unique items had been presented. Before the experiment, one practice trial was
presented to participants to familiarize them with the stimuli and the task. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 25
Experiment 3
Experiment 3 was run online using PsyToolkit version 2.6.1; ,88 because of limitations due to the
corona virus pandemic. Each participant was explicitly instructed to use headphones and to run the
experiment with undivided attention in quiet surroundings. Each participant was presented with all
shadowed productions from two talkers from Experiment 2, leading to 13 different lists, with two
participants assigned to each list. Stimuli from the same talker were grouped to facilitate talker and
speech recognition, and presented in a random order. The task was to indicate after stimulus offset
whether the pseudoword had stress on the first or the second syllable (2AFC). Two response
options were presented, with capitals indicating stress (e.g., WAsol vs. waSOL), and participants
used the mouse to select their response. No practice items were presented and the experiment was
self-timed.
Note that the same participants who performed Experiment 3 also participated in Experiment 4.
To avoid explicit awareness about lexical stress influencing their behavior in Experiment 4, these
participants always first ran Experiment 4, and only then took part in Experiment 3.
Experiment 4
Experiment 4 was run online using Testable (https://testable.org). Each participant was
explicitly instructed to use headphones and to run the experiment with undivided attention in quiet
surroundings. Before the experiment, participants were instructed to always look and listen
carefully to the audiovisual talker and categorize the first vowel of the sentence-final target
pseudoword (2AFC). Specifically, at stimulus offset, a response screen was presented with two
response options on either side of the screen, for instance: bagpif vs. baagpif. Participants entered
their response by pressing the “Z” key on a regular keyboard for the left option and the “M” key
for the right option. The next trial was only presented after a response had been recorded (ITI = bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 26
500 ms). No mention of lexical stress was made in the instructions.
Participants were presented with 8 pseudowords, sampled from 5-step F2 continua, in 2 beat
gesture conditions (onset of first or second vowel aligned to apex of beat gesture), resulting in 80
unique items in a block. Each participant received two identical blocks with a unique random order
within blocks.
Statistical analysis
Experiment 1
Trials with missing data (n = 7; 0.1%) were excluded from analysis. Data were statistically
analyzed using a Generalized Linear Mixed Model GLMM; 68 with a logistic linking function as
implemented in the lme4 library version 1.0.5; ,89 in R 90. The binomial dependent variable was
participants’ categorization of the target as either having lexical stress on the first syllable (SW;
e.g., WAsol; coded as 1) or the second syllable (WS; e.g., waSOL; coded as 0). Fixed effects were
Continuum Step (continuous predictor; scaled and centered around the mean), Beat Condition
(categorical predictor; deviation coding, with beat gesture on the first syllable coded as +0.5 and
beat gesture on the second syllable coded as -0.5), and Block (categorical predictor with the
audiovisual block mapped onto the intercept), and all interactions. The use of deviation coding of
two-level categorical factors (i.e., coded with -0.5 and +0.5) allows us to test main effects of these
predictors, since with this coding the grand mean is mapped onto the intercept. The model also
included Participant and Pseudoword as random factors, with by-participant and by-item random
slopes for all fixed effects, as advocated by Barr et al. 91.
The model showed a significant effect of Continuum Step (β = -1.093, SE = 0.284, z = -3.850,
p < 0.001), indicating that higher continuum steps led to lower proportions of SW responses. It bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 27
also showed an effect of Beat Condition (β = 1.142, SE = 0.186, z = 6.155, p < 0.001), indicating
that – in line with our hypothesis – listeners were biased towards perceiving lexical stress on the
first syllable if there was a beat gestures on that first syllable. Since the audiovisual block was
mapped onto the intercept, this simple effect of Beat Condition should only be interpreted with
respect to the audiovisual block. In fact, an interaction between Beat Condition and Block was
observed (β = -0.813, SE = 0.115, z = -7.081, p < 0.001), indicating that the effect of Beat
Condition was drastically reduced in the audio-only control block. No other effects or interactions
were statistically significant.
Experiment 2
All shadowed productions were manually evaluated and annotated for syllable onsets and
offsets. We excluded 11 trials (<0.4%) because no speech was produced, and another 232 trials
(6.2%) because participants produced more phonemes than the original target pseudoword (e.g.,
hearing wasol but repeating back kwasol) which would affect our duration measurements. Thus,
acoustic analyses only involved correctly shadowed productions (n = 2997; 80%) and incorrectly
shadowed productions but with the same number of phonemes (n = 504; 13.5%). Measurements
of mean F0, duration, and amplitude were calculated for each individual syllable using Praat, and
are summarized in Figures S3, S4, and S5 in the Supplementary Material.
Individual syllable measurements of F0 (in Hz), duration (in ms), and amplitude (in dB) were
entered into separate Linear Mixed Models as implemented in the lmerTest library (version 3.1-0)
in R. Statistical significance was assessed using the Satterthwaite approximation for degrees of
freedom 92. The structure of the three models was identical: they included fixed effects of Syllable
(categorical predictor; with the first syllable mapped onto the intercept), Continuum Step
(continuous predictor; scaled and centered around the mean), and Beat Condition (categorical bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 28
predictor; with a beat gesture on the first syllable mapped onto the intercept), and all interactions.
The model also included Participant and Pseudoword as random factors, with by-participant and
by-item random slopes for all fixed effects. Model output is given in Table 1 and the combined
predicted estimates of the three models are visualized in Figure 2.
Table 1. Statistical outcomes of the models testing amplitude (dB), F0 (Hz), and duration (ms)
measurements from Experiment 2. Significant effects are highlighted in bold.
Amplitude model F0 model Duration model
β SE t p β SE t p β SE t p
intercept 71.28 0.89 79.88 <0.001 182.03 6.70 27.16 <0.001 268.29 18.55 14.46 <0.001
Syllable -4.47 0.62 -7.19 <0.001 -18.95 3.14 -6.03 <0.001 122.00 25.70 4.75 <0.001
Continuum Step -0.77 0.07 -11.51 <0.001 -7.66 0.84 -9.11 <0.001 0.19 1.07 0.18 0.861
Beat Condition -0.09 0.10 -0.94 0.352 -0.02 1.16 -0.02 0.984 -7.67 1.54 -4.98 <0.001
Syll * Step 1.88 0.08 23.58 <0.001 18.54 1.01 18.31 <0.001 2.17 1.40 1.55 0.120
Syll * Beat 0.53 0.11 4.65 <0.001 0.31 1.43 0.22 0.830 8.87 1.98 4.49 <0.001
Step * Beat 0.09 0.08 1.10 0.270 0.64 1.00 0.64 0.525 -2.31 1.40 -1.66 0.098
Syll * Step * Beat -0.13 0.11 -1.19 0.234 -0.91 1.43 -0.64 0.524 1.06 1.98 0.53 0.594
The observed effects of Syllable in all three models demonstrate word-final effects: the last
syllables of words typically have lower amplitude, lower F0, and longer durations (cf. Figure 2).
The effects of Continuum Step confirm that participants adhered to our instructions to carefully
shadow the pseudowords. That is, for higher steps on the F0 continuum (i.e., sounding more WS-
like), the amplitude and F0 of the first syllable (mapped onto the intercept) decreased (see simple
effects of Continuum Step), while amplitude and F0 of the second syllable increased (interactions bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 29
Syllable and Continuum Step). Finally, and most critically, the duration of the first syllable
(mapped onto the intercept) decreased when the beat gesture was on the second syllable (simple
effect of Beat Condition; cf. Figure 2). That is, when the beat fell on the second syllable, the first
syllable was more likely to be perceived as unstressed, leading participants to reduce the duration
of the first syllable of their shadowed production. Moreover, when the beat fell on the second
syllable, the second syllable itself had a higher amplitude and a longer duration (interactions
Syllable and Beat Condition; cf. Figure 2). That is, with the beat gesture on the second syllable,
the second syllable was more likely to be perceived as stressed, resulting in louder and longer
second syllables in the shadowed productions.
Experiment 3
Similar to Experiment 1, the categorization responses (SW = 1; WS = 0) were statistically
analyzed using a GLMM with a logistic linking function. Fixed effects were Continuum Step and
Beat Condition (coding adopted from Experiment 1), and their interaction. The model included
random intercepts for Participant, Talker, and Pseudoword, with by-participant, by-talker, and by-
item random slopes for all fixed effects.
The model showed a significant effect of Continuum Step (β = -0.985, SE = 0.162, z = -6.061,
p < 0.001), indicating that higher continuum steps led to lower proportions of SW responses.
Critically, it showed a small effect of Beat Condition (β = 0.232, SE = 0.078, z = 2.971, p = 0.003),
indicating that – in line with our hypothesis – listeners were biased towards perceiving lexical
stress on the first syllable if the shadowed production was produced in response to an audiovisual
pseudoword with a beat gesture on the first syllable. The interaction between Continuum Step and
Beat Condition was marginally significant (β = 0.110, SE = 0.059, z = 1.886, p = 0.059),
suggesting a tendency for the more ambiguous steps to show a larger effect of the beat gesture. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 30
Experiment 4
We analyzed the binomial categorization data (/a:/ coded as 1; /ɑ/ as 0) using a GLMM with a
logistic linking function. Fixed effects were Continuum Step and Beat Condition (coding adopted
from Experiment 1), and their interaction. The model included random intercepts for Participant
and Pseudoword, with by-participant and by-item random slopes for all fixed effects.
The model showed a significant effect of Continuum Step (β = -0.922, SE = 0.091, z = -10.155,
p < 0.001), indicating that higher continuum steps led to lower proportions of long /a:/ responses.
Critically, it showed a small effect of Beat Condition (β = -0.240, SE = 0.105, z = -2.280, p =
0.023), indicating that – in line with our hypothesis – listeners were biased towards perceiving the
first vowel as /a:/ if there was a beat gesture on the second syllable (or perceiving /ɑ/ if the beat
gesture fell on the first syllable). No interaction between Continuum Step and Beat Condition was
observed (p = 0.817).
DATA AVAILABILITY STATEMENT
The experimental data of this study will be made publicly available for download upon
publication. At that time, they may be retrieved from https://osf.io/b7kue under a CC-By
Attribution 4.0 International license.
ACKNOWLEDGEMENTS
This research was supported by the Max Planck Society for the Advancement of Science,
Munich, Germany (H.R.B.; D.P.) and the Netherlands Organisation for Scientific Research (D.P.;
Veni grant 275-89-037). We would like to thank Giulio Severijnen for help in creating the
pseudowords of Experiments 1-3, and Nora Kennis, Esther de Kerf, and Myriam Weiss for their bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 31
help in testing participants and annotating the shadowing recordings.
REFERENCES
1. Levinson, S. C. Pragmatics. (Cambridge University Press, 1983).
2. Holler, J. & Levinson, S. C. Multimodal Language Processing in Human Communication.
Trends Cogn. Sci. 23, 639–652 (2019).
3. Levelt, W. J. M. Speaking: From Intention to Articulation. (MIT Press, 1989).
4. Vigliocco, G., Perniss, P. & Vinson, D. Language as a multimodal phenomenon:
implications for language learning, processing and evolution. Philos. Trans. R. Soc. B Biol.
Sci. 369, 20130292 (2014).
5. Hagoort, P. The neurobiology of language beyond single-word processing. Science 366, 55–
58 (2019).
6. Perniss, P. Why We Should Study Multimodal Language. Front. Psychol. 9, (2018).
7. Kaufeld, G., Ravenschlag, A., Meyer, A. S., Martin, A. E. & Bosker, H. R. Knowledge-
based and signal-based cues are weighted flexibly during spoken language comprehension. J.
Exp. Psychol. Learn. Mem. Cogn. 46, 549–562 (2020).
8. Kuperberg, G. R. & Jaeger, T. F. What do we mean by prediction in language
comprehension? Lang. Cogn. Neurosci. 31, 32–59 (2016).
9. Van Petten, C. & Luka, B. J. Prediction during language comprehension: Benefits, costs, and
ERP components. Int. J. Psychophysiol. 83, 176–190 (2012).
10. Clark, H. H. Using Language. (Cambridge University Press, 1996).
11. Hagoort, P., Hald, L., Bastiaansen, M. & Petersson, K. M. Integration of Word Meaning and
World Knowledge in Language Comprehension. Science 304, 438–441 (2004). bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 32
12. Kutas, M. & Federmeier, K. D. Electrophysiology reveals semantic memory use in language
comprehension. Trends Cogn. Sci. 4, 463–470 (2000).
13. McGurk, H. & MacDonald, J. Hearing lips and seeing voices. Nature 264, 746–748 (1976).
14. Rosenblum, L. D. Audiovisual Speech Perception and the McGurk Effect. in Oxford
Research Encyclopedia of Linguistics (2019).
15. Massaro, D. W. The McGurk Effect: Auditory Visual Speech Perception’s Piltdown Man. in
The 14th International Conference on Auditory-Visual Speech Processing (AVSP2017) (eds.
Ouni, S., Davis, C., Jesse, A. & Beskow, J.) (KTH, 2017).
16. Alsius, A., Paré, M. & Munhall, K. G. Forty Years After Hearing Lips and Seeing Voices:
the McGurk Effect Revisited. Multisensory Res. 31, 111–144 (2018).
17. Bosker, H. R., Peeters, D. & Holler, J. How visual cues to speech rate influence speech
perception. Q. J. Exp. Psychol. 38 (in press) doi:10.1177/1747021820914564.
18. Rosenblum, L. D., Dias, J. W. & Dorsi, J. The supramodal brain: implications for auditory
perception. J. Cogn. Psychol. 29, 65–87 (2017).
19. Efron, D. Gesture and Environment. (King’s Crown, 1941).
20. Kendon, A. Gesture: Visible Action as Utterance. (Cambridge University Press, 2004).
21. McNeill, D. Hand and Mind: What Gestures Reveal about Thought. (University of Chicago
Press, 1992).
22. Cooperrider, K. Fifteen ways of looking at a pointing gesture. (2020).
23. Goldin‐Meadow, S. Pointing Sets the Stage for Learning Language—and Creating
Language. Child Dev. 78, 741–745 (2007).
24. Kita, S. Pointing: Where Language, Culture, and Cognition Meet. (Psychology Press, 2003).
25. Peeters, D., Chu, M., Holler, J., Hagoort, P. & Özyürek, A. Electrophysiological and bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 33
Kinematic Correlates of Communicative Intent in the Planning and Production of Pointing
Gestures and Speech. J. Cogn. Neurosci. 27, 2352–2368 (2015).
26. Kelly, S. D., Özyürek, A. & Maris, E. Two Sides of the Same Coin: Speech and Gesture
Mutually Interact to Enhance Comprehension. Psychol. Sci. 21, 260–267 (2010).
27. Özyürek, A., Willems, R. M., Kita, S. & Hagoort, P. On-line Integration of Semantic
Information from Speech and Gesture: Insights from Event-related Brain Potentials. J. Cogn.
Neurosci. 19, 605–616 (2007).
28. Wu, Y. C. & Coulson, S. How iconic gestures enhance communication: An ERP study.
Brain Lang. 101, 234–245 (2007).
29. Biau, E. & Soto-Faraco, S. Beat gestures modulate auditory integration in speech perception.
Brain Lang. 124, 143–152 (2013).
30. Esteve-Gibert, N. & Prieto, P. Prosodic Structure Shapes the Temporal Realization of
Intonation and Manual Gesture Movements. J. Speech Lang. Hear. Res. 56, 850–864 (2013).
31. Krahmer, E. & Swerts, M. The effects of visual beats on prosodic prominence: Acoustic
analyses, auditory perception and visual perception. J. Mem. Lang. 57, 396–414 (2007).
32. Leonard, T. & Cummins, F. The temporal relation between beat gestures and speech. Lang.
Cogn. Process. 26, 1457–1471 (2011).
33. Loehr, D. P. Temporal, structural, and pragmatic synchrony between intonation and gesture.
Lab. Phonol. 3, 71–89 (2012).
34. Pouw, W., Harrison, S. J. & Dixon, J. A. Gesture–speech physics: The biomechanical basis
for the emergence of gesture–speech synchrony. J. Exp. Psychol. Gen. 149, 391–404 (2020).
35. Dick, A. S., Goldin‐Meadow, S., Hasson, U., Skipper, J. I. & Small, S. L. Co-speech
gestures influence neural activity in brain regions associated with processing semantic bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 34
information. Hum. Brain Mapp. 30, 3509–3526 (2009).
36. Habets, B., Kita, S., Shao, Z., Özyurek, A. & Hagoort, P. The Role of Synchrony and
Ambiguity in Speech–Gesture Integration during Comprehension. J. Cogn. Neurosci. 23,
1845–1854 (2010).
37. Kita, S. & Özyürek, A. What does cross-linguistic variation in semantic coordination of
speech and gesture reveal?: Evidence for an interface representation of spatial thinking and
speaking. J. Mem. Lang. 48, 16–32 (2003).
38. Özyürek, A. Hearing and seeing meaning in speech and gesture: insights from brain and
behaviour. Philos. Trans. R. Soc. B Biol. Sci. 369, 20130296 (2014).
39. Pouw, W., Paxton, A., Harrison, S. J. & Dixon, J. A. Acoustic information about upper limb
movement in voicing. Proc. Natl. Acad. Sci. (2020) doi:10.1073/pnas.2004163117.
40. Wagner, P., Malisz, Z. & Kopp, S. Gesture and speech in interaction: An overview. Speech
Commun. 57, 209–232 (2014).
41. Ekman, P. & Friesen, W. V. The Repertoire of Nonverbal Behavior: Categories, Origins,
Usage, and Coding. Semiotica 1, 49–98 (1969).
42. Treffner, P., Peter, M. & Kleidon, M. Gestures and Phases: The Dynamics of Speech-Hand
Communication. Ecol. Psychol. 20, 32–64 (2008).
43. Ferré, G. Gesture/speech integration in the perception of prosodic emphasis. in Proceedings
of Speech Prosody 2018 35–39 (ISCA, 2018).
44. Biau, E., Torralba, M., Fuentemilla, L., de Diego Balaguer, R. & Soto-Faraco, S. Speaker’s
hand gestures modulate speech perception through phase resetting of ongoing neural
oscillations. Cortex 68, 76–85 (2015).
45. Casasanto, D. & Jasmin, K. Good and Bad in the Hands of Politicians: Spontaneous Gestures bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 35
during Positive and Negative Speech. PLoS ONE 5, e11805 (2010).
46. Dimitrova, D., Chu, M., Wang, L., Özyürek, A. & Hagoort, P. Beat that Word: How
Listeners Integrate Beat Gesture and Focus in Multimodal Speech Discourse. J. Cogn.
Neurosci. 28, 1255–1269 (2016).
47. Holle, H. et al. Gesture Facilitates the Syntactic Analysis of Speech. Front. Psychol. 3,
(2012).
48. Hubbard, A. L., Wilson, S. M., Callan, D. E. & Dapretto, M. Giving speech a hand: Gesture
modulates activity in auditory cortex during speech perception. Hum. Brain Mapp. 30, 1028–
1037 (2009).
49. Wang, L. & Chu, M. The role of beat gesture and pitch accent in semantic processing: An
ERP study. Neuropsychologia 51, 2847–2855 (2013).
50. So, W. C., Chen-Hui, C. S. & Wei-Shan, J. L. Mnemonic effect of iconic gesture and beat
gesture in adults and children: Is meaning in gesture important for memory recall? Lang.
Cogn. Process. 27, 665–681 (2012).
51. Igualada, A., Esteve-Gibert, N. & Prieto, P. Beat gestures improve word recall in 3- to 5-
year-old children. J. Exp. Child Psychol. 156, 99–112 (2017).
52. Llanes-Coromina, J., Vilà-Giménez, I., Kushch, O., Borràs-Comes, J. & Prieto, P. Beat
gestures help preschoolers recall and comprehend discourse information. J. Exp. Child
Psychol. 172, 168–188 (2018).
53. Vilà-Giménez, I., Igualada, A. & Prieto, P. Observing Storytellers Who Use Rhythmic Beat
Gestures Improves Children’s Narrative Discourse Performance. Dev. Psychol. 55, 250–262
(2019).
54. Rochet-Capellan Amélie, Laboissière Rafael, Galván Arturo & Schwartz Jean-Luc. The bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 36
Speech Focus Position Effect on Jaw–Finger Coordination in a Pointing Task. J. Speech
Lang. Hear. Res. 51, 1507–1521 (2008).
55. Kleinschmidt, D. F. & Jaeger, T. F. Robust speech perception: Recognize the familiar,
generalize to the similar, and adapt to the novel. Psychol. Rev. 122, 148 (2015).
56. Jesse, A., Poellmann & Kong, Y. English Listeners Use Suprasegmental Cues to Lexical
Stress Early During Spoken-Word Recognition. J. Speech Lang. Hear. Res. 60, 190–198
(2017).
57. Reinisch, E., Jesse, A. & McQueen, J. M. Early use of phonetic information in spoken word
recognition: Lexical stress drives eye movements immediately. Q. J. Exp. Psychol. 63, 772–
783 (2010).
58. Jesse, A. & McQueen, J. M. Suprasegmental Lexical Stress Cues in Visual Speech can
Guide Spoken-Word Recognition. Q. J. Exp. Psychol. 67, 793–808 (2014).
59. Scarborough, R., Keating, P., Mattys, S. L., Cho, T. & Alwan, A. Optical Phonetics and
Visual Perception of Lexical and Phrasal Stress in English. Lang. Speech 52, 135–175
(2009).
60. Foxton, J. M., Riviere, L.-D. & Barone, P. Cross-modal facilitation in speech prosody.
Cognition 115, 71–78 (2010).
61. Sumby, W. H. & Pollack, I. Visual Contribution to Speech Intelligibility in Noise. J. Acoust.
Soc. Am. 26, 212–215 (1954).
62. Golumbic, E. M. Z., Cogan, G. B., Schroeder, C. E. & Poeppel, D. Visual input enhances
selective speech envelope tracking in auditory cortex at a “cocktail party”. J. Neurosci. 33,
1417–1426 (2013).
63. Jesse, A. & Mitterer, H. Pointing Gestures do not Influence the Perception of Lexical Stress. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 37
in Proceedings of INTERSPEECH 2011 2445–2448 (2011).
64. Norris, D. & McQueen, J. M. Shortlist B: A Bayesian model of continuous speech
recognition. Psychol. Rev. 115, 357–395 (2008).
65. Peelle, J. E. & Sommers, M. S. Prediction and constraint in audiovisual speech perception.
Cortex 68, 169–181 (2015).
66. McClelland, J. L. & Elman, J. L. The TRACE model of speech perception. Cognit. Psychol.
18, 1–86 (1986).
67. Nooteboom, S. G. & Doodeman, G. J. N. Production and perception of vowel length in
spoken sentences. J. Acoust. Soc. Am. 67, 276–287 (1980).
68. Quené, H. & Van den Bergh, H. Examples of mixed-effects modeling with crossed random
effects and with binomial data. J. Mem. Lang. 59, 413–425 (2008).
69. Rietveld, A. C. M. & Van Heuven, V. J. J. P. Algemene Fonetiek. (Uitgeverij Coutinho,
2009).
70. Steffman, J. Intonational structure mediates speech rate normalization in the perception of
segmental categories. J. Phon. 74, 114–129 (2019).
71. Bosker, H. R. Accounting for rate-dependent category boundary shifts in speech perception.
Atten. Percept. Psychophys. 79, 333–343 (2017).
72. Massaro, D. W. Perceiving Talking Faces: From Speech Perception to a Behavioral
Principle. (MIT Press, 1998).
73. Cho, T., McQueen, J. M. & Cox, E. A. Prosodically driven phonetic detail in speech
processing: The case of domain-initial strengthening in English. J. Phon. 35, 210–243
(2007).
74. Maslowski, M., Meyer, A. S. & Bosker, H. R. Listeners normalize speech for contextual bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 38
speech rate even without an explicit recognition task. J. Acoust. Soc. Am. 146, 179–188
(2019).
75. Doelling, K. B., Arnal, L. H., Ghitza, O. & Poeppel, D. Acoustic landmarks drive delta–theta
oscillations to enable speech comprehension by facilitating perceptual parsing. NeuroImage
85, 761–768 (2014).
76. Kösem, A. et al. Neural entrainment determines the words we hear. Curr. Biol. 28, 2867–
2875 (2018).
77. Bosker, H. R., Sjerps, M. J. & Reinisch, E. Spectral contrast effects are modulated by
selective attention in ‘cocktail party’ settings. Atten. Percept. Psychophys. 82, 1318–1332
(2020).
78. Bosker, H. R., Sjerps, M. J. & Reinisch, E. Temporal contrast effects in human speech
perception are immune to selective attention. Sci. Rep. 10, (2020).
79. Golumbic, E. M. Z. et al. Mechanisms underlying selective neuronal tracking of attended
speech at a “cocktail party”. Neuron 77, 980–991 (2013).
80. de la Cruz-Pavía, I., Werker, J. F., Vatikiotis-Bateson, E. & Gervain, J. Finding Phrases: The
Interplay of Word Frequency, Phrasal Prosody and Co-speech Visual Information in
Chunking Speech by Monolingual and Bilingual Adults. Lang. Speech 002383091984235
(2019) doi:10.1177/0023830919842353.
81. Jesse, A. & Johnson, E. K. Prosodic temporal alignment of co-speech gestures to speech
facilitates referent resolution. J. Exp. Psychol. Hum. Percept. Perform. 38, 1567–1581
(2012).
82. Chrabaszcz Anna, Winn Matthew, Lin Candise Y. & Idsardi William J. Acoustic Cues to
Perception of Word Stress by English, Mandarin, and Russian Speakers. J. Speech Lang. bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 39
Hear. Res. 57, 1468–1479 (2014).
83. Connell, K. et al. English Learners’ Use of Segmental and Suprasegmental Cues to Stress in
Lexical Access: An Eye-Tracking Study. Lang. Learn. 68, 635–668 (2018).
84. Tyler, M. D. & Cutler, A. Cross-language differences in cue use for speech segmentation. J.
Acoust. Soc. Am. 126, 367–376 (2009).
85. Peeters, D. Virtual reality: A game-changing method for the language sciences. Psychon.
Bull. Rev. 26, 894–900 (2019).
86. Severijnen, G. G. A. The Role of Talker-Specific Prosody in Predictive Speech Perception.
(Radboud University, Nijmegen, 2020).
87. Boersma, P. & Weenink, D. Praat: doing phonetics by computer [computer program].
(2016).
88. Stoet, G. PsyToolkit: A Novel Web-Based Method for Running Online Questionnaires and
Reaction-Time Experiments. Teach. Psychol. 44, 24–31 (2017).
89. Bates, D., Maechler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using
lme4. J. Stat. Softw. 67, 1–48 (2015).
90. R Development Core Team. R: A Language and Environment for Statistical Computing
[computer program]. (2012).
91. Barr, D. J., Levy, R., Scheepers, C. & Tily, H. J. Random effects structure for confirmatory
hypothesis testing: Keep it maximal. J. Mem. Lang. 68, 255–278 (2013).
92. Luke, S. G. Evaluating significance in linear mixed-effects models in R. Behav. Res.
Methods 49, 1494–1502 (2017).
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 40
FIGURE CAPTIONS
Figure 1. Example audiovisual stimulus and results from Experiments 1, 3, and 4. A. In Experiments 1, 2, and
4, participants were presented with an audiovisual speaker (facial features masked in the actual experiment)
producing the Dutch carrier sentence Nu zeg ik het woord... “Now say I the word...” followed by a disyllabic target
pseudoword (e.g., bagpif). The speaker produced three beat gestures, with the last beat gesture’s apex falling in the
target time window. Two auditory conditions were created by aligning the onset of either the first (top waveform in
blue) or the second vowel of the target pseudoword (bottom waveform in orange) to the final gesture’s apex (gray
vertical line). B. Participants in Experiment 1 categorized audiovisual pseudowords as having lexical stress on either
the first or the second syllable. Each target pseudoword was sampled from a 7-step continuum (on x-axis) varying
F0 in opposite directions for the two syllables. In the audiovisual block (solid lines), participants were significantly
more likely to report perceiving lexical stress on the first syllable if the speaker produced a beat gesture on the first
syllable (in blue) vs. second syllable (in orange). No such difference was observed in the audio-only control block
(transparent lines). C. Participants in Experiment 3 categorized the audio-only shadowed productions, recorded in
Experiment 2, as having lexical stress on either the first or the second syllable. If the audiovisual stimulus in
Experiment 2 involved a pseudoword with a beat gesture on the first syllable, the elicited shadowed production was
more likely to be perceived as having lexical stress on the first syllable (in blue). Conversely, if the audiovisual
stimulus in Experiment 2 involved a pseudoword with a beat gesture on the second syllable, the elicited shadowed
production was more likely to be perceived as having lexical stress on the second syllable (in orange). D.
Participants in Experiment 4 categorized audiovisual pseudowords as containing either short /ɑ/ or long /a:/ (e.g.,
bagpif vs. baagpif). Each target pseudoword contained a first vowel that was sampled from a 5-step F2 continuum
from long /a:/ to short /ɑ/ (on x-axis). Moreover, prosodic cues to lexical stress (F0, amplitude, syllable duration)
were set to ambiguous values. If the audiovisual speaker produced a beat gesture on the first syllable (in blue),
participants were biased to perceive the first syllable as stressed, making the ambiguous first vowel relatively short
for a stressed syllable, leading to a lower proportion of long /a:/ responses. Conversely, if the audiovisual speaker
produced a beat gesture on the second syllable (in orange), participants were biased to perceive the initial syllable as bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Bosker & Peeters 41
unstressed, making the ambiguous first vowel relatively long for an unstressed syllable, leading to a higher
proportion of long /a:/ responses. For all panels, error bars enclose 1.96 x SE on either side; that is, the 95%
confidence intervals...... 9
Figure 2. Predicted estimates from the combined models of Experiment 2. In Experiment 2, participants
shadowed the audiovisual sentence-final pseudowords in either Beat Condition. The figure shows stylized wave
forms for the two syllables of the shadowed productions, on either side of the zero point on the x-axis, showing their
durations (ms), aligned around the onset of the second syllable (at 0). The bold horizontal error bars at the extremes
of the stylized waveforms indicate duration SE. The vertical position of the (black rectangles surrounding the)
stylized waveforms shows the produced F0 (in Hz on y-axis; height of rectangles indicates F0 SE on either side of
the midpoint). The color gradient and height of the stylized waveforms indicates average amplitude (dB). The
overall differences between first and second syllables show pronounced word-final effects: the last syllable of a
word typically has a longer duration, and a lower F0 and amplitude. Critically, after audiovisual trials with a beat
gesture on the second syllable (lower panel; compared to a beat gesture on the first syllable in the upper panel),
shadowers produced their first syllables with a shorter duration and their second syllables with a longer duration and
higher amplitude. No effect of F0 was observed...... 12
A “Nu zeg ik het woord...”
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. time B Experiment 1 C Experiment 3 D Experiment 4 0.9 0.9 0.9 ● ● ● 0.8 ● 0.8 0.8 ● 0.7 ● 0.7 0.7 ● 0.6 ● 0.6 ● 0.6 ● ● ● 0.5 ● 0.5 0.5 ● ● ● ● ● 0.4 ● 0.4 0.4 ● ● ● ● 0.3 0.3 P (long /a:/) 0.3 ● ● ● 0.2 ● 0.2 0.2 P (stress on 1st syllable) 0.1 P (stress on 1st syllable) 0.1 0.1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 Lexical stress continuum Lexical stress continuum F2 continuum bioRxiv preprint doi: https://doi.org/10.1101/2020.07.13.200543; this version posted July 13, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Beat gesture on 1st syllable Amplitude (dB) 200 71 70 190 1st syllable 69 68 180 67 170 2nd syllable 160 150 -8 ms Beat gesture on 2nd syllable 200 190 1st syllable +0.5 dB; +9 ms 180 170 2nd syllable
Fundamental frequency (F0; in Hz) 160 150 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 Duration (in ms)