<<

A study on audio-visual interaction: How visual temporal cues influence

grouping of auditory events

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

in the Graduate School of The Ohio State University

By

Yu Fan

Graduate Program in Music

The Ohio State University

2019

Dissertation Committee

Udo Will, Advisor

Arved Ashby

Marjorie K.M. Chan

1

Copyrighted by

Yu Fan

2019

2

Abstract

Music provides us a complex, rapidly changing acoustic spectrum, and when presented with such stimuli, the spontaneously groups acoustic elements together. Real-world experience of rhythm is often multisensory, involving both auditory and visual systems, such as when we observe drumming. However, the manner in which auditory grouping behavior is influenced by concurrent temporal cues from visual stimuli is not fully known yet. This study investigates audio-visual interaction in rhythm . The central questions addressed are whether visual rhythm influences human’s rhythmic grouping of auditory sequences, and whether musical training intensifies or weakens this influence.

Two experiments were conducted. The preliminary experiment was developed to identify the visual stimuli, natural versus artificial, that would be appropriate to the inquiry. Participants were recruited for a synchronized tapping test to natural visual stimuli and artificial visual stimuli. The stability of the synchronization and the rate of re- synchronization were measured in a synchronization task. The results show that participants’ synchronization performance of temporal perception was better with natural visual stimuli, which justified the use of a natural visual as the rhythmic stimulus in the main study on temporal perception.

ii

The main study testified the audio-visual interaction through a grouping task.

Trained musicians and non-musicians were recruited. In the experiment, participants were asked to perform a grouping task through tapping to a cyclical 12-tone drumbeat sequence with different stress patterns that indicated different grouping structure tendencies (either duplets or triplets). While listening, they were asked to simultaneously visual stimuli

1) with duple cueing gestures; 2) with triple cueing gestures; or 3) without any cue. The results show that subjects’ duple/triple grouping choice is significantly related to the concurrent visual stimuli with duple/triple cues, and that visual rhythm had a much greater influence on musicians’ auditory grouping behavior than on that of non-musicians.

The result of the first experiment, which incorporated novel and realistic visual stimuli in a rhythm perception task, bolsters the rationale of using closer-to-reality experimental designs. The result of the main study demonstrated and highlighted the role of visual in multisensory rhythm perception. Moreover, the current study also found that the audio-visual interaction in rhythmic grouping is correlated to long- music training, and this finding contributes to the body of knowledge on the process of rhythmic perception.

iii

Dedication

This dissertation is dedicated to my family, whose support and love along my PhD journey has been an anchor and inspiration.

iv

Acknowledgments

In my PhD journey, I received a lot of care and help from teachers, classmates, and friends. Here I would like to express my most sincere gratitude to those who have supported me, helped me, and encouraged me throughout this .

First and foremost, I would like to express my deepest gratitude to my supervisor,

Dr. Udo Will, a respectable, responsible, and resourceful scholar, who has provided me with valuable guidance in every stage of my research. Without his enlightening instruction and humbling kindness, I could not have completed my thesis.

I also extend my thanks to the rest of my committee, Dr. Chan and Dr. Ashby, for their insightful comments and encouragement; to the musicology department at OSU and the School of Music, for providing an inclusive academic atmosphere that allowed me to learn a lot throughout my PhD journey; and to all the experiment participants who joined this study with a cooperative spirit.

My sincere appreciation also goes to a four- scholarship from the Chinese government, which has supported me in completing my PhD program.

Last but not least, I’d like to express my thanks to my beloved family for their loving consideration all these , especially my husband, Zhe Dong, who faithfully supports and encourages me during my study.

v

Vita

2007–2012...... B.A. Musicology, China Conservatory

2012–Present ...... Ph.D., The Ohio State University, Columbus

Publications

Fan, Y., & Will, U. (2016). Musicianship and tone-language experience differentially influence pitch- interaction. Proceedings of the 14th ICMPC, San Francisco.

Fields of Study

Major Field: Music

Area of Emphasis: Cognitive Ethnomusicology

vi

Table of Contents

Abstract ...... ii Dedication ...... iv Acknowledgments...... v Vita ...... vi List of Tables ...... x List of Figures ...... xi Chapter 1. Introduction ...... 1 1.1. Distinction Between Sensation and Perception ...... 2 1.2. Rhythm Perception is a Multisensory Experience ...... 7 1.3. Embodiment Paradigm in Rhythm Perception Study ...... 8 1.4. Summary ...... 14 Chapter 2. -Based Timing and Interval Perception ...... 15 2.1. Interval as an Element of Auditory Rhythmic Pattern ...... 17 2.2. Mechanism and Variability of Interval Perception ...... 19 2.3. Interaction Between Perceived Non-Temporal Factors and Interval Perception ... 23 2.4. Visual Interval Perception...... 29 2.5. Summary ...... 31 Chapter 3. Entrainment-Based Timing and Pulse Perception ...... 33 3.1. Pulse as a Perceptual Phenomena ...... 33 3.2. Entrainment and Pulse Perception ...... 35 3.2.1. Entrainment ...... 35 3.2.2. Entrainment-Based Theories and Pulse ...... 39 3.3. Sensorimotor Links in Pulse Perception ...... 43 3.3.1. Sensorimotor Synchronization to Auditory Stimuli ...... 45 3.3.2. Sensorimotor Synchronization to Visual Stimuli ...... 46 vii

Chapter 4. Grouping in Rhythm Perception ...... 49 4.1. Discussions on the Structure of African Rhythm ...... 50 4.2. “Ambiguity” of Grouping in African Rhythm Perception ...... 56 4.3. Repetition-Based Approach for Grouping ...... 58 Chapter 5. Audio-Visual Interaction and Experiment Hypothesis ...... 64 5.1. Previous Studies on Audio-Visual Interaction in Rhythm Perception ...... 65 5.1.1. Alternative Visual Stimuli ...... 67 5.1.2. Alternative Auditory Rhythmic Pattern ...... 72 5.2. Present Research Questions ...... 73 5.2.1. Hypothesis of More Efficient Visual Stimulus ...... 73 5.2.2. Hypothesis of Visual Influence on Auditory Grouping Perception ...... 75 5.3. Summary ...... 76 Chapter 6. Experiment on Natural/Artificial Visual Rhythm ...... 78 6.1 Participants ...... 78 6.2 Material and Stimuli ...... 79 6.3 Procedure ...... 84 6.4 Data Collection ...... 85 6.5 Synchronization Success Rate Analysis and Data Filtering ...... 87 6.5.1. Synchronization Success Rate Analysis ...... 87 6.5.2. Results of Synchronization Success Rate Analysis ...... 88 6.6 Entrainment Performance Analysis ...... 90 6.6.1. Synchronization Performance Analysis and Results ...... 90 6.6.2. Error-Correction Performance Analysis and Results ...... 97 6.7 Discussion ...... 105 6.7.1. Effect of Different Visual Forms (Natural/Artificial) on Entrainment ...... 106 6.7.2. Interaction between Perturbation Type and Position ...... 108 6.8 Conclusion ...... 109 Chapter 7. Experiment of Visual Influence on Auditory Grouping ...... 110 7.1 Material and Stimuli ...... 110 7.1.1. Auditory Stimuli ...... 110 7.1.2. Visual Stimuli ...... 113 7.1.3. Trial Design ...... 114

viii

7.1.4. Real-World Stimuli ...... 117 7.2 Participants ...... 118 7.3 Procedure ...... 120 7.4 Processing ...... 121 7.5 Analysis and Results ...... 124 7.5.1. Analysis I: Grouping Responses Analysis ...... 124 7.5.2. Analysis II: Response (RTs) and Grouping Difference ...... 132 7.6 Additional Analyses and Results ...... 134 7.6.1. Grouping Responses to Ambiguous Auditory Patterns ...... 134 7.6.2. Mean Visual Influence and Specific Music Experience ...... 135 7.7 Discussion ...... 137 7.7.1. Grouping preference and switch tendency ...... 137 7.7.2. Different visual influence across different auditory patterns ...... 140 7.7.3. RTs difference between musicians and non-musicians ...... 141 7.7.4. Visual influence on grouping choice and music training experience ...... 141 Chapter 8. Major Findings and General Discussion ...... 143 8.1 Implications for cognitive studies ...... 145 8.2 Implications for music ...... 147 8.3 Implications for musicology ...... 148 Bibliography ...... 150 Appendix A. ANOVA Table on Circular Variance ...... 171 Appendix B. ANOVA Table on PCR ...... 172 Appendix C. ANOVA Table on Recovery Rate ...... 174 Appendix D. ANOVA Table on Bscore ...... 176 Appendix E. Questionnaire ...... 177

ix

List of Tables

Table 4.1. Cyclic calculation of binarity/ternarity tendencies of rhythmic pattern XOXXOXOXXOXO ...... 61

Table 6.1. Structure of one stimuli trial...... 83

Table 6.2. Vector length in different condition and tempi ...... 93

Table 7.1. Five rhythmic patterns derived from the African drum patterns ...... 111

Table 7.2. Descriptive statistics for non-musicians and musicians (each N = 12) ...... 119

Table 7.3. Trials in which grouping choice changed under different conditions ...... 123

Table 7.4 Result of LME Table ...... 132

x

List of Figures

Figure 2.1. Timescales of temporal processing...... 16

Figure 2.2. Illustration of rhythmic pattern in soundwave representation: isochronous temporal structure marked by loud-soft-loud sounds...... 19

Figure 2.3 A schema of a prototypical SET model for short interval perception...... 20

Figure 2.4 Pitch contour of four different tones in Mandarin...... 25

Figure 2.5. Statistical result of behavior experiment...... 28

Figure 3.1. Representation of two drum rhythmic patterns: Serra Acima (left) and Dobrado (right)...... 37

Figure 3.2. Illustration of the Mocambique (AM) and Congo (AC) groups becoming entrained...... 38

Figure 3.3. Phase relationships between internal oscillation and frequency of external stimuli...... 41

Figure 3.4. Schematic illustration of sensory entrainment...... 42

Figure 4.1. The Kadodo section in Adzogbo, a dance in Accra...... 53

Figure 4.2. Excerpt from Kagan drumming...... 54

Figure 4.3. A rhythmic sequence containing a period of 12 operational values...... 57

Figure 4.4. Illustration of two possible triple groupings, which start from different notes of the same cyclically presented rhythmic sequence...... 61

Figure 5.1 Illustration of flashing as visual stimuli in temporal perception studies. 68

Figure 5.2 Illustration of moving bouncing ball as visual stimuli in temporal perception study...... 69

Figure 5.3 Illustration of PLF as visual stimuli in delivering temporal information...... 71

xi

Figure 6.1. Part A. Natural human visual stimuli (VN); Part B. Artificial PLF visual stimuli (VP) ...... 81

Figure 6.2. Illustration of the procedure of the registration of the response data...... 86

Figure 6.3. Percentage of successful trials in the test...... 89

Figure 6.4. Phase of the mean vectors grouped by different conditions and tempi...... 92

Figure 6.5. Two-dimensional graph of the mean variance for AVN and AVP in three tempi (F, P, S)...... 94

Figure 6.6. Two-dimensional graph of the mean variance for VN and VP in three tempi (F, P, S)...... 95

Figure 6.7. Two-dimensional graph of the mean variance for AV and V in three tempi (F, P, S)...... 96

Figure 6.8. Mean relative asynchrony for four modalities by three tempi...... 97

Figure 6.9. Early perturbation...... 99

Figure 6.10. Delayed perturbation...... 99

Figure 6.11. Alpha estimation for two types of perturbation by (Modal  Tempi)...... 101

Figure 6.12. Alpha estimation for two types of perturbations by (Position X Tempi). .. 102

Figure 6.13. Recovery rate grouped by conditions and tempi...... 104

Figure 6.14. Recovery rate grouped by perturbation types and positions...... 105

Figure 7.1. One cycle of pattern 3 as a soundwave...... 112

Figure 7.2. Illustration of the duple pattern and triple pattern in visual stimuli...... 114

Figure 7.3. of combined auditory stimuli with duple visual cue...... 116

Figure 7.4. Timeline of combined auditory stimuli with triple visual cue...... 116

Figure 7.5. Summary of the Bscore by different visual cues...... 125

Figure 7.6. Summary of the Bscore in five auditory patterns...... 126

Figure 7.7. Bscore across different conditions of visual cues between musician group and non-musician group...... 127

xii

Figure 7.8. Summary of the Bscore of the responses grouped by the five auditory patterns and three visual timing cue conditions...... 129

Figure 7.9. Ddegree and Tdegree by patterns...... 131

Figure 7.10. Mean Response Times (RTs) by groups and visual conditions...... 133

Figure 7.11. Summary of the mean Bscore of the response and the normalized binarity of five auditory patterns...... 135

Figure 7.12. Correlation between the approximate practice of musicians and mean visual influence...... 137

xiii

Chapter 1. Introduction

This study was inspired by a real-life experience: one , at a private party the author shared a YouTube video clip of African percussion music. The crowd spontaneously started to swing and clap with the rhythm but, to the author’s surprise, in different ways.

The individuals standing near the bar—for whom the screen was not in sight—tapped or moved to a different beat compared to those in the living room and could see the screen.

Although both groups were reacting to the same piece of music, the way they responded was different, and the only difference in their experience of the music was that one group had visual information of the music and the other did not. Another interesting thing was that, unlike in an orchestra concert where the crowds find consensus on how to move with the music, each group was confident in their movements and the two did not merge into one.

Although listening is mainly located in the auditory domain, such as when we listen to MP3 or CD, the experience of music can also be multimodal. For instance, visual information works as a timing cue in various music activities, such as between ensemble players and the conductor, in a duet or trio performance, and sometimes even between performers and audiences in live performance. In this work, the author investigated how vision affects rhythm perception in our hearing experience.

1

Given that this is a multidisciplinary work, the first chapter will clarify some concepts central to the dissertation that may be understood differently across disciplines.

First, the objective rhythmic stimuli are not equal to our perceived rhythm, because how we interpret the information we sense is a subjective experience and this is influenced by various factors. Secondly, rhythm perception is a multisensory experience, and temporal information can be transferred cross different sensory modalities. Thirdly, perception and action system are not isolated but are tightly linked as a complete association in rhythm perception. These assumptions, forming the foundation upon which this research was conducted, are presented in this chapter through clarifying the distinction between sensation and perception, the visual influence on auditory perception in the context of the paradigm of embodied .

1.1. Distinction Between Sensation and Perception

Human beings are equipped with various sensory organs—typically including eyes, ears, nose, tongue, and skin—that enable them to receive information from the environment, and that information allows them to effectively interact with the world. The information received from sensory organs through sensation is referred to as sense, and the understanding that individuals derive from the sensory inputs is called perception. This distinction, though widely accepted today, was not always apparent in human .

The nature of sensory perception has been a subject of human inquiry since ancient

Greece and Rome. In the Republic, for instance, Plato uses the allegory of the cave to explain that it is through our sensory organs we perceive the world of appearances. He goes on to caution that the perceived world is falsehood because it is only the shadow of the real

2 world and that the reality of the world, or the truth, can only be perceived through the spirit/mind. Although this philosophical argument can never be tested, the gap between what we sense and what we perceive that Plato describes appears to be an experience common to many. Following Plato, Aristotle sought to explain the mechanism of perception from a physical perspective, in which “perception comes about with [an organ’s] being changed and affected . . . for it seems to be kind of alternation” (Aristotle, trans. 2002, De Anima ii 5, 416b33–34). In his theory, certain external quality is received and reproduced in the sensory organs. By way of illustration, if a person puts his hand into hot water to sense its temperature, the hand itself becomes warm during the process; and if a person an object to sense its red color, a little red image could be spotted in her eyeball. Based on Aristotle’s theory, the sensory organ has the capacity, at least potentially, to process itself into the quality being sensed (Shields, 2016). He also developed the first system of categorizing our —sight, hearing, taste, smell, and touch—which remains widely acknowledged and applied in modern research. However, the process through which the sensory organ perceives external stimuli is yet to be clearly explained in terms of physiological mechanisms.

The early ventures into the connection between the perceived world and the real world involved little empirical demonstration. In modern times, with the developments in physiology and , a growing body of studies has revealed the physiological basis and underlying neuronal mechanisms of sensory perception. Thus, our understanding, at least scientifically, has gained increasing clarity of the sensory perception that was formerly considered miraculous. Between the two distinct but complementary processes—

3 sensation and perception—we have a greater understanding of the sensation process, which brings information from the outside world into our brain, while we understand the perceptual process to provide meaning to the sensory information input. In any sensory system, the upward processing (taking the external information) and the downward processing (subjectively interpreting the received information) largely overlapped, there is no sharp neuroanatomical boundary between sensation and perception.

One illustration of sensation process is aural sensation. Sound originates from the air pressure changes caused by the vibration of an object. It propagates in the physical form of a combination of sine waves through mediums such as air, liquid, or solid to reach our inner ear. One way of achieving such transmission is through the outer and middle ears.

The sound waves travel through the ear canal until they reach the eardrum, which vibrates from the sound. The vibration of the eardrum passes through the middle ear bones before reaching the inner ear. The vibration of the cochlear fluid causes displacement of the basilar membrane and was converted into electrochemical signal through the deflection of the hair cells in the basilar membrane. Another way in which sound wave reaches Cochlea is through bone of the skulls. The bone conduction devices perform the role of the eardrums, which converts sound waves and convert them into vibrations that can be received directly by the Cochlea. Either of these two methods of transmission may concurrently stimulate the hair cells in the inner ear. Through the central auditory pathway, these neural signals reach a specific area of the human brain, primarily in the auditory cortex, which we believe is the endpoint of sensation. With the reception of auditory input, some perception processes are activated to interpret them into abstract meanings. For instance, the frequency

4 of sound wave is perceived as pitch, and the intensity of the sound wave is perceived as volume. It must be noted that some perception processes already take place along the transmission pathway. For instance, at the hind- and midbrain level we “perceive,” say, the direction the sound comes from.

Another example of the sensory process is visual sensation. External objects can be seen because they either reflect light from luminous bodies such as the sun or fire or they are luminous themselves. The light travels in the form of photons and reaches the cornea, which covers the front of eyes. Cornea focuses the light toward the at the back of our eyes. The photoreceptive cells of the retina, known as the rods and cones, detect the photons and respond by producing neural impulses. These signals are then sent to a specific area of our brain, the primary , which is the endpoint of the sensation of visual stimuli.

Then, in the perceptual process, the received visual input are translated into meanings; for instance, the wavelength of light is perceived as color, and the amplitude and length of light waves are perceived as brightness.

Generally, sensation is a process whereby external physical stimuli are transduced into neural signal by sensory receptors. Perception refers to the process in which the brain reprocesses, organizes, and interprets the sense to “translate” it into something

“meaningful.” For example, the physical qualities of sound—such as frequency, amplitude, spectral complexity—are perceived as pitch, volume or loudness, and timbre. To some degree, sensation is a passive process that is initiated when the sensory receptor takes in information and emitted from an outside physical stimulus, while perception is an active process, in which our brain or mind selectively participates to process the

5 information of sensory inputs. For instance, as long as a person is not hearing-impaired, hearing simply happens as the ear automatically picks up sound information. Listening, however, requires concentration so that the brain derives meaning from the sound, such as the meaning of the words and sentences, or even the emotion expressed in the sound. In short, a human can participate in the experience of hearing sounds without being consciously engaged in listening.

Perception and cognition are often regarded as interchangeable terms because they both denote the abstract mental processing of sensory inputs. The term cognition is occasionally distinguished from perception for denoting higher-level abstraction, such as abstracting meaning, emotion, structural information, etc., from language, literature, or music. Further, following the creation of the new discipline of “cognitive ” in the mid-20th century, the term cognition is more used as an umbrella term for mental processes in the context of scientific inquiry. In this study, perception generally denotes subjective mental processing of objective sensory inputs; and cognition is generally used to denote a high-level structural understanding.

How we translate sensory input into “meanings” remains an open question because of our limited methodologies for investigating mental processes. Although the underlying mechanism is not completely clear, studies have identified various factors that shape and sometimes distort the process of sensory perception (Fiske, 2013). Three different types of factors are highlighted in this thesis. The first is interactions within one sensory modality.

The question is, for example, how features of auditory stimuli influence the way we group auditory events. The type concerns interactions across sensory modalities, where

6 the perception of information from one sensory modality affects the way we perceive information from another modality. Here the question is, for example, whether the visual temporal information influences auditory grouping perception. A third type of factor is an observer’s previous experience; that is, different observers’ perception of identical stimuli can vary depending on the observers’ background. In this thesis one of the questions is whether the degree of visual influence on auditory rhythm perception is influenced by music training. Throughout this thesis, the author attempts to identify instances of these factors in the perception process under investigation.

1.2. Rhythm Perception is a Multisensory Experience

Unlike other elements in music that are modality-specific (especially with respect to auditory modality)—such as melody, harmony, etc.—rhythm uniquely engages more than one of our sensory domains. Rhythm is created by a temporally arranged sequence of events—involving aural, visual, or tactile objects—that stimulate the sensory organs. It can be perceived through auditory, visual, or tactile stimulus streams. For example, when a performer strikes the drum, the audience can perceive the drumming rhythm through hearing the drum sounds, watching the drummer’s gesture, feeling the vibrations, or physically touching the performer’s arm.

Although temporal perception involves multisensory systems, the majority of previous studies on rhythm perception have focused on the auditory domain, resulting in a dearth of scholarship on the role of other sensory modalities in temporal perception.

According to the physiological studies mentioned in Section 1.1, auditory and visual pathways are separable, but for the amodal feature of rhythm perception, one general

7 question is “what is the relationship between the temporal information perceived from auditory stimuli and that perceived from visual stimuli?” For the multisensory temporal perception experience, does temporal information in the two pathways become combined at all, and at what level of processing? The answers to these questions remain murky, and this study seeks to help bridge this gap in literature by investigating two major modalities, auditory and visual, in the process of temporal perception.

While the body of literature offers little insight on this topic, there are a few studies in neuroscience that generated evidence in support of the argument that auditory and visual information is connected at midbrain level (Stein, Stanford & Rowland, 2009), and the cortical level (Vander Wyk et al., 2010). Assuming that the integration of perceived auditory temporal information and visual temporal information does exist, we could then explore the interaction of cross-modal influence that occurs during the rhythm perception process. Previous studies on visual-audio interactions mainly explored auditory influence on visual temporal perception (Repp & Penel, 2002; Grahn, Henry, & McAuley, 2011), but not whether perceived visual temporal information influences auditory temporal perception or how the two modalities interact in processing sensory information. The current study aims to test audio-visual interaction through exploring whether perceived visual temporal information influences auditory temporal perception.

1.3. Embodiment Paradigm in Rhythm Perception Study

In an effort to understand how we process high-level perception, such as language comprehension and action understanding, the discipline of cognitive science was established in the mid-1950s, with the aim of explaining the way make sense of

8 complex sensory data at an abstract and conceptual level. The discipline emerged from multidisciplinary scholarship involving philosophy, , neuroscience, linguistics, anthropology, etc., and researchers in this field soon began to develop theoretical frameworks for understanding cognitive abilities (Thagard, 2014).

Classical theories of cognition, inspired by the developments of computer science and artificial intelligence, adopt an information-processing approach, which prevailed in the 20th century. According to these frameworks, the mind is an information processor, and mental processes are independent, abstract formal processes (Clark, 2011; Rowlands,

2013; Shapiro, 2012). In this view, after the sensory information input from the external world is perceived (low-level perception), it is translated into a syntactic code of meaningful symbols and processed according to a systematic set of rules (high-level cognition process), resulting in output as action. The traditional cognitive paradigm considers bodily actions and other behaviors as outcomes of high-level symbol processing and thus serving no cognitive function other than representing the result.

At the end of the 20th century, the theory of embodied cognition emerged. Within this new theory, researchers began to explore connections between the structure of mental processes and physical embodiment. Unlike traditional theories in cognition, embodied cognition asserts that the body and its movements shape our cognitive experience. Within the embodied cognition framework, planning or execution of an action and its related sensory perception are coded in a shared representation in the brain. Whenever one of these two components are activated, both motor and sensory areas in the brain are recruited

(Prinz, 1997; Hommel, Müsseler, Aschersleben, & Prinz, 2001). In the 20 years, an

9 increasing body of evidence from neuroscience supports the embodiment cognition paradigm (hereafter the EC paradigm). Sensory processes (perception) and motor processes (action), assumed to have evolved together, are seen therefore as fundamentally inseparable, mutually informative, and structured so as to ground conceptual systems

(Varela, Rosch, & Thompson, 2016; Gibbs, 2010; Gallagher, 2013).

Music provides an especially interesting subject for cognitive studies because much of musical behavior is non-linguistic in nature. How do we find the regular beat from a complex sound sequence, and how are our emotions inspired by certain music? Music- related cognitive processes are hard to explain through rational symbol-based analysis.

Within the EC paradigm, perception is not only a cerebral but a corporeally grounded process (Clayton, Sager, & Will, 2005). From the perspective of listeners, music evokes body movements in listeners, and the movements are transformed into sensory input that helps the listeners understand the music. The perspective of EC paradigm in music perception implies the multi-modal encoding of sensory information and the coupling of perception and action. Therefore, to study the way our mind makes sense of music is to study embodied musical activities. For example, studies have been done in which subjects in a laboratory did not simply listen passively to music while sitting motionless. Rather, they were instructed to give motor responses while listening, which allowed the researchers to collect behavioral feedback—not verbal feedback—for analysis. One of the most widely applied method to test human rhythm perception is the task of finger tapping to a musical rhythm (a detailed review can be found in Repp, 2005). Following a similar design, this study adopted finger-tapping tasks in multisensory rhythm perception experiments. The

10 temporal feature of subjects’ movement during listening and watching is analyzed as perceptual response.

Furthermore, the EC paradigm recognizes cognition as something both empirically available and socially constructed, so that the human bodily experience also shapes the embodied cognition process. Our brain is adaptable and plastic (Rakic, 2002), and its cognitive processes are constantly shaped by our bodily experiences throughout our lives.

For instance, it has been shown that long-term immersion in a specific cultural environment changes the way our mind works through the formation of bodily experience. Therefore, the EC paradigm often used as a quantitative approach to study the cultural influence on human behavior or mind. One of the frequently studied cultural factors is “being a musician,” that is, having received special training in and exposure to specific music (Will,

2017). Compared to non-musicians, musicians have experienced long-term training of mind and body for specific listening and observation tasks. Such experience may shape musicians’ brain for processing multisensory information differently than that of non- musicians. Supported by EC paradigm, the musicians can be predicted as behaving differently than that of non-musicians in the rhythm perception tasks of current study.

The embodiment approach is valuable because it may present a more universally applicable method for understanding the human cognitive process in music. Most of the previous models of rhythm perception are derived from notation-based symbolic analysis on European classical music; therefore, although valid within that framework, they cannot be applied as universally as the studies might claim. On one hand, music in many non-

Western cultures have notation systems different from Western ones. For instance, some

11 non-Western instrumental music is written with a notation of playing techniques, as the reduced notation for Guqin (a traditional Chinese zither) or kunkunshi notation for

Japanese Sanshin (a Japanese bowed string instrument), which does not directly describe what notes were played. In these notation systems, it is the playing techniques that are described, such as finger positions, stroke techniques, and tablature. On the other hand, notation plays a relatively minor role, and some cultures even have no written notation and rely instead on oral traditions. For instance, in Chinese ethnic groups Tujia and Dong, or in some African tribes, music is mainly transmitted from one generation to another through oral tradition. Even today, Indian classical music, Middle Eastern music, and even some

Western traditions (jazz, rock) are largely orally transmitted. While music notation is often culture-specific, human-synchronized movement to music is a universal behavior across cultures. With the EC paradigm, rhythm perception can be understood through analyzing body movement, a more universally applicable method than verbal or notation analysis.

EC paradigm brings new insight into ethnomusicological studies.

Ethnomusicology, as the study of people making music (Titon, 2015), has a long history in a broader area of interest that goes beyond the European art tradition. And ethnographical methods, such as participant observation and interviews, have long been used among ethnomusicologists to understand what music means to its practitioners and audiences.

However, musical concepts rarely translate across cultures, and some feelings, for both performers and audiences, are hard to articulate through words. Therefore, such descriptive data we collect from fieldwork and interviews may limit the understanding of the “emic” meaning of a people’s music. Meanwhile, the EC paradigm allows music meaning to be

12 analyzed through corporeal articulations of performers and audiences during music making and appreciation. In a research on group ensemble of Indian tanpuras (Clayton, 2007), the performers stated that the tanpura rhythms each proceeded independently. However, an analysis of the temporal behavior of the performers and audiences in the recorded video of the live performance revealed that there was significant interpersonal entrainment between performers, as well as that among the audience. Such phenomena in music performance would be neglected without studying the temporal behaviors.

It deserved to be noted that some ethnomusicologists have long recognized the embodied dimension of music experience. Thomas Clifton have proposed that music experience not merely as an object, “but as fields of action for a subject” (Clifton, 1983.

P.70). Another early scholar who pointing out the relationship between human body and music experience is John Blacking (1992), who argued that the musicality is embedded in human body. The musicality is not only “socially constructed, but also “genetically conditioned” (p.305). Charles Keil (1995), a pioneer ethnomusicologist, who pointing out the significant role human body in music making and perception by terming the idea of

“kinesthetic listening” (1995). It claims that the audiences’ listening experience are shaped by their motor movement. John Chernoff suggested “movement is the key to ‘hearing’ the music” (1977: 22) in the case of West African percussion music. Being aware of the embodied role in music experience, the interpretative approach is applied in describing and analyzing the music activities in ethnomusicological studies. Until recent, many studies have tried to test the embodied musicality from scientific perspective. For instance, Philips- silver (2007, 2009) have found that the movements audiences produced are intimately

13 linked to the sounds one perceived, and such interaction between action and hearing experience exist since infanthood and will grow continually into adulthood.

1.4. Summary

This chapter delineated some basic theoretical foundations of the whole thesis. A discussion of the distinction between sensation and perception clarified that how the physical energy being received and interpret are two different processes. Then, the chapter turned to an explanation of rhythm perception as a multisensory experience, by pointing out that temporal information can be conveyed and perceived through multisensory channels. Lastly, the chapter introduced the embodied cognitive paradigm as the theoretical foundation of this study, namely, that music perception is an embodied activity that depends on our sensorimotor apparatus and our sociocultural environment. As Blacking

(1973) explained, “Music is a synthesis of cognitive processes which are present in culture and in the human body: the form it takes, and the effects it has on people, are generated by the social experiences of human bodies in different cultural environments” (p. 73). All the concepts elaborated in this chapter serve as the foundation for this work.

14

Chapter 2. Clock-Based Timing and Interval Perception

Human temporal perception covers a large timescale, from miliseconds to days.

Studies have roughly categorized this temporal range into four different time scales (see

Figure 2.1). It is presumed that different neural mechanisms underlie the process of temporal perception at different timescales (Mauk & Buonomano, 2004; Buhusi & Meck,

2005): microseconds (Covey & Casseday, 1999), milliseconds (Buonomano & Karmarkar,

2002), (Gibbon, Malapani, Dale, & Gallistel, 1997), and a (King

& Takahashi, 2000).

One of the most basic temporal components in rhythmic sequences is an interval, which is the duration between two successive (sensory) events. The timescale of an interval at issue in the scholarship on this subject typically ranges from milliseconds to seconds, because, for the perception of music, this is the typical range used for intervals that constitute a rhythmic pattern. Within such a temporal range, human can discriminate differences in durations between events, reproduce presented intervals, and even accurately synchronize their movement to regular events. However, there is no consensus as to which temporal mechanisms account for these temporal processing abilities.

15

Figure 2.1. Timescales of temporal processing. Adapted from “The Neural Basis of Temporal Processing,”by M. D. Mauk and D. V. Buonomano, 2004, Annual Review of Neuroscience, 27(1), p. 309.

Two proposed timing mechanisms can account for interval perception: clock-based timing mechanism and entrainment-based timing mechanism. Both of these mechanisms have successfully explained interval perception in auditory and visual domains to some degree, but the entrainment model better captures higher-level features of temporal patterns, such as pulse perception. It is necessary to discuss how we perceive stand-alone intervals before we delve into our study of the broader questions of rhythm perception. To 16 that end, Chapter 2 and Chapter 3 provide a detailed treatment of these two mechanisms and an explanation on the application of entrainment-based timing as the fundamental assumption in the current study. Included in these discussions are the features of intervals and the underlying mechanism of auditory and visual interval perception.

2.1. Interval as an Element of Auditory Rhythmic Pattern

The word rhythm is widely used in both our daily life and academia and is endowed with various definitions. In line with previous works, our work uses the word rhythm to refer to a sequence of discrete temporal intervals, marked by sensory events (Cooper &

Meyer, 2008; Fraisse, 1982; Clarke, 1999). Specific to the auditory channel, the sensory events are sound events where the sound is the air pressure vibration over time in the audible frequency range. The onset of sound is the point in time when the sound initiates, and the offset of sound is the point in time when the amplitude of sound reduces to zero.

In music theory, the onset of a note is also referred to as the attack point or start point.

Further, the interval is regarded filled when there is one continuous signal and its onset and offset mark the beginning and the end of the time interval. Conversely, an interval is said to be empty if two successive sounds mark the beginning and the end, and there is silence between the two sounds.

The of filled and empty intervals are slightly different. Studies have found that the perception of intervals can be influenced by the way they are marked (e.g.,

Grondin, 2010). Independent, empty auditory intervals are perceived as shorter in temporal estimation, and less accurate in temporal discriminations, than filled intervals (Wearden,

Norton, Martin, & Montford-Bebb, 2007); on the other hand, the accuracy of empty

17 intervals is better than that of filled intervals when intervals are organized into sequences

(Klyn, Will, Cheong, & Allen, 2015). One proposed explanation is that temporal order is easier to be identified if events are separated by silent intervals (Warren & Obusek, 1972;

E. C. Thomas & Brown, 1974). To simplify the discussion, throughout the thesis, all intervals are by default empty intervals.

The duration between onsets of two consecutive sound events is called the inter- onset interval (IOI), which is a numerical measurement in quantifying the temporal distance between sound events. The minimum IOI for human to distinguish two sounds is about 3 ms; to determine the sequence order of sounds the IOI needs to be 30 ms; and to make a rhythmic judgement, such as estimating the duration, tempo, etc., the minimum required IOI is 100 milliseconds (Hirsh, 1974; London, 2012).

Rhythms are not only distinguished by the intervals that constitute them, but also by the physical features of the sound events that mark the intervals. Additionally, a specific group of intervals and sound events that generates a rhythm is called a rhythmic pattern; therefore, changing the interval or sound events by changing any of the sound features of a rhythmic pattern—including pitch, duration, loudness, etc.—will form a new pattern.

Illustrated below (Figure 2.2) is a rhythmic pattern which the isochronous temporal structure marked by loud-soft-loud sounds.

18

Figure 2.2. Illustration of rhythmic pattern in soundwave representation: isochronous temporal structure marked by loud-soft-loud sounds.

2.2. Mechanism and Variability of Interval Perception

How temporal intervals are encoded in the perception process is not fully understood. Neural science studies have discovered that performing tasks related to time perception activates multiple areas in our brain, such as frontal cortex (e.g., Lewis & Miall,

2003; Oshio, 2011), (e.g., Gibbon, 1977; Grahn & Brett, 2007), parietal cortex (e.g., Bueti & Walsh, 2009; Alexander, Cowey, & Walsh, 2005), and

(e.g., Braitenberg, 1967; Harrington & Haaland, 1999; Fierro et al., 2007). However, the specific cognitive process is yet to be explained (a detailed review can be found in Ivry &

Spencer, 2004). Among the diverse models of time perception that have been presented, many include neurobiological internal .

The earliest description of the internal clock for timing can be traced back to the middle of the 20th century in Treisman (1963), who assumed there is a central timer for estimating intervals. The clock delivers fixed time signals, which allow for a comparison with the external interval to be estimated. The scalar expectancy theory (SET) is one of the leading theories for the internal clock mechanism (E. A. Thomas & Weaver, 1975; Jones,

19

Moynihan, Mackenzie, & Puente, 2002). According to SET, the internal clock is refined into two components: a neural pacemaker and an accumulator (Gibbon, 1977; Rakitin et al., 1998). Figure 2.3 illustrates the schematics of SET.

Figure 2.3 A schema of a prototypical SET model for short interval perception. The three stages are presented here. Adapted from “Carving the Clock at its Component Joints: Neural Bases for Interval Timing,” Wencil, Coslett, Aguirre, and Chatterjee, 2010, Journal of Neurophysiology, 104(1), p. 161.

20

According to SET, the neural pacemaker is continuously emitting pulses at a specified rate, regardless of whether there are active sensory inputs. The accumulator receives input from both the pacemaker and the sensor. It can count the number of pulses it receives from the pacemaker, and it has a valve that can shut off input from a pacemaker.

The start of the to-be-timed interval triggers open the valve, allowing pulses to enter the accumulator. At the end of the signal, the valve is closed, and the accumulator counts the number of pulses that entered since the last opening of the valve. The number of pulses collected in the accumulator is compared to durations stored in the to determine whether the current duration is longer or shorter than remembered intervals (Lejeune, 1998;

Gibbon, Church, & Meck, 1984). It is worth noting here that the clock-based model is based on the assumption that duration judgment is the estimation of the relative differences between the remembered standard duration and the to-be-judged interval. It can be said that interval perception is achieved through the comparison of relative time difference between intervals. Generally, through clock-based mechanisms, the duration of the interval is conveyed by an abstract internal code (the counted number of pulses).

The clock-based models are widely well accepted, one of the reasons may due to, it is flexible in explaining the variabilities in human estimating times. Behavioral studies on human interval perception show that it is difficult for internal clocks operating in living organisms to achieve the same level of accuracy as mechanical clocks (Lejeune, 1998).

Specifically, behavioral data on time judgments of intervals, among human and other , have consistently shown variability in interval perception—that is, the relative

21 inaccuracy of the perceived interval is approximately constant over a broad range of temporal intervals (Grondin, 2001). The variability of interval estimation increases in proportion to the magnitude of the to-be-judged interval. Assuming that the inaccuracy is

 1 s when perceiving a 10-second interval, then the inaccuracy will be approximately 

10 s when perceiving a 100-second interval. This is one of the key contributions of SET to the inner-clock models, namely that the timing accuracy of the inner clock is relative to the size of the interval being timed—which refers to the meaning of “scalar” in this model

(Allen, 1990; Gibbon, 1977; Gibbon, Church, & Meck, 1984).

SET identifies the main source of variabilities in the attentional process, whose role is overlooked by other theories. Attention is recognized as the allocation of processing resources to specific objects. Empirical data indicates that the attentional process influences human interval judgment and that there is a non-trivial perceptual variability based on the amount of attentional resources allocated to the time estimation task of intervals. Zakay (1989) was the first to directly test the attentional allocation process and found that durations were estimated as longer if the subjects were instructed to treat duration estimation as the primary task, compared to the duration estimation by those instructed to treat it as a secondary task. Moreover, in the attentional-distraction procedure, the reproduced duration is decreased if subjects’ attention was directed away from the estimation task (Zakay, 1992). An attentional-gate model (Zakay & Block, 1996; Block &

Zakay, 1997) has been put forward to explain such phenomena. It claims that the

“attention” is conceptualized as a gate before the valve in SET mechanism. The more attentional resources are allocated to time estimation, the more frequently or widely the

22 gate opens, thereby allowing more pulses to be transferred from the pacemaker to the counter.

The main criticism against the scalar expectancy theory is that it is difficult to identify the functional area in the brain that acts as the internal clock or the accumulator described in the theory. Scholars have achieved enough clarity only on the positive relationship between temporal estimation and contingent negative variation (CNV) amplitude and peak (McAdam, 1966; Macar & Vidal, 2004). CNV is a slow, negative deflection of brainwaves. It was first observed by Walter and his colleagues (1964), who used electroencephalographic (EEG) recordings to monitor the activity of human brainwaves while performing temporal estimation tasks. CNV is sometimes called “an online index of timing” because it is present whenever humans estimate time intervals (Macar & Vidal, 2004). However, CNV provides evidence neither in favor of nor against clock timer models. Therefore, the clock-based models should still be treated as a hypothetical construct holding up for further validation by brain researches.

2.3. Interaction Between Perceived Non-Temporal Factors and Interval Perception

In addition to the variability in interval perception discussed above, studies have found that the temporal perception of a given interval does not solely depend on the temporal feature (duration of the interval); instead, it interacts with various non-temporal features of the stimuli. In other words, the duration of the same interval may be perceived differently depending on how the stimuli are delivered or presented. Some of the factors— such as sensory modality (e.g., Penney, Gibbon, & Meck, 2000), intensity (e.g., Matthews,

23

Stewart, & Wearden, 2011), pitch (e.g., Pfeuty & Peretz, 2010), familiarity (e.g.,

Witherspoon & Allan, 1985), and whether an interval is filled or empty (e.g., Wearden et al., 2007)—have been recognized for some time to systematically change the perceived duration of interval. The explanation offered by the SET model is that these factors either change the rate of the pacemaker or change the open–close status of the valve of the accumulator, thereby changing the flow of pulses into the accumulator (Allman & Meck,

2011; Lejeune, 1998; Meck & Benson, 2002).

A recent study attempted to investigate the factors in a multivariate approach (Fan

& Will, 2016) and made an interesting finding of integrated effects over the factors. The research highlighted the factors of the pitch of interval markers, participants’ music training, and language experience, which were only explored as univariate factors in previous studies. This section presents an overview of those studies to summarize the scholarship on the effect of cultural experience on pitch–temporal interaction.

Studies have highlighted that musicians—individuals with long-term musical training—have a greater ability to extract and estimate temporal information more accurately compared to non-musicians (e.g., Rammsayer & Altenmuller, 2006; Cicchini,

Arrighi, Cecchetti, Giusti, & Burr, 2012). This, researchers assert, is the result of the plasticity of musicians’ brains because many years of musical practice can significantly change both the structural and functional aspects of many regions in the brain, such as auditory cortex, , brainstem, etc. (Pantev et al., 1998; Amunts et al., 1997;

Musacchia, Sams, Skoe, & Kraus, 2007; Jäncke, 2009). Moreover, studies have shown that speaking a tonal language improves pitch processing at the brainstem level (Krishnan, Xu,

24

Gandour, & Cariani, 2005). Behavior studies have found that native Mandarin speakers are better at discriminating pitch intervals compared to native English speakers (Pfordresher

& Brown, 2009; Hove, Sutherland, & Krumhansl, 2010). Different cultures have developed different phonological systems, in which pitch and duration play different roles in different languages. In Mandarin, a tone language, both the height and the dynamic change of pitch serve lexical or semantic functions (specific tone contours are illustrated in Figure 2.4).

Figure 2.4 Pitch contour of four different tones in Mandarin. Ma in the first tone means “mother”; “hemp” in the second tone; “horse” in the third tone; and “curse” in the fourth tone. The neutral tone is not shown here because its pitch depends on other syllables it is combined with.

In Figure 2.4, although the duration of the four lexical tones of Mandarin, in isolated monosyllables, varies significantly, with tone 3 being the longest and tone 1 and tone 4

25 being the shortest (Brotzman, 1964; Nordenhake & Svantesson, 1983), in text-reading and natural conversation, the duration of the four lexical tones tends to converge (Yang, Zhang,

Li, & Xu, 2017). On the other hand, in English, pitch mainly serves to convey prosodic information such as stress or sentence structure, but duration distinguishes the meaning

(Cruttenden, 1997; Yip, 2010; Gussenhoven, 2010). Therefore, learning tone languages demands the establishment of a fine-grained association between pitch and word meaning, which leads to a more developed capability for and mechanism on pitch processing

(Pfordresher & Brown, 2009).

Evidence from neuroscience found that native experience with a tone language

(e.g., Mandarin Chinese and Thai) leads to improved pitch presentation and smoother pitch-tracking accuracy, measured by the neural encoding performance of pitch in the auditory brainstem (Krishnan, Bidelman, Gandour, 2010; Swaminathan, Krishnan, &

Gandour, 2008). Studies have also found interaction between auditory pitch perception (of the auditory stimuli that mark an interval) perception and the perceived duration of a given interval. When listening to intervals bounded by two tones, the interval duration is perceived as shorter when bounded by high tones and longer when bounded by low tones

(Pfeuty & Peretz, 2010, Lake, Labar, & Meck, 2014).

Based on these studies, tone language speakers and musicians may have different perceptual processing on pitch. It was hypothesized in Fan and Will (2016) study that cultural influence (e.g., tone language speaking experience, music training experience) may have an effect on pitch–temporal interaction. And such hypothesis that the pitch-

26 temporal interaction is experience-dependent was under explored in the experiment (Fan

& Will, 2016). The data on interval timing performance were collected from and compared among four groups of participants: Mandarin-speaking musicians, Mandarin-speaking non-musicians, English-speaking musicians, and English-speaking non-musicians. The results (illustrated in Figure 2.5) aligned with previous studies, indicating that intervals marked by higher-pitched tones are underestimated, and overestimated if marked by lower- pitched tones. However, this interaction was significantly reduced among Mandarin- speaking musicians. What is immediately evident from this result is that under all conditions, musically trained subjects, represented by blue bars in Figure 2.5, are significantly less influenced in their duration estimation by the pitch of tone. Furthermore, duration judgments of Chinese-speaking subjects are less influenced by pitch than English- speaking subjects. Also, the difference between subjects who speak different languages is reduced if they are both musically trained.

27

Figure 2.5. Statistical result of behavior experiment. The y-axis represents “normalized difference of PSE” which is difference between PSE and ISI divided by ISI where PSE is the point of subjective equality. The higher the value of this variable, the stronger the influence of pitch is on a participant’s duration perception. The blue bars represent the musician groups, and the red bars represent the non-musician groups. The left part of the graph represents the Chinese-speaking groups, and the right part represents the English-speaking groups, which are marked C and E, respectively. The labels “long” and “short” designate results for the two inter-stimulus intervals used in the experiments. The figure shows significant difference between English speaking and Chinese speaking group as well as musician and non-musician group. Adapted from “Musicianship and Tone-language Experience Differentially Influence Pitch-duration Interaction,” Fan, and Will, 2016, Proceedings of the 14th ICMPC 2016.

28

These results reveal that the pitch–temporal interaction in the perceptual process is mediated by individuals’ cultural experience; for instance, having musical training or speaking a tone language reduces the influence of pitch on duration perception. Although the neural mechanisms that account for such cultural influences are not fully understood, the result of this study connects the mechanism of language processing with the phenomenon of pitch–temporal interaction.

2.4. Visual Interval Perception

The intervals are not only marked by auditory events. Any visual event—for example, changes in color, shape, or position—may mark intervals, and the duration of the intervals could be perceived by human. The temporal mechanisms for auditory stimuli and visual stimuli share certain commonalities that enable cross-modal transaction (e.g.,

Grondin & Rousseau, 1991), such as the ability to compare the duration of and sound

(Ivry & Schlerf, 2008); on the other hand, they have their own modality-specific components that separately deal with specific features of the stimuli.

Recall the discussion of the sensory transduction process in chapter 1, which explains that sensory information is recognized only if it is transduced from mechanical energy to neural signals. The modality difference exists already in this transduction mechanism in the auditory system and the . The duration of retinal processing and chemical transduction of light at the retina is a lot longer than that required for the counterpart of sound waves (mechanical energy of sound vibration is transduced into electrical energy) at the ear (King & Palmer, 1985). And it was found that an auditory

29 stimulus takes 8 ms–10 ms to reach the brain cortex, while a visual stimulus takes 20 ms–

40 ms (Kemp, 1973). Accordingly, such modality difference in mechanical transduction and transmission suggests that there may be a greater latency in human reaction to visual stimuli than to auditory stimuli, thus causing the difference in the duration perceived from the two modalities.

Behavioral studies show a general agreement on the modality difference of interval perception. For instance, two systems have different fusion thresholds, which allows film/video to work at 25 frames per second. And the duration estimation of sounds is longer than that of lights although they last the same span of time (Wearden, Edwards, Fakhri, &

Percival, 1998). Further, temporal acuity is better for auditory than for visual intervals (van

Wassenhove, 2009; Merchant, Zarco, Bartolo, & Prado, 2008), and this holds for both filled and empty intervals (Goldstone & Lhamon, 1971; Grondin & Rousseau, 1991). Also, there is superior discrimination performance for auditory compared to visual interval

(Rammsayer, Borter, & Troche, 2015), as human beings are more accurate in reproducing intervals marked by auditory stimuli than those marked by visual stimuli (Glenberg, Mann,

Altman, Forman, & Procise, 1989).

The modality difference in interval perception cannot be completely attributed to the biophysical differences in sensory processing of auditory and visual stimuli, as the observed discrepancy across audio–visual modalities usually does not behave as a constant error. It has been argued that the visual interval perception can also be explained using clock models, and such difference between auditory interval and visual interval perception

30 is due to the underlying operation of the internal clock for audio/visual stimuli may be different (van Wassenhove, Buonomano, Shimojo, & Shams, 2008). The common understanding of the difference is that the pacemaker rate is faster for auditory stimuli than for visual stimuli, which leads to modality difference between auditory and visual interval perception (van Wassenhove et al., 2008; Barne et al., 2017; Penney et al., 2000; Wearden, et al., 1998).

The different pacemaker rates may be due to either changes in the rate of an internal clock or the existence of multiple internal clocks with different rates. There are animated discussions on whether our brain is endowed with a unique and centralized clock that processes the temporal information from various sensory inputs (Treisman, 1963; Gibbon,

Malapani, Dale, & Gallistel, 1997), or whether we have many clocks, each of which is modality-specific (or responsible for calculating specific sensory inputs) (Buonomano,

2000; Burr, Tozzi, & Morrone, 2007; Grondin, 2010). Emerging evidence supports the idea of a plurality of temporal mechanisms and that specific non-temporal parameters of visual stimuli—such as visibility (Terao, Watanabe, Yagi, & Nishida, 2008), size (Xuan, Zhang,

He, & Chen, 2007), and speed (Kaneko & Murakami, 2009)—can alter perceived time.

2.5. Summary

This chapter discusses the perception of the duration of intervals marked by two sensory events, focusing on the SET model as the leading theory that explains interval perception in both auditory and visual modalities. The fundamental hypothesis of the SET model is that there is an inner clock emitting pulse. Interval is estimated based on the

31 comparison between the memorized duration of the former interval and the accumulative number of pulses collected in the accumulator for the to-be-judged interval. One of the most important ideas that are clarified in this chapter is that what we perceived may not equal to what is objectively present. Because the perception is a subjective process which not only based on the sensory inputs but also influenced by many other factors. The next chapter presents another timing mechanism.

32

Chapter 3. Entrainment-Based Timing and Pulse Perception

Having provided an overview of clock-based timing mechanism in Chapter 2, I now turn to entrainment-based timing mechanism. Although neither mechanism can fully explain the temporal perception process, I will demonstrate that entrainment-based models offer an advantage in their explanation of the process when we perceive a pulse in response to rhythmic sequences. Studies have shown that, when presented with a sequence of intervals in a rhythmic pattern, rather than processing the intervallic relationship between discrete events, we detect the regularities of external rhythmic pattern and entrain our internal periodic oscillator to them (Schulze, 1978). This process is pulse perception, which is one of the important aspects in musical rhythm perception and is often coupled with body movement, from the simple tapping of a finger or a foot to complex dancing.

This chapter discusses entrainment-based timing mechanism and related models in detail, followed by a synopsis of the sensorimotor synchronization to auditory and visual rhythms. I clarify these key concepts because they are central to this research, forming the theoretical framework of the study of rhythm perception.

3.1. Pulse as a Perceptual Phenomena

Pulse is often referred to as a perceived feature in response to the periodic features of a surface rhythm (Fitch, 2013) or other quasi-periodic events. Although perceived pulse is significantly influenced by the regularity of the sound surface of rhythm, it is not

33 necessarily determined by physical events. Pulse may occur even in the absence of events that coincide with surface rhythm. Pulse is often colloquially called “beat,” although, strictly speaking, “beat” and “pulse” refer to different concepts and cannot be used interchangeably. Beat is a more sound-oriented concept and is used to describe sound phenomena, such as how many temporal units a rhythmic pattern is constructed on. On the other hand, pulse pertains to one’s perception and describes the periodic, temporal units subjectively formed in one's response to external rhythm. In the process of music perception, “pulse provides a stable, dynamic referent with respect to which a complex musical rhythm is experienced” (Large, 2008, p. 192). It also functions as a reference for guiding our movement and structuring incoming events. Once an individual has established a sense of regular pulses, the perception gains momentum such that the feeling of pulse does not fade immediately even if the stimuli are stopped (Cooper & Meyer, 2008).

A series of regular pulses defines the tempo we perceived. That is, perceived tempo is not directly determined by the surface rhythm. When the sound of rhythm is “too fast” or “too slow,” people naturally group the rapid events into one pulse or divide large intervals into smaller ones, feeling the pulse that is “just right.” Such grouping or dividing behavior is often reflected in humans’ musical movement, such as tapping to every two beats if the music is fast or tapping twice if the music is slow.

Researchers have found that our sense of pulse is confined to a specific temporal span—from 200–250 ms to about 1.5 s (London, 2012). Within this range, it is widely accepted that there is a preferred tempo around which an individual feels and responds to the pulse most strongly and naturally. A recent EEG study measured human brainwave

34 synchronization following periodic stimulation and found maximum response at the tempo of 2 Hz (500 ms) (Will & Berg, 2007). Other studies found that human spontaneous motor tempo (the preferred and natural pace to act) of whole-body movements is around 1.67–2

Hz (500–600 ms) both during daily activities (MacDougall & Moore, 2005) and when walking in synchrony to music (Styns, Noorden, Moelants, & Leman, 2007).

It should be noted that there is a wide range of individual variances in preferred tempo (Will & Makeig, 2011). From the perspective of human developmental studies, the preferred tempo slows down as one ages (Drake, Jones, & Baruch, 2000). For children between the ages of four and seven, the preferred tempo is typically between 3.33 Hz (300 ms) and 2.5 Hz (400 ms) (McAuley, Jones, Holub, Johnston, & Miller, 2006). For adults, the preferred tempo is around 1.66 Hz (600 ms), while for seniors, it slows to approximately

1.43 Hz (700 ms) (McAuley et al., 2006). Therefore, the average range of preferred pulse is 400–700 ms.

3.2. Entrainment and Pulse Perception

This section first delineates the concept of entrainment and then proceeds with a discussion on how entrainment theories explain the underlying mechanism of pulse perception.

3.2.1. Entrainment

Entrainment was first identified in 1655 by Dutch physicist Christiaan Huyens, who coined the term to describe a phenomenon among physical oscillators and later invented the . The pendulum featured a swinging weight that oscillates at precise 35 time intervals, the length of which depends on its length. It resists swinging at another rate and may serve as a precise . Huyens noticed that when two pendulum clocks are placed near each other, regardless of whether they were facing each other or in opposite directions and no matter when the clocks were started, the swinging weights on the two pendulums would eventually match in their movement and synchronize with each other.

Once they are synchronized, they stay synchronized; if one pendulum was disturbed, the two pendulums would regain synchronization (Huygens, 2018).

Entrainment is not a phenomenon unique to mechanical systems. It is also a shared tendency within a wide range of biological systems such as a group of synchronous fireflies that light up in the same cadence. Entrainment has also been observed among human. For instance, between two speakers in conversation, their rate of speaking, latency, and utterance durations are often aligned (Webb, 1969; Local, 2007; Matarazzo & Wien, 1966), as are their motor behaviors such as body posture and facial expression (Louwerse, Dale,

Bard, & Jeuniaux, 2012; Shockley, Santana, & Fowler, 2003). Additionally, entrainment occurs in the performances of collaborative musical ensembles (Yoshida, Takeda, &

Yamamoto, 2002).

One recent study found that the entrainment can be also influenced by socio- cultural factors. Lucas, Clayton, and Leante’s (2011) study applied entrainment-based analysis to the rhythmic production of performers in a cultural context—specifically,

Congado, a religious tradition in Minas Gerais, Brazil. In Congado rituals, various music groups participate at the same time, and they can be distinguished from one another by not only their costumes and musical instruments but also by the rhythmic patterns of their

36 music while they march. For instance, Figure 3.1 shows the differences between Serra

Acima and Dobrado, two different drum rhythmic patterns played by two different groups in Congado.

Figure 3.1. Representation of two drum rhythmic patterns: Serra Acima (left) and Dobrado (right). The black notes represent strong strokes, and the white notes represent weak strokes. Adapted from “Inter-group Entrainment in Afro-Brazilian Congado Ritual,” Lucas et al., 2011, Empirical Musicology Review, 6(2), p. 78.

Lucas et al. (2011) recorded and analyzed four groups’ performances during the ritual; in particular, the timing data derived from the audio recording of two groups’ performances were analyzed, and specific activities and distance between the two groups were marked to the sound. Figure 3.2 illustrates the rhythms observed in the Mocambique

(AM) and the Congo (AC) groups from the community of Arturos, during this Congado procession. The blue bar represents the AM group, and the pink bar represents AC group.

37

The red shading indicates the periods when the two groups became mutually entrained in their musical rhythms.

Figure 3.2. Illustration of the Mocambique (AM) and Congo (AC) groups becoming entrained. Adapted from “Inter-group Entrainment in Afro-Brazilian Congado Ritual,” Lucas et al., 2011, Empirical Musicology Review, 6(2), p. 78.

When the two groups’ paths converged, they naturally adjusted their tempi to become entrained with each other. However, if the two groups were from different communities, they would resist synchronization. The resistance does not prevent entrainment. Eventually, the two groups synchronized in anti-phase relationship which is described as “resisting entrainment” Lucas et al.’s (2011) entrainment analysis puts it this way: “participants maintain boundaries between the different groups depends on sustaining

38 a rhythmic difference” (p. 75). Thus, the researchers concluded that there is no monolithic form of entrainment and that the particular form of entrainment determined by certain socio-cultural factors.

The general entrainment process includes two fundamental elements: first, it involves two or more autonomous rhythmic oscillations; second, it describes the interaction between the oscillators (Clayton, Sager, & Will, 2005). The interaction between oscillators could be either symmetric or asymmetric (Clayton, 2012). Symmetric interactions often happen in collaborative situations, in which interpersonal entrainment often occurs, such as musicians in jazz ensembles who influence each other equally. Asymmetric interactions often happen when people listen to recorded music, in which the sounds that are played influence the listeners, but not vice versa. Whether the interaction between the oscillators is symmetric or asymmetric, the general outcome of entrainment is that the two oscillators are “phase locked” to each other, where the phase difference between the two remains nearly constant for a long period.

3.2.2. Entrainment-Based Theories and Pulse

Entrainment-based theories of pulse perception explain it as an entrainment between a self-sustaining oscillator in human and an external stimulus. For instance, Large and Jones’s (1999) dynamic attending theory considers the self-sustaining oscillator as an endogenous attentional oscillation in which the attention is temporally distributed regularly and periodically. Further, the frequency of the attentional oscillations is flexible and adaptable, which can be synchronized to external stimuli through entrainment. These inner oscillators in human have peaks in amplitude with a roughly regular frequency, which 39 automatically predicts the next stimulus event through being entrained to the perceived regularity of the stimuli sequence. The regularities of physical stimuli, such as sounds or locomotion repetition, in the flow of external rhythmic events marks the frequency of the external oscillator. As the periodic physical characteristics of a sound sequence draws a perceiver’s attention, the perceiver adjusts the attentional oscillator’s parameters, such as period (the time span between peaks of a given cycle) or phase (the difference between the onset of the stimulus and the peak of an attentional oscillator), to the regularity of the external events until becoming locked into the periodicity of the stimulus.

Through this entrainment mechanism, interval estimation is achieved by comparing the phase relationship between the expected stimulus event and the actual stimulus event.

If the next stimulus event coincides with the expected event, the subject perceives the current interval as having the same duration with the last interval (such as S2 in Figure

3.3). If the next stimulus event happens earlier (S1 in Figure 3.3) or later (S3 in Figure 3.3) than expected, the current (target) interval is perceived as shorter or longer than the previous interval.

40

Figure 3.3. Phase relationships between internal oscillation and frequency of external stimuli.

The interval differences from such comparisons are often calculated in a circular way such that they are mapped to a range of 0° to 360°. In this sense, the earlier or later arrival of the next event is considered a phase shift. As illustrated in Figure 3.3, S2 has no phase shift, S3 has a positive phase shift, and S1 has a negative phase shift. Phase shifts temporarily break the phase-lock relationship between two oscillations, and the inner oscillator may try to adjust itself to recover the original phase relationship.

41

Figure 3.4. Schematic illustration of neural oscillation sensory entrainment. Adapted from “Sensory Entrainment Mechanisms in Auditory Perception: Neural Synchronization Cortico-Striatal Activation,” Sameiro-Barbosa & Geiser, 2016, Frontiers in Neuroscience, 10, p. 3.

The entrainment between internal and external oscillations is observed in neural oscillations. It serves as supportive evidence for the role of entrainment in temporal perception. Here we focus primarily the thalamocortical system, which is the main route for the transmission of sensory information to the and where many oscillatory activities can be observed. For instance, neural population rhythms, which are cyclical fluctuations of baseline neural activity, are recorded in the thalamic regions of the 42 brain. Electrophysiological studies have used the “frequency-tagging” method to test the relationship between the activity of the human brain and external stimuli. In one example, as Figure 3.4 illustrates, the electrical activities on the human scalp are being recorded by an EEG while listening to the auditory stimuli which presenting repetitively at a fixed frequency. Then a comparison of the neural oscillator’s activity from EEG and the stimulus sequence would be developed to see whether there is a “frequency-tagging” or not for verifying the entrainment. A series of previous studies have observed that the of neural oscillation in the delta range is synchronized to the frequency of the auditory stimuli (Gross et al., 2013; Lakatos et al., 2013; Sameiro-Barbosa & Geiser, 2016;

Will & Berg, 2007).

3.3. Sensorimotor Links in Pulse Perception

Applying entrainment to pulse perception, we observe that it is common for individuals to make rhythmic movements that are effectively entrained by the perceived pulse of music, such as tapping the feet to the beat of a song. The entrainment between periodic body movements and the stimuli is so compelling that individuals often struggle to maintain movements that are out of sync (Repp, 2002). This phenomenon is called sensorimotor synchronization (SMS) (Todd, 1994, 1995; Todd & Lee, 2015), which can be explained with sensorimotor theory (SMT). According to SMT, the motor system is an important mediator for pulse perception—the central nervous system processes sensory information from external stimuli while regulating motor actions, and the motor system plays a continuous, active role in temporal perception by executing motor actions and sending feedback to the nervous system. From this perspective, therefore, what one

43 perceives is generally based on not only what one hears but also the way one moves

(Phillips-Silver & Trainor, 2007). Todd further elaborated that “the sensory system tune[s] into the temporal-motional properties of the physical environment, while the motor control system tunes into the dynamic properties of its musculoskeletal system” (Todd, Oboyle,

Lee, 1999, p. 26). In this way, the pulse perception of musical rhythm is a continual dynamic process between sensory and motor systems.

Between the stimuli and the motor response there is inevitably an asynchrony, most often a negative asynchrony when periodic stimuli are expected. This asynchrony is influenced by a number of factors. Scholars have argued that it takes more time to transmit sensory information from motor effectors (such as fingertip or foot) to the brain than from the ear to the brain because of anatomical reasons (Fraisse, 1980; Paillard, 1949).

Increasing the tempo of the stimuli reduces asynchrony. Also, individuals with musical training are found to experience less asynchrony than those without musical training

(Aschersleben, 2002; Repp, 2005). Generally, however, to adapt to and maintain the entrainment, subjects in studies automatically adjust the intervals between motor responses according to the asynchrony they experienced.

Neural science studies have also found that neural regions associated with the motor system are associated with rhythm perception (Zatorre, Chen, & Penhune, 2007; Cameron,

Stewart, Pearce, Grube, & Muggleton, 2012). Additionally, entrainment observed between brainwaves and the frequency of external stimuli as constituting an essential part of the neurophysiological process underlying time coupling between rhythmic sensory input and motor output. Will and Berg (2007) discovered that the delta band of brainwaves in

44 response to the periodic auditory stimuli has an absolute maximum at around 2 Hz, which corresponds to the optimal tempo identified in various sensorimotor behaviors, as mentioned in the discussion in Section 3.1 regarding spontaneous motor tempo and preferred pulse.

3.3.1. Sensorimotor Synchronization to Auditory Stimuli

Since Stevens’s initial study in 1886, human sensorimotor synchronization has been studied through a simple tapping task in which participants are asked to use their index finger to tap a solid surface to a metronome-like sound. Various experiments of this kind showed that sensorimotor synchronization (SMS) to auditory stimuli is limited to certain range of event rates. Typically, finger tapping becomes difficult if the inter-tap intervals exceeds 150–200 ms, or five to seven taps per seconds (Todor & Kyprie, 1980;

Truman & Hammond, 1990). Musicians can continue to perform the task until the intervals are reduced to the 100–120 ms range (Repp, 2003). For intervals smaller than that, the participants begin to falter in their attempts to synchronize their tapping to the auditory stimuli.

While SMS to auditory stimuli is a rather universal experience, not all music can evoke a pulse feeling in listeners, such as some music found in East Asian, Arab-Turkish, and Indian traditions. The alap, an opening section in North Indian classical performance, is an irregularly timed section full of improvisation that familiarizes listeners with the melodic framework of the music. Listeners can respond with periodic movement even though little or no temporal regularity can be detected in the musical stimuli. In this case, the motor responses are not the result of pulse perception but are instead determined by the 45 subject’s internal resonance, “slightly modified by interactions with the external signal”

(Will, Clayton, Wertheim, Leante, & Berg, 2015, p. 22). Only when the sound events are produced with a certain degree of regularity there is an alignment between the sound sequence and the listener’s motor responses.

3.3.2. Sensorimotor Synchronization to Visual Stimuli

Similar to our response to auditory stimuli, we also have the ability to synchronize our actions with visual stimuli. In the context of music production, this is often observed in ensemble performances in which the ensemble musicians synchronize to the conductor’s movement, or among musicians in a duet, trio, or a larger group. Studies have observed neuronal entrainment through phase-locking in the visual cortex when individuals synchronize sensorimotor movement to visual stimuli (Lakatos, Karmos, Mehta, Ulbert, &

Schroeder, 2008; Montemurro, Rasch, Murayama, Logothetis, & Panzeri, 2008; Spaak, de

Lange, & Jensen, 2014). Other studies also observed the activation of the motor area in response to visual stimuli (Jänck, Loose, Lutz, Specht, & Shah, 2000).

SMS to visual stimuli, as is the case with auditory stimuli, can be carried out efficiently only within a certain range of event rates, which is referred as effective rate.

Beyond this range, the subjects may feel it is difficult to synchronize, which leads to the growing divergence between the response and the stimuli. However, SMS to visual stimuli can occur in a narrower range of effective rates compared to auditory synchronization. The upper threshold of SMS to auditory stimuli is around 0.56 Hz (IOI = 1.8 s) (Engström,

Kelso, & Holroyd, 1996; Miyak, Onishi, & Pöppel, 2004). The upper threshold of SMS to visual stimuli is comparable to that of auditory stimuli (Engström et al., 1996; Mates, 46

Müller, Radil, & Pöppel, 1994). However, the lower threshold for visual synchronization, which is about 2.2 Hz (IOI= 460 ms) (Repp, 2003; Dunlap, 1910), which is four times higher than that of for auditory synchronization, which is around 8–10 Hz (IOI = 100–125 ms). Moreover, the quality of synchronization also differs between visual and auditory synchronization: the synchronization of movement to visual stimuli is much less stable than to auditory stimuli (Kolers & Brewster, 1985; Repp & Penel, 2003). The term stable here describes the phase relationship between subjects’ movement to the frequency of the stimuli.

Scholars have hypothesized the reasons why synchronization is poorer with visual rhythmic sequences than with auditory sequences. Fraisse (1948) attributed the difference to the distance of neural connections—the motor area is closer to the auditory area than the visual area. Thaut, Kenyon, Schauer, and Mcintosh (1999) made a similar hypothesis, arguing that the connection between the visual cortex and spinal motor neurons is less direct, while the connection to the auditory cortex is more direct and enables auditory rhythms to entrain movements. Jäncke et al. (2000) developed a functional resonance imaging (FMRI) study to observe different brain area activations in SMS to auditory and visual stimuli, finding that during the finger-tapping task, motor-related areas were activated more strongly when responding to auditory stimuli than to visual stimuli, but timing-related areas were activated more strongly in visual stimuli than in auditory stimuli.

Therefore, they suggested that SMS with an auditory sequence may rely more on eliciting an internal movement rhythm (or internal motor control), while SMS with a visual

47 sequence may rely more on processing the stimuli’s features themselves (Jäncke et al.,

2000).

3.4. Summary

Both clock-based models and entrainment-based models are both psychological timing models that seek to elucidate the underlying mechanism of temporal processing.

The clock-based models advantaged in explaining independent intervals’ time estimation, however, entrainment-based models advantaged more for interval sequence, or more complex temporal patterns. And the entrainment-based models better at clarifying the process of pulse perception as well as the coordinated body-movement, which are the fundamental elements in rhythm perception. Further, the SMT (sensory-motor theory), as one entrainment-based model, postulates an interconnected network of sensory, motor, and cognitive systems, which is crucial in the embodied cognition paradigm.

Additionally, this chapter explains the reasons that the current study regards rhythm perception as a dynamic process, in which subjects are not passively asked to sit and hear or watch; rather, participants’ body movement also serves as an indicator of their perceived temporal information. These clarifications will aid in our discussion in the next chapter on how we detect regularities in rhythmic sequences.

48

Chapter 4. Grouping in Rhythm Perception

Grouping refers to the perceptual unification of a series of events into a larger unit.

A certain kind of grouping occurs in experiencing musical rhythms, because the auditory system spontaneously groups acoustic elements together, and of the grouping that occurs in the context of music, two kinds in particularly have been widely discussed. One is serial grouping; it depends on the serial proximity in the time, pitch, and timbre of temporally adjacent events and is often referred to as motifs, themes, theme-groups, etc., in music analysis (Lerdahl & Jackendoff, 1983, p. 12). The other is periodic grouping (Parncutt,

1994), which organizes events into perceptually equal temporal units. These two kinds are not mutually exclusive: serial grouping defines what is perceived as (periodic) repetition.

This periodic grouping is the main perspective of the current study and is referred to simply as “grouping” throughout the discussion.

Within a sequence of auditory events, certain objective differences, such as duration and intensity, suggest the presence of a perceptual group boundary. For instance, a lengthened sound tends to indicate the end of a group (Vos, 1977), whereas a louder sound tends to indicate the beginning of a group (Trainor & Adams, 2000, Pove & Essens, 1985).

Besides these factors, the repetitiveness of a rhythmic pattern may also function as an indicator for marking the grouping boundary; for instance, in African music, one of the prominent features of its rhythm is “the periodic repetition of a single rhythmic cell” (M.

49

S. Eno-Belinga, 1965, p. 18). The current study developed a new approach to measure the repetition as an indicator of grouping choice and to see how well such repetition predicts a listener’s grouping habit. Before introducing the repetition-based grouping approach, the chapter first presents an overview of previous ethnomusicologists’ discussion of African rhythm perception.

4.1. Discussions on the Structure of African Rhythm

Ethnomusicologists have long been interested in the music in various African cultural groups, such as the Ewe (Nketia, 1963; Anku, 2009), Dagbamba (Chernoff, 1979;

Locke, 1990, 2009), Mandé (Charry, 2000; Polak, 2010; Polak & London, 2014), and

Yoruba (Euba, 1990; King, 1960; Omojola, 2014). Studies found that African rhythms, especially those of percussion (ensembles), are full of richness and complexness.

On the continent of Africa, a popular form of music is fast-paced, rhythmic drumming. For the people of Africa, the drum is a symbol of life, and its beat is the heartbeat of the community (Bakare, 1997). The drum belongs to the membranophone class, which produces sound primarily through the vibration of a stretched membrane. The membrane, often called a drumhead, is stretched over the open end of a frame. African music makes use of several types of drums, including the talking drum, an - shaped drum with two drum heads, and the djembe, a goblet drum with one drum head.

The talking drum varies its pitch by changing the tension of the drumhead, such as when playing to mimic tone patterns in African tonal languages. Many African natives have no difficulty understanding the message delivered by the talking drum. The djembe is an integral part of life in parts of West Africa. It is often played alongside other percussion

50 instruments at events such as weddings, births, and funerals. Additionally, bells are frequently played together with drumming to keep time.

Scholars have tried various approaches to describe the rhythmic feature and perceptual process of African rhythm. Through fieldwork, scholars have found that

“Africans transmitted most of their musics orally, without indigenous forms of written representation” (Shelemay, 2010, p. 24), and Kauffman claimed more broadly that “in

Africa there are no highly systematic means of verbalized description on the nature of rhythm” (Kauffman, 1980, p. 393). This is not an altogether surprising statement, because

“rhythm” itself is not an indigenous musical category in Africa. Rather, in African culture, musical rhythm is not separated from its social and cultural context, as it is closely correlated with the whole music activity and the fabric of life (Chernoff, 1979).

Specifically, some names such as Jansa, Sika, Soli not only indicates specific Mande

(Ethnic groups in West African) rhythmic pattern, but primarily specify the specific dance forms and their respective contexts.

Many sub-Saharan languages have no words whose semantics coincide with the

Western concept of "rhythm" or even “music” (Keil, 1979, p. 47; Monts, 1990, p. 90;

Charry, 1995, p. 210). However, Agawu’s fieldwork in Northern Ewe did lead to his finding that “related concepts of stress, duration, and periodicity do in fact register in subtle ways in Ewe discourse” (Agawu, 1995b, p. 388). These cultural differences have made it difficult for ethnomusicologists in the Western tradition to describe African rhythm only through semantic analysis. One alternative approach to understand how African rhythm is produced and perceived is to transcribe sounds into notation and to analyze its structural

51 features. However, although the concepts of meter and metrical notation have been widely used by scholars in describing and analyzing African rhythms, such approaches are not without controversy.

One criticism of this approach is the difference in how rhythms are patterned between Western and African musical traditions. In the West, the concept of meter derives from the assumption that elements in a rhythmic sequence will be perceived in a recursive hierarchical structure. In other words, the Western concept of meter refers to “the regular, hierarchical pattern of beats to which the listener relates musical events” (Lerdahl &

Jackendoff, 1981, p. 485). Specifically, every two or three units in lower time scale is perceptually grouped into a larger unit one level higher (Lerdahl & Jackendoff, 1983). If the sound events are located at a higher level, they will be perceived as more salient than other events that are located at lower levels. This understanding of meter is reflected in the metrical notations that register sound events into periodic stress patterns. For instance, the downbeat of each metrical unit is always the most prominent one, both physically and perceptually. On the other hand, however, in African rhythm there is acoustically no regular pattern of stressed and unstressed sounds that correspond to the Western concept of meter, which presents a significant challenge for those attempting to use metrical transcription in describing African rhythm.

Among ethnomusicologists who undertook this challenge was Arthur Morris Jones, a Western pioneer of African music studies who applied meter to explain the rhythmic complexity of African music. In his transcription, each phrase starts in the downbeat position of a new meter, which leads to a mishmash of meters with varying lengths (Jones,

52

1959). Chernoff followed Jones’s idea of multiple meters and developed the idea of a

“polymeter”—two or more different meters played at the same time—for explaining the complex and irregular stress patterns of African rhythms (specific illustration can be seen in Fig. 4.1). In his research on Ewe and Dagomba drumming, Chernoff (1979) claimed that

“African music is often characterized as polymetric because, in contrast to most Western music, African music cannot be notated without assigning different meters to the different instruments of an ensemble” (p. 45). Locke, the principal heir to Jones, expanded on this idea, proposing “offbeat timing,” in which the drumming starts on the weak beat (Locke,

1982) (specific illustration can be seen in Fig. 4.2).

Figure 4.1. The Kadodo section in Adzogbo, a dance in Accra. This notation transcribed the rhythmic pattern of three percussion instruments-bell, kagan and kidi- in Kadodo section of Adzogbo dance music. The bell is noted in 4/4 meter, the Kidi is notated in 6/8, and Kagan is notated in ¾. Adapted from “African Rhythm and African Sensibility,” Chernoff, 1979, University of Chicago Press. P.45.

53

Figure 4.2. Excerpt from Kagan drumming. In this notation, the main rhythmic pattern was played by bell and other instruments play rhythmic patterns either “supporting” or “contrasting” to the main rhythmic pattern based on off-beat principle. Adapted from “Principles of Offbeat Timing and Cross-Rhythm in Southern Eve Dance Drumming,” Lock, D. (1982), Ethnomusicology, 26(2), 217. p.222.

Describing African rhythm from the perspective of meter concept was later criticized by many Africanist ethnomusicologists (e.g., Agawu, 2014; Kolinski, 1973;

Anku, 1992). First, critics question the existence of the concept of meter itself in African cultures because no discourse or pedagogical scheme pertinent to meter-related concepts has been found among African musicians (Agawu, 2014). Further, at least two pulse levels are needed for meter perception (Yeston, 1976; London, 2012), but only one level of pulse has been confirmed in African rhythm perception among scholars. Waterman, a leading ethnomusicologist, explained that the African rhythm is centered on a metronome sense, a skill that has to be developed to properly analyze African music (R. Waterman, 1952). The metronome sense can be summarized as “a learned, subjective, equal-pulse framework assumed without question or consideration to be part of the perceptual equipment of both

54 musicians and listeners” (C. Waterman, 1991, p. 173). Agawu (1995a) claimed that a complex rhythmic surface is unfolded within a “secure metronomic framework” underlying complex rhythms of the surface (p. 110). Arom (1991) further clarified that even in polyrhythmic situations, only one single primary pulse level is created by various interlocking rhythmic patterns (Arom, 1991). Thus, it can be concluded that pulse level in

African rhythm has been confirmed by scholars, but there is no indication for pulse grouping at higher levels.

According to Arom’s (1989) description of the structure of typical African drum patterns, pulse, as an isochronal standard musical time, is subdivided into binary (two or four equal parts), ternary (three equal parts), or, very rarely, five parts. The subdivisions are the smallest operational values filled with tones that have different sound features

(Arom, 1989). The sound features mainly result from the dynamic accentuation, tone color alternation, and contrasting durations, all of which interweave to formulate the whole rhythmic sequence. Therefore, in Arom’s description on the structure of African rhythm, sound events unfold sequentially without any hierarchical temporal reference. And perceptually, there is no evidence for recursive grouping hierarchies in the perception of temporal event sequences, as in language (Altenberg, 1990) or musical rhythm (Kung &

Will, 2016). Instead, auditory sequences are perceived and processed in a linear, sequential way, in which grouping occurs in a serial sequence simply on the basis of sound features

(Clayton, 2000).

55

4.2. “Ambiguity” of Grouping in African Rhythm Perception

As mentioned briefly above, African rhythms are interwoven with accents and tone colors, without the reference system of regular accentuation. They “creates a feeling of uncertainty and of ambiguity regarding how the subdivision of the period is perceived”

(Arom, 1989, p. 91). The word ambiguity in Arom’s description here is salient to our discussion.

Empirically, ambiguity often arises when one hears a given rhythmic figure that has no sustained or strong support from the surface feature of the rhythmic sequence for any one particular type of listening, either in duple or triple. One illustration of irregular accents in a rhythmic pattern is in Fig. 4.3. Although the pulse serves as a point of temporal reference, the accents do not regularly coincide with the pulse, whether they are ternary pulses or binary pulses. For the rhythmic figure in Fig. 4.3, listeners may have perceived either four pulses or six pulses depending on how they choose to perceive the subdivisions in binary groups or ternary groups.

56

Figure 4.3. A rhythmic sequence containing a period of 12 operational values. The vertical lines indicate the perceived pulse. The red lines present one possible perceived pulse, and the blue lines present another possible perceived pulses. These two possibilities are arranged into either four ternary pulses or six binary pulses. The position of the marked accent elements in relation to the pulses is not systematically the same. Adapted from “Time Structure in the Music of Central Africa: Periodicity, Meter, Rhythm and Polyrhythmics,” Arom, 1989, Art and the New Biology: Biological Forms and Patterns, 22(1), p. 95.

It should be noted that the term ambiguity here describes neither the objective rhythmic pattern nor the perceived pulse, although both are clearly presented, in the physical world for the former and in the inner mind for the later. The ambiguity refers to the type of listening—specifically, it refers to the phenomenon that in listening to the same rhythmic sequence, the perceptual groupings among listeners may differ from each other, or one listener may give different perceptual groupings at different times during the experience of listening to the same rhythmic pattern. But each time, only one perception is taking place, rather than two perceptions occurring at the same time. This is analogue to the Rubin’s vase/face phenomenon, in which either a vase or a face is perceived from the famous bi-stable figure, but not both at the same time.

57

4.3. Repetition-Based Approach for Grouping

Studies have found that one of the prominent features of rhythm is “the periodic repetition of a single rhythmic cell” (M. S. Eno-Belinga, 1965, p. 18). Since African rhythms provide a case study for the current study of visual influence on auditory grouping, an offshoot of the current study is to learn whether the repetition of patterns is a good predictor of grouping selection in listening. This would allow us to explore how different sound features correlate to the different ways of listening of a given sequence, toward the ultimate objective of offering an analytical view of the ambiguity. We created an approach that enumerates different groupings of a sound sequence. Then the degree of repetition of either a duple or triple stress mixed pattern in the whole rhythmic sequence is used as an indicator for predicting the likelihood of different listening.

To illustrate our approach, take, for example, a simplified rhythmic pattern composed of 12 equal distance tones, in which all tones are identical except for loudness level. To express the composition of a pattern using notations, each pattern is represented by 12 signs of X and O. X indicates a louder tone, and O indicates a softer tone. For instance, a rhythmic pattern with alternative loud and soft tones is notated as

XOXOXOXOXOXO. A duple grouping of this sequence is XO XO XO XO XO XO, indicating the sequence is heard in six duple groups. A triple grouping of the sequence is

XOX OXO XOX OXO, indicating it is heard as four triple groups. The degree of repetition of a certain grouping is defined as the maximum repetition degree of a sub-pattern, and the repetition degree of the sub-pattern is the number of its occurrence divided by the number of groups. In the duple grouping example, there is only one sub-pattern, XO, which

58 appeared six times out of all six groups. Thus, the repetition degree of the duple grouping is 6/6 = 1.0. In the case of ternary grouping, two sub-patterns exist: XOX and OXO, both of which appeared twice, thus having a repetition degree of 0.5. The repetition degree of the triple grouping is decided by the repetition degree of XOX, which is 0.5.

Note that the repetition degree of OXO does not count because we treat sub-patterns differently based on the starting tone. Only the sub-patterns that start with a louder tone,

X, are included when compute the repetition degree because our approach follows the basic listening preference that a louder sound tends to be perceived as the beginning of a group

(Trainor & Adams, 2000; Pove & Essens, 1985). Therefore, the repetition degree calculated based on the sub-pattern OXO in the example above is exclude from the calculation. For duple groupings, sub-patterns like XO are considered, but those like OX are not. And for triple groupings, sub-patterns like XXO, XOX, and XOO are considered, but OXX, OXO, etc., are not. The distinction of calculation introduced here is for qualitatively describing the characterization of the rhythmic pattern as well as predicting one kind of listening. It does not necessary indicate the Africans or audiences from other cultures listen it in the way described here.

The example only lists duple grouping and triple grouping. The sequence could have alternative groupings such as quadruple grouping, which would result in a sub-pattern of XOXO, and sextuple grouping, which would result in a sub-pattern of XOXOXO. The technique may be extended to include all these cases, but in the current study the set of different groupings is simplified to consider only groupings in one of two categories—two- beat and three-beat groupings. In fact, quadruple and larger groupings are rare; studies have

59 found that when responding to a series of identical tones in a moderate tempo, people prefer to group them into twos or threes, which has long been known as subjective rhythmization

(Bolton, 1984; Meuman, 1894).

Thus far, our technique has considered the possible groupings of scenarios that begin to group in the first event of the sequence. But we have overlooked the scenarios in which notes other than the first in a rhythmic sequence are perceived as the first note of the grouping. In reality, perceptual grouping may start from any event of the sequence if the auditory rhythmic sequence is presented in a repeated cycle. For instance, think of a 12- beat rhythm, XOXXOXOXXOXO (Fig. 4.4), that is presented repeatedly cycle by cycle.

Normally, if a listener chooses to perceive it as triple groups and starts the grouping from the first note (shown as triple grouping A in Fig. 4.4), the calculation follows the described approach above. However, if a listener starts the grouping from the third note, it will be perceived as the first note in the listener’s mind, which would reconstruct the 12-note pattern (shown as triple grouping B in Fig. 4.4).

60

Figure 4.4. Illustration of two possible triple groupings, which start from different notes of the same cyclically presented rhythmic sequence.

In this sense, since rhythmic patterns are generated by repeating a sub-pattern, we could compute their duple grouping as starting at the first note, at the second note, and so on. The same applies to triple groupings. Interestingly, duple groupings that start at the first note are identical to those that start at the third note, as well as all duple groupings that start at odd-number notes. Similarly, all duple groupings that start at even-number notes are identical. Technically, triple groupings are similar in that there are three equivalence classes where the equivalence relation is module three of their starting note number.

Table 4.1. Cyclic calculation of binarity/ternarity tendencies of rhythmic pattern XOXXOXOXXOXO

Perceptual Pattern Starting Grouping Choice Repetitiveness Grouping Type Point Degree

61

Ternary First Tone XOX XOX OXX OXO 2/4 = 0.5 Groupings Second Tone OXX OXO XXO XOX 1/4 = 0.25

Third Tone XXO XOX XOX OXO 2/4 = 0.5

Binary First Tone XO XX OX OX XO XO 3/6 = 0.5 Groupings Second Tone OX XO XO XX OX OX 2/6 = 0.33

Illustrated in Table 4.1 is a list of possible duple/triple groupings of the repeated rhythmic pattern of XOXXOXOXXOXO with different starting notes, along with the corresponding repetition degrees. In the table, sub-patterns in boldface are the most or one of the most repeated sub-patterns.

Finally, we define two additional terms, binarity and ternarity. In repeated rhythmic patterns, binarity refers to the maximum degree of repetition of all duple groupings that start at different notes. Similarly, ternarity is the maximum degree of repetition of all triple groupings. We use binarity and ternarity to predict human grouping tendencies on a given rhythmic sequence. For instance, a high binarity of a sequence, calculated by the rules in our approach, suggests that humans will be more likely to choose binary grouping when listening, and a high ternarity indicates triple groupings are more likely. In a later chapter we show that the binarity/ternarity values are in fact correlated to the actual grouping choice made by participants.

62

This technique is a simple preliminary method to describe the characterization feature of the rhythmic pattern. It functioned as a measurement in selection of auditory stimuli as well as a reference for predicting the effect of the repetition degree of a specific rhythmic pattern on grouping behavior. The approach is explained with a rhythmic pattern constituted mainly of identical tones with different volumes. More sound features may be added into the repetitive rhythmic pattern to measure grouping tendency, if this simple technique can be proven by the current study.

63

Chapter 5. Audio-Visual Interaction and Experiment Hypothesis

Much of the information extracted from music is modality-specific—especially with respect to auditory modality—such as pitch, melody, harmony, etc. However, temporal information in music, such as the rhythm, is hetero-modal as it can engage one or more of our sensory domains. Consider drumming, for which one can perceive the rhythm by hearing sounds (auditory modality), by watching the drummer’s movement (visual modality), or by touching the drum (somatosensory modality). Although rhythm perception engages with multiple sensory systems, only auditory and visual modalities are the focus of consideration in the current study.

As elaborated in previous discussions, both clock-based timing mechanisms and entrainment-based timing mechanisms, to different degrees, explain the underlying process of auditory and visual rhythm perception. But what is the relationship between the temporal information perceived from auditory stimuli and from visual stimuli? This question has attracted a great deal of academic attention, and much of the extant research supports the dominant role of the auditory system in temporal perception but has reached no consensus on the role of perceived visual rhythm in audio-visual interaction.

In fact, many studies identified only an insignificant role, if not an altogether nonexistent role, of visual rhythm in audio-visual interaction. Sporadic recent studies, however, reported that visual temporal information influence auditory rhythm perception.

64

In this chapter, we argue that visual rhythm does contribute to rhythm perception, and the difficulty that previous studies wrestled with in finding the influence of visual rhythm on auditory rhythm perception is due to the use of ineffective visual stimuli in presenting rhythmic information, as well as the overly simplified auditory rhythmic patterns used in experiments. Therefore, we propose an experiment that improves on the form of the visual stimuli, and another experiment for testing the role of visual cues in auditory grouping perception.

5.1. Previous Studies on Audio-Visual Interaction in Rhythm Perception

According to Lewkowicz and Ghazanfar (2009), concurrent information received from different sensory modalities is often incorporated into a coherent perceptual experience, and this perceptual process is often reflected in two situations in our daily life.

In some situations, multisensory information comes from the same source, such as perceiving sensory information from different modalities while watching a person speaking. In other situations, the multisensory information comes from different sources of action, such as watching a conductor’s gesture while hearing the ensemble’s music. In the latter scenario, cross-modal interaction may exist, where the input in one modality affects the perception of the same information input in another modality. Numerous studies have explored audio-visual interaction in rhythm perception. The current study, too, places its main focus on audio-visual interaction. Specifically, while studies have observed how auditory temporal information from bimodal stimuli affects the of timing

(Fendrich & Corballis, 2001; Morein-Zamir, Soto-Faraco, & Kingstone, 2003), the concurrent research turns to the converse influence: the effect of perceived temporal

65 information from visual modality in the audio-visual interaction in rhythm perception, a topic on which scholarly consensus has been elusive.

Many early studies found that the auditory dominates in audio-visual interaction in temporal perception through a wide range of timing tasks and methodologies. Recanzone

(2003) developed a series of experiments featuring audio-visual interaction on tempo perception. He found that even though the tempos of the auditory tones and visual flashing lights were clearly distinct, subjects’ perceived tempo of the visual sequence shifted toward the tempo of the simultaneously presented auditory sequence when executing temporal judgement tasks. Repp and Penel (2004) tested the synchronized motor response to metronome sequences under distraction, using a target sequence and a distractor sequence with either flashing lights or auditory tones, with different phase relationships. They found that subjects’ synchronized movement to visual stimuli was easily distracted by the simultaneously presented auditory stimuli, but they were less easily distracted by the visual sequence in their synchronization to the auditory sequence. Another experiment from the same research group studied human automatic phase correction (Repp & Penel, 2002).

They simultaneously presented two isochronous sequences, one auditory and the other visual, such that the onset times for both sequences coincide. Temporal perturbations were introduced through a shifting of corresponding auditory and visual events in temporally opposite directions. The researchers found that the subjects responded to the perturbation in the auditory sequence even though they were asked to tap in synchrony with the visual sequence. Likewise, Kato and Konishi (2006) presented isochronous auditory and visual sequences in anti-phase and found similar result. These results supported that auditory

66 modality was superior to visual modality for temporal perception. A hypothesis called modality appropriateness was proposed to explain the auditory dominance on temporal perception. (e.g., Glenberg & Jona, 1991; Welch & Warren, 1986). According to this hypothesis, the auditory system is highly developed for temporal processing, and the visual system is highly developed for spatial processing.

5.1.1. Alternative Visual Stimuli

Some recent studies, using alternative visual stimuli rather than the “flashing lights” commonly used in early studies, have observed the influence of temporal information from visual modality on auditory rhythm perception. While earlier studies employed the repeated movement of a simple object to convey temporal information, more recent research uses stimuli in more complex forms using anthropomorphic objects and movements.

A traditional flashing light stimulus involves a white dot in the middle of the black background screen (or other opposite-color pairs). A flash is made by illuminating and then diming the center dot. Each flash marks a temporal event, and at least two flashes of the dot constitute one interval. The specific setting of the flashing light used in presenting temporal information is illustrated in Figure 5.1.

67

Figure 5.1 Illustration of flashing light as visual stimuli in temporal perception studies. The time lapse between the first and second light flashes is designed for interval perception tasks.

Recently, scholars have questioned whether auditory modality’s dominance in temporal perception truly reflects the differences between auditory and visual modalities or if the flashing visual stimulus was simply less effective in temporal encoding in the laboratory setting. In the last two decades, a series of studies added supplemental movement information to the visual stimuli to see whether it would optimize and facilitate the visual rhythm perception. Initially, a bouncing ball was introduced. Study subjects were shown a visualization of the changing velocity and the trajectory of the ball’s bouncing movement (Figure 5.2); every hit of virtual edge marks a temporal event, and at least two

68 hits mark one interval. The researchers found that subjects’ synchronization performance in the bouncing ball setup greatly improved compared to the static flashing light experiment and was nearly as good as an auditory metronome (Iversen, Patel, Nicodemus, & Emmorey,

2015; Gan, Huang, Zhou, Qian, & Wu, 2015), which contradicted previous claims that visual senses were unable to aid rhythm perception (Patel, Iversen, Chen, & Repp, 2005).

Hove, Iversen, Zhang, and Repp (2013) even found that a bouncing ball as a visual distractor has a stronger effect than an auditory distractor (auditory beeps) for visually trained participants (e.g., video gamers and ball players).

Figure 5.2 Illustration of moving bouncing ball as visual stimuli in temporal perception study. The time lapse between the onset of the first and second squashed balls is designed for interval perception tasks.

69

Further, Su and Jonikaitis (2011) found that auditory tempo judgment is influenced by the rate of simultaneously presented visual stimuli with spatial movement. Specifically, subjects listened to a standard auditory sequence while concurrently watching dots moving at faster, same, or lower speed. Subjects judged the auditory tempo as faster when the auditory sequence was accompanied by accelerating visual moving stimuli, and slower when it was accompanied by decelerating visual moving stimuli; there was no change in tempo judgment if the auditory sequence was accompanied with constant speed visual stimuli. These results, which Recanzone (2003) did not find through using static flashing lights in tempo judgment tasks, were also found in the same study on temporal reproduction tasks.

Brain imaging studies have given an explanation for why visual movement stimuli are more effective in visual temporal perception and why they influence auditory temporal perception. For instance, a study found that the motor area of the human brain is activated during visual movement perception (Schubotz, Friederici, & von Cramon, 2000), and the involvement of the motor area may bring finer-grained control over the rhythmic movement to the stimuli (Naito, 2004) as well as modulating the ongoing neural activities in the primary auditory cortex (Schroeder, Lakatos, Kajikawa, Partan, & Puce, 2008).

70

Figure 5.3 Illustration of PLF as visual stimuli in delivering temporal information. The knee-bending movement of a humanlike figure is presented by PLF, and the temporal onset of two bending movements is used as temporal interval markers.

More recently, studies have added biological movement to the visual stimuli used to present temporal information; for instance, the artificial animation of a point-light figure

(PLF) has become a frequently used stimulus in multimodal perception studies. Examples of PLF used in audio-visual temporal perception studies include orchestra conductors’ moving gestures (Wöllner, Deconinck, Parkinson, Hove, & Keller, 2012) and humanlike knees bending (Su, 2014a, 2014b, 2014c; Vuoskoski, Thompson, Spence, & Clarke, 2016).

The research shows that using PLF significantly improved synchronization performances compared to using inanimate object movement. Figure 5.3 illustrates a specific setting of the knees-bending PLF. In this experiment setup, each downward bending marks a

71 temporal point, and at least two downward bending present an interval. Stadler, Springer,

Parkinson, and Prinz (2012) found that biological movement (human body movement) has a different movement velocity profile compared to artificial movement (object movement) and serves as a more precise temporal prediction of visual action. This perceptual advantage on perceiving temporal information based on the velocity profile of human movement suggests that visual stimuli with biological kinematics are more appropriate than object movement in temporal perception tasks.

5.1.2. Alternative Auditory Rhythmic Pattern

Some recent studies also found the influence of visual stimuli in audio-visual interaction when the audio stimuli have non-trivial rhythmic pattern. The studies mentioned above—such as Recanzone (2003), Repp and Penel (2002, 2004), and Kato and

Konishi (2006)—used identical isochronous tones. However, temporal patterns more complex than such identical sound sequences occur in live performances, and the regular appearance of a certain auditory feature is not easily found from the rhythmic surface. For instance, the alap section in Hindustani classical music performances features free rhythm, which has no explicit beat; however, Clayton (2007) analyzed the temporal changes of a singer’s action in performance videos and, it was found that the performer’s arm and hand gestures influenced the temporal behavior of other performers, as well as that of the audience. Further, a behavior study developed by Su (2014a) used complex auditory rhythms with irregular temporal relationships among the tones of the sequence. It was found that participants synchronized their tapping to the complex auditory rhythm more consistently when they were paced by an audio-visual beat (constituted by auditory tones 72 accompanied with knee-bending movement of point-light figure animation) than by an auditory beat alone.

Generally, all of these behavior studies have observed that temporal information from visual modality has an influence on auditory temporal perception and that auditory rhythmic patterns in real-life situations are more complex than the identical isochronous tone sequences used in their experiments. Rhythm perception often involves some higher- order or complex processing, such as pulse perception, and these complex patterns may introduce some difficulty in such tasks. In such cases, the perceptual information from auditory modality may not be enough to complete the auditory rhythmic structure analysis, and the perceiver resorts to temporal references in a different modality.

5.2. Present Research Questions

Based on previous discussions, the summarized two aspects that leads to the negative result of the visual senses in auditory rhythm perception. In the following two sections, the hypotheses on testing a more appropriate visual stimulus, as well as the audio- visual interaction in rhythmic grouping perception is being introduced. The rationale of the corresponded testing method will be discussed.

5.2.1. Hypothesis of More Efficient Visual Stimulus

A review of previous studies revealed that the addition of movement

(spatiotemporal) information to visual stimuli (e.g., bouncing ball, animated PLF) is one key to optimizing visual rhythm processing. However, all of these previous studies used artificial visual stimulation, which does not reflect the natural environment we experience.

It is obvious that natural rhythmic movement looks smoother and more complex than the 73 rhythmic movement of artificial point-light visual stimuli. In a natural context, drumming, for instance, is often achieved through a complex and highly coordinated mechanical interaction between bones, muscles, ligaments, and joints within the musculoskeletal system, which may convey subtler temporal information than artificial stimuli can. As well, the PLF only indicates artificial movements of bones and joints, which cannot reflect the complete dynamic of real movements of the human body. It remains unclear, however, whether presenting movement stimuli in an artificial PLF animation or from a real human actor makes any difference in the effectiveness on temporal perception.

The current study takes this question one step further and tests whether a new stimulus, a presentation of natural human drumming movement, is a more effective visual stimulus in temporal perception. It is possible that the human neural system more readily resonates and entrains to external visual stimuli in a natural and native form of presentation than a PLF. This hypothesis is tested by comparing the synchronization performance and error correction performance in response to visual stimuli of artificial PLF versus natural human movement. The expected result of this experiment was that participants would have a more stable synchronization with natural human visual stimuli than with artificial PLF visual stimuli. Moreover, with the temporal perturbation, it is expected that participants could adjust their tapping to re-synchronize to both types of stimuli. But a faster recovery is expected with natural visual stimuli than with artificial visual stimuli. The experiments and results testing this hypothesis are presented and discussed in Chapter 6.

74

5.2.2. Hypothesis of Visual Influence on Auditory Grouping Perception

We propose another hypothesis, that listeners’ duple/triple grouping choice to the auditory rhythmic pattern would be considerably influenced by visual temporal cues.

The reason for choosing grouping behavior as the observation objective is that grouping is influenced by various dimensions of auditory features, such as pitch and timbre

(Ellis & Jones, 2009, Krumhansl & Iverson, 1992), tempo (Repp, 2007), and tonal structure

(Prince, Thompson, & Schmuckler, 2009). Auditory grouping may also be influenced by the temporal information presented in other modalities, such as the temporal information concurrently received from visual stimuli. Previous studies have not tested this cross- modality interaction.

The current study further hypothesizes that grouping based on auditory features can be influenced by temporal cues from visual modality. This hypothesis should be tested by comparing subjects’ grouping choice with/without visual stimuli to see whether there is a difference. Multiple methodological approaches exist that can be used to observe grouping choice. One is the tapping task (e.g., Pfordresher, 2003; Povel & Essens, 1985; Snyder &

Krumhansl, 2001), in which participants are asked to indicate the location of grouping boundaries in a rhythmic pattern by tapping. The timespan between taps (inter-tap intervals; ITI) is the main indicator of a listener’s grouping responses as it can be corresponded to the timespan of the grouped sound events. This approach was founded on the concept that music perception is an embodied process. An alternative methodological approach on auditory grouping is the rating task (Ellis & Jones, 2009; Prince, 2014;

Hannon et al., 2004, Nittono et al., 2000). Participants would be asked to rate the perceived

75 grouping of the stimuli on a five-point rating scale, on which one is strong duple and five is strong triple. The rating scale test is a subjective measure. The current study used the tapping approach instead of the ordinal scales task, because the tapping task is more in accordance with the fundamental paradigm that music perception is an embodied process, as discussed in detail in Chapter 1. Lastly, behavioral feedback was preferred for analyzing subjects’ perception process. The testing of this hypothesis and the experimental procedure is explained in detail in Chapter 7.

5.3. Summary

This chapter presented an overview of previous studies on audio-visual interaction in temporal perception. Literature shows that different forms of visual stimuli influence the effectiveness of temporal perception. Specifically, some studies found that although rhythm perception often resides in the auditory domain, as long as there are appropriate visual stimuli, the role of temporal information in visual modality cannot be neglected in concurrent auditory rhythm perception. Moreover, for more complex perceptual tasks, especially perceiving the rhythmic structural of the auditory sequence, the perceiver often resorts reference to temporal information from visual modality if the temporal information received from auditory stimuli are not enough to complete the auditory rhythmic structure analysis,

This chapter proposed that visual stimuli play a significant role in auditory grouping, and those resembling natural human body movement the most are the most appropriate in experiment settings. When both visual and auditory stimuli are presenting

76 rhythmic information, the visual stimuli act as a grouping cue that shifts the distribution of the subject’s grouping choices.

In the following chapters, two experiments are developed to test the function of natural visual stimuli as a more effective visual stimulus in temporal perception, as well as to determine if perceived visual temporal information changes subjects’ behavior in grouping auditory rhythms.

77

Chapter 6. Experiment on Natural/Artificial Visual Rhythm

In the present study, the first experiment was designed to explore whether humans entrain better to natural human visual stimulation than to artificial, humanlike visual stimulation (e.g., point-light figures, or PLF). In particular, the experiment measured and compared the synchronization of human pulse tapping to the two types of stimuli, in both a unimodal visual condition (no sound accompanying) and a bimodal audio-visual condition. The subjects were asked to synchronize their tapping to the pulse they felt in four conditions: artificial PLF visual stimulation only (VP), artificial PLF visual stimuli with sound (AVP), natural human visual stimulation only (VN), and natural human visual stimulation with sound (AVN). The degree of synchronization and the dynamic error- correction performance in the different conditions were then compared. The visual stimulus associated with the best entrainment performance was then used in the main experiment.

We predicted that the synchronization of finger tapping to the visual stimuli of natural human movement would be more stable and that participants would more quickly regain synchronization with the same stimuli.

6.1 Participants

The experiment involved 16 participants (eight women, mean 26.75 years, SD

= 1.75; eight men, mean age 26 years, SD = 3.02). They were undergraduate and graduate

78 students at The Ohio State University, Columbus. All reported normal hearing, normal or corrected-to-normal vision, and no known history of any neuromuscular deficits that would affect their movement. Three female participants had received professional musical training (mean training time of four years, SD = 1.5), and the remaining 13 participants had no musical training experience. All participants gave verbal informed consent before the experiment and received $10 in compensation after the experiment. This study was approved by The Ohio State University Institutional Review Board for Human Research.

6.2 Material and Stimuli

The stimuli in this study were extracted from recordings of a live djembe performance the researchers recorded. In the recording, a Senegalese professional djembe player performed metronome-like beats in slow tempo (1.33 Hz, IOI = 750 ms), medium tempo (2 Hz; IOI = 500ms), and fast tempo (3.33 Hz, IOI = 300 ms). The performance was recorded by a Nikon D3000 camera (1280  768 image resolution), with a visual sampling rate of 60 Hz (60 fps frames per second) and an audio sampling rate 48000 kHz.

The onset of each drum sound on the audio track of the recording was marked and analyzed in Praat (Boersma & Weenink, 2017). The mean value and standard deviation of the IOIs were analyzed to find the most stable period. Three periods of quasi-isochronous

85 beats with the least variation were extracted from the original clips by Adobe Premier

Pro Software: one with fast tempo (mean IOI = 300.20 ms, SD = 5.2 ms), one with preferred tempo (mean IOI = 500.49 ms, SD = 10.2 ms), and one with slow tempo (mean IOI =

746.67 ms, SD = 14.9 ms).

79

Auditory Stimuli. The auditory stimuli were directly extracted from the three video clips of each of the 85 quasi-isochronous Djembe beats.

Visual Stimuli. Two types of visual stimuli were used in this test. The first was natural human visual stimuli (VN), in which the original natural recording of the Djembe player was used as the visual stimuli (Figure 6.1A). The second type was artificial PLF visual stimuli (VP), in which the drumming movement was presented in the point-light figure display (Figure 6.1B). The humanlike PLF visual stimuli (VP) were generated based on the natural human visual stimuli (VN) as follows: The movement of hands, arms, shoulders, and head in the VN were extracted frame by frame as points, lines, and circles.

The marked frames were then rebuilt into an animated video (VP) through a MATLAB animation script. In this way, the VP stimuli have the same motion projection and dynamics as the human actor.

80

A B

Figure 6.1. Part A. Natural human visual stimuli (VN); Part B. Artificial PLF visual stimuli (VP)

Structure of One Trial. A sequence of 85 consecutive drum beats formed one trial.

The first ten beats were designed as adaption phase, during which subjects synchronized their tap responses. The following 40 beats (from beat 11 to beat 50) were regarded as the synchronized phase. Subjects’ responses to this section were used to measuring the degree of stabilization of entrainment.

Because the simple demonstration of synchronization performance is not enough to identify entrainment (Rosenblum et al., 1998), IOI perturbations have been added to the clip to test the timing error correction behavior and the time course of the re- synchronization. The next 35 beats (from beat 51 to beat 85) were designed as perturbation phase. Based on the Phase perturbation paradigm (Repp, 2001; Elliott et al., 2008), a perturbed beat is the one that occur either earlier or later than expected, and all subsequent beats were shifted by the same amounts, such that participants had to correct the timing of 81 their taps to get back in phase with the beat. In this test, two single phase perturbations were introduced in each trial. The first perturbation (P1) located in beat 51 and the second one (P2) located in beat 68. Each trial contains both types (early and delayed) of perturbations, the sequence of which was counter-balanced over the whole experiment.

To create the perturbances, the sound track and image track of the original video clips were edited in Matlab to either shorten or lengthen the IOI between the target beat and its preceding beat. In many previous psychophysical studies, for interval durations between 300 ms to 1000 ms, the detection thresholds for the change of single intervals in an isochronous sequence are rarely below 4% of the event onsets in a sequence (studies summarized in Repp B. H., 2000). In our study, the perturbation magnitude is set to 20% of the mean IOI interval, which is within the detectable threshold. Applied to auditory stimuli, the perturbations were created by either shortening or lengthening one IOI by

150ms (at 1.33 Hz), 100ms (at 2 Hz), or 60ms (at 3.33 Hz). Visual stimuli were created with corresponding manipulations. At a frame rate of 60 fps, 13 frames (1.33 Hz), 9 (2 Hz), or 6 (3.33 Hz) frames were either eliminated or doubled for the perturbation beat. Frames that were eliminated were selected from the downward movement and duplicated frames were from upward movement that produced the respective sound. As such video editing may influence the flow of the watched movement. So, the cut frames were not adjacent in order to minimize the noticeable discontinuities in the video.

Therefore, one stimulus trial consists of 85 drum beats, subdivided into three phases

(Table 6.1):

82

Adaption phase: the first ten beats, used for the subject to adapt to the test.

Stable phase: beat 11 to 50, regular metronome beats to test subject’s degree of synchronization in entrainment.

Perturbation phase: two perturbations, added (beat 51 and 68) in each trial to test the error correction ability in entrainment.

Table 6.1. Structure of one stimuli trial.

Phases Adaption Stable P1 error correction P2 error correction

phase Phase period period

Beat 1-10 11-50 51 52-67 68 68-85

No.

Stimuli in the Test. There are 24 trials in total, with 4 conditions (VN, VP, AVN and AVP), each presented at 3 tempi (S: 1.33 Hz, P: 2 Hz, F: 3.33 Hz) and with 2 perturbation arrangements (E1D2: Early perturbation first delayed perturbation second,

E2D1: delayed perturbation first but early perturbation second). These 24 trials are presented in two blocks. 12 trials with visual-only stimuli in one block, and the other 12 trials with audio-visual stimuli in another block. 83

6.3 Procedure

The experiment was performed and analyzed in the quiet room at the Cognitive

Ethnomusicology Lab. Auditory stimuli were presented binaurally over Sony headphones.

The visual stimuli were presented on a laptop with 1920  1080 resolutions. The experiment was presented using Python software. Participants were asked to wear headphones for the audio-visual block.

Before the experiment, the participants were asked for oral consent to participate in this experiment. Then they were instructed to sit in front of the laptop screen. A training section was presented before the formal test.

The training section contained four short trials. Each trial had 20 metronome beats drumming (in the preferred tempo of 2 Hz) presented in the four conditions: VN, VP, AVN, and AVP. The participants were asked to wear the headphones only in the combined audio- visual conditions. During the familiarization stage, the participants were asked to hold a 15 cm metal rod with their preferred hand and tap a wooden pad on the table. They were asked to synchronize their tapping movement with the pulse they perceived from the stimuli.

Additional presentations of the stimuli were provided if required by subjects for assuring an understanding of the different experimental conditions. After the training session, the formal session started when the participant felt ready.

In the formal test, participants were asked to synchronize their tapping to the visual stimuli they watched on the screen. They were not informed about the perturbations in the stimuli. For balance, half of the participants started with the block that did not require the

84 headphones. The other half of the participants started with the block with the headphones.

The sessions were self-paced; participants started a block by pressing the bar on the keyboard, and each subsequent trial automatically started five seconds after the end of the previous trial. Additionally, a rest could be taken in between the sessions.

The subjects’ tapping responses and sound stimuli were recorded simultaneously on separated, synchronized tracks through Adobe Audition for later analysis.

6.4 Data Collection

The subjects’ responses in the tests were transformed into a series of onset times.

The onset times for sound events in both the recorded stimuli track and the subjects’ tap response track were analyzed in MATLAB. The onset extraction script in MATLAB automatically marked the onset of each sound event of the participants’ taps.

The script operated on the wave file of the recording of the participants’ taps. It scanned the file sequentially with a 3-ms window, which was activated on capture of sound levels greater than 10% of the global maxima of sound level and deactivated once it was empty of any sound level over that threshold. The time span from the activation to the deactivation of the window was considered one tap, and the activation point was considered the onset of that tap. The automated tagging of taps was double-checked by the researcher with the aid of audio playback.

Each response event was attributed to its closest stimuli. After a one-to-one attribution, if there were still tap responses that had not been attributed, they were regarded as a tapping error and deleted (illustrated in Figure 6.2). 85

Figure 6.2. Illustration of the procedure of the registration of the response data. In the first run of registration, for each stimulus its 180-degree region (geometrically, the voronoi cell). The green line indicates the midpoint of the period between two stimuli events. If there is only one response in the region (the region between two green lines), then that one response is mapped to the stimulus. If there are multiple responses, then the nearest one to the stimulus is mapped, the other ones are left unmapped for next run. Based on this rule, R1, R2, and R3 were attributed to the responses to S1, S2, and S3. R5 and R4 both sit in S4’s region. R4 was closer to S4 so it was mapped to S4, and R5 was left unregistered for the second run. The same situation happened to R6 and R7, where R7 was mapped to S7 and R6 was left for the second run. In the second run, responses were scanned sequentially. For each unregistered response, the previous response before it was checked. If the previous response was registered to a stimulus, then the stimulus after it was checked. The current response was mapped to it if it was not yet mapped to by any response. Following the guideline, R5 was mapped to S5. R6 had no stimuli to map and was regarded as an extra tapping error and excluded from further analysis.

86

6.5 Synchronization Success Rate Analysis and Data Filtering

6.5.1. Synchronization Success Rate Analysis

The data for the adaption phase (responses to the first 10 stimuli events) were not included in any further analysis, because synchronization typically requires a few taps to stabilize. The response tap timing data in the stable phase (from beat 11 to beat 50) were analyzed using the standard circular statistics packages in MATLAB. The instantaneous phase between the subjects’ responses and stimulus events was calculated as the location of the response events within the stimulus event cycles. Each response tap in a trial was mapped onto a unit circle in terms of its relative phase (0–360) from the quasi-periodic target (always at 0). The relationship between the stimuli and responses events were registered as the phase angle, or degree of synchronization. For example, the phase angle was 0 if the taps and target stimuli were simultaneous events, and 180 if they were precisely off beat. Taps slightly after the target had a phase of 0–90, and taps slightly before the target had a phase of 270–360.

The uniformity of the distribution of the phase angles is one of the important indicative measures of synchronization (Repp, 2003). If subjects were unable to synchronize, their tap-to-target phases would be represented in the form of a continuous phase drift (i.e., steadily increasing or decreasing asynchronies). In this case, the phase angle of tap-to-target would tend to spread uniformaly in the range of 0–360. We used the Rayleigh to test for the null hypothesis of uniform distribution against the alternative hypothesis of non-uniform distribution of the tap-to-target relative phases from each trial

87

(Hove, Spivey, & Krumhansl, 2010). A trial was considered successful if the null hypothesis of uniformity could be rejected with a p-value of less than 0.05. Those trials with near-uniform distribution of phase angles were not counted as successful and were excluded from further analyses.

The average percentage of successful trials in each condition (VN, VP, AVN, and

AVP) and tempo (slow, preferred, and fast) was calculated and reported. The synchronization success rate in each condition and the tempo was compared and summarized.

6.5.2. Results of Synchronization Success Rate Analysis

There were 384 trials from all 16 subjects in this experiment, with an overall success rate of 95.58%. Seventeen trials could not reject the null hypothesis of uniform distribution and were considered unsuccessful to synchronize. They were excluded from later analysis. Specifically, the overall success rate of trials under the bimodal conditions was higher than that under unimodal conditions (100% vs. 91%) (X2 (1, N = 384) = 15.756, p < 0.001). This result agrees with the conclusion from previous studies (Elliott et al., 2010;

Wing et al., 2009) that a bimodal beat (e.g., audio-visual beat) has a stronger effect on coupling subjects’ sensorimotor synchronization in finger-tapping than does a unimodal

(visual) beat. Figure 6.3 presents the synchronization success rate in each condition.

88

Successful Rate of Trials by Conditions 100%100%100% 100%100%100% 100% 100% 96.90% 96.90% 100% 90% 78.13% 80% 75% 70% 60% 50% 40% 30% 20% 10% 0% AVN AVP VN VP

Slow Preferred Fast

Figure 6.3. Percentage of successful trials in the test.

From Figure 6.3, it can be seen that the advantage of bimodal stimuli in success rate increases gradually with the increase of the speed. Respectively, the advantage is greater in fast-tempo trials (X2 (1, N = 128) = 14.801, p < 0.001) and less in preferred- tempo trials (X2 (1, N = 384) = 0.50794, p < 0.476), and there was no advantage in slow tempi. Specifically, compared to bimodal conditions, the success rate maintained a relatively high level in slow and preferred tempi visual-only conditions. The rate only dropped considerably in the fast visual-only conditions.

The results showed that the type of visual display (either an artificial point-light figure or a real, natural human movement) makes no difference on the success rate at any 89 tempi, regardless of modality of the stimuli (X2 (1, N = 384) = 1, p < 0.001). The percentage of the synchronization success rates in the two bimodal conditions (AVN and AVP) were the same at every tempo. Under the visual-only conditions, the percentage in the slow and preferred tempi VN were the same as that of VP, and in the fast tempi VN (78.13%), it was slightly higher than that of VP (75%).

6.6 Entrainment Performance Analysis

Entrainment was analyzed based on the subjects’ synchronization performance and error-correction performance. Specifically, the mean vector and circular variance were calculated to measure the quality of the synchronization performance during the stable phase. The error-correction performance was measured based on the instant phase correction and the rate at which the subject regained synchronization after the perturbation.

Specific analysis methods and results are listed below.

6.6.1. Synchronization Performance Analysis and Results

Synchronization Degree Analysis. The two main variables used to assess synchronization performance were mean vector and circular variation. Both of these variables provide a detailed view of the synchronization between a participant’s tapping response and the stimuli, as well as the variability of this synchronization.

Mean vector q, the mean value of the phase angles, represents the nature of the synchronization. It is an indicator of lag or lead tendency in the tapping responses. For instance, a mean vector larger than 0 indicates that subjects’ responses had a lagging tendency—that the responses were slightly later than the stimuli. If the mean vector is less 90 than 0/360, it indicates a leading tendency—that responses were slightly ahead of the stimuli.

Circular variance is another assessment of synchronization performance, and it measures the variability of the taps’ relative phases (the stability of the tap-to-target synchronization). Circular variance v varies from zero to one. It is an indicator of the stabilization or consistency of the tapping responses: the larger the variance, the less stable or consistent the synchronization.

We were interested in seeing if the trials with a natural human display have smaller circular variance values and a different mean from those with artificial stimuli in any of the conditions. To test this, the analysis involved a 2 (bimodal or visual-only)  2 (natural human visual or point-light figure visual display)  3 (slow, preferred, fast) repeated measures analysis of variance (ANOVA) on the mean and the circular variance.

Results of Synchronization Degree. The mean vector analysis showed that the responses in all conditions, except the fast VN, were negative asynchrony (from -30 to

0). This result means that the responses were slightly ahead of the stimulus. The result is illustrated in Figure 6.4.

91

Figure 6.4. Phase of the mean vectors grouped by different conditions and tempi.

The result is accordant with the previous finding that people usually tap slightly earlier than the upcoming stimuli due to the anticipation mechanism. Only the fast VN showed positive asynchrony (27.6), which indicates that the response taps lagged behind the stimuli by 23 ms on average.

It is worth noting that the participants tapped to the natural human visual display slightly later than to the point-light human display. This trend existed in all conditions and tempi. However, we were not able to confirm the finding with statistical significance using the Watson-William test. The difference in mean was only significant for the slow VN and

92

VP (Watson-William: p = 0.0334), and the test was not applicable to the fast VN and VP conditions, because the mean vector length was too short (< 0.45). What seems more interesting is: in slow tempo the responses to unimodal stimuli are earlier than those to bimodal stimuli, and the inverse is true for the fast tempo. The complete table of the vector length in every conditions and tempi is illustrate below in Table 6.2.

Table 6.2. Vector length in different condition and tempi

Condition Slow Preferred Fast

AVN 0.9435861 0.9002744 0.81581224

AVP 0.9182857 0.8954824 0.79221228

VN 0.8873482 0.6966398 0.24303527

VP 0.8544431 0.7290651 0.44801053

The 2 (bimodal or visual-only)  2 (natural human visual or point-light figure visual display)  3 (slow, preferred, fast) repeated measures analysis of variance (ANOVA) on circular variance revealed main effects of modality (with or without auditory stimuli) and tempi. A complete ANOVA table can be found in Appendix A. Illustrated in Figures 6.5

93 and 6.6, the synchronization performance was significantly more stable in bimodal

2 conditions than in visual-only conditions: F (1, 15) = 143.08, P < 0.001 (ηp = 0.185).

Additionally, the variance was larger with a decrease in IOI (or an increase of the tempi):

2 F (2, 30) = 97.74, P < 0.001 (ηp = 0.253). As shown in Figure 6.7, the data indicate a

2 significant interaction between the two main effects: F (5, 75) = 44.321, P < 0.001 (ηp =

2 0.115). The partial eta-square (ηp ) is the effect size. No other interactions have been found among other main factors.

Figure 6.5. Two-dimensional graph of the mean variance for AVN and AVP in three tempi (F, P, S). The x-axis is the value IOI values. The y-axis is the value of the mean variance in each group. Yellow bars indicate AVP condition and blue bars indicate the AVN condition. Error bars represent SEM.

94

Figure 6.6. Two-dimensional graph of the mean variance for VN and VP in three tempi (F, P, S). The x-axis is the value IOI values. The y-axis is the value of the mean variance in each group. Yellow bars indicate the VP condition and the blue bars indicate the VN condition. Error bars represent SEM.

95

Figure 6.7. Two-dimensional graph of the mean variance for AV and V in three tempi (F, P, S). The x-axis is the value IOI values. The y-axis is the value of the mean variance in each group. Yellow bars indicate the V condition and the blue bars indicate the AV condition. Error bars represent SEM.

An interesting trend should be noted here: responses to natural human visual stimuli were consistently more stable in bimodal conditions than responses to artificial PLF visual stimuli. As seen in Figure 6.5, responses to AVN have a smaller circular variance across all tempi than responses to AVP, although such differences do not show an overall statistical significance (F (1, 15) = 0.327, p = 0.57). The same trend was found in visual- only conditions across slow and fast tempi (as seen in Figure 6.6), but these differences similarly were not statistically significant (pslow = 0.57 & pfast = 0.34). Additionally, although in unimodal preferred tempi, variation is slower in responses to artificial PLF than responses to natural stimuli, but the differences were not statistically significant (p =

0.434). 96

6.6.2. Error-Correction Performance Analysis and Results

We observed that, across all conditions, participants responded to the phase shift and attempted to resynchronize with the beat (Figure 6.8).

Figure 6.8. Mean relative asynchrony for four modalities by three tempi. The solid lines depict early perturbation, and the dotted lines depict delayed perturbation. Shaded areas represent SEM. Top: slow; middle: preferred; bottom: fast.

The chart shows that the phase adjustment to the perturbations already started from the T+1 tap (T indicates the one that responded to perturbed stimuli) and lasts for a couple of taps. Following an early perturbation (a negative shift of the event), the next tap tended

97 to be advanced. Following a delayed perturbation (a positive shift), the next tap tended to be delayed. Continuous correction happened in the few taps after the perturbation, and the correction behavior displayed further differences across conditions. To investigate how the conditions affected the way participants corrected their taps, we analyzed the phase correction response and the speed of regaining synchrony with the quasi-isochronous stimuli.

Phase Correction Analysis. Phase correction response (PCR) describes the participants’ immediate correction to the asynchrony. The perturbed event changed the relative phase of the isochronous sequence from that point, which elicited periodic correction in synchronized tapping. The asynchrony AT deviated from the stabilized series of AT-1, AT-2, and so on, as illustrated in Figures 6.9 and 6.10.

The PCR can be calculated as follows: PCR = AT+1 − AT = (RT+1 − ST+1) − (RT −ST).

Alternatively, the PCR can also be calculated by subtracting the trials’ baseline IOI from the inter-tap interval (ITI) immediately following a perturbation: PCR = (RT+1 − RT) - (ST+1

− ST). For instance, in a 500-ms IOI trial, if the ITI interval (ITI) (interval between T and

T+1) following a delayed target was 600 ms, the PCR would equal 100 ms. On the response of such deviation, expectedly, AT+1 is corrected toward returning to At-1. In this case, the response fully mitigates the deviation such that AT+1 = At-1 , which is equivalent to PCR =

At-1 − AT = 1. But in reality, it is often not the case; PCR is often greater or smaller than 1, which represents overcorrection or under-correction.

98

Figure 6.9. Early perturbation. T marks the tap where the perturbed phase shift occurred, with T-1 representing a tap immediately before the phase shift and T+1 representing a tap immediately after the perturbation. The asynchrony at each tap is An =Rn - Sn, where Rn is the onset of the tap at n and S(n) is the onset of the stimuli at n.

Figure 6.10. Delayed perturbation. T marks the tap where the perturbed phase shift occurred, with T-1 representing a tap immediately before the phase shift and T+1 representing a tap immediately after the perturbation. The asynchrony at each tap is An =Rn - Sn, where Rn is the onset of the tap at n and S(n) is the onset of the stimuli at n.

99

To evaluate PCR throughout different tempi and perturbation, we transformed it into the phase correction parameter α through the function α = PCR/perturbation. If α = 1, then the subject exactly corrected the same time period of the perturbation in the T+1. On the other hand, α < 1 and α > 1 represent the under-correction or overcorrection, respectively. A repeated measures ANOVA was applied to analyze PCRs across different tempi.

Results of Phase Correction Analysis. With a full ANOVA of 3 (Tempi)  2

(Visual-only, or bimodal conditions)  2 (natural human or artificial PLF visual display) 

2 (perturbation type)  2 (perturbation position) on PCR- α, we had tempo, modality, perturbation type, and position as factors. The complete results are provided in Appendix

B.

Illustrated in Figure 6.11, the results show that subjects demonstrated larger corrections in bimodal conditions than in visual-only conditions: F (1, 15) = 89.060, P <

2 0.001 (ηp = 0.075). Additionally, the degree of correction increased with the increase of

2 the IOI (decrease of the tempi): F (2, 30) = 10.877, P < 0.001 (ηp = 0.210). The predicted main effect of visual display was not statistically significant: F (1, 15) = 0.232, P = 0.63

2 (ηp = 0.0002). Further, there is an interesting interaction between position and

2 perturbation: F (1, 15) = 64.457, P < 0.001(ηp = 0.054). The mean PCR-α of the delayed perturbation was only larger in the first position, and that of early perturbation was larger in the second position (Figure 6.12).

100

Figure 6.11. Alpha estimation for two types of perturbation by (Modal  Tempi). The x-axis refers to the tempi (F: IOI = 300ms, P: IOI = 500ms, S: IOI= 750ms). The y-axis refers to the value of Alpha. The solid lines indicate bimodal conditions, and the dotted lines indicate unimodal conditions. Red represents the visual stimuli in the natural human display, and blue represents the visual stimuli of point-light figure display. Error bars represent a ±1 standard error.

101

Figure 6.12. Alpha estimation for two types of perturbations by (Position X Tempi). The x-axis refers to the tempi (F: IOI = 300ms, P: IOI = 500ms, S: IOI= 750ms). The y-axis refers to the value of Alpha. The solid lines indicate bimodal conditions, and the dotted lines indicate unimodal conditions. Red means the early perturbation was located in position 1, and blue means the early perturbation was located in position 2. The error bars represent a ±1 standard error.

Recovery Rate Analysis. The participants’ taps were not immediately stabilized at

T+1. Further corrections were made in ensuing taps, and the corrections were in either direction depending on the accumulative over- or under-correction. We were interested in the number of taps needed for the participants to resynchronize to the stimuli.

We defined baseline asynchrony before encountering the perturbation in a trial as the mean of the asynchronies at T-10 through T-1, and relative asynchrony is defined as the asynchrony subtracted by the baseline asynchrony.

To quantify the rate of recovery or resynchronization from the perturbation, we considered that a participant had resynchronized at tap  if the absolute value of relative 102 asynchrony at , +1, and +2 were all below a certain threshold, and we chose a threshold of 5% of IOI. Such threshold is slightly greater than the mean variance in the stable phase because it functioned as a reference for the entrainment status. Because  might not be found within a limited number of taps in a trial, we used an alternative definition, Recovery

Rate (RR), such that RR is 1/ if  was found within a limited number of taps. In our analysis, we limited the maximum  at 8. This is necessary only for the reason of the trial construction. A repeated measures ANOVA on Recovery Rate (RR) was applied to analyze the different rate of recovery across different conditions.

Recovery Rate Analysis Result. We performed a repeated measure ANOVA of 3

(tempi)  2 (visual-only, or bimodal conditions)  2 (natural human or artificial PLF visual display)  2 (perturbation type)  2 (perturbation position) on the recovery rate (RR).

Complete results are provided in Appendix C. The recovery rate analysis result (illustrated in Figure 6.13) shows that the recovery rate was faster along with the increase of IOI: F (2,

2 30) = 23.34, P < .01 (ηp = 0.061).

103

Figure 6.13. Recovery rate grouped by conditions and tempi. The y-axis indicates the recovery rate. The larger the value is, the faster the recovery.

As shown in Figure 6.13, resynchronization was faster in the slow tempo than in the preferred tempo, and the slowest resynchronization rate was with the fast tempo.

Additionally, recovery was faster in bimodal conditions than unimodal conditions: F (1,

2 15) = 7.737, P < .01 (ηp = 0.01). Although the recovery rate for AVN was faster than that of AVP across different tempi, and similarly in slow VN vs. slow VP, these differences were not statistically significant.

It should be noted that the perturbation position had the main effect in the recovery

2 process: F (1, 15) = 4.9, P = 0.027 (ηp = 0.006). Figure 6.14 shows that, generally, subjects’ recovery to synchronization was faster if the perturbation appeared in the second position.

104

And there is a significant interaction among perturbation types, perturbation positions, and

2 the tempi: F (11, 165) = 3.362, P = 0.035 (ηp = 0.004).

Figure 6.14. Recovery rate grouped by perturbation types and positions.

6.7 Discussion

In this experiment, we investigated the importance of the visual stimuli being in a natural or artificial form in the context of rhythm perception through measuring subjects’ rhythmic entrainment performance. It was hypothesized that natural visual stimuli would generally better facilitate the stabilization of rhythmic synchronization and phase correction to perturbations compared to visual artificial stimuli, in both visual-only conditions and audio-visual bimodal conditions. The results show that participants’ performance on synchronization and perturbation reaction was better with natural human visual stimuli than with artificial point-light figure stimuli in many aspects. Unfortunately, we were not able to establish statistical significance in the differences in relation to the

105 influence of other factors. However, the results are still supportive of and consistent with our hypothesis. The lack of statistical significance might be attributed to the fact that the inherent differences between the main factors were small.

6.7.1. Effect of Different Visual Forms (Natural/Artificial) on Entrainment

From the current study, behavioral responses among subjects to natural visual stimuli and artificial PLF stimuli have shown several differences with respect to the stabilization of entrainment, phase-correction performance, and the re-synchronization rate.

The mean vector of the phase angle was one of the indicators of synchronization performance, and it refers to the difference between the time of a tap and the time of the onset of the corresponding event in the external rhythm. Participants tended to tap before the corresponding beat happened. As a result, the mean vector of phase angle values was negative across all conditions and tempi except the fast VN. This anticipatory response has been commonly found in previous literature, but it is still not well explained. Such exception shows the positive mean vector; however, its mean vector length is relatively short, which suggests that the estimated mean is less reliable.

Further, a difference in the behavior response to different visual stimuli was found in the variance of phase angle, which is one of the most important indicators for measuring the stabilization of entrainment. In bimodal conditions, the synchronization to natural human visual stimuli was consistently more stable than synchronization to artificial PLF visual stimuli, since the mean variance in AVN was consistently lower than in AVP. The

106 same advantage of natural human stimuli can be observed in the visual-only conditions in both slow and fast tempi. The second difference resides in the resynchronization performance after perturbations. Participants had a higher recovery rate (or faster speed in regaining synchronization) after perturbations when responding to natural visual stimuli than when responding to PLF in bimodal conditions. The same advantage is observed in the fast visual-only condition. Although these differences did not reach the level of statistical significance, they are consistently and clearly observed. Similar variations in behavior responses to natural and artificial visual stimuli were previously found in Saygin and Stadler’s (2012) study. In their experiment, the subjects’ temporal estimation of observed action was different in response to a video of a real human agent than to a video of a robot.

Previous studies have demonstrated that rhythm perception entails one’s internal sensorimotor coupling, which enables us to move to rhythms (see Chapter 3 for elaborated discussion on this topic). In this case, the internal motor system may be more readily activated through observing real human actions than artificial ones. Recently, mirror neurons were discovered to encode received visual information and activate the observer’s , which is also involved in motor execution (Saygin, 2007; Saygin &

Stadler, 2012). This reception of stimuli that can easily be translated into human action may enhance the sensitivity and speed of reaction to perturbations. In a realistic situation, the drumming action is achieved through a complex and highly coordinated mechanical interaction between bones, muscles, ligaments, and joints within the musculoskeletal system, but PLF stimuli only present the spatial-temporal change of bones and joints. Any

107 finer action changes or perturbations in action may be better presented and understood by the observer if presented with natural visual stimuli instead of PLFs.

6.7.2. Interaction between Perturbation Type and Position

In addition to the main effect we investigated, we observed an interesting phenomenon while exploring the interaction between the two perturbation types and two positions. When presented with the perturbation, participants tended to advance the next tap following an early perturbation but delay the next tap following a delayed perturbation.

The PCR tended to be smaller for early perturbation than for delayed perturbation at the first position (early in a trial). However, at the second position (later in a trial), the PCR tended to be larger for early perturbation than for delayed perturbation. This result was true with both audio-visual and visual-only sequences. While this phenomenon is difficult to explain, we propose one possible explanation, which is that subjects’ anticipation played a role in phase correction. Because our design had a fixed 18-beat distance between the perturbations, the participants could have easily anticipated the occurrence of the second perturbation. Repp and Moseley (2012) found that advance information about the position and direction of a phase shift enabled musicians to adjust their taps to reduce the asynchrony occurring at that point. In our study, the first perturbation appeared unexpectedly while the second was somewhat anticipated. The different mean PCR between perturbation type could have been mediated by this anticipation.

Moreover, the resynchronization rate was faster if the perturbation appeared in the second position (except for delayed perturbations in visual-only conditions). This may be

108 because the first perturbation prepared the participant for the second perturbation. This preparatory anticipation enabled participants to advance or delay the tap intended to coincide with the shifted stimuli and, therefore, reduce the asynchrony occurring after the perturbation.

6.8 Conclusion

The results of this study represent an important step toward understanding how we entrain to different forms of visual rhythm. At both the unimodal level (visual-only conditions) and bimodal level (audio-visual conditions), the results indicate that synchronization was slightly better with natural human visual stimuli than with artificial

PLF visual stimuli. They also show that phase correction to perturbation was influenced by an interaction between perturbation type and anticipation. Given these results, natural human visual stimuli are recommended in designing the main study, which applies more complex rhythmic patterns.

109

Chapter 7. Experiment of Visual Influence on Auditory Grouping

The present research explores bodily responses to perceived groupings of auditory stimuli with visual temporal cues as a reference to ascertain whether the visual temporal cues influence auditory grouping choice and, if so, whether the magnitude of this influence is shaped by long-term music training. This chapter first discusses the design of the study.

Then the results of the experiment and a discussion of the results are presented.

7.1 Material and Stimuli

The auditory and visual stimuli were designed to explore the factors in audio-visual interaction during grouping perception. The factors includes the tempi, different visual cue

(with or without temporal cue), specific visual temporal cue (duple or triple temporal cue), different auditory pattern (clearly favoring duple or triple grouping, or ambiguous sound features for supporting either grouping choice).

7.1.1. Auditory Stimuli

The stimuli were constructed by repeating and modifying the digital tone sample of one drum beat. Stressed tones were realized by artificially increasing the volume by 8 dB.

Inspired by the real Djembe pattern, five drum patterns were constructed, each formed by

12 isochronous sound events. The selection of these five rhythmic patterns were intended to cover both clear and ambiguous patterns as human listening to ambiguous patterns may

110 receive more influence from information in other sensory system. We used the quantitative characterization described in chapter 4 to rule a pattern as clear or ambiguous. The five rhythmic patterns include one that clearly favor binary grouping, and one that clearly favor ternary grouping, and three ambiguous patterns. The specific degree of binarity/ternarity of these five patterns are presented in Figure 7.1 and one sample of the pattern 3 is presented in table 7.1.

Table 7.1. Five rhythmic patterns derived from the African drum patterns

Pattern Rhythmic Patterns Binarity Ternarity

1 XOXOXOXOXOXO 1.0 0.5

2 XXOOXXOXOXXO 0.5 0.5

3 XOXXOXOXXOXO 0.5 0.5

4 XXOXXOOXOXXO 0.5 0.75

5 XXOXXOXXOXXO 0.33 1.0

Note. X indicates the “stressed tone,” and O indicates the “unstressed tone.” Both X and O are sounds of equal value with equal intervals in between.

111

Figure 7.1. One cycle of pattern 3 as a soundwave.

The resulting stimuli were modified isochronous monotone sequences that are frequently employed in studies on metrical groupings as basic auditory materials. An isochronous sequence of identical tones has a neutral metrical structure and lends itself to numerous possible metrical interpretations. Therefore, any small controlled change, such as adding different pitches to form a melodic contour or distributing stresses in different locations, could encourage certain metrical interpretations and resist others.

The slow and fast temporal conditions were set as a variable to see whether the influence of visual information changes over different tempo. The stimuli were presented at two different tempi: fast and slow. In fast tempo, the inter-stimuli interval (ISI) was 200 ms, and 312.5 ms in the slow tempo. Within one trial, one pattern was presented repeatedly at one of the tempi, and the trial lasted one . For the fast tempo, the pattern repeated for 25 cycles within one minute, and for the slow tempo, the pattern repeated for 16 cycles.

The temporal setting intentionally avoided the human preferred pulse period

(around 500 ms). As mentioned in Chapter 3, humans prefer to perceive the pulse at 2 Hz.

112

In the current setting, for the duple grouping in the fast temporal condition, the inter-tap- interval (ITI) was around 400 ms, and in slow temporal condition, the ITI was around 625 ms. For the triple grouping in fast temporal condition, the ITI was around 600 ms, and in slow temporal condition, the ITI was around 937.5 ms. It was expected that participants should have no inherent preference for either of these grouping choices.

7.1.2. Visual Stimuli

Instead of using artificial visual stimuli, natural visual stimuli with real human body movement were used as the visual stimuli. Specifically, the visual stimuli in this experiment were sampled from a live recording of an African cowbell performance. The performer was asked to play a duple rhythmic pattern and a triple rhythmic pattern as shown in Figure 7.2. The duple pattern was played in two tempi—1.6 Hz and 2.5Hz—and the triple pattern was also played in two tempi — 1.01 Hz and 1.67 Hz. The duple and triple rhythmic patterns selected as visual stimulation were relatively simple. The tapping movements were mostly limited to vertically up and down with small horizontal adjustments. And the intervals between each hit were designed to be only 1,2 or 3 times of the IOI of the corresponding audio stimulus.

113

Figure 7.2. Illustration of the duple pattern and triple pattern in visual stimuli. The auditory tone numbers here are temporal indicators the isochronous temporal units. They were filled with stressed and unstressed drum strike tones.

Because humans cannot control the pace perfectly, the performer was asked to play cowbell with each pattern and tempo several times. Accordingly, several video clips were collected during the live performance, the ones that were most stable in pace were selected as visual stimuli in current experiment. The visual stimuli with duple and triple patterns were aligned and combined with the auditory stimuli into the bimodal stimuli presented to participants. For the control condition with no visual temporal cues, one still-frame from the cowbell playing video was provided to the participants during the presentation of the auditory stimuli.

7.1.3. Trial Design

Each auditory sequence was paired with three types of visual stimuli separately to synthesize the AV stimuli: 1) a cowbell-striking movement that cued binary grouping

114 information; 2) a cowbell-striking movement that cued ternary grouping information; and

3) a still picture of cowbell playing (no grouping cues).

The synthesizing of the AV stimuli was achieved through aligning the onset of the first auditory beat to the frame in which the player’s stick first touches the cowbell. Because the video stimulus was played by a human player and the tempo was not exactly precise, the error was compensated for when in the following way: the ISI of the audio stimuli was slightly adjusted to reduce any displacement between audio period and recorded visual period. Figure 7.3 illustrated one cycle of duple visual pattern combined with auditory pattern in slow tempo. Figure 7.4 illustrated one cycle of triple visual pattern combined with auditory pattern in slow tempo.

115

Figure 7.3. Timeline of combined auditory stimuli with duple visual cue. The line for the hand position (top) approximates the trajectory of the performer’s hand in the video clip.

Figure 7.4. Timeline of combined auditory stimuli with triple visual cue. The line for the hand position (top) approximates the trajectory of the performer’s hand in the video clip.

116

Based on this approach, the five auditory sequences (five drum patterns) were each combined with three types of visual stimuli in two tempi, resulting in 30 experimental trials in total. There are additional 6 trials for practice purpose which follows the same design.

The auditory sequences of practice trials were identical tones without any stresses. The unstressed auditory pattern is matched to three kinds of visual pattern in two tempi, which turn out to be 6 practice trials in total.

7.1.4. Real-World Stimuli

The present study is one of few studies using realistic stimuli. Most previous psychological studies mainly make use of artificial stimulation produced in the laboratory.

The artificial stimuli are often easy to control in many dimensions, which is one of the main reasons that artificial stimuli are favored by other studies. Although the results of such studies have fundamentally contributed to our understanding of the way humans perceive musical rhythm, whether they reflect human behavior in and in response to real musical performances is an open question.

With a certain degree of manipulation through processing in the lab, all our original stimuli were obtained from live recordings. The design is as close to the real music scene as possible to obtain responses as close as possible to those due to human perceptual processing in real music performance. The auditory stimuli were sampled from real African drum patterns. The visual stimuli in the current study were from a live recording of a cowbell performer—the person who plays the bell in African bands plays a role similar to the conductor in a Western concert. The participants in this test were like musicians playing

117 in a band who listen to their own instrumental sounds but also pays attention to the conductor’s gesture for rhythmic alignment. Similar scenarios can also be found in many other ensembles, such as Jazz, and chorus, etc., in which performers often interact aurally as well as visually in order to perform together.

7.2 Participants

All the participants were recruited from the student population at Ohio State

University. And the participants were divided into two groups: musician and non-musician groups. The criterion for being in the musician group was that the student is pursuing a music performance degree in the music department with at least five years of professional training experience on at least one instrument (including vocal training) and ending not more than three years before taking this test. To ensure the ongoing level of intensity of the music training experience, those participants who practice music for less than one per day were not considered musicians in this study. The non-musicianship group were selected from students who are not studying a music major or minor. The criteria for being in the non-musician group was no sustained music training (the sustained training here was defined as at least for two consecutive years of professional music training) within the last ten years. The enrollment of participants is based on their self-reports on their music- training experience.

Specific information about these two groups is presented below in the Table 7.2.

The two groups of participants in this study included 12 musicians (mean age = 22.42 ±

3.7 years; 4 male and 8 female) and 12 non-musicians (mean age = 25.17 ± 1.9 years; 4

118 male and 8 female). None reported any history of auditory or uncorrected visual impairment or neurological disease. Most of the volunteers in the musician group were proficient in two or three instruments (including voice). All participants in the non- musician group were students in other majors. Five of them had previous music training: one studied the violin for three years starting at age eight, one studied the piano for two years starting at age seven, and one studied the organ for two years starting at age seven, one studied the Chinese zither for two years starting at age seven, and studied the piano for three and the violin for six months at the age of 15. However, none of them had any music training or specific music practice in the last ten years.

Table 7.2. Descriptive statistics for non-musicians and musicians (each N = 12)

119

7.3 Procedure

The experiment was conducted in the quiet room at the Cognitive Ethnomusicology

Lab. The stimuli were presented to participants on a laptop. The auditory stimuli were presented binaurally through a pair of Sony headphones, and the visual stimuli were presented on 34 17 (cm) screen with a resolution of 19201080. The workflow of the experiment was controlled using a script written in Python 2.7.

Each participant was presented with two blocks of trials. A session began with a practice block of three trials in which participants became familiar with the experiment environment and the tasks. The auditory pattern in practice trials was different from those of the experimental trials as described in the previous section. These three trials were repeatedly presented until the participant felt confident in his or her familiarity with the experiment setting. The practice block was then followed by the experimental block of 30 trials. For each block, the trials were presented in a random order. Each trial lasted one minute with a five-second pause in between trials. If the participant needed more time to rest, he or she could temporarily stop the test by pressing the spacebar on the keyboard.

After finishing all the trials, participants were asked to complete a questionnaire.

Over the course of the experiment, the subjects sat in front of the laptop, watching the screen while listening to the sound stimuli through the headphones. Participants were told that there was one rhythmic pattern repeated cycle-by-cycle within one trial. They were instructed to tap to the start of the subgroups they perceived. The participants were told, “whether you hear groups of two tones or groups of three tones, tap each time the

120 groups start.” Although subjects were informed that there would be a video on the screen, they were asked to attend to the auditory input only.

The participants used a 15 cm metal rod held in their preferred hand and a wooden pad laid in front of the laptop to perform the tapping. The participants were also told that the rhythmic pattern in auditory stimuli would be repeated cycle-by-cycle and last for one minute. They were informed that it was fine if they do not start response tapping immediately after the stimuli started.

7.4 Processing

The recorded stimuli and response tracks for each trial were analyzed as a series of onset time points. The data extraction was done in the same way as in the first experiment analysis. The onset time point of each response tap was marked as the perceived grouping boundary of the corresponding auditory stimulus sequences.

The response tap was categorized into two types:

1) Binary Tap: one response tapping to every two sound events in which ITI is

within ± 10% × 2 × IOI (inter-onset-interval of sound stimuli).

2) Ternary Tap: one response tapping to every three sound events, in which ITI is

within ± 10% × 3 × IOI.

The ± 10% range in the criteria is a result of the following considerations: first, if the range too narrow then a large portion of the participants’ tapping cannot be categorized.

Second, the range should not be too wide (i.e. over 20% where the two categories overlap).

121

10% is the median of 0% and 20% which leaves a gap between two categories. The gap acts as a buffer range to prevent the phenomena that when a participant taps around 2.4 ×

IOI (which is 2 × IOI × (1 + 20%), this is also equals to 3 × IOI × (1 - 20%)) and the categorization would be ambiguous.

In one trial which lasts one minute, a participant does not necessarily make only binary taps or ternary taps. It’s usually a combination of the two plus some taps that cannot be categorized into either category. To measure the percentage of binary and ternary tappings, we group response taps into sections which include Bsections and Tsections. The percentage of binary tapping in one trial is calculated as the cumulative duration of

Bsections divided by the effective length of the trial. The effective length of a trial is from the onset of the first response to the end of the stimuli. The percentage of ternary tapping is calculated in the same way.

Not all binary tappings can be grouped into Bsection. A valid Bsection must cover at least 12 sound events (the length of one cycle of auditory stimuli pattern). The same is to Tsections. The above criteria essentially leave some duration that cannot be covered by either a Bsection or a Tsection. Such parts are often observed in the adaptation phase at the beginning of the response or if the grouping choice changes. A trial counted as non- successful if more than 50% of the effective length could not be categorized. Elven trials were excluded as non-successful among the total 720 trials, leaving 709 valid trials. These

11 trials were all from the non-musician group.

122

Of the 709 valid trials, 418 started with a binary grouping choice with a change happening in 51 of these trials (12.2%). Two hundred ninety-one trials started with a ternary grouping with a change happening in 20 trials (6.87%). These results indicate that participants were more likely to change their grouping choice if they started with a binary grouping but less likely to change if they started with a ternary grouping. The maximum number of switches in one trial was two (in total, four trials had two changes). Of the 709 valid trials, 67 trials (9.4%) switched the grouping choice within the trial. Among those 67 trials, non-musicians switched more (37 out of 67 trials) than musicians. Specific features of the two groups are illustrated in Table 7.3.

Table 7.3. Trials in which grouping choice changed under different conditions

For each trial, the two registrations, Bsection and Tsection, were transformed into a Bscore, which is the Bsection divided by the sum of the Bsection and Tsection:

123

The range of Bscore is from 0 to 1; it equals 0 when there is no Bsection and equals

1 when there is no Tsection. After this data processing, the following analysis was developed.

7.5 Analysis and Results

7.5.1. Analysis I: Grouping Responses Analysis

Our study aims to test the main hypothesis of visual influence on auditory grouping.

The first analysis fit all grouping responses into a mixed model of all factors: visual cues, tempi, auditory pattern, and groups (musicians and non-musicians). A full 3 × 2 × 5 × 2

ANOVA was calculated with Bscore as the dependent variable, three visual cue conditions

(duple visual cue, triple visual cue, no visual cue), two tempi (slow and fast), five auditory patterns, and two groups (musicians and non-musicians) as factors.

Based on the repeated ANOVA analysis, visual cue, tempi, auditory patterns are three main effects, and a two-way interaction between visual cues and groups. The complete result of ANOVA can be found in Appendix D.

Illustrated in the Figure 7.5 below, the Bscore is higher when participants were presented duple visual cue and lower when participants were presented triple visual cue: F

2 (2, 46) = 16.663, P < 0.001 (ηp = 0.036). The mean Bscore is 0.5415 (SEM (standard error

124 of mean) = 0.031) when no-visual cue presented but go over to 0.67555 (SEM = 0.029) when binary-visual cue presented and decreased to 0.46403 (SEM = 0.031) in ternary- visual cue condition.

Figure 7.5. Summary of the Bscore by different visual cues. VD means visual duple stimulus, VT means visual triple stimulus, and VS means still visual stimulus. Error bars represent SEM.

Tempi also influences on subjects’ grouping choices: F (1, 23) = 12.717, P < 0.001

2 (ηp = 0.013). The Bscore in slow tempo has an average of 0.506 (SEM = 0.026), but in

125 fast tempo the is average is 0.61516 (SEM = 0.024996), which indicates the subjects were more likely to choose binary grouping in the fast tempo than in the slow tempo.

And the other main influence came from the auditory patterns in which the mean

Bscore was significantly different between the five different patterns: F (4, 92) = 59.269,

2 P < 0.001 (ηp = 0.261). Illustrated in Figure 7.6. As expected, P1 has the highest Bscore and P5 has the lowest. Other patterns are in between.

Figure 7.6. Summary of the Bscore in five auditory patterns. Error bars represent SEM.

The effect of visual cue is greater for the musician group than for the non-musician

2 group: F (2, 46) = 5.503, p < 0.01. (ηp = 0.012). Illustrated in Figure 7.7. The slope over 126 the Musician group is steeper than that over the Non-musician group, which means the difference is greater between visual cues. Specifically, there are hardly seen any significant differences between the visual stimuli for the non-musicians, the variance mainly comes from the musicians.

Figure 7.7. Bscore across different conditions of visual cues between musician group and non-musician group. VD is the duple visual cues, VS is the triple visual cues, VT is the still visual image. Error bars represent SEM.

It was expected that there is an interaction between visual cues and the auditory patterns. However, there is no statistically significant interaction between visual cues and the auditory patterns: F (19, 437) = 1.535, P = 0.141571. From the figure 7.8, it can be seen

127 that there is a difference between the Bscores in different visual cues by auditory patterns.

It seems that the influences of two different visual cues are not symmetric and related to the different auditory patterns. The phenomenon of the asymmetric influence of two different visual cues to different auditory patterns becomes clear when transforming the

Bscore to Ddegree and Tdegree. Ddegree (influence degree of duple visual cues) is defined as the difference between the Bscore with duple visual cues and the Bscore with no visual cue, measured for the same auditory pattern, the same tempo and the same participant.

Tdegree (influence degree of triple visual cues), in the same way, is the difference between the Bscore of triple visual cues and the Bscore of no visual cue.

128

Figure 7.8. Summary of the Bscore of the responses grouped by the five auditory patterns and three visual timing cue conditions.

Illustrated in Figure. 7.9, for patterns that have a clear binary or ternary grouping structure, namely P1 and P5, the influence of visual cues seems to be asymmetric. Take P1 as an example, P1 has clearly duple grouping structure and from the data collected, the mean Bscore is close to 1.0. It is not possible to observe significant positive influence of

Bscore on this pattern. However, a negative influence is still possible. Similar situation happens in P5, which has clearly triple grouping structure. The influence of triple visual cue on P5 is also asymmetric. Both pattern 2, 3 and 4 are relatively ambiguous pattern in

129 which no regular auditory feature clearly supports duple or triple grouping. From the Figure

7.9, it can be also observed there a “shift” from P2 to P3 that. That is, the grouping choice of P2 is only influenced by triple visual cues, but the grouping choice of P3 is only influenced by duple visual cues. And for P4, Ddegree is much higher than Tdegree in which grouping of this pattern is more influenced by duple visual cues. These results gave some implication that different temporal cues in visual stimuli do not influence auditory grouping in the same degree, instead, the influence degree may more depend on the quantitative characterization of the stimuli (binarity/ternarity).

A small detail is that the mean D/T degree (shown in Figure 7.9) may not be exactly equal to the difference between Bscores (in Figure 7.8). Such fact could be easily detected by looking at the Tdegree of P1 and P2. They are different, but the mean Bscore difference between still condition and triple visual cue condition for P1 and P2 are roughly equal. The reason behind this is that the populations for calculating D/T degree and mean Bscore are different. If a participant’s response to triple visual cue is considered an outlier, then

Tdegree of this participant is undefined thus their response to still cue is also not considered.

However, on calculating mean Bscores, their response to still cue is still considered.

130

Figure 7.9. Ddegree and Tdegree by patterns.

We did a second analysis with a linear mixed-effect model (LME) with the visual, tempi, pattern and the interaction of visual and musician as fixed factors, and the participant as a random effect. The rationale for this analysis is that the ANOVA analysis may not be adequate as the Bscore does not necessarily come from normal distribution. However, the

LME fits the data well and find statistical significance for the same three main effects and the interaction as identified by ANOVA. The details of the result can be checked below in

Table 7.4.

131

Table 7.4 Result of LME Table

Source Sum Sq Mean Sq NumDF DenDF F value Pr (>F) Tempo 2.029 2.0293 1 676.49 13.5408 0.000252*** Pattern 39.193 9.7982 4 676.14 65.3804 < 2.2e- 16*** Visual 6.838 3.4190 2 675.96 22.8140 2.583e- 10*** Visual  Musician 1.948 0.6493 3 58.52 4.3327 0.007967** Signif. Codes: *, less than 0.05; **, less than 0.01; ***, less than 0.001.

7.5.2. Analysis II: Response Times (RTs) and Grouping Difference

This section analyzes the time needed for a participant to tap first response to stimuli after each trial started. A trial is considered invalid if the participant’s response time does not fall within the range of 0 to 20 seconds (one third of the total trial length). Based on this criterion, there were eleven invalid trials from non-musician and one from musician.

From these twelve, seven were in the with-visual-cue condition, and five are without. We use 2(musician/non-musician) x 2 (Visual/no-visual) repeated measure ANOVA with response time as dependent variable and participant as random factor to examine if there is any significant relationship between response time to audio-visual stimuli and musical training.

The result of the analysis found the interaction between participant group and

2 presence of visual cue to be significant: F (1, 21) = 4.471, P < 0.0348 (ηp = 0.626).

Illustrated in the Figure 7.10 below, musicians have similar response times (RTs) for the 132 two conditions (with or without visual cues conditions), while non-musicians show comparable RTs for the no visual cues condition but significantly longer ones for conditions with visual cues. It indicates that when non-musicians are exposed to auditory and visual temporal cues, they need more processing time before they start responding than if they were only exposed to auditory cues.

Figure 7.10. Mean Response Times (RTs) by groups and visual conditions. Error bars represent SEM.

133

7.6 Additional Analyses and Results

Inspired by the observed results, two additional analyses were developed for further understanding of subjects’ grouping behavior in multisensory stimuli.

7.6.1. Grouping Responses to Ambiguous Auditory Patterns

In chapter 4 we discussed the idea of ambiguity in the process of grouping and introduced the feature of binarity and ternarity of patterns. It is interesting to see if the binarity and ternarity of a sound pattern can predict the grouping tendency (more specifically, average Bscore) of an auditory pattern. For this purpose, we combine the binarity and ternarity together as normalized binarity which is equal to binarity / (binarity

+ ternarity). It is similar to how we compute Bscore from B section and T section. The relationship between Bscore of the responses and normalized binarity is tested with a

Pearson correlation test.

The result of the correlation coefficient analysis indicated that there was a significant positive association between the mean Bscore of the response and the normalized binarity of the auditory pattern: r (3) = 0.9754, p = 0.0046. Specific values can be seen in Figure 7.11. It means that there is a correlation between the quantitative characterization of the stimuli (normalized binarity) and the participants responses.

134

Figure 7.11. Summary of the mean Bscore of the response and the normalized binarity of five auditory patterns.

7.6.2. Mean Visual Influence and Specific Music Experience

In the 15-item questionnaire (shown in Appendix E), musicians’ training experience across ages have been collected in detail including the starting age of music training, number of practice years, daily practice hours, ensemble years, and training schedule for the past 12 months. From the data collected, each musician’s total music training hours is calculated. The calculation on practice hours mainly based on the reported practice hours and frequency, such as the daily practice hour, weekly practice days. The result is a number that reflects the practice intensity of the musicians. The starting age of

135 music training, total practice hours, and ensemble experience years, are separately tested against the observed mean visual influence among the musicians to see whether there was a correlation.

There was no correlation between the starting age of music training and number of ensemble experience years and the mean visual influence. However, a correlation seemed to exist between the total practice hours and the mean visual influence, illustrated below in

Figure 7.12. Except for M5, which seems to be an outlier, the other 11 data points showed a significant correlation to the influence degree: r (9) = 0.8854, p = 0.0003. A reason for why M5, despite the extremely long practice hours, showed little influence degree is not detectable from the questionnaire data.

136

Figure 7.12. Correlation between the approximate practice hours of musicians and mean visual influence.

7.7 Discussion

The main hypothesis of this study has been substantiated by the experimental results. An effect of visual temporal cues on auditory grouping behavior was found.

Moreover, the magnitude of this effect is bigger with musicians than with non-musicians.

Three interesting points are discussed in the sections below.

7.7.1. Grouping preference and switch tendency

Of the 709 effective trials, 418 trials (59%) started with duple groupings, and 291 trials (41%) started with triple groupings. And from the trials without visual timing cues, 137

138 trials (58.47%) started with duple groupings, and 98 trials (41.53%) started with triple groupings. The percentage of duple groupings is higher than triple groupings in both slow and fast tempi. These results may indicate a preference for duple groupings rather than triple groupings. This is in accordance with previous experimental data (Drake, 1993) that showed that duple groupings are favored over triple groupings. The reason for this preference is not clear. A relative finding is that Brochard et al. (2003) found evidence in

ERPs that the brain shows a significantly more vigorous response to attenuated odd than to attenuated even events in an isochronous monotone sequence. This may suggest that a binary grouping structure may be automatically imposed by the listeners to such sound sequences. The preference may also be cultural conditioning. Binary groupings are more common in Western culture; therefore, Western people are more familiar with them than ternary groupings (Apel, 1970). To further explore this idea, studies should recruit participants from different cultures to find the effect of cultural influence.

Another interesting finding is that the grouping can be changed spontaneously and unintentionally. A spontaneous switch was reported in 67 out of the 709 valid trials (9.4%).

There were no striking differences between musicians (30 trials) and non-musicians (37 trials) in switching behavior. The switch here is to change the original duple/triple grouping choice to the other. It is distinguished from forced switches because there was no instruction to switch the grouping within one trial. A similar observation was made in

Repp’s (2007) study: even though participants were asked to tap on every third note of the identical sound sequence, they were often unable to maintain it. Instead, subjects switched to tap on every fourth note somewhere during the responses. A visual analogy of this

138 phenomena is the Rubin (1915) face-vase figure (as in Figure 7.13). The subject could perceive the figure as either a face or vase, and these two percepts alternate spontaneously and frequently. However, the difference between the switching in auditory domain and that in visual domain is that the switching in the auditory grouping occurs less frequently within a short time. It has been found that the two visual percepts for face-vase figure can be flipped over several times in one second, while the switching frequency of the auditory grouping is much smaller. In our study, for most of the time in the one-minute trial, the interpretation was stable. The switching rate (change times per minute) for P5 is the highest one with 0.16 and the switching rate for P1 is the lowest one with 0.049. P4 has the switching rate of 0.07, P2 and P4 have switching rates at 0.13 and 0.1. The switching rate is much lower than the face/vase visual figure. Moreover, from the different switching rates among different patterns, we can see that the probability to group one way or the other is not the same for all five patterns, whereas in the vase/face image there is a fixed ratio. At the same time, if a participant switched his/her grouping responses, it seemed unlikely that he or she would switch back spontaneously to the original interpretation. The maximum number of switches in one trial was two, and only four trials had two switches. The reason why precepts reversing to auditory stimuli is different than to visual stimuli is currently unknown and deserves further research.

139

Figure 7.13. Rubin’s Vase-Face (1915)

7.7.2. Different visual influence across different auditory patterns

This study also found that visual cues influences different auditory patterns in different degrees. When the visual cue is more congruent with the quantitative characterization of the stimuli (binarity/ternarity), the visual influence is smaller, while the visual influence is more significant when presenting visual cues are conflicting with the auditory cues. The grouping responses to auditory pattern 1 and 5 are especially obvious on this point. Generally, the duple visual cue had more influence on auditory patterns 3, 4, and 5, and the triple visual cue had more influence on auditory patterns 1 and 2. The result was intuitive; the visual cue had more influence on conflicting (or mostly conflicting) auditory stimuli. Future study can introduce more complex auditory pattern.

140

7.7.3. RTs difference between musicians and non-musicians

From the RTs analysis, musicians have similar response times (RTs) no matter whether there has a temporal cue from visual stimuli. However, the non-musicians show comparable RTs if no temporal cues presented in visual stimuli, but significantly longer

RTs if visual temporal cues are present. Such differences in RTs of musicians and non- musicians indicates that when non-musicians are exposed to auditory and visual temporal cues, they need more processing time before they start responding than if they were only exposed to auditory cues.

Numerous studies have demonstrated structural and functional differences between musicians and non-musicians in brain areas related to multisensory integration domains

(Gaser & Schlaug, 2003; Hodges, 2005; Hairston, Hodges, Burdette, & Wallace, 2006;

Musacchia et al., 2007, 2008). Musicians have reported improved cognitive processing in both the auditory and visual domain (Rodrigues, Loureiro, & Caramelli, 2013; Faßhauer,

Frese & Evers, 2015). Such difference between musicians and non-musicians may account for the difference of the reaction time between musicians and non-musicians.

7.7.4. Visual influence on grouping choice and music training experience

A new finding of this study is that the degree of visual influence on musicians is much higher than on non-musicians in rhythm perception. From the result, there are hardly seen any significant differences between the grouping responses to any visual cue conditions among non-musicians. However, significant differences exist in musician group.

141

The visual influence on musicians grouping behavior may further explained by the founding of the significant correlation has been found between musicians’ training hours and mean visual influence. The longer music training, the bigger visual influence musician has. In the experiment, the participants were recruited such that the separation in terms of the music training intensity between the two groups was large. The result of a significant correlation between the practice hours and average visual-influence degree may give some indication of this assumption because practice hours is one of the important indicators of music training intensity. A previous study has found that the degree of observed structural and functional adaptation in the brain correlates to the intensity and duration of music practice (Miendlarzewska & Trost, 2014).

142

Chapter 8. Major Findings and General Discussion

This study is inspired by an anecdotal phenomenon observed in the author's’ life that people clap or move along with the heard African drum music in a different way depending on whether they watch the video. With the hypothesis that watching visual movement influences our hearing habit, this dissertation examined the audio-visual interaction in rhythm perception. And the result of the study confirmed the hypothesis that visual temporal cues influence grouping of auditory events which embodied in bodily movements.

One interesting and important finding of the current study is verifying the possibility of introducing a better visual stimulus in a lab setting for studying rhythm perception. Much of what we know about visual rhythm processing is derived from responses to simple artificial stimuli that rarely occur in the natural environment. Such artificial stimuli are popular in many psychologically based investigations of audio-visual interaction as they can be easily and precisely controlled. The result of such experimental setting does shed some light on the way we process multisensory information and contribute to the fundamental knowledge on how human process multisensory stimuli.

However, it is questionable how much these stimuli represent actual music because, the applied visual stimuli are far removed from those appear in the real-world musical situations. The first experiment in the current study compared the different effects of 143 natural visual stimuli and abstract visual stimuli in human rhythm perception. There is an observable advantage of entrainment performance to natural visual stimuli compared to abstract visual stimuli in terms of a more stableness of entrainment and a faster rate of re- synchronization after the perturbation, though these differences were not found to reach statistical significance. A lack of significance might be attributable to the fact that the inherent differences between the natural/abstract stimuli selected in this experiment were small.

Further, the main study examines the perceptual process of rhythmic grouping.

When presented with a complex acoustic spectrum, the auditory system spontaneously groups acoustic elements together. Grouping can be Influenced through the periodic features that constitute the rhythmic pattern: such as timbre, pitch, and stress, tone length and silence. Whether and how the perceived temporal information from the visual modality interacted with the auditory grouping is the main focus of this study. The study found three features of the perceptual process of rhythmic grouping:

1) Within the auditory modality, the way of grouping is related to the

physical features of the rhythmic surface, specifically the repetition of the duple

or triple stress patterns.

2) Across modalities, the perceived temporal cue from the visual

modality influences auditory grouping behavior.

144

3) The audio-visual interaction in rhythmic grouping is experience-

dependent—the observer’s previous experience, such as musical training,

increases the effect of the visual influence on auditory grouping behavior.

Generally, the role of visual sense in multisensory rhythm perception is demonstrated and highlighted by the results of this study. This is a challenge for the

“Modality-Appropriateness hypothesis” (Welch, 1980) that audition is dominant in temporal processing, vision is dominant in spatial processing. It needs to be noted, though, that the current study only involved the cowbell playing (with mainly the upper limb movements) as visual stimuli. In African music, dancing is also key in rhythm perception.

For future research, it would be interesting to introduce more complex visual stimuli, such as dancing movements, into a rhythm perception experiment.

8.1 Implications for cognitive studies

Psychological and neuroscience studies have long been interested in understanding the underlying psychological mechanism of audio-visual interaction in temporal perception. Evidence from various neuroscience studies has suggested that temporal processing in auditory and visual modalities share a common representation, which leads to a claim that there is a common area in the brain for projecting and processing temporal information from both modalities. FMRI studies have found the common neural substrates engaged in auditory and visual duration judgment (Boltz, 2005). The brain areas, such as the premotor and supplementary motor cortices, basal ganglia, and cerebellum, are commonly activated for visual and auditory timing tasks (Schubotz et al., 2000; Shih et al.,

145

2009). EEG studies have found that the temporal processing of visual intervals and auditory intervals share a common neural representation (Barne et al., 2017). In this sense, with bimodal stimulation, a certain kind of interaction or fusion happens in such common areas.

However, whether such fusion takes place in specific early sensory areas or in the higher hetero-modal brain area (the cortical areas that are responsible for collecting multisensory information for more complex cognitive activities) later is an open question.

From five obtained verbal reports from participants, it seems that participants felt differently about the fusion that took place. Most of the musicians felt it was easy to recognize the temporal structure and to sense the pulse of the visual rhythmic pattern. It hints that the temporal structure of auditory stimuli and visual stimuli are processed in parallel, and that the final grouping behavior embodied in tapping was directed by a conscious decision. This behavior may suggest that the fusion takes place in higher hetero- modal brain area. Conversely, most non-musician participants reported that they hardly recognized the rhythmic pattern presented in the visual stimuli. It was reported that they could easily detect and rely on the larger arm or hand movement in striking the bell, which gave them a temporal reference for grouping the auditory events. This description is in accordance with previous findings that sound is perceived as louder when paired with larger gestures (Rosenblum & Fowler, 1991). In this sense, the visual information influenced the low-level feature – loudness - of the sound, which then increase the tendency that such sound events perceived as the grouping boundary. Generally, these results do not reject either of the two fusion location hypotheses; instead, both possibilities may exist.

146

They suggest that these two kinds of fusion may exist at the same time or are alternatively used in temporal perception. This uncertainty needs to be made clear in future studies.

8.2 Implications for music

Music is about sound but not only about sound. It is a particular experience within the mind of listener. The findings of current study demonstrated that the vision has an indisputable role in music experience. Such finding is important for both musicians and listeners. For musicians, it is important for them to be aware of that the role of watching bodily movement in auditory perception so that they can make use of it as a strategy in performance. In the one aspect, it can improve the cooperation in the ensemble performance. It is very common that if being in an ensemble, such as concerto, Jazz ensemble, choral, paying attention to movements of peer performers significantly improve the rhythmic synchronization. On the other aspect, being aware of the role of vision in auditory rhythm perception benefits for the specific performing techniques. One example from Marimba players, who adjusting of gesture length as a skill of performing. The varying gesture length does not change the length of the auditory sounds. However, listeners’ perception of the sounds’ length is greatly affected by the big/small gesture player

(Broughton, & Stevens 2009). That is, musicians can generate and utilize visual information in playing for changing listeners’ feeling on its musical sounds.

Moreover, keeping in mind of the important role of vision, it is also important for the listening experience. Although, many pure auditory media of music are very popular,

147 such as mp3, radio broadcasts, etc., such experience can never replace the moment when we are seated in the concert or watch a music video.

Generally, the finding of current study implicates that embracing the vision’s role in realizing or experiencing musical sound is important for both musicians, audiences, or the interaction between two.

8.3 Implications for musicology

Meter often being referred as a concept to understand and analyze music. As being proposed by London (2004), meter is an entrainment process and the metrical entrainment is the synchronization to the periodic stressed and unstressed pulses within a hierarchical structure of the music. Later, Kung and Will (2016) proposed that the musical entrainment is the process that we entrain to variety of components, mainly the surface feature of the auditory sounds, and the grouping of sound sequence, but not to meter. The result of current study in line with the idea of Kung and Will (2016) study, and further observed that the regular visual movement is another component of the musical entrainment of which process is mixed with the grouping of auditory sounds.

If the observed movement information greatly influences our processing of auditory information, as indicated by our experiments, then we need to rethink how we actually understand and analyze music. Most western music theorists prefer to analysis music with notation. Western notation only records the expected sound, no visual or body movement

148 information is notated. When playing music with only notations, the performer makes body movements according to their own understanding of the music and playing skills. The resulting experience could be significantly different even if the auditory features are identical. Music experience is a multisensory experience, analysis of music should take the visual information in an important position.

149

Bibliography

Agawu, V. K. (2014). Representing African music postcolonial notes, queries, positions. Florence: Taylor and Francis. Agawu, V. K. (1995a). African rhythm: A northern ewe perspective. Cambridge: Cambridge University Press. Agawu, V. K. (1995b). The invention of “African rhythm”, Journal of the American Musicological Society, 48(3), 380-395. DOI: 10.2307/3519832 Alexander, I., Cowey, A., & Walsh, V. (2005). The right parietal cortex and time perception: Back to Critchley and the Zeitraffer phenomenon. Cognitive Neuropsychology,22(3-4), 306-315. doi:10.1080/02643290442000356 Allen, J. F. (1990). Maintaining knowledge about temporal intervals. Readings in Qualitative Reasoning About Physical Systems,361-372. doi:10.1016/b978-1-4832-1447- 4.50033-x Allman, M. J., & Meck, W. H. (2011). Pathophysiological distortions in time perception and timed performance. Brain,135(3), 656-677. doi:10.1093/brain/awr210 Altenberg, B. (1990). Speech as linear composition. In G. Kaie, A. Haastrup, L. Jakobson, J. E. Nielson, J. Sevaldsen, H. Specht, & A. Zettersten (Eds.), Proceedings from the fourth Nordic conference for English studies (pp. 133–134). Copenhagen: Copenhagen University Press. Amunts, K., Schlaug, G., Jäncke, L., Steinmetz, H., Schleicher, A., Dabringhaus, A., & Zilles, K. (1997). Motor cortex and hand motor skills: Structural compliance in the human brain. Human Brain Mapping,5(3), 206-215. doi:10.1002/(sici)1097- 0193(1997)5:33.0.co;2-7 Anku, W. (1992). Structural set analysis of African music. Legon, Ghana: Soundstage Production. Anku, W. (2009). Drumming among the Akan and Anlo Ewe of Ghana: An introduction. African Music: Journal of the International Library of African Music,8(3), 38-64. doi:10.21504/amj.v8i3.1827

150

Apel, W. (1970). Harvard dictionary of music. (2nd ed.). Cambridge, MA: Harvard University Press. Aristotle. De Anima II 5 (M. Burnyeat, M, Trans. 2002), Phronesis, 47: 28-90. Arom, S. (1989). Time structure in the music of Central Africa: periodicity, meter, rhythm and polyrhythmics. Leonardo,22(1), 91-99. doi:10.2307/1575146 Arom, S. (1991). African polyphony and polyrhythm: Musical structure and methodology. Cambridge: Cambridge University Press. Aschersleben, G. (2002). Temporal control of movements in sensorimotor synchronization. Brain and Cognition,48(1), 66-79. doi:10.1006/brcg.2001.1304 Bakare, S. (1997). The drumbeat of life: Jubilee in an African context. Geneva: WCC Publications. Barne, L. C., Sato, J. R., Camargo, R. Y., Claessens, P. M., Caetano, M. S., & Cravo, A. M. (2017). A common representation of time across visual and auditory modalities. Neuropsychologia,119, 223-232. doi:10.1101/183426 Blacking, J. (1973). How musical is man? Seattle: University of Washington Press. Blacking, J. (1992). The biology of music-making. In H. Meyers (Ed.), Ethnomusicology: An introduction (pp. 301-314). London: Macmillan Press. Block, R. A., & Zakay, D. (1997). Prospective and retrospective duration judgments: A meta-analytic review. Psychonomic Bulletin & Review,4(2), 184-197. doi:10.3758/bf03209393 Boersma, P., & Weenink, D. (2017). Praat: doing phonetics by computer [Computer program]. from http://www.praat.org/: Version 6.0.36, retrieved 11 November 2017. Bolton, T. (1984). Rhythm. The American Journal of Psychology, 6(2), 145-238. Boltz, M. (2005). Duration judgments of naturalistic events in the auditory and visual modalities. Perception & ,67(8), 1362-1375. doi:10.3758/bf03193641 Braitenberg, V. (1967). Is the cerebellar cortex a biological clock in the millisecond range? Progress in Brain Research,334-346. doi:10.1016/s0079-6123(08)60971-1 Broughton, M., & Stevens, C. (2009). Music, movement and marimba: An investigation of the role of movement and gesture in communicating musical expression to an audience. Psychology of Music, 37(2), 137-153. doi:10.1177/0305735608094511 Brochard, R., Abecasis, D., Potter, D., Ragot, R., Drake, C. (2003) The “ticktock” of our internal clock: Direct brain evidence of subjective accents in isochronous sequences, Psychological Science, 14(4), 362-366. 151

Brotzman, R. L. (1964). Progress report on mandarin tone study. Project on Linguistic Analysis, (8), 1-35. Bueti, D., & Walsh, V. (2009). The parietal cortex and the representation of time, space, number and other magnitudes. Philosophical Transactions of the Royal Society B: Biological ,364(1525), 1831-1840. doi:10.1098/rstb.2009.0028 Buhusi, C. V., & Meck, W. H. (2005). What makes us tick? Functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10), 755-765. doi:10.1038/nrn1764 Buonomano, D. V. (2000). Decoding temporal information: A model based on short-term synaptic plasticity. The Journal of Neuroscience,20(3), 1129-1141. doi:10.1523/jneurosci.20-03-01129.2000 Buonomano, D. V., & Karmarkar, U. R. (2002). How do we tell time? , 8(1):42–51. DOI: 10.1177/107385840200800109 Burr, D., Tozzi, A., & Morrone, M. C. (2007). Neural mechanisms for timing visual events are spatially selective in real-world coordinates. Nature Neuroscience,10(4), 423- 425. doi:10.1038/nn1874 Cameron, D. J., Stewart, L., Pearce, M. T., Grube, M., & Muggleton, N. G. (2012). Modulation of motor excitability by metricality of tone sequences. Psychomusicology: Music, Mind, and Brain, 22(2), 122-128. doi:10.1037/a0031229 Charry, E. S. (1995). Musical thought, history, and practice among the Mande of West Africa. Ann Arbor, Mich: UMI. Charry, E. S. (2000). Mande music: Traditional and modern music of the Maninka and Mandinka of Western Africa. Chicago: University of Chicago Press. Chernoff. J. M. (1997). “Hearing” in West African idioms. The World of Music, 39(2), 19-25. Chernoff. J. M. (1979). African Rhythm and African Sensibility. Chicago: University of Chicago Press. Cicchini, G. M., Arrighi, R., Cecchetti, L., Giusti, M., & Burr, D. C. (2012). Optimal encoding of interval timing in expert percussionists. Journal of Neuroscience,32(3), 1056-1060. doi:10.1523/jneurosci.3411-11.2012 Clark, A. (2011). Supersizing the mind: Embodiment, action, and cognitive extension. Oxford: Oxford University Press.

152

Clarke, E. F. (1999). Rhythm and timing in music. In D. Deutsch (Ed.), Academic Press series in cognition and perception: A series of monographs and treatises. The psychology of music (pp. 473–500). San Diego, CA: Academic Press. Clayton, M., Sager, R., & Will, U. (2005). In time with the music: The concept of entrainment and its significance for ethnomusicology. European Meetings in Ethnomusicology,11, 3-142. Clayton, M. R. (2000). Time in Indian music: Rhythm, metre, and form in north Indian rag performance. New York: Oxford University Press. Clayton, M. R. (2007). Observing entrainment in music performance: Video-based observational analysis of Indian musicians tanpura playing and beat marking. Musicae Scientiae,11(1), 27-59. doi:10.1177/102986490701100102 Clayton, M. (2012). What is Entrainment? Definition and applications in musical research. Empirical Musicology Review,7(1-2), 49-56. doi:10.18061/1811/52979 Clifton, T. (1983). Music as Heard: A Study in Applied Phenomemenology. New Haven, CT: Yale University Pr. Covey, E., & Casseday, J. H. (1999). Timing in the auditory system of the bat. Annual Review of Physiology, 61(1), 457-476. doi:10.1146/annurev.physiol.61.1.457 Cooper, G. W., & Meyer, L. B. (2008). The rhythmic structure of music. Chicago: University of Chicago Press. Cruttenden, A. (1997). Intonation (Cambridge Textbooks in Linguistics). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139166973 Drake, C. (1993). Reproduction of musical rhythms by children, adult musicians, and adult nonmusicians. Perception & Psychophysics,53(1), 25-33. doi:10.3758/bf03211712 Drake, C., Jones, M. R., & Baruch, C. (2000). The development of rhythmic attending in auditory sequences: Attunement, referent period, focal attending. Cognition,77(3), 251- 288. doi:10.1016/s0010-0277(00)00106-2 Dunlap, K. (1910). Reaction to rhythmic stimuli with attempt to synchronize. Psychological Review,17(6), 399-416. doi:10.1037/h0074736 Elliott, M. T., Welchman, A. E., & Wing, A. M. (2008). Being discrete helps keep to the beat. Experimental Brain Research,192(4), 731-737. doi:10.1007/s00221-008-1646-8 Elliott, M. T., Wing, A. M., & Welchman, A. E. (2010). Multisensory cues improve sensorimotor synchronisation. European Journal of Neuroscience,31(10), 1828-1835. doi:10.1111/j.1460-9568.2010.07205.x

153

Ellis, R. J., & Jones, M. R. (2009). The role of accent salience and joint accent structure in meter perception. Journal of Experimental Psychology: Human Perception and Performance,35(1), 264-280. doi:10.1037/a0013482 Engström, D. A., Kelso, J., & Holroyd, T. (1996). Reaction-anticipation transitions in human perception-action patterns. Human Movement Science,15(6), 809-832. doi:10.1016/s0167-9457(96)00031-0 Euba, A. (1990). Yoruba drumming: The dùndún tradition. Bayreuth: Bayreuth University. Fan, Y. and Will, U. (2016). Musicianship and tone-language experience differentially influence pitch-duration interaction. Proceedings of the 14th ICMPC 2016, San Francisco, USA. Faßhauer, C., Frese, A., & Evers, S. (2015). Musical ability is associated with enhanced auditory and visual cognitive processing. BMC Neuroscience, 16(1). doi:10.1186/s12868- 015-0200-4 Fendrich, R., & Corballis, P. M. (2001). The temporal cross-capture of audition and vision. Perception & Psychophysics,63(4), 719-725. doi:10.3758/bf03194432 Fierro, B., Palermo, A., Puma, A., Francolini, M., Panetta, M., Daniele, O., & Brighina, F. (2007). Role of the cerebellum in time perception: A TMS study in normal subjects. Journal of the Neurological Sciences,263(1-2), 107-112. doi:10.1016/j.jns.2007.06.033 Fiske, S. T. (2013). Social cognition. London: SAGE. Fitch, W. T. (2013). Rhythmic cognition in humans and animals: Distinguishing meter and pulse perception. Frontiers in Systems Neuroscience,7, 68. doi:10.3389/fnsys.2013.00068 Fraisse, P. (1948). Rythmes Auditifs et Rythmes Visuels [Visual and auditory rhythms]. L’Année Psychologique, 49, 21-41. Fraisse, P. (1980). Les synchronisations sensori-motrices aux rythmes [Sensorimotor synchronizations to rhythms]. In J. Requin (Ed.), Anticipation et comportement (pp. 233– 257). Paris: Centre National. Fraisse, P. (1982). Rhythm and tempo. In D. Deutch (Ed.), The psychology of music (pp. 149–180). New York: Academic Press. Gallagher, S. (2013). How the body shapes the mind. Oxford: Clarendon Press.

154

Gan, L., Huang, Y., Zhou, L., Qian, C., & Wu, X. (2015). Synchronization to a bouncing ball with a realistic motion trajectory. Scientific Reports,5(1), 11974. doi:10.1038/srep11974 Gaser, C., & Schlaug, G. (2003). Brain structures differ between musicians and non- musicians. The Journal of Neuroscience,23(27), 9240-9245. doi:10.1523/jneurosci.23-27- 09240.2003 Gibbon, J. (1977). Scalar expectancy theory and Weber’s law in timing. Psychological Review,84(3), 279-325. doi:10.1037//0033-295x.84.3.279 Gibbon, J., Malapani, C., Dale, C. L., & Gallistel, C. (1997). Toward a neurobiology of temporal cognition: Advances and challenges. Current Opinion in Neurobiology,7(2), 170-184. doi:10.1016/s0959-4388(97)80005-0 Gibbon, J., Church, R., & Meck, W. (1984). Scalar timing in memory. In J. Gibbon & J. Allan (Eds.), Annals of the New York Academy of Science: Timing and time perception (pp. 52-77). New York: New York Academy of Sciences. Gibbs, R. W. (2010). Embodiment and cognitive science. Cambridge: Cambridge University Press. Glenberg, A. M., Mann, S., Altman, L., Forman, T., & Procise, S. (1989). Modality effects in the coding reproduction of rhythms. Memory & Cognition,17(4), 373-383. doi:10.3758/bf03202611 Glenberg, A. M., & Jona, M. (1991). Temporal coding in rhythm tasks revealed by modality effects. Memory & Cognition,19(5), 514-522. doi:10.3758/bf03199576 Goldstone, S., & Lhamon, W. T. (1971). Levels of cognitive functioning and the auditory-visual differences in human timing behavior. In M. H. Appley (Ed.), Adaptation level theory (symposium) (pp. 263-280). New York: Academic Press. Grahn, J. A., & Brett, M. (2007). Rhythm and beat perception in motor areas of the brain. Journal of Cognitive Neuroscience,19(5), 893-906. doi:10.1162/jocn.2007.19.5.893 Grahn, J. A., Henry, M. J., & Mcauley, J. D. (2011). FMRI investigation of cross-modal interactions in beat perception: Audition primes vision, but not vice versa. NeuroImage,54(2), 1231-1243. doi:10.1016/j.neuroimage.2010.09.033 Grondin, S., & Rousseau, R. (1991). Judging the relative duration of multimodal short empty time intervals. Perception & Psychophysics,49(3), 245-256. doi:10.3758/bf03214309

155

Grondin, S. (2001). From physical time to the first and second moments of psychological time. Psychological Bulletin,127(1), 22-44. doi:10.1037/0033-2909.127.1.22 Grondin, S. (2010). Timing and time perception: A review of recent behavioral and neuroscience findings and theoretical directions. Attention, Perception, & Psychophysics,72(3), 561-582. doi:10.3758/app.72.3.561 Gross, J., Hoogenboom, N., Thut, G., Schyns, P., Panzeri, S., Belin, P., & Garrod, S. (2013). Speech rhythms and multiplexed oscillatory sensory coding in the human brain. PLoS Biology,11(12). doi:10.1371/journal.pbio.1001752 Gussenhoven, C. (2010). The phonology of tone and intonation. Cambridge: Cambridge University Press. Hairston, W. D., Hodges, D. A., Burdette, J. H., & Wallace, M. T. (2006). Auditory enhancement of visual temporal order judgment. NeuroReport,17(8), 791-795. doi:10.1097/01.wnr.0000220141.29413.b4 Hannon, E. E., Snyder, J. S., Eerola, T., & Krumhansl, C. L. (2004). The role of melodic and temporal cues in perceiving musical meter. Journal of Experimental Psychology: Human Perception and Performance,30(5), 956-974. doi:10.1037/0096-1523.30.5.956 Harrington, D. L., & Haaland, K. Y. (1999). Neural underpinnings of temporal processing: Α review of focal lesion, pharmacological, and functional imaging research. Reviews in the ,10(2), 91-116. doi:10.1515/revneuro.1999.10.2.91 Hirsh, I. J. (1974). Temporal order and auditory perception. Sensation and Measurement,251-258. doi:10.1007/978-94-010-2245-3_24 Hodges, D. A. (2005). Aspects of multisensory perception: The integration of Visual and auditory information in musical experiences. Annals of the New York Academy of Sciences,1060(1), 175-185. doi:10.1196/annals.1360.012 Hommel, B., Müsseler, J., Aschersleben, G., & Prinz, W. (2001). The theory of event coding (TEC): A framework for perception and action planning. Behavioral and Brain Sciences,24(05), 849-878. doi:10.1017/s0140525x01000103 Hopkins, R. G., & Yeston, M. (1976). The stratification of musical rhythm. Notes,34(1), 77. doi:10.2307/897288 Hove, M. J., Spivey, M. J., & Krumhansl, C. L. (2010). Compatibility of motion facilitates visuomotor synchronization. Journal of Experimental Psychology: Human Perception and Performance,36(6), 1525-1534. doi:10.1037/a0019059

156

Hove, M. J., Sutherland, M. E., & Krumhansl, C. L. (2010). Ethnicity effects in relative pitch. Psychonomic Bulletin & Review,17(3), 310-316. doi:10.3758/pbr.17.3.310 Hove, M. J., Iversen, J. R., Zhang, A., & Repp, B. H. (2013). Synchronization with competing visual and auditory rhythms: Bouncing ball meets metronome. Psychological Research,77(4), 388-398. doi:10.1007/s00426-012-0441-0 Huygens, C. (2018). Huygens, Christiaan (Also Huyghens, Christian) [Website]. Retrieved from https://www.encyclopedia.com/people/science-and-technology/physics- biographies/christiaan-huygens#3404703173 Iversen, J. R., Patel, A. D., Nicodemus, B., & Emmorey, K. (2015). Synchronization to auditory and visual rhythms in hearing and deaf individuals. Cognition,134, 232-244. doi:10.1016/j.cognition.2014.10.018 Ivry, R. B., & Spencer, R. M. (2004). The neural representation of time. Current Opinion in Neurobiology,14(2), 225-232. doi:10.1016/j.conb.2004.03.013 Ivry, R. B., & Schlerf, J. E. (2008). Dedicated and intrinsic models of time perception. Trends in Cognitive Sciences,12(7), 273-280. doi:10.1016/j.tics.2008.04.002 Jones, A. M. (1959). Studies in African music. London: Oxford University Press. Jones, M. R., Moynihan, H., Mackenzie, N., & Puente, J. (2002). Temporal aspects of stimulus-driven attending in dynamic arrays. Psychological Science,13(4), 1313-1319. doi:10.1111/j.0956-7976.2002.00458.x Jäncke, L., Loose, R., Lutz, K., Specht, K., & Shah, N. (2000). Cortical activations during paced finger-tapping applying visual and auditory pacing stimuli. Cognitive Brain Research,10(1-2), 51-66. doi:10.1016/s0926-6410(00)00022-7 Jäncke, L. (2009). Music drives brain plasticity. F1000 Biology Reports,1, 78. doi:10.3410/b1-78 Kaneko, S., & Murakami, I. (2009). Perceived duration of visual motion increases with speed. Journal of Vision,9(7), 14. doi:10.1167/9.7.14 Kato, M., & Konishi, Y. (2006). Auditory dominance in the error correction process: A synchronized tapping study. Brain Research,1084(1), 115-122. doi:10.1016/j.brainres.2006.02.019 Kauffman, R. (1980). African Rhythm: A Reassessment. Ethnomusicology,24(3), 393. doi:10.2307/851150 Keil, C. (1979). Tiv song: The sociology of art in a classless society. Chicago: Univ. of Chicago Press.

157

Keil, C. (1995). The theory of participatory discrepancies: A progress report. Ethnomusicology,39(1), 1. doi:10.2307/852198 Kemp, B. J. (1973). Reaction time of young and elderly subjects in relation to perceptual deprivation and signal-on versus signal-off conditions. Developmental Psychology,8(2), 268-272. doi:10.1037/h0034147 King, A., & Palmer, A. (1985). Integration of visual and auditory information in bimodal neurones in the guinea-pig superior colliculus. Experimental Brain Research,60(3), 492- 500. doi:10.1007/bf00236934 King, A. (1960). Employments of the "standard pattern" in Yoruba music. African Music: Journal of the African Music Society,2(3), 51-54. doi:10.21504/amj.v2i3.610 King, D. P., & Takahashi, J. S. (2000). Molecular genetics of circadian rhythms in mammals. Annual Review of Neuroscience,23(1), 713-742. doi:10.1146/annurev.neuro.23.1.713 Klyn, N. A., Will, U., Cheong, Y., & Allen, E. T. (2015). Differential short-term memorisation for vocal and instrumental rhythms. Memory,24(6), 766-791. doi:10.1080/09658211.2015.1050400 Kolers, P. A., & Brewster, J. M. (1985). Rhythms and responses. Journal of Experimental Psychology: Human Perception & Performance, 11, 150-167. Kolinski, M. (1973). A cross-cultural approach to metro-rhythmic patterns. Ethnomusicology,17(3), 494-506. doi:10.2307/849962 Krishnan, A., Xu, Y., Gandour, J., & Cariani, P. (2005). Encoding of pitch in the human brainstem is sensitive to language experience. Cognitive Brain Research,25(1), 161-168. doi:10.1016/j.cogbrainres.2005.05.004 Krishnan, A., Bidelman, G. M., & Gandour, J. T. (2010). Neural representation of pitch salience in the human brainstem revealed by psychophysical and electrophysiological indices. Hearing Research,268(1-2), 60-66. doi:10.1016/j.heares.2010.04.016 Krumhansl, C. L., & Iverson, P. (1992). Perceptual interactions between musical pitch and timbre. Journal of Experimental Psychology: Human Perception and Performance,18(3), 739-751. doi:10.1037//0096-1523.18.3.739 Kung, H. N. & Will, U. (2016). Hierarchical or not? A cross cultural study on pulse and complex meter perception. Proceedings of the 14th ICMPC 2016, San Francisco. Large, E. W. (2008). Resonating to musical rhythm: Theory and experiment. In S. Grondin (Ed.), Psychology of time (pp. 189–231). Bingley, UK: Emerald.

158

Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. (2008). Entrainment of neuronal oscillations as a mechanism of attentional selection. Science,320(5872), 110- 113. doi:10.1126/science.1154735 Lakatos, P., Musacchia, G., O’Connel, M., Falchier, A., Javitt, D., & Schroeder, C. (2013). The spectrotemporal filter mechanism of auditory selective attention. Neuron,77(4), 750-761. doi:10.1016/j.neuron.2012.11.034 Lake, J. I., Labar, K. S., & Meck, W. H. (2014). Hear it playing low and slow: How pitch level differentially influences time perception. Acta Psychologica,149, 169-177. doi:10.1016/j.actpsy.2014.03.010 Large, E. W., & Jones, M. R. (1999). The dynamics of attending: How people track time- varying events. Psychological Review,106(1), 119-159. doi:10.1037//0033- 295x.106.1.119 Lerdahl, F., & Jackendoff, R. (1983). A generative theory of tonal music. MA: MIT Press. Lejeune, H. (1998). Switching or gating? The attentional challenge in cognitive models of psychological time. Behavioural Processes,44(2), 127-145. doi:10.1016/s0376- 6357(98)00045-x Lerdahl, F., & Jackendoff, R. (1981). On the theory of grouping and meter. The Musical Quarterly,67(4), 479-506. doi:10.1093/mq/lxvii.4.479 Lewis, P., & Miall, R. (2003). Brain activation patterns during measurement of sub- and supra-second intervals. Neuropsychologia,41(12), 1583-1592. doi:10.1016/s0028- 3932(03)00118-0 Lewkowicz, D. J., & Ghazanfar, A. A. (2009). The emergence of multisensory systems through perceptual narrowing. Trends in Cognitive Sciences,13(11), 470-478. doi:10.1016/j.tics.2009.08.004 Local, J. (2007). Phonetic detail and the organisation of talk-in-interaction. In Proceedings of the 16th International Congress of Phonetic Sciences: ICPhS XVI, (6- 10),1-10. Saarbrücken: Universität des Saarlandes. Locke, D. (1982). Principles of offbeat timing and cross-rhythm in southern Eve dance drumming. Ethnomusicology,26(2), 217. doi:10.2307/851524 Locke, D., & Lunna, A. (1990). Drum damba: Talking drum lessons. Crown Point, IN: White Cliffs Media Company.

159

Locke, D. (2009). Simultaneous multidimensionality in African music: Musical cubism. African Music: Journal of the International Library of African Music,8(3), 8-37. doi:10.21504/amj.v8i3.1826 London, J. (2012). Hearing in time psychological aspects of musical meter. New York: Oxford University Press. Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012). Behavior matching in multimodal communication is synchronized. Cognitive Science, 36(8), 1404-1426. doi:10.1111/j.1551-6709.2012.01269.x Lucas, G., Clayton, M., & Leante, L. (2011). Inter-group entrainment in Afro-Brazilian Congado ritual. Empirical Musicology Review,6(2), 75-102. doi:10.18061/1811/51203 Macar, F., & Vidal, F. (2004). Event-related potentials as indices of time processing: A review. Journal of Psychophysiology,18(2/3), 89-104. doi:10.1027/0269-8803.18.23.89 Macdougall, H. G., & Moore, S. T. (2005). Marching to the beat of the same drummer: The spontaneous tempo of human locomotion. Journal of Applied Physiology,99(3), 1164-1173. doi:10.1152/japplphysiol.00138.2005 Matarazzo, J. D., & Wiens, A. N. (1966). Interviewer effects of interviewee speech and silence durations. PsycEXTRA Dataset. doi:10.1037/e469452008-023 Mates, J., Müller, U., Radil, T., & Pöppel, E. (1994). Temporal integration in sensorimotor synchronization. Journal of Cognitive Neuroscience,6(4), 332-340. doi:10.1162/jocn.1994.6.4.332 Matthews, W. J., Stewart, N., & Wearden, J. H. (2011). Stimulus intensity and the perception of duration. Journal of Experimental Psychology: Human Perception and Performance,37(1), 303-313. doi:10.1037/a0019961 Mauk, M. D., & Buonomano, D. V. (2004). The neural basis of temporal processing. Annual Review of Neuroscience,27(1), 307-340. doi:10.1146/annurev.neuro.27.070203.144247 McAdam, D. W. (1966). Slow potential changes recorded from human brain during learning of a temporal interval. Psychonomic Science,6(9), 435-436. doi:10.3758/bf03330974 McAuley, J. D., Jones, M., Holub, S., Johnston, H., & Miller, N. (2006). The time of our lives: life span development of timing and event tracking. Journal of Experimental Psychology: General, 135(3), 348-367. DOI: 10.1037/0096-3445.135.3.348

160

Meck, W. H., & Benson, A. M. (2002). Dissecting the brains internal clock: How frontal– striatal circuitry Keeps time and shifts attention. Brain and Cognition,48(1), 195-211. doi:10.1006/brcg.2001.1313 Merchant, H., Zarco, W., Bartolo, R., & Prado, L. (2008). The context of temporal processing is represented in the Multidimensional relationships between timing tasks. PLoS ONE,3(9), e3169. doi:10.1371/journal.pone.0003169 Meumann, E. (1894). Untersuchungen zur Psychologie und Aesthetik des Rhythmus. Philosophische Studien 10: 249-322 and 393-430. Miendlarzewska, E. A., & Trost, W. J. (2014). How musical training affects cognitive development: Rhythm, reward and other modulating variables. Frontiers in Neuroscience,7. doi:10.3389/fnins.2013.00279 Miyake, Y., Onishi, Y., & Pöppel, E. (2004). Two types of anticipatory-timing mechanisms in synchronization tapping. Acta Neurobiol Exp,46(3), 415-426. Montemurro, M. A., Rasch, M. J., Murayama, Y., Logothetis, N. K., & Panzeri, S. (2008). Phase-of-firing coding of natural visual stimuli in primary visual cortex. Current Biology,18(5), 375-380. doi:10.1016/j.cub.2008.02.023 Monts, L. P. (1990). An annotated glossary of Vai musical language and its social contexts. Paris: Peeters. Morein-Zamir, S., Soto-Faraco, S., & Kingstone, A. (2003). Auditory capture of vision: Examining temporal ventriloquism. Cognitive Brain Research,17(1), 154-163. doi:10.1016/s0926-6410(03)00089-2 M. S. Eno-Belinga, M. S. (1965). Litterature et musique populaires en Afrique noire. Paris: Cujas. Musacchia, G., Sams, M., Skoe, E., & Kraus, N. (2007). Musicians have enhanced subcortical auditory and audiovisual processing of speech and music. Proceedings of the National Academy of Sciences,104(40), 15894-15898. doi:10.1073/pnas.0701498104 Musacchia, G., Strait, D., & Kraus, N. (2008). Relationships between behavior, brainstem and cortical encoding of seen and heard speech in musicians and non-musicians. Hearing Research,241(1-2), 34-42. doi:10.1016/j.heares.2008.04.013 Naito, E. (2004). Sensing limb movements in the motor cortex: How humans sense limb movement. The Neuroscientist,10(1), 73-82. doi:10.1177/1073858403259628 Nittono, H., Bito, T., Hayashi, M., Sakata, S., & Hori, T. (2000). Event-related potentials elicited by wrong terminal notes: Effects of temporal disruption. Biological Psychology,52(1), 1-16. doi:10.1016/s0301-0511(99)00042-3

161

Nketia, K. J. (1963). African Music in Ghana. Chicao: Northwestern University Press. Nordenhake, M., & Svantesson, J.-O. (1983). Duration of standard Chinese word tones in different sentence environments. Working Papers 25 (Lund, Sweden), 105–111. Omojola, B. (2014). Yorùbá music in the twentieth century: Identity, agency, and performance practice. Rochester: University of Rochester Press. Oshio, K. (2011). Possible functions of prefrontal cortical neurons in duration discrimination. Frontiers in Integrative Neuroscience,5, 5-25. doi:10.3389/fnint.2011.00025 Paillard, J. (1949) Quelques donnees psychophysiologiques relatives au declenchement de la commande motrice [Some psychophysiological data relating to the triggering of motor commands]. L’Ann. Psych 48, 28–47. Pantev, C., Oostenveld, R., Engelien, A., Ross, B., Roberts, L. E., & Hoke, M. (1998). Increased auditory cortical representation in musicians. Nature,392(6678), 811-814. doi:10.1038/33918 Parncutt, R. (1994). A perceptual model of pulse salience and metrical accent in musical rhythms. Music Perception: An Interdisciplinary Journal,11(4), 409-464. doi:10.2307/40285633 Patel, A. D., Iversen, J. R., Chen, Y., & Repp, B. H. (2005). The influence of metricality and modality on synchronization with a beat. Experimental Brain Research,163(2), 226- 238. doi:10.1007/s00221-004-2159-8 Penney, T. B., Gibbon, J., & Meck, W. H. (2000). Differential effects of auditory and visual signals on clock speed and temporal memory. Journal of Experimental Psychology: Human Perception and Performance,26(6), 1770-1787. doi:10.1037/0096- 1523.26.6.1770 Pfeuty, M., & Peretz, I. (2010). Abnormal pitch—time interference in congenital amusia: Evidence from an implicit test. Attention, Perception, & Psychophysics,72(3), 763-774. doi:10.3758/app.72.3.763 Pfordresher, P. Q. (2003). The role of melodic and rhythmic accents in musical structure. Music Perception,20(4), 431-464. doi:10.1525/mp.2003.20.4.431 Pfordresher, P. Q., & Brown, S. (2009). Enhanced production and perception of musical pitch in tone language speakers. Attention, Perception, & Psychophysics,71(6), 1385- 1398. doi:10.3758/app.71.6.1385

162

Phillips-Silver, J., & Trainor, L. J. (2007). Hearing what the body feels: Auditory encoding of rhythmic movement. Cognition,105(3), 533-546. doi:10.1016/j.cognition.2006.11.006 Polak, R. (2010). Rhythmic feel as meter: Non-isochronouse beat subdivision in Jembe music from Mali. Rhythm: Africa and Beyond Music Theory Online,16(4). doi:10.30535/mto.16.4.4 Polak, R., & London, J. (2014). Timing and meter in Mande drumming from Mali. Music Theory Online,20(1). doi:10.30535/mto.20.1.1 Povel, D-J., & Essens, P. (1985). Perception of temporal patterns. Music Perception, 2(4), 411-440. doi.org/10.2307/40285311 Prince, J. B., Thompson, W. F., & Schmuckler, M. A. (2009). Pitch and time, tonality and meter: How do musical dimensions combine? Journal of Experimental Psychology: Human Perception and Performance,35(5), 1598-1617. doi:10.1037/a0016456 Prince, J. B. (2014). Pitch structure, but not selective attention, affects accent weightings in metrical grouping. Journal of Experimental Psychology: Human Perception and Performance,40(5), 2073-2090. doi:10.1037/a0037730 Prinz, W. (1997). Perception and action planning. European Journal of ,9(2), 129-154. doi:10.1080/713752551 Rakic, P. (2002). Neurogenesis in adult primate neocortex: An evaluation of the evidence. Nature Reviews Neuroscience,3(1), 65-71. doi:10.1038/nrn700 Rakitin, B. C., Gibbon, J., Penney, T. B., Malapani, C., Hinton, S. C., & Meck, W. H. (1998). Scalar expectancy theory and peak-interval timing in humans. Journal of Experimental Psychology: Animal Behavior Processes,24(1), 15-33. doi:10.1037/0097- 7403.24.1.15 Rammsayer, T., & Altenmüller, E. (2006). Temporal information processing in musicians and nonmusicians. Music Perception, 24(1), 37-48. doi:10.1525/mp.2006.24.1.37 Rammsayer, T. H., Borter, N., & Troche, S. J. (2015). Visual-auditory differences in duration discrimination of intervals in the subsecond and second range. Frontiers in Psychology,6. doi:10.3389/fpsyg.2015.01626 Recanzone, G. H. (2003). Auditory influences on visual temporal rate perception. Journal of Neurophysiology,89(2), 1078-1093. doi:10.1152/jn.00706.2002 Repp, B. H. (2000). Compensation for subliminal timing perturbations in perceptual- motor synchronization. Psychological Research Psychologische Forschung,63(2), 106- 128. doi:10.1007/pl00008170

163

Repp, B. H. (2001). Phase correction, phase resetting, and phase shifts after subliminal timing perturbations in sensorimotor synchronization. Journal of Experimental Psychology: Human Perception and Performance,27(3), 600-621. doi:10.1037/0096- 1523.27.3.600 Repp, B. H., & Penel, A. (2002). Auditory dominance in temporal processing: New evidence from synchronization with simultaneous visual and auditory sequences. Journal of Experimental Psychology: Human Perception and Performance,28(5), 1085-1099. doi:10.1037//0096-1523.28.5.1085 Repp, B., & Penel, A. (2003). Rhythmic movement is attracted more strongly to auditory than to visual rhythms. Psychological Research Psychologische Forschung, 68(4). doi:10.1007/s00426-003-0143-8 Repp, B. H. (2002). Automaticity and voluntary control of phase correction following event onset shifts in sensorimotor synchronization. Journal of Experimental Psychology: Human Perception and Performance,28(2), 410-430. doi:10.1037/0096-1523.28.2.410 Repp, B. H., & Penel, A. (2002). Auditory dominance in temporal processing: New evidence from synchronization with simultaneous visual and auditory sequences. Journal of Experimental Psychology: Human Perception and Performance,28(5), 1085-1099. doi:10.1037//0096-1523.28.5.1085 Repp, B. H., & Penel, A. (2004). Rhythmic movement is attracted more strongly to auditory than to visual rhythms. Psychological Research,68(4), 252-270. doi:10.1007/s00426-003-0143-8 Repp, B. H. (2003). Rate limits in sensorimotor synchronization with auditory and visual sequences: The synchronization threshold and the benefits and costs of interval subdivision. Journal of Motor Behavior,35(4), 355-370. doi:10.1080/00222890309603156 Repp, B. H. (2005). Sensorimotor synchronization: A review of the tapping literature. Psychonomic Bulletin & Review,12(6), 969-992. doi:10.3758/bf03206433 Repp, B. H. (2007). Hearing a melody in different ways: Multistability of metrical interpretation, reflected in rate limits of sensorimotor synchronization. Cognition,102(3), 434-454. doi:10.1016/j.cognition.2006.02.003 Repp, B. H., & Moseley, G. P. (2012). Anticipatory phase correction in sensorimotor synchronization. Human Movement Science,31(5), 1118-1136. doi:10.1016/j.humov.2011.11.001

164

Rodrigues, A. C., Loureiro, M. A., & Caramelli, P. (2013). Long-term musical training may improve different forms of visual attention ability. Brain and Cognition,82(3), 229- 235. doi:10.1016/j.bandc.2013.04.009 Rosenblum, L. D., & Fowler, C. A. (1991). Audiovisual investigation of the loudness- effort effect for speech and nonspeech events. Journal of Experimental Psychology: Human Perception and Performance,17(4), 976-985. doi:10.1037/0096-1523.17.4.976 Rosenblum, M., Tass, P., Kurths, J., Volkmann, J., Schnitzler, A., & Freund, H. (1998). Detection of n:m phase locking from noisy data: Application to magnetoencephalography. Physical Review Letter,81(15), 3291-3294. doi:doi:10.1103/PhysRevLett.81.3291 Rowlands, M. (2013). New science of the mind - from extended mind to embodied phenomenology. MIT Press. Rubin, E. (1915). Visuell wahrgenommene Figuren. In D. C. Beardslee & M. Wertheimer (Eds.), Readings in perception (pp. 194-203). Princeton, NJ: D. Van Nostrand Company, Inc. Sameiro-Barbosa, C. M., & Geiser, E. (2016). Sensory entrainment mechanisms in auditory perception: Neural synchronization cortico-striatal activation. Frontiers in Neuroscience,10. doi:10.3389/fnins.2016.00361 Saygin, A. P. (2007). Superior temporal and premotor brain areas necessary for biological motion perception. Brain,130(9), 2452-2461. doi:10.1093/brain/awm162 Saygin, A. P., & Stadler, W. (2012). The role of appearance and motion in action prediction. Psychological Research,76(4), 388-394. doi:10.1007/s00426-012-0426-z Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. (2008). Neuronal oscillations and visual amplification of speech. Trends in Cognitive Sciences,12(3), 106- 113. doi:10.1016/j.tics.2008.01.002 Schubotz, R. I., Friederici, A. D., & Cramon, D. Y. (2000). Time perception and motor timing: A common cortical and subcortical basis revealed by fMRI. NeuroImage,11(1), 1-12. doi:10.1006/nimg.1999.0514 Schulze, H. (1978). The detectability of local and global displacements in regular rhythmic patterns. Psychological Research,40(2), 173-181. doi:10.1007/bf00308412 Shapiro, L. A. (2012). Embodied cognition. In E. Margolis, R. Samuels, & S. P. Stich (Eds.), The Oxford Handbook of Philosophy of Cognitive Science. Oxford: Oxford University Press.

165

Shelemay, K. K. (2010). Notation and oral tradition. In R. M. Stone (Ed.), The Garland handbook of African music (pp. 24-43). New York: Routledge. Shields, C. (2016). Aristotle’s psychology. The Stanford Encyclopedia of Philosophy, Edward N. Zalta (ed.). Retrieved from https://plato.stanford.edu/archives/win2016/entries/aristotle-psychology/ Shih, L. Y., Kuo, W., Yeh, T., Tzeng, O. J., & Hsieh, J. (2009). Common neural mechanisms for explicit timing in the sub-second range. NeuroReport,20(10), 897-901. doi:10.1097/wnr.0b013e3283270b6e Shockley, K., Santana, M., & Fowler, C. A. (2003). Mutual interpersonal postural constraints are involved in cooperative conversation. Journal of Experimental Psychology: Human Perception and Performance,29(2), 326-332. doi:10.1037/0096- 1523.29.2.326 Snyder, J., & Krumhansl, C. L. (2001). Tapping to Ragtime: Cues to pulse finding. Music Perception,18(4), 455-489. doi:10.1525/mp.2001.18.4.455 Spaak, E., Lange, F. P., & Jensen, O. (2014). Local entrainment of Alpha oscillations by visual stimuli causes cyclic modulation of perception. Journal of Neuroscience,34(10), 3536-3544. doi:10.1523/jneurosci.4385-13.2014 Stadler, W., Springer, A., Parkinson, J., & Prinz, W. (2012). Movement kinematics affect action prediction: Comparing human to non-human point-light actions. Psychological Research,76(4), 395-406. doi:10.1007/s00426-012-0431-2 Stein, B. E., Stanford, T. R., & Rowland, B. A. (2009). The neural basis of multisensory integration in the midbrain: Its organization and maturation. Hearing Research, 258(1-2), 4-15. doi:10.1016/j.heares.2009.03.012 Steven, L. T. (1886). On the Time Sense. Mind, 11, 393-404. Styns, F., Noorden, L. V., Moelants, D., & Leman, M. (2007). Walking on music. Human Movement Science,26(5), 769-785. doi:10.1016/j.humov.2007.07.007 Su, Y., & Jonikaitis, D. (2011). Hearing the speed: Visual motion biases the perception of auditory tempo. Experimental Brain Research,214(3), 357-371. doi:10.1007/s00221-011- 2835-4 Su, Y. (2014a). Visual enhancement of auditory beat perception across auditory interference levels. Brain and Cognition,90, 19-31. doi:10.1016/j.bandc.2014.05.003 Su, Y. (2014b). Audiovisual beat induction in complex auditory rhythms: Point-light figure movement as an effective visual beat. Acta Psychologica,151, 40-50. doi:10.1016/j.actpsy.2014.05.016

166

Su, Y. (2014c). Content congruency and its interplay with temporal synchrony modulate integration between rhythmic audiovisual streams. Frontiers in Integrative Neuroscience,8. doi:10.3389/fnint.2014.00092 Swaminathan, J., Krishnan, A., & Gandour, J. T. (2008). Pitch encoding in speech and nonspeech contexts in the human auditory brainstem. NeuroReport,19(11), 1163-1167. doi:10.1097/wnr.0b013e3283088d31 Terao, M., Watanabe, J., Yagi, A., & Nishida, S. (2008). Reduction of stimulus visibility compresses apparent time intervals. Nature Neuroscience,11(5), 541-542. doi:10.1038/nn.2111 Thagard, P. (2014). Cognitive science. The Stanford Encyclopedia of Philosophy. (Edward N. Zalta ed.). Retrieved from https://plato.stanford.edu/archives/fall2014/entries/cognitive-science/ Thaut, M., Kenyon, G., Schauer, M., & Mcintosh, G. (1999). The connection between rhythmicity and brain function. IEEE Engineering in Medicine and Biology Magazine,18(2), 101-108. doi:10.1109/51.752991 Thomas, E. A., & Weaver, W. B. (1975). Cognitive processing and time perception. Perception & Psychophysics,17(4), 363-367. doi:10.3758/bf03199347 Thomas, E. C., & Brown, I. (1974). Time perception and the filled-duration . Perception & Psychophysics,16(3), 449-458. doi:10.3758/bf03198571 Titon, J. T. (2015). Ethnomusicology as the Study of People Making Music. Musicological Annual,51(2), 175-185. doi:10.4312/mz.51.2.175-185 Todd, N. P. (1994). The auditory “primal sketch”: A multiscale model of rhythmic grouping. Journal of New Music Research, 23(1), 25-70. doi:10.1080/09298219408570647 Todd, N. P. (1995). The kinematics of musical expression. The Journal of the Acoustical Society of America,97(3), 1940-1949. doi:10.1121/1.412067 Todd, N. P., Oboyle, D. J., & Lee, C. S. (1999). A sensory-motor theory of rhythm, time perception and beat induction. Journal of New Music Research,28(1), 5-28. doi:10.1076/jnmr.28.1.5.3124 Todd, N. P., & Lee, C. S. (2015). The sensory-motor theory of rhythm and beat induction 20 years on: A new synthesis and future perspectives. Frontiers in Human Neuroscience,9. doi:10.3389/fnhum.2015.00444 Todor, J. I., & Kyprie, P. M. (1980). Hand differences in the rate and variability of rapid tapping. Journal of Motor Behavior,12(1), 57-62. doi:10.1080/00222895.1980.10735205

167

Trainor, L. J., & Adams, B. (2000). Infants’ and adults’ use of duration and intensity cues in the segmentation of tone patterns. Perception & Psychophysics,62(2), 333-340. doi:10.3758/bf03205553 Treisman, M. (1963). Temporal discrimination and the indifference interval: Implications for a model of the "internal clock". Psychological Monographs: General and Applied,77(13), 1-31. doi:10.1037/h0093864 Truman, G., & Hammond, G. R. (1990). Temporal regularity of tapping by the Left and Right hands in timed and untimed finger tapping. Journal of Motor Behavior,22(4), 521- 535. doi:10.1080/00222895.1990.10735526 Van Wassenhove, V. (2009). Minding time in an amodal representational space. Philosophical Transactions of the Royal Society B: Biological Sciences,364(1525), 1815-1830. doi:10.1098/rstb.2009.0023 Van Wassenhove, V., Buonomano, D. V., Shimojo, S., & Shams, L. (2008). Distortions of Subjective Time Perception Within and Across Senses. PLoS ONE,3(1), e1437. doi:10.1371/journal.pone.0001437 Vander Wyk, B. C., Ramsay, G. J., Hudac, C. M., Jones, W., Lin, D., Klin, A. Lee, S. M., Pelphrey, K. A. (2010). Cortical integration of audio-visual information. Brain and Cognition, 74(2), 97-106. doi:10.1016/j.bandc.2010.07.002 Varela, F. J., Rosch, E., & Thompson, E. (2016). The embodied mind: Cognitive science and human experience. Cambridge, MA: The MIT Press. Vos, P. (1977). Temporal duration factors in the perception of auditory rhythmic patterns, Scientific Aesthetics/Sciences de I’Art 1, 183-199. Vuoskoski, J. K., Thompson, M. R., Spence, C., & Clarke, E. F. (2016). Interaction of sight and sound in the perception and experience of musical performance. Music Perception: An Interdisciplinary Journal,33(4), 457-471. doi:10.1525/mp.2016.33.4.457 Walter, W. G., Cooper, R., Aldridge, V. J., Mccallum, W. C., & Winter, A. L. (1964). Contingent negative variation: An electric sign of sensori-motor association and expectancy in the human brain. Nature,203(4943), 380-384. doi:10.1038/203380a0 Warren, R. M., & Obusek, C. J. (1972). Identification of temporal order within auditory sequences. Perception & Psychophysics,12(1), 86-90. doi:10.3758/bf03212848 Waterman, C. A. (1991). The uneven development of Africanist Ethnomusicology: Three issues and a critique. In Comparative Musicology and Anthropology of Music: Essays on the History of Ethnomusicology (Vol. I, pp. 169-186). Chicago: University of Chicago Press.

168

Waterman, R. A. (1952). African influence on the music of the Americas. In S. Tax (Ed.), Acculturation in the Americas (pp. 207-218). Chicago: University of Chicago Press. Wearden, J. H., Edwards, H., Fakhri, M., Percival, A. (1998). Why “sounds are judged longer than lights”: application of a model of the internal clock in humans. Quarterly Journal of Experimental Psychology B, 51(2), 97-120. DOI: 10.1080/713932672 Wearden, J. H., Norton, R., Martin, S., & Montford-Bebb, O. (2007). Internal clock processes and the filled-duration illusion. Journal of Experimental Psychology: Human Perception and Performance,33(3), 716-729. doi:10.1037/0096-1523.33.3.716 Webb, J. T. (1969). Subject speech rates as a function of interviewer behaviour. Language and Speech,12(1), 54-67. doi:10.1177/002383096901200105 Welch, R. B., & Warren, D. H. (1986). Intersensory interactions. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance: Vol. 1. Sensory processes and perception (pp. 1–36). New York: Wiley. Wencil, E. B., Coslett, H. B., Aguirre, G. K., & Chatterjee, A. (2010). Carving the clock at its component joints: Neural bases for interval timing. Journal of Neurophysiology,104(1), 160-168. doi:10.1152/jn.00029.2009 Will, U. (2017). Cultural factors in responses to rhythmic stimuli. In J. R. Evans, & R. Turner (Eds.), Rhythmic stimulation procedures in neuromodulation. Elsevier Neuroscience. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience Letters,424(1), 55-60. doi:10.1016/j.neulet.2007.07.036 Will, U., Clayton, M., Wertheim, I., Leante, L., & Berg, E. (2015). Pulse and entrainment to non-isochronous auditory stimuli: The case of north Indian Alap. Plos One,10(4). doi:10.1371/journal.pone.0123247 Will, U., & Makeig, S. (2011). EEG research methodology and brainwave entrainment. In G. Turow & J. Berger J. (Eds.), Music, science and the rhythmic brain: Cultural and clinical implications (pp. 86–107). Routledge. Witherspoon, D., & Allan, L. G. (1985). The effect of a prior presentation on temporal judgments in a perceptual identification task. Memory & Cognition,13(2), 101-111. doi:10.3758/bf03197003 Wöllner, C., Deconinck, F. J., Parkinson, J., Hove, M. J., & Keller, P. E. (2012). The perception of prototypical motion: Synchronization is enhanced with quantitatively morphed gestures of musical conductors. Journal of Experimental Psychology: Human Perception and Performance,38(6), 1390-1403. doi:10.1037/a0028130

169

Xuan, B., Zhang, D., He, S., & Chen, X. (2007). Larger stimuli are judged to last longer. Journal of Vision,7(10), 1-5. doi:10.1167/7.10.2 Yang, J., Zhang, Y., Li, A., & Xu, L. (2017). On the duration of mandarin tones. Interspeech 2017. doi:10.21437/interspeech.2017-29 Yip, M. (2010). Tone. Cambridge: Cambridge Univ. Press. Yoshida, T., Takeda, S., & Yamamoto, S. (2002). The Application of Entrainment to Musical Ensembles. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=488245F75E49DC002389878B 48889443?doi=10.1.1.471.3067&rep=rep1&type=pdf Zakay, D. (1989). Subjective time and attentional resource allocation: An integrated model of time estimation. In I. Lvin & D. Zakay (Eds.), Time and human cognition: A life-span perspective (pp. 365–397). Amsterdam: North Holland. Zakay, D. (1992). The role of attention in children’s time perception. Journal of Experiment Children Psychology, 54(3):355-371. doi:10.1016/0022-0965(92)90025-2 Zakay, D. Block, R. A. (1996). The role of attention in perspective duration estimation. Advances in Psychology, 115. 143-164. doi.org/10.1016/S0166-4115(96)80057-4 Zatorre, R. J., Chen, J. L., & Penhune, V. B. (2007). When the brain plays music: Auditory–motor interactions in music perception and production. Nature Reviews Neuroscience,8(7), 547-558. doi:10.1038/nrn2152

170

Appendix A. ANOVA Table on Circular Variance

Tests of Hypothesis for Between Subjects Effects Source Df Sum Sq Mean Sq F Value P Value Modality 1 0.00070 0.00070 0.034 0.85644 Display 1 0.06043 0.06043 2.976 0.11248 Tempi 1 0.24704 0.24704 12.165 0.00508** Modality  Tempi 1 0.11088 0.11088 5.460 0.03941* Residuals 11 0.22339 0.02031

Tests of Hypothesis for Within Subjects Effects Source Df Sum Sq Mean Sq F Value P Value Modality 1 1.2028 1.2028 143.084 <2e- 16*** Display 1 0.0074 0.0074 0.880 0.349 Tempi 2 1.6433 0.8217 97.739 <2e- 16*** Modality  Display 1 0.0006 0.0006 0.076 0.783 Modality  Tempi 2 0.7452 0.3726 44.321 <2e- 16*** Display  Tempi 2 0.0202 0.0101 1.201 0.302 Modality  Display  Tempi 2 0.0184 0.0092 1.095 0.336 Residuals 340 2.8582 0.0084

Note: Modality: bimodal/unimodal (visual-only) conditions. Display: natural/artificial visual display. Tempi: fast/preferred/slow tempi. Significance Markers (using R convention): *, less than 0.05 **, less than 0.01 ***, less than 0.001

171

Appendix B. ANOVA Table on PCR

Tests of Hypothesis for Between Subjects Effects Source Df Sum Sq Mean Sq F Value P Value Modality 1 1.6861 1.6861 6.358 0.0452* Display 1 0.0110 0.0110 0.041 0.8454 Tempi 2 1.1558 0.5779 2.179 0.1944 PT 1 0.0414 0.0414 0.156 0.7065 PP 1 0.0386 0.0386 0.146 0.7159 Modality  Display 1 0.0047 0.0047 0.018 0.8979 Display  PT 1 0.0263 0.0263 0.099 0.7634 PT  PP 1 0.4946 0.4946 1.865 0.2210 Residuals 6 1.5912 0.2652

Tests of Hypothesis for Within Subjects Effects Source Df Sum Mean F P Value Sq Sq Value Modality 1 24.42 24.42 89.060 <2e-16*** Display 1 0.06 0.06 0.232 0.630272 Tempi 2 67.95 33.97 123.888 <2e-16*** PT 1 2.01 2.01 7.343 0.006905** PP 1 0.17 0.17 0.620 0.431344 Modality  Display 1 0.04 0.04 0.163 0.686451 Modality  Tempi 2 5.97 2.98 10.877 2.25e-05*** Modality  PT 1 0.53 0.53 1.937 0.164456 Modality  PP 1 0.28 0.28 1.027 0.311197 Display  Tempi 2 1.01 0.50 1.841 0.159411 Display  PT 1 0.39 0.39 1.429 0.232403 Display  PP 1 0.03 0.03 0.113 0.736823 PT  Tempi 2 1.58 0.79 2.890 0.056281 PT  PP 1 17.65 17.65 64.346 4.73e-15*** Tempi  PP 2 0.67 0.34 1.227 0.293927 Modality  Display  PT 1 0.03 0.03 0.109 0.741803 Modality  Display  Tempi 2 0.11 0.05 0.198 0.820224 Modality  Display PP 1 0.16 0.16 0.584 0.445130 Modality  PT  Tempi 2 5.11 2.56 9.321 0.000102*** Modality  PT  PP 1 0.01 0.01 0.030 0.861730 Modality  Tempi  PP 2 3.60 1.80 6.568 0.001498** 172

Display PT  Tempi 2 0.75 0.37 1.361 0.257115 Display  PT  PP 1 0.15 0.15 0.543 0.461525 Display  Tempi  PP 2 0.51 0.25 0.926 0.396815 PT Tempi  PP 2 2.21 1.11 4.033 0.018166* Modality Display  PT  Tempi 2 1.23 0.61 2.237 0.107573 Modality Display  PT  PP 1 0.49 0.49 1.798 0.180410 Modality Display  Tempi  PP 2 0.37 0.19 0.679 0.507567 Modality PT  Tempi  PP 2 2.76 1.38 5.041 0.006714** Display PT  Tempi  PP 2 0.33 0.16 0.596 0.551207 Modality Display  PT  PP  Tempi 2 1.58 0.79 2.882 0.056721 Residuals 664 182.09 0.27

Note: Modality: bimodal/unimodal (visual-only) conditions. Display: natural/artificial visual display. Tempi: fast/preferred/slow tempi. PT: early/delay perturbation type. PP: first/second perturbation position. Significance Markers (using R convention): *, less than 0.05 **, less than 0.01 ***, less than 0.001

173

Appendix C. ANOVA Table on Recovery Rate

Tests of Hypothesis for Between Subjects Effects Source Df Sum Sq Mean Sq F Value P Value Modality 1 0.1483 0.1483 0.916 0.3612 Display 1 0.6637 0.6637 4.098 0.0705 Tempi 1 0.4006 0.4006 2.474 0.1468 Tempi  Display 1 0.2262 0.2262 1.397 0.2646 PT  PP 1 0.0047 0.0047 0.029 0.8678 Residuals 10 1.6195 0.1619

Tests of Hypothesis for Within Subjects Effects Source Df Sum Mean F P Value Sq Sq Value Modality 1 0.52 0.5236 7.747 0.00553** Display 1 0.13 0.1339 1.982 0.15966 Tempi 2 3.15 1.5772 23.336 1.59e- 10*** PT 1 0.16 0.1625 2.405 0.12143 PP 1 0.33 0.3313 4.902 0.02716* Modality  Display 1 0.00 0.0015 0.022 0.88274 Modality  Tempi 2 0.01 0.0072 0.107 0.89869 Modality  PT 1 0.12 0.1151 1.703 0.19237 Modality  PP 1 0.11 0.1089 1.611 0.20478 Display  Tempi 2 0.05 0.0257 0.381 0.68361 Display  PT 1 0.03 0.0313 0.463 0.49623 Display  PP 1 0.00 0.0000 0.000 0.99555 PT  Tempi 2 0.06 0.0303 0.449 0.63846 PT  PP 1 0.03 0.0261 0.386 0.53475 Tempi  PP 2 0.06 0.0292 0.432 0.64970 Modality  Display  PT 1 0.03 0.0285 0.421 0.51651 Modality  Display  Tempi 2 0.11 0.0531 0.785 0.45649 Modality  Display PP 1 0.04 0.0407 0.602 0.43811 Modality  PT  Tempi 2 0.00 0.0008 0.012 0.98824 Modality  PT  PP 1 0.10 0.1011 1.496 0.22166 Modality  Tempi  PP 2 0.09 0.0454 0.672 0.51106 Display PT  Tempi 2 0.29 0.1437 2.126 0.12009 Display  PT  PP 1 0.01 0.0143 0.211 0.64606 174

Display  Tempi  PP 2 0.00 0.0001 0.001 0.99915 PT Tempi  PP 2 0.45 0.2272 3.362 0.03524* Modality Display  PT  Tempi 2 0.11 0.0561 0.830 0.43637 Modality Display  PT  PP 1 0.08 0.0771 1.141 0.28577 Modality Display  Tempi  PP 2 0.06 0.0296 0.438 0.64562 Modality PT  Tempi  PP 2 0.14 0.0707 1.046 0.35183 Display PT  Tempi  PP 2 0.13 0.0673 0.996 0.36979 Modality Display  PT  PP  Tempi 2 0.05 0.0257 0.380 0.68395 Residuals 671 45.35 0.0676

Note: Modality: bimodal/unimodal (visual-only) conditions. Display: natural/artificial visual display. Tempi: fast/preferred/slow tempi. PT: early/delay perturbation type. PP: first/second perturbation position. Significance Markers (using R convention): *, less than 0.05 **, less than 0.01 ***, less than 0.001

175

Appendix D. ANOVA Table on Bscore

Tests of Hypothesis for Between Subjects Effects Source Df Sum Sq Mean Sq F Value P Value Tempo 1 0.579 0.5788 1.219 0.2849 Pattern 4 2.171 0.5428 1.143 0.3699 Group 1 2.454 2.4541 5.169 0.0363 Residuals 17 8.072 0.4748

Tests of Hypothesis for Within Subjects Effects Source Df Sum Sq Mean F Value P Value Sq Tempo 1 2.02 2.020 13.407 0.000272*** Pattern 4 39.16 9.791 64.985 < 2e-16*** Visual 2 5.43 2.714 18.014 2.47e-08*** Tempo  Pattern 4 1.14 0.285 1.892 0.110204 Tempo  Visual 2 0.08 0.040 0.268 0.764770 Pattern  Visual 8 1.85 0.231 1.535 0.141571 Tempo  Group 1 0.01 0.005 0.035 0.852135 Pattern  Group 4 0.49 0.122 0.808 0.520431 Visual  Group 2 1.87 0.934 6.202 0.002152** Tempo  Pattern  Visual 8 1.12 0.140 0.929 0.491384 Tempo  Pattern  Group 4 0.60 0.149 0.990 0.412485 Tempo  Visual Group 2 0.57 0.284 1.887 0.152423 Pattern  Visual  Group 8 0.21 0.027 0.177 0.993953 Tempo Pattern  Visual  Group 8 0.77 0.096 0.639 0.745391 Residuals 627 94.47 0.151

Note: Tempo: Slow/fast tempo. Pattern: Five auditory patterns. Visual: Visual duple timing cue, visual triple timing cue, visual still image without any timing cue. Group: Musician/non-musician participant group.

Significance Markers (using R convention): *, less than 0.05 **, less than 0.01 ***, less than 0.001

176

Appendix E. Questionnaire

Basic Information

Age: ______Gender: ______Major: ______

Questions and possible responses are displayed below across a number of pages:

1/15. I would not consider myself a musician? (Answer with an X or  in one box)  Completely Agree  Strongly Agree  Disagree  Agree  Neither Agree nor Disagree  Strongly Disagree  Completely Disagree

2/15. I spend a lot of my free time doing music-related activities.  Completely Disagree  Strongly Disagree  Disagree  Neither Agree nor Disagree  Agree  Strongly Agree  Completely Agree

3/15. I listen attentively to music for __ per day.

 0-15min  15-30min  30-60min  60-90min  2hr  2-3hr 177

 4hr or more

4/15. What kind of music you’d like to listen every day?

5/15. I have attended __ live music events as an audience member in the past twelve months.

 0  1  2  3  4-6  7-10  11 or more

6/15. Please specify the type of music in the concert you mainly attend (classical music, popular music, world music, etc.)

7/15. I can play __ musical instruments (including professional voice).

 0  1  2  3  4  5  6 or more

178

8/15. Have you learnt music instrument, including voice?  Learnt through private music lessons  Learnt by yourself  Never learnt any→Skip to 13/15

9/15. Please give the start year age and number of training years on each instrument, including voice.

______

10/15. On what instrument (including voice) have you attained the highest level of skills?

11/15. How often do you practice for each instrument, or voice? Specify hours for regular daily practice. (Daily can be defined as 5 to 7 days per . Year can be defined as 10 to 12 months)

12/15. Specify the amount of time you currently (in the last 12 months) spend on practicing instrument (or voice) individually. (Do not count the time you spent in any group rehearsal. Count by day, week, ).

13/15. Did you participant in band/orchestra/choir? (Y/N)_____. If N, skip to the end; if Y, please specify the type of ensemble and the number of years you participated.

179

14/15. How often did you join ensemble practice and performance in the last 12 months?  None  1-4  5-8  9-12  13-16  17-20  21 or more

15/15. How many hours do you practice every week in group rehearsal?  0  0.5  1  1.5  2  3-4  5 or more______(specify the number)

You have finished the questionnaire! Please check through the booklet to make sure you have not accidentally missed any pages. Thank you for your help.

180