<<

Functional transfer of musical training to speech perception in adverse acoustical situations

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Arts in the Graduate School of The Ohio State University

By

Jianming Shen

Graduate Program in Speech and Science

The Ohio State University

2014

Master's Examination Committee:

Dr. Lawrence L. Feth, Advisor

Dr. Antoine J. Shahin

Copyrighted by

Jianming Shen

2014

Abstract

Listeners can perceive interrupted speech as continuous, provided that the gap is masked by another extraneous sound such as white noise or a cough. This phenomenon, known as the continuity or phonemic restoration, is an adaptive function of our that facilitates speech comprehension in adverse acoustic situations. In this study, we examined the hypothesis that the effect of music training, as manifested in one’s enhanced ability to anticipate envelope variation and thus perceive continuity in degraded music, can transfer to phonemic restoration. We posited that this cross-domain extension is largely due to the overlapping neural networks associated with rhythm processing in the lower-level central auditory system.

Musicians and non-musicians listened to physically interrupted short music tunes and

English words which contained a segment that was replaced by white noise, and judged whether they heard the stimuli as interrupted or continuous through the noise. Their perceptual threshold of continuity— here defined as the interruption duration at which they perceived the sound as continuous by a 50% chance—for each session was measured and calculated based on an adaptive procedure. Results revealed that musicians tolerated longer interruptions than non-musicians during the speech session, but not during the music session. The results partially support the existence of functional transfer of musical training to speech perception. Meanwhile, the interruption thresholds in both ii sessions were highly correlated, which is consistent with the hypothetical overlap between neural networks related to music and speech processing. These findings may have implications for developing learning tools and strategies to support perception of spoken language in adverse listening situations.

iii

Acknowledgments

First of all, I would like to express my sincere gratitude to my research advisor Dr. Tony

Shahin for his guidance, patience, dedication and empathy during such a special period of time in my life. What he has taught me is not only how to reach a scientific goal step by step, but also how to stay loyal, responsible, brave and adaptive in adversity.

My special thanks also goes to my academic advisor Dr. Larry Feth, who helped me transform my way of thinking from an engineering to a scientific perspective, raised thought-provoking questions for this project, and provided guidance in writing and revising my thesis. I wish to have a broad heart like his and to detect more signals from his characteristic of humor.

Moreover, I would like to thank Dr. in School of Music, who formally led me into the world of science and imparted to me his creeds in empirical research. His lab has been a home-like place for me in the past two years and will always be a home. I can never overestimate the impact his publications have on my perspective and career pursuit.

I owe gratitude to Dr. Robert Fox, who has trusted, encouraged and supported me all through; to Dr. Eric Healy, who accepted me into this program and gave me flexibility in

iv personal development; to Dr. Rebecca McCauley, who induced me to rethink about my motivation and my goal of life when I was in stagnancy.

I want to thank my colleague David Bendoly in the Auditory Neuroscience Lab, who devoted a lot of time to this project and offered true friendship when I was in trouble during the winter blizzard. I’m also grateful to Dr. Mark Pitt and Dr. Laurel Trainor, as well as our former lab manager Jyoti Bhat for their support on this project.

I cannot forget the generous help from these people in Speech & Hearing Science: Jing

Yang, Sarah Yoho, Carla Youngdahl, Niall Klyn, Christin Ray and Jenny Lundine (with her medical dog Ansley); from the School of Music: Dan Shanahan, Claire Arthur, Nat

Condit-Schultz, Kirsten Nisula, Erin Allen, Aaron Carter-Cohn and Gary Yim.

I can never write too much to thank these true friends who taught me how to survive in this country in and out and helped me overcome troughs in life: my brother Manny Rizzi and sister Charlette Lin from Department of Psychology, emergent contact Michael

Rudy, old neighbors Katherine Bracken and Frank Brownfield and songwriter Eric

Clemens…to name but a few. I feel lucky to have these positive people in my life.

Finally, I want to thank my extended family who have supported my education for so many years. Particularly, I dedicate this small accomplishment to my maternal grandpa who had brought me up, as well as my paternal grandpa who had always supported my career goal in science, but passed away early this year. Our dream will come true.

v

Vita

2012...... B.E. Bioinformatics, Tongji University

2012 to present ...... Graduate Teaching Associate, Department

of Speech & Hearing Science, The Ohio

State University

Fields of Study

Major Field: Speech and Hearing Science

vi

Table of Contents

Abstract ...... ii

Acknowledgments...... iv

Vita ...... vi

Fields of Study ...... vi

Table of Contents ...... vii

List of Figures ...... x

Chapter 1: Introduction ...... 1

1.1 The relationship between speech and music ...... 1

1.2 Neural substrates, brain imaging and electrophysiology ...... 2

1.3 Phonemic restoration ...... 5

1.4 Restoration of a musical tone ...... 9

1.5 More about neural perspective on shared auditory continuity ...... 11

1.7 Current study ...... 17

Chapter 2: Materials and Methods ...... 19

2.1 Participants ...... 19

vii

2.2 Stimuli ...... 20

2.3 Procedure ...... 26

2.4 Data Analysis ...... 30

2.4.1 Preprocessing ...... 30

2.4.2 Calculation ...... 31

Chapter 3: Results ...... 32

3.1 ANOVA ...... 33

3.2 Correlation analysis ...... 34

3.3 Supplementary analysis ...... 39

Chapter 4: Discussion ...... 41

4.1 Music training influence on illusory perception in the music domain ...... 42

4.1.1 Controlling for familiarity during stimulus selection ...... 42

4.1.2 Possible influence of number of stimulus presentations ...... 46

4.1.3 Possible competition between sound segregation and illusory continuity ...... 46

4.1.4 Selection of target tone and replacer noise ...... 48

4.2 Music training influence on illusory perception in the speech domain ...... 49

4.2.1 Consonants and vowels ...... 49

4.2.2 Top-down information ...... 51

4.3 Musicianship and OMSI ...... 52

viii

4.3.2 Ollen Music Sophistication Index and other instruments ...... 53

4.3.3 How to treat vocalists and songwriters ...... 55

4.4 Cross-domain correlation as a means of elucidating functional transfer ...... 56

4.4.1 Neural markers ...... 56

4.4.2 Adequacy of rhythmic information ...... 57

4.4.3 Nature vs. Nurture ...... 59

References ...... 61

Appendix A: Ollen Music Sophistication Index ...... 68

Appendix B: Music works used for generating music stimuli ...... 71

Appendix C: Experimental data used in statistical analysis ...... 72

ix

List of Figures

Figure 1. Illustration of a tone in Classical music replaced by white noise...... 22

Figure 2. Illustration of a tone in Jazz music replaced by white noise ...... 23

Figure 3. Illustration of phoneme /ʃ/ in word “efficient” replaced by white noise ...... 25

Figure 4. Ollen Music Sophistication Index scores of 10 non-musicians and 11 musicians

...... 27

Figure 5. Bar plot of mean perceptual threshold (group × session) with error bar corresponding to 95% Confidence Interval ...... 33

Figure 6. Scatterplot of threshold for tunes against Ollen Music Sophistication Index.. . 36

Figure 7. Scatterplot of threshold for words against Ollen Music Sophistication Index .. 37

Figure 8. Scatterplot of threshold for words against threshold for tunes ...... 38

x

Chapter 1: Introduction

1.1 The relationship between speech and music

Speech and music represent different forms of complex sounds that are ecologically relevant to our daily lives. Both, allow us to communicate our thoughts and feelings.

Naturally, people are interested in the similarities and distinctions between music and speech. From an anthropological view, language and music evolved from the same communicative system (Darwin, 1981). Acoustically, in the temporal domain, fluctuation of amplitude allows us to perceive distinctive syllables in speech and separate tones in music. In the frequency domain, the fundamental frequency and the harmonic profile largely determine pitch and timbre perception for both speech and music, although these perceptual variables are not equally important across languages and genres. Whereas comparison of spoken language and music shows similar hierarchical structures on acoustic and syntactic levels, spoken language is probably processed by a different encapsulated cognitive system than music (Jackendoff, 2009; Peretz, 2006). Meanwhile, paralinguistic information like prosody is more analogous to musical information in terms of its function in nonverbal communication, especially for the conveyance of emotional information.

1

Inevitably, our everyday auditory experience is sometimes challenged by the adverse acoustical environment we live in. This problem is especially daunting to listeners with hearing loss. Music training has been shown to help alleviate such adversities. For instance, musicians, who usually have advantage in pitch discrimination (Tervaniemi,

Just, Koelsch, Widmann, & Schroger, 2004), working memory function (Chan, Ho, &

Cheung, 1998) and selective attention (Strait, Kraus, Parbery-Clark, & Ashley, 2010), have been reported to perform more robustly than non-musicians in Hearing-in-noise test

(HINT) and multi-talker Quick Speech-in-noise (QuickSIN) test (Parbery-Clark, Skoe,

Lam, & Kraus, 2009). Therefore, behavioral tests have provided us with evidence supporting the proposition that music training may have a positive impact on speech perception. The remaining challenge is to elucidate the neural functioning that allows for such transfer across modalities, and in turn apply what we learn to understanding how music training can benefit speech perception in real-life situations.

1.2 Neural substrates, brain imaging and electrophysiology

In order for the functional transfer to occur between speech perception and , there must be sufficient overlap between the neural networks associated with processing information from the two modalities, so that the effect of training from one domain (music/speech) can be partially shared by the other domain (Shahin, 2011). This shared effect can be recorded and visualized using brain-imaging techniques like electroencephalography (EEG), magnetoencephalography (MEG), magnetic resonance imaging (MRI) and functional MRI. EEG and MEG, for example, have superb temporal

2 resolution to elucidate the amplitude and latency of event-related potentials/fields

(ERP/Fs), also known as auditory evoked potentials/fields (AEP/Fs). An increase in the amplitude of AEPs, which has been demonstrated in musicians versus non-musicians

(Pantev et al., 1998; A. Shahin, Bosnyak, Trainor, & Roberts, 2003), is suggestive of greater recruitment of neurons or greater temporal alignments of the neural activity to a sound feature (e.g., acoustic onset). On the other hand, shorter AEP latency (Shahin,

Roberts, Pantev, Trainor, & Ross, 2005; Shahin, Roberts, & Trainor, 2004) may tell us that the process is streamlined by virtue of neuroplasticity, i.e., due to music training.

While EEG and MEG are superb in assessing the temporal dynamics of neural activity, they are lacking in spatial resolution (i.e., informing of the exact location of activity). In contrast, fMRI which has a poor temporal resolution has much better spatial resolution. fMRI is typically based on assessing blood oxygenation level dependent (BOLD) signals, which can help trace the activation/deactivation in different anatomical regions in the brain and ensure that the observed change in electric signals in speech and music tasks come from some overlapping part of the neural networks.

If we focus on the electrophysiology related to audition, the effect of music training manifests itself in several obligatory auditory components— byproducts of necessary signal transmission and processing in the auditory system, from frequency following response (FFR) in the auditory brainstem (Musacchia, Sams, Skoe, & Kraus, 2007;

Wong, Skoe, Russo, Dees, & Kraus, 2007), to auditory evoked potential (AEP) complex

P1-N1-P2-N2 (in this coding system P1/N1 denotes the first positive/negative peak in the

3 waveform of voltage, and so forth), which originates from primary auditory cortex (A1) and non-primary auditory cortex (NPAC) (Musacchia, Strait, & Kraus, 2008). In particular, a larger P2 amplitude as observed in musicians may indicate a training effect of binding temporal (related to the rhythm, or the envelope variation) and spectral

(related to the pitch and the harmonic profile) features of sound into a coherent representation of melody, in addition to coding them separately (Marie, Magne, &

Besson, 2011; Shahin et al., 2003; Shahin, Bishop, & Miller, 2009; Tremblay, Kraus,

McGee, Ponton, & Otis, 2001), and this enhanced ability is potentially beneficial to speech perception as well. However, cognitive potentials like mismatch negativity

(MMN) are more likely to be specialized in either the music or speech domain and thus are less susceptible to the inter-domain transfer with regard to auditory functionality (

Tervaniemi et al., 2000; Tervaniemi & Huotilainen, 2003).

The majority of the aforementioned findings are drawn from ideal conditions which are isolated from real-life acoustical environment. To get an inclusive picture of the benefit of music training to speech perception, we need to examine listeners’ performance under naturalistic circumstances, in other words, in tasks wherein noise exists as the masker or distractor. Compared to the abundance of evidence based on behavioral studies, neurophysiological evidence regarding this topic remains insufficient, especially on the cortical level. But at least for auditory brain stem responses (ABRs) musicians show less delayed peaks and more preserved amplitudes than non-musicians when listening to speech in noise (Parbery-Clark, Skoe, & Kraus, 2009).

4

1.3 Phonemic restoration

One way to evaluate the transfer of music training to speech perception in adverse listening situations is by measuring the ability to perceive or apprehend degraded speech in musicians and non-musicians. One such scenario is embodied in the “auditory continuity illusion” phenomenon (Miller, 1950), also known as “illusory filling-in”

(Shahin et al., 2009), or “phonemic restoration” (Samuel, 1981; Warren, 1970) when the phenomenon is examined with speech stimuli. During the continuity illusion, listeners can perceive interrupted speech as continuous, provided that the gap is masked by another extraneous sound, such as a white noise, a cough, or a pure tone (Warren, 1970).

In the auditory system, the continuity illusion can also be compared to other phenomena.

One of them is the restoration of the during the perception of pitch, which owes to the redundancy in tonotopic organization in the auditory cortex— One frequency dimension in the auditory periphery is mapped to two-dimensional space, with the extra dimension organized in the order of harmonics, which enables the neuron encoding the fundamental to be activated as a byproduct of the activation of neurons corresponding to the harmonics (Bendor & Wang, 2005; Ehret, 1997). Another example is restoration of a missing in a metric structure, which is mediated by the gamma band neural oscillations (Snyder & Large, 2005). More details about this study will be mentioned in section 1.5.

In addition, auditory perceptual restoration with a noise masker has also been studied in animals, with the implication that the illusory percept may bear some adaptive functions 5 that were conserved through evolution. Behavioral experiments typically compare animals’ responses to their conspecific vocal calls with or without a segment replaced by either silence or noise, in order to investigate if noise can induce an illusory percept in these species. This paradigm is reflected in the studies on treefrogs by (Seeba, Schwartz,

& Bee, 2010), on European starlings (Braaten & Leary, 1999), and on cotton-top tamarins (Miller, Dibble, & Hauser, 2001), although details may vary. Moreover, electrophysiological studies on cats (Sugita, 1997) and awake macaque monkeys (Petkov,

O’Connor, & Sutter, 2007) have revealed that the response strength of neurons in primary auditory cortex (A1) can be restored when the missing segment in the acoustic stimulus is replaced by noise; in other words, A1 neuron responses follow the illusory percept.

Researchers have attempted to understand the neural mechanisms underlying the auditory continuity illusion with different techniques and found that the neural networks underlying this process vary with the available bottom-up and top-down cues. Petkov et al. (2007) using simple tones found that the continuity illusion in macaque is reflected by the activity in primary auditory cortex (A1). Similarly, Riecke, Esposito, Bonte, &

Formisano (2009) using tones reported that illusory filling-in is specific to a region in the middle part of Heschl’s gyrus, analogous to A1 in Macaque. However, with speech stimuli the processing shifts to higher level networks. Heinrich, Carlyon, Davis, &

Johnsrude (2008) did a whole-brain fMRI study to investigate illusory vowels based on perceptual continuity, and reported the involvement of middle temporal gyrus as a region fulfilling phonetic access and mediating illusory filling-in. Given the higher spectrotemporal complexity of speech signals, it is understandable that the continuity 6 illusion for speech such as in phoneme restoration should be accomplished by a more complex mechanism than that for tones or beats. Other studies which used words, demonstrated that the neural mechanism of illusory filling-in involves areas beyond the auditory cortex, such as angular gyrus, and motor cortex and inferior frontal gyrus

(Shahin et al., 2009).

To elucidate this process in which the missing spectrotemporal structure is restored,

Shahin et al. (2009) proposed that the continuity illusion relies on at least two dissociable neural pathways (Carlyon, Micheyl, Deeks, & Moore, 2004; Lyzenga, Carlyon, & Moore,

2005; Repp, 1992; Samuel, 1981; Sivonen, Maess, Lattner, & Friederici, 2006). The first pathway provides the domain-general function of “sensory repair”, a largely unconscious process in which missing or degraded bottom-up information is mended or ameliorated based on spectrotemporal features. This happens in brain regions including left inferior frontal gyrus (Burton, Small, & Blumstein, 2000; Zaehle, Geiser, Alter, Jancke, & Meyer,

2008; Zatorre, 2001) as part of lower-level spectrotemporal processing itself.

In the second pathway, the repaired bottom-up information is compared with prior top- down information so that the listener can achieve a subjective feeling of continuity/interruption when it turns out to be a match/mismatch. The prior expectancy here can be fulfilled simultaneously by two types of pre-existing knowledge: (1) by template matching to an internal representation registered in memory, such as a learned word; (2) as a result of Gestalt processing (Wertheimer & King, 2004), which is based on the rule that acoustic signals of vocalization tend to be spectrotemporally smooth

7

(continuous everywhere on the time-frequency plane, without abrupt change in rhythm or formant trajectory). Whereas these two above can function collaboratively, the latter would suffice to ensure subjective continuity for the case of pseudo-words— artificial words devoid of an existing template in lexical memory. For instance, such a word as

“cigbet” has no semantic value in our lexicon, although it conforms to general phonotactical rules in English.

From a neurophysiological view, to make sure that the illusory segment can be seamlessly interpolated into the gap, the sensitivity of A1 to the onset and offset of the missing part (Riecke et al., 2009; Shahin et al., 2009) needs to be suppressed. Riecke et al. (2009) found that the dynamics of theta band power (4-8Hz) may index the existence of the continuity illusion to the extent that it is attenuated following interruption boundaries when the illusion succeeds compared to when it fails. This reduction of theta band AC activity is consistent with its putative role of preserving the sound representations such as the speech envelope, near the boundaries of abrupt changes. As a result, a percept similar to what corresponds to an intact stimulus can be generated

(Petkov et al., 2007; Shahin et al., 2009). The audiovisual integration study by Shahin,

Kerlin, Bhat, & Miller (2012) confirmed the reduction in theta phase-consistency as well as N1 and P2 amplitudes following the interruption boundaries when the continuity illusion is perceived.

8

1.4 Restoration of a musical tone

As mentioned before, when it comes to functional transfer from music to speech, not only do musicians outperform non-musicians in a series of behavior tests but some neural markers indexing the possible effect of music training have been identified as well

(Musacchia et al., 2007; Parbery-Clark, Skoe, Lam, et al., 2009). Therefore, it is quite natural to think that music training may have a positive impact on the continuity illusion.

However, A conjecture of the opposite result is also arguable—musicians may perform less well in tasks requiring the continuity illusion due to enhanced abilities in concurrent sound segregation (Zendel & Alain, 2009) plus stronger sensitivity to interruption boundaries (Musacchia et al., 2008). The former can be understood as the musical counterpart of perceiving multi-talker speech—to be able to juggle between multiple auditory objects in an acoustical scene, and the latter is part of expected aural skills following music training—to be able to detect missing segments, most commonly a tone in short duration.

Supportive evidence is certainly available and some will be mentioned in the section 1.5.

But even before expecting the functional transfer to take effect in speech perception, one needs to first acknowledge that musicians should exhibit greater ability to hear continuity through degraded music. To make the situation more analogous to phonemic restoration, it is preferable to replace part of a short tune, such as a single note, by white noise, and expect a similar phenomenon called “tonal restoration”. Still it is important to know that spectral information and temporal information are intertwined when music travels

9 through the auditory system, and therefore, the research focus should be on integral continuity but not on other perceptual features of this replaced tone, such as pitch and timbre.

To be perceived as continuous, the signal of degraded music may undergo a process similar to that of phonemic restoration, i.e., a combination of more than one pathway.

First, music input will be repaired during lower-level spectrotemporal processing, which entails no subjective awareness. Then, top-down information will play an important role of matching bottom-up information with prior knowledge. Here the top-down information includes but should not be limited to these two parts: (1) general rules in a music culture, acquired mostly through statistical learning, providing the basis of Gestalt processing; (2) familiarity with a particular music piece such as a motive in a famous symphony, which generates veridical expectation (Huron, 2006).

An important caveat for this analogy is that music, in a common sense, does not guarantee spectrotemporal smoothness as in natural speech (Bregman, 1990; Shahin et al., 2009). During speech production, it takes time for the nerves and muscles to gradually change the aerodynamic characteristics of one’s unique vocal system. In addition, co-articulation, which refers to the phenomenon that the place of articulation of one phoneme is assimilated to that of its adjacent phoneme, has developed so as to facilitate “seemless docking” between conceptually separate phonemes. For example, the

/n/ in the word “ten” is an alveolar consonant, whereas in the word “tenth” the /n/ is pronounced dentally. In contrast, when one plays certain musical instruments like the

10 piano, on which keys corresponds to a set of discreet frequencies on the spectrum, there exists no “continuous glissando” or “portamento”— continuous pitch sliding between two temporally adjacent tones as can be realized on a violin or a trombone (Sadie &

Tyrrell, 2001), and therefore, discontinuity on the spectrogram may occur. This difference in availability of spectrotemporal smoothness indicates that the Gestalt rules for speech may not be directly transplantable. Moreover, the coding of music and speech in memory, and thereby the forms of top-down feedback in speech and music, are reckoned to be barely comparable (McMurray, Dennhardt, & Struck-Marcell, 2008). As a result, in order to pinpoint the mechanism underlying this potential functional transfer, it seems reasonable to zoom in the lower-level perspective of neurophysiological processing, where the domains of speech and music perception plausibly interact.

1.5 More about neural perspective on shared auditory continuity

The continuity illusion through a missing part of the signal can be understood as the expectation of the event happening at that particular moment, which makes use of information that comes before and maybe after the missing segment. Expectation of music segments involves information related to the rhythm (embodied in the variation of signal envelope), the melody, the chordal structure etc., while expectation of speech requires information of the rhythmic pattern and the formant trajectory. Therefore, the overlapping part most probably lies in rhythm processing.

It was mentioned in section 1.2 that P1-N1-P2-N2 AEP complex, especially the P2 component, is representative of the effect of music training (Shahin et al., 2003). Marie et 11 al. (2011) demonstrated in their study that enhanced P2 is associated with musician’s advantage in encoding metric structure in speech, which is indicative of the possible functional transfer of rhythm processing. The researchers varied the syllabic length of the final word of a sentence and asked musicians as well as non-musicians to determine whether this word was well-pronounced or not. It turned out that musicians had larger P2 amplitudes for metrically incongruent than congruent words, which may be explained by more neural resources recruited to tune in to the temporal pattern of the ongoing signal and integrate this pattern into one coherent percept (Neuhaus & Knösche, 2008).

ABRs and AEPs as defined in the time domain can provide a decent amount of information about the functionality of the auditory system, but do not capture all relevant aspects of auditory perception. ABR and AEPs are phase-dependent neural activity, such that neural activity that is not phase-locked (also known as induced activity) to onsets of sounds from trial-to-trial is filtered out during averaging of trials. Oscillatory activity on the other hand, captures activity that is temporally jittered from trial-to-trial (Shahin,

Trainor, Roberts, Backer, & Miller, 2010). Thus, other neural markers extracted from objective measures, like oscillatory activity, should provide extra information, part of which may be related to auditory continuity shared by speech and music. In EEG signals, theta band oscillation (4-8 Hz) is often associated with processing temporal features of auditory input. More specifically, theta band phase patterns can track the temporal envelope of spoken sentence signal (Ahissar et al., 2001; Luo & Poeppel, 2007; Shahin,

Roberts, Chau, Trainor, & Miller, 2008). As mentioned earlier, it is also evident that reduced theta band activity is associated with reduced sensitivity to onset and offset of 12 noise in speech (Riecke et al., 2009). Theta band activity has appeared in music research as well, although not often on the auditory perceptual level, e.g. an increase of frontal midline theta power when listening to pleasant as opposed to unpleasant music (Sammler,

Grigutsch, Fritz, & Koelsch, 2007). It is highly important to stay cautious about the localization of ERP components or oscillatory activity when interpreting EEG results.

In contrast, the auditory processing of spectral attributes is to some extent reflected in induced gamma band oscillations (30-100 Hz, the exact range varies across studies), in that enhanced phase-locking and spectral power in the gamma band is indicative of higher spectral complexity of the acoustic signal (Shahin et al., 2008). On the other hand, induced gamma band oscillations are also closely related to rhythmic expectation in music, which intuitively should fall into the function of temporal processing. Snyder &

Large (2005) studied the relationship between short-latency gamma-band (20-60 Hz) activity (GBA) and metric structure of tone sequences. They found that while evoked

(phase-locked) GBA tend to diminish in absence of an expected tone at a certain beat, induced (non-phase-locked) GBA persists even when the tone is omitted from external stimulus. The latter seemed to mirror the preserved mental representation for the missing tone in the rhythmic context. This form of temporal filling-in possibly relates to enhanced

Gestalt integration, template-matching with long-term memory, and expectation, which are clearly shared by better phonemic restoration following music training (Besson &

Faïta, 1995; Eulitz & Hannemann, 2010; Fujioka, Trainor, Ross, Kakigi, & Pantev, 2005;

Lenz et al., 2008; Shahin et al., 2008). However, as mentioned before, the process of top- down feedback is not so much likely transferable from one domain to the other as lower- 13 level induction of rhythmic expectancy, which could be largely pre-attentive or unconscious. Due to incomplete knowledge about gamma oscillation per se, there is no sufficient evidence to link rhythmic expectancy directly to phonemic restoration via induced GBA, but at least it is reasonable to posit that preservation of rhythmic structure contributes to the restoration of a musical tone in terms of perceptual continuity, as discussed in the previous section.

So far, another band of oscillatory activity that has not been discussed yet is the alpha band (8-14 Hz). A recent view on the responsibility of alpha wave is to inhibit areas of the cortex not currently in use, or alternatively contribute to network coordination and communication (Palva & Palva, 2007). Here in the domains of music and speech perception, the role of alpha band oscillation can be exhibited in different forms. Müller et al. (2013) conducted a study on illusory perception of music interrupted by noise, using magnetoencephalography (MEG) and electrocorticography (ECoG) data from epileptic patients. Their results first confirmed that listeners have stronger music experience through noise when the music context is more familiar to them. Moreover, time- frequency analysis showed that illusory perception is associated with decreased auditory alpha power which indicates increased auditory cortex excitability. Also observable was phase synchrony between right auditory cortex and the medial temporal lobe, which is involved in retrieval of music memory (Watanabe, Yagishita, & Kikyo, 2008). One noteworthy feature in their stimuli is that the duration of pink noise embedded here lasted as long as two seconds, which was compensated by the abundance of top-down information induced by their music excerpts. 14

In the domain of speech, alpha activity plays an important role in indexing speech segmentation, which relies more on top-down information. In a study conducted by

Shahin & Pitt (2012), bursts of fronto-central alpha activity following the onset of a physically continuous but perceptually ambiguous phoneme /s/, was found when the listener heard the two monosyllabic words (e.g. gas source) or non-words (e.g. nas sorf) as segmented. This can be explained by the viewpoint that enhanced alpha power reflects the disengagement of neural clusters that become irrelevant as top-down information accumulates to a certain point where a word boundary becomes plausible. With speech and music combined, alpha activity can be seen as at least one channel through which long-term memory can have an impact on perceptual continuity, and therefore it would be unwise to exclude this mechanism from the complete explanation of subjective continuity in the case of a missing phoneme or a missing tone. However, it is worthwhile to reemphasize that higher-level mechanism of illusory filling-in is not necessarily transferable across domains.

Despite all these relevant neural markers, which is inclusive but incomplete, there is no denying that the majority of the available knowledge regarding these neurophysiological markers is still correlational. Little is known about the systematic mapping or interaction between stereotypical AEPs and oscillatory activity. The same limitation applies to the mechanism underlying this concerto between different bands of activity, which, instead of playing independent roles, has been hypothesized to be coordinated under some hierarchical organization—lower frequency phase activity may assume the function of

15 modulating high frequency amplitudes so that neural excitability during stimulus processing can be finely controlled (Lakatos et al., 2005).

1.6 A glimpse of audiovisual integration influenced by music training

It is evident that visual cues such as lip movement can enhance speech perception for some people. For example, Kaiser, Kirk, Lachs, & Pisoni (2003) conducted a behavior study comparing spoken word identification between a normal hearing (NH) group and a group of cochlear implant (CI) users. Results showed that (1) all subjects performed better in AV task than in auditory-only or visual-only task; (2) NH group outperformed

CI group in auditory-only condition; (3) in AV condition, there was no significant difference between NH and CI groups. Therefore, visual cues are particularly beneficial to people with hearing loss.

The possibility that music training may have some transferrable positive effect on speech perception leads us to the secondary question how this potential influence will extend to audiovisual tasks. Research has shown that musicians have better pitch encoding in the brainstem and in auditory cortex than non-musicians for AV speech (Musacchia et al.,

2007, 2008). Not only were the Frequency Following Responses (FFR) and wave d of

ABR larger in musicians, but also the FFR amplitude increased with music training. In addition, cortical P1-N1-P2-N2 complex appeared to be larger in musicians than in non- musicians for the consonant –vowel /da/ presented in an AV context. Visual cues that are congruent with the audio input may increase the auditory sensitivity to boundaries of speech sound, more effectively in musicians than in non-musicians, because music 16 training may involve multisensory integration (e.g., reading the music scores, or attending to the conductor).

It would be interesting to investigate whether music training might benefit degraded speech perception in AV context, and in particular, whether this effect is additive, compensatory or even offset to the benefit of visual cues itself. It is possible that music activity may forge neuroplastic changes on both auditory level and multisensory level, on condition that the music training we are referring to here involves more than enhancement of aural skills. This condition is important because musicians have different practice strategies: some only play by ear and never read music scores (Seppänen,

Brattico, & Tervaniemi, 2007); some play to themselves and never coordinate with other musicians in a group or an ensemble. These musicians may not enjoy much advantage in audiovisual tasks.

1.7 Current study

The goal of the current study is to investigate the possible functional transfer from music perception to speech perception with regard to the perception of illusory continuity through interruption. This project begins with assessing this transfer behaviorally, a necessary first step to examining the transfer neurophysiologically, i.e., using functional neuroimaging techniques.

17

Since it is shown that musicians have a better command of the unfolding rhythmic patterns for music (Besson & Faïta, 1995; Snyder & Large, 2005; Zanto, Snyder, &

Large, 2006), we therefore hypothesize that:

The general effect of long-term music training, as manifested in one’s enhanced ability to anticipate envelope variation and thus perceive continuity in degraded music, can transfer to one’s ability to perceive continuity in degraded speech as well. This extension from restoration of a musical tone to restoration of a phoneme is largely due to the overlapping neural networks associated with rhythm processing in lower-level central auditory system

(exhibited in the behavior of AEP complex P1-N1-P2-N2).

In this experiment, musicians and non-musicians were asked to listen to one session of short music tunes and one session of spoken English words with short noise as interruption, in order to measure their perceptual thresholds of continuity for the music and speech materials used here. Words, instead of sentences, were used to minimize syntactic and semantic information that can be provided by the context. Comparison between these two groups will reveal whether musicians have advantage in phonemic restoration over non-musicians. At the same time, further analysis with Ollen Music

Sophistication Index (Ollen, 2006, 2009), a score quantifying people’s overall music ability and experience, will help answer to what extent the mechanism underlying auditory illusory continuity is shared between speech and music, as well as how strong the influence of music training is upon the behavioral tolerance of noise.

18

Chapter 2: Materials and Methods

2.1 Participants

Twenty-eight adult musicians and non-musicians volunteered to participate in this study.

The musicians (n = 11) were practicing students from the School of Music at the Ohio

State University and those who perform on local stages regularly but do not necessarily have a degree in music. Non-musicians (n = 17) had little or no music training experience. One self-reported non-musician was found to be an experienced vocalist and therefore was re-categorized into the musician group. All participants gave written informed consent in accordance with procedures approved by Institutional Review Board of the Ohio State University. All subjects passed the hearing screening (25 dBHL,

125~8000 Hz) and received remuneration at the end.

Five participants did not complete the task due to a technical issue. Another two failed to follow the instructions, according to debriefing and preprocessing results (see section

2.4.1 for details). Therefore, the final analysis covered data from 10 non-musicians (mean age 22.6, age range 19-35, 6 females vs 4 males) and 11 musicians (mean age 26.8, age range 20-57, 2 females vs 9 males).

19

2.2 Stimuli

The experiment included two main sessions, one for music tunes and the other for spoken words.

The “tunes” session contained 128 short music clips (called “tunes” for short), which were cut out from several published instrumental works and converted to “.wav” format.

They covered three distinct genres available in North America (Classical, Jazz and

Latino) yet not necessarily pursued by every listener on a daily basis. These three genres took up 50%, 35% and 15% of the whole set of tunes, respectively. Supplementary information regarding these selected works can be found in Appendix 2.

Each tune lasted 1~2 seconds and was resampled to 48000 Hz (16 bits per sample) so as to match with the word files. A tone (in most situations corresponding to one musical note) in the melodic stream located between the first and last tone was chosen as the target for white noise replacement. Beginning and ending latencies of each target tone relative to the start of the sound file were determined using Adobe Audition 2.0, and then the central time point of the noise was aligned with the central time point of this tone.

Because the original audio signals contain no energy above 17000 Hz, the white noise used for this session was low-pass filtered so as to match the same cutoff frequency.

Criteria for selecting a limited number of tunes were as follows: (1) Tunes were predominantly homophonic, with a salient melodic voice (referring to a leading instrument instead of a human voice) embedded in certain chordal structures. (2) Each

20 tune contained at least 5 distinct tones in a row. (3) The tempi were slow enough for the listener to catch the melody, and fast enough for the listener to achieve a sense of rhythm within a short period of time. (4) The target tone (without noise) was remarkable enough when the tune was heard for the first time. (5) Excessively loud background noise or percussion was avoided. Whereas these strict criteria might constrain the diversity of the stimuli, they helped ensure a context highly analogous to that in the “words” session— high intelligibility, moderate speed, less difficulty and no discomfort.

Figure 1 and Figure 2, which were generated in Matlab (MathWorks, Natick, MA), illustrate modification of one tune in Classical orchestra and another tune in Jazz with saxophone as the leading voice. The upper panel shows the time-domain waveform as well as the spectrotemporal representation (i.e. spectrogram) of the music signal. The time-domain waveform was normalized to the range of [-1, 1], since the presentation volume might differ for different participants. The spectrogram was plotted using Short-

Time Fourier Transform (STFT) with a 512-point sliding window and 50% overlap between segments. The colormap in the spectrogram was in dB scale, and the warm color indicates high energy. The middle panel shows that the full segment of the tone was replaced by white noise, which is characterized by its uniform distribution across the power spectrum. This “physically interrupted” condition in the middle panel applied to all the stimuli used in the experiment. In contrast, the “physically continuous” or

“superimposed” condition, as displayed on the lower panel, was only used to explain what a “perceptually continuous” tune should sound like during the instruction.

21

Figure 1. Illustration of a tone in Classical music replaced by white noise. The upper panel contains the normalized waveform of the original tune (left), as well as the corresponding spectrogram in which the warm color indicates high energy (right). The middle panel depicts the only experimental condition where the full segment of the target tone is replaced by white noise. The lower panel depicts the condition where noise is superimposed upon the target tone, but this condition was not used in the stimuli.

22

Figure 2. Illustration of a tone in Jazz music replaced by white noise. The upper panel contains the normalized waveform of the original tune (left), as well as the corresponding spectrogram (right). The middle panel depicts the only experimental condition where the full segment of the target tone is replaced by white noise. The lower panel depicts the condition where noise is superimposed upon the target tone.

While barely visible in the original waveform, the spectral features of different instruments (sound quality and its change over time) and temporal features (rhythmic patterns) of different genres can be easily seen in the spectrograms. In these two

23 particular examples (Figure 1 and Figure 2), the classical tune exhibits more isochronous beats than the jazz tune, which goes along with the common impression.

The “words” session contained 231 English words that were compiled from the

University of Western Australia MRC Psycholinguistic Database, with a familiarity rating of 300-700. All words were tri-syllabic nouns and adjectives with at least one fricative/affricative between the initial and final phonemes. Fricatives and affricatives have been frequently used here because their acoustic features are comparable to noise among other phonemes and thus can induce more robust phonemic restoration than other phonemes (Samuel, 1981). The words were spoken by a female vocalist with a fundamental frequency of 203 Hz, and recorded through a Shure KSM studio microphone at a sampling rate of 48000 Hz.

One fricative/affricative (or occasionally some other consonants) located between the initial and final phoneme of each word was chosen and replaced by white noise in the same way as for music stimuli. Based on our tally, of all the 231 phonemes, 65% belonged to the group [ʃ ], [ʧ], [ʤ] or [ʒ]; 32% were [s] or [z]; and the rest were [t],[d] or

[f]. All words were identifiable and unambiguous even after noise was introduced, thus minimizing differential semantic effects.

Figure 3 shows the modification of the word “efficient” by juxtaposing the time-domain waveform along with the spectrogram. Among three conditions (original, physically interrupted, physically continuous) displayed here, only the middle panel stands for the real experimental stimuli, where the full segment of [ʃ] is replaced by noise. As can be 24 seen, compared to those in Figure 1 and Figure 2 (tunes), the spectrotemporal representation of a word has much more remarkable changes of timbre over time.

Figure 3. Illustration of phoneme /ʃ/ in word “efficient” replaced by white noise. The upper panel contains the normalized waveform of the original word (left), as well as the corresponding spectrogram (right). The middle panel depicts the only experimental condition where the full segment of /ʃ/ is replaced by white noise. The lower panel depicts the condition where noise is superimposed upon /ʃ/.

25

In order to minimize the potential impact of sound intensity level on the result, we conducted A-weighted RMS normalization on the sets of original words and tunes respectively using Adobe Audition 2.0. When a segment of the stimulus targeted at that phoneme/tone was replaced by noise, the sound intensity level of the white noise was equalized to that of the replaced segment plus 3dB.

2.3 Procedure

Each participant was required to fill out a 10-item questionnaire for “Ollen Music

Sophistication Index” (Ollen, 2006, 2009). The result is a score between 0 and 1000, a larger value indicating higher musicality. Figure 4 shows the OMSI scores among our participants. As can be seen, non-musicians in our study tend to have a score below 200, while musicians have scores that spread over a bigger range (students in a music major do not necessarily score higher). Validity of this questionnaire to be used in our study will be discussed later and the whole set of questions can be found in Appendix 2.

26

Figure 4. Ollen Music Sophistication Index scores of 10 non-musicians and 11 musicians

After entering the sound-attenuated booth, the participant received and passed audiometric screening, and then listened to experimental instructions while sitting in front of a monitor. At this time, figure 3 was demonstrated to the participant in order to facilitate his/her understanding of the task. The audio input was presented to the participant through insert Etymotic ER-4B earphones (Etymotic Research Elk Grove

Village, IL) and the volume was adjusted to the individual’s comfortable level, which would stay constant throughout the experiment. Presentation of stimuli and logging of

27 behavioral responses were done in Presentation (Neurobehavioral Systems, Berkeley,

CA) interfaced with Matlab.

The main task was in the form of a two-alternative forced choice: the listener was required to judge whether each noise-interrupted stimulus sounded “interrupted” or

“continuous”, by pressing “1” or “2” on the keyboard with the left hand. Each participant completed a practice session (Session 0) before the formal sessions so that he/she was familiarized with the task. After practice, the participant reported to the experimenter his/her self-evaluation of performance and was asked to take the practice again if proficiency in judgment was not reached yet. To control for order effects of sessions, in both the musician group and non-musician group, half of the participants began with the

“Tunes” session (Session 1) and the other half began with the “words” session (Session

2).

The practice session contained 10 tunes and 10 words randomly selected from the whole sets of stimuli. Here “proportion” is defined as duration of the white noise replacement divided by duration of the original target, which ranges from 25% to 325% in the system

(a proportion greater than 100% means the noise replacement intrudes into adjacent tones or phonemes). Using the proportion instead of absolute duration here allowed us to evaluate the illusory continuity based on the discrete elements constituting the tunes and words. Measurement based on the absolute duration may be subject to some potential biases due to the speeds of the recorded music and speech, as well as the various lengths

28 of these selected tones and phonemes that were replaced by noise. In the practice session, the proportion stayed at 100%.

In the “tunes” or “words” session, the stimuli were presented in a pseudorandom order generated by a genetic algorithm (Wager & Nichols, 2003). This order remained the same for all participants. The proportion of noise relative to target (as defined above) started from 100% and evolved over time: depending on whether the subject heard the previous stimulus as “interrupted” or “continuous”, for the next trial the noise would shrink or expand by 15% of the initial length. As the adaptive procedure went on, the duration of white noise would rove around each individual’s threshold of continuity. Adequate counterbalancing would eventually be reached between the two perceptual categories

“continuous” which meant that the participants experienced the illusion and

“interrupted”, i.e., failed to experience the illusion. Although it is possible that some measurement bias would occur toward the starting level of the noise proportion, this bias is considered as fair to each participant and would not interfere with the interpretation of the results.

The experimenter read the following instructions to the participant before the “tunes” or

“words” session:

Keep looking at the cross in the middle of the black screen. Just pay attention to whether the music (word) itself sounds continuous or interrupted into two separate parts due to the noise. However, even if you notice the white noise, you may still hear the piece as a continuous melody (utterance). In other words, attend to the music (word) at ease instead 29 of the noise, just like when you are listening to some radio music (news) channel with some “buzzing” interference. For the tunes, we will not ask you to recall the melody (For the words, you are not required to think about their meaning). Treat each stimulus tune based on your true feeling of continuity.

In addition, at the end of the “tunes” session, the participant was asked to rate his/her overall familiarity with each of the genres used in this study (Classical, Jazz and Latino) on a 7-point Likert scale (1-7 in integers). This further divided into two questions: (1)

How familiar are you with this genre in general in your daily life? (2) How familiar are you with the particular music works in this genre that appeared in our stimuli? The participant was told to give an average estimate regarding all the tunes in this genre and to report the specific works they recognized if there existed any. The recognized music samples had been expected to be unfamiliar, and therefore should be replaced by other unfamiliar works in future studies.

2.4 Data Analysis

2.4.1 Preprocessing

The noise-interruption duration was reported as percentage of the replaced phoneme/tone in a log file along with the binary response with regard to continuity (continuous or interrupted). Trials which had double responses, indicating that the participant probably regretted the decision, or trials on which the response occurred before the onset of the word were not included in the final analysis. In the experiment program there was no feedback or warning of error provided during an ongoing session, so that the participant’s 30 natural response through the adaptive procedure would not be disrupted. Very rarely were trials actually discarded (less than two per session on average), partially owing to the instructions and practice session. On average the percentage was about 153% for the phonemes and 137% for the tones, which indicates that the noise replaced adjacent phonemes/tones as well.

Also, data from a two participants were not included in the final analysis. The first person’s noise proportion consistently reached floor level (25%) after some trials, which might indicate that he/she did not understand the instructions or was not able to focus on the task due to fatigue. The other person reported to have used an idiosyncratic strategy when making the decision —whether the noise sounds strong or soft instead of whether the stimulus sounds interrupted or continuous.

2.4.2 Calculation

To estimate the perceptual threshold for one “tunes” session or one “words” session, the noise proportions were averaged across all trials except those rejected during the preprocessing (see section 2.4.1). Given that the number of “continuous” trials is generally bigger than that of “interrupted” trials due to the aforementioned measurement bias (the difference is usually less than 10% of the total number of trials in that session), this average proportion may be a little lower than the expected value of the perceptual threshold of continuity. Still, this bias is shared by all the participants and thus will not interfere with interpretation of the results.

31

Chapter 3: Results

If the hypothesis of functional transfer of music training to speech perception in noise is true, then the following prediction of the results can be deduced:

Musicians would have higher perceptual thresholds of continuity than non-musicians, in both “tunes” and “words” sessions. In other words, musicians have more tolerance for noise interruption in both music and speech.

A summary of the experimental results is presented in figure 5. As can be seen, in the

“words” session the musicians seemed to outperform the non-musicians, which goes along with the prediction, but in the “tunes” session, there was little difference between these two groups, which seems counterintuitive.

Subsequently, we used ANOVA to examine whether there is strong evidence supporting the observed results. We also conducted some correlation analysis and supplementary analysis (ANCOVA) in an attempt to further corroborate the claims made here. Raw data used in the statistical analyses here are available in Appendix 3.

32

Figure 5. Bar plot of mean perceptual threshold (group × session) with error bar corresponding to 95% Confidence Interval. Perceptual threshold for the “tunes” or “words” session is reported as the average proportion of the duration of the noise compared to the duration of the target tone or phoneme. Each bar here stands for the mean threshold across participants within one group for one session.

3.1 ANOVA

A mixed-design factorial Analysis of Variance (ANOVA) was conducted in Statistica, with (a) group (non-musician/musician) as the between-subject factor, (b) session

(tunes/words) as the within-subject factor and (c) threshold as the dependent variable.

The ANOVA failed to detect a significant main effect of group (F(1,19)= 1.225, p=0.282), or a significant main effect of session (F (1,19)=3.786, p=0.067), but the latter is less informative because stimuli used in two sessions were not calibrated in the same way.

33

However, a significant interaction between group and session exists (F(1,19)= 6.549, p=0.019), which manifests itself in the lack of difference for the music session in contrast with the salient difference for the speech session between two groups.

More importantly, post-hoc analysis using Least Significant Difference (LSD) revealed significant difference between music and speech tasks among musicians (p= 0.004, df=

26.026), as well as significant difference between non-musicians and musicians for the speech session (p= 0.05, df= 26.026), which supports the observations from Figure 5. The result is in agreement with our prediction that musicians have more tolerance for noise- interruptions during speech perception, but does not give support to our analogous prediction in restoration of a musical tone against noise. The significance and implication of these results to understanding music training transfer to speech perception will be discussed in the next chapter.

3.2 Correlation analysis

Subsequent correlational analysis includes the Ollen Music Sophistication Index (OMSI) score obtained for each participant.

In an ANOVA, we treated the abstract variable of musicality in a dichotomized approach, i.e., to label participants as musicians or non-musicians based on their self-reported music experience as well as current music activity. Here Pearson correlation coefficients were calculated in order to gain more information about the relation between these variables

34

(OMSI, threshold_tunes and threshold_words) across individuals. The results are as follows: r (OMSI× threshold_music)= -0.1625, df=19, p=0.2155 r (OMSI× threshold_speech)= 0.2820, df=19, p=0.4815 r (threshold_music×threshold_speech)= 0.6156, df=19, p=0.0030

Figure 6 shows the correlational trend between OMSI and threshold_tunes; Figure 7 between OMSI and threshold_words; and Figure 8 between the two thresholds. Graphs were plotted in R, with package Mosaic (2014). As can be seen there is a strong correlation between noise tolerance for tunes and that for words, which supports our hypothesis of functional transfer, indicating the overlapping neural mechanism associated with speech and music perception. What also draws our attention is the weak negative correlation between OMSI and threshold for music. However, the reliability of this trend is limited by our unevenly distributed OMSI scores in the dataset. Finally, the positive correlation between OMSI and the threshold for speech confirmed the posthoc analysis result that musicians did outperform non-musicians in the “words” session.

35

Figure 6. Scatterplot of threshold for tunes against Ollen Music Sophistication Index. Horizontal axis: OMSI score, an integer between 0 and 1000. Vertical axis: perceptual threshold for the “tunes” session, reported as the average proportion of the duration of the noise compared to the duration of the target tone. A percentage larger than 100% indicates that the continuity illusion can intrude into adjacent tones as well. Black line: linear regression curve.

36

Figure 7. Scatterplot of threshold for words against Ollen Music Sophistication Index. Horizontal axis: OMSI score, an integer between 0 and 1000. Vertical axis: perceptual threshold for the “words” session, reported as the average proportion of the duration of the noise compared to the duration of the target tone. A percentage larger than 100% indicates that the continuity illusion can intrude into adjacent phonemes as well. Black line: linear regression curve.

37

Figure 8. Scatterplot of threshold for words against threshold for tunes. Perceptual threshold for the “tunes” or ”words” session is reported as the average proportion of the duration of the noise compared to the duration of the target tone or phoneme. A percentage larger than 100% indicates that the continuity illusion can intrude into adjacent tones or phonemes as well. Solid line: linear regression curve; dashed curve: confidence interval; dotted curve: prediction interval)

38

3.3 Supplementary analysis

From the results of ANOVA, we did not see any advantage of musicians over non- musicians in their ability to tolerate noise interruption in music. Moreover, we saw a slightly negative correlation between OMSI and perceptual threshold for music. These counterintuitive results may be related to a confounding variable that we did not address, namely the familiarity with our music stimuli, which provides top-down information for illusory filling-in. We used the self-reported ratings in response to the second question

(veridical familiarity with particular music works), as mentioned in section 2.3, to estimate the overall familiarity (F) in the following way:

F= 0.5×FClassical+0.35×FJazz+0.15×FLatino, where the three coefficients are the proportions of total trials adding up to 100%. The reason why we omitted the first question (everyday familiarity with a genre) from our analysis was because the schematic familiarity based on that turned out to be uncorrelated with the threshold for tunes (r= -0.0306).

The average familiarity among the non-musician group is 1.6136 (SD= 0.8860), and the musician group 2.1100 (SD= 1.6700), which shows that the selected stimuli were unfamiliar in general (rating between 1 and 7) but still more familiar to the musicians.

In order to control for the effect of familiarity and see if musicality has an impact on continuity illusion in the “tunes” session, an Analysis of Covariance (ANCOVA) was conducted in Statistica with (a) threshold for tunes as the dependent variable, (b) group

(musician/non-musician) as the fixed factor, and (c) familiarity as the covariate. The

39 result shows no significant effect of familiarity (F(1,18)= 0.479, p=0.498), which may indicate that the familiarity of our music stimuli was minimal. Nor does an effect of group exist (F(1,18)= 0.011, p=0.919), which confirms that the musicians did not outperform the non-musicians in hearing continuity through noise-interrupted music tunes.

40

Chapter 4: Discussion

This study is based on the hypothesis that the benefit of music training can be transferred to speech perception in adverse acoustical situations. To gain evidence in favor of this hypothesis, we asked both musicians and non-musicians to listen to noise-interrupted speech and music, and measured their perceptual thresholds of continuity based on their binary choice of “interrupted” versus “continuous” over an adaptive procedure. The prediction deduced from the hypothesis was that musicians should express more tolerance for noise interruptions in music and this quality should transfer to noisy speech as well.

The results show that the musicians in this study did not hear continuity illusion through degraded music tunes more easily than the non-musicians, which was in disagreement with the first part of the prediction. However, the musicians did show enhanced continuity perception compared to the non-musicians for interrupted speech, which is supportive of the second part of the prediction. In the meantime, the perceptual threshold for music was found to be highly correlated with that for speech, although both of them were in very low correlation with Ollen Music Sophistication Index (OMSI).

41

In this section, the implications of our results, alternative explanations as well as strengths and weaknesses of our methodology will be discussed.

4.1 Music training influence on illusory perception in the music domain

The absence of significant difference between musicians and non-musicians in the

“tunes” session was counterintuitive, because based on the hypothesis of functional transfer it seems natural to expect that musicians would have better noise tolerance in the domain of music before expecting any similar advantage in the domain of speech. The result of ANCOVA further confirmed this absence of difference, as the veridical familiarity to our music stimuli was strictly limited even for the majority of our musician participants (see section 3.3 for details).

4.1.1 Controlling for familiarity during stimulus selection

Our assumption that musicians would outperform non-musicians in the “tunes” session was based on the knowledge that musicians have a better command of the unfolding rhythmic patterns for music (Besson & Faïta, 1995; Snyder & Large, 2005; Zanto et al.,

2006). As mentioned in section 1.5, filling in a missing music segment requires information related to the rhythm, the melody, the chordal structure etc., while expectation of a missing speech segment requires information on the rhythmic pattern and the formant trajectory. That’s why the overlapping part— rhythm processing might play a major role in the possible functional transfer, and the “rhythm” here should be exhibited in the variation of envelope, but can go without background percussive beats as

42 added in common Pop music. The rhythm can sometimes be implicit, that is, perceivable due to its correlation with meter, melody and chordal structure. Imagine if the stimuli used in this study was not real-life music, but MIDI music without any dynamic change over time, the listener might still be able to expect the rhythm. Therefore, it is interesting to test if the result in this study can be replicated with MIDI music.

Because our hypothesis highlights the lower-level sensory repair, and because the top- down information (e.g., emotional meaningfulness related to episodic memory) in the speech session was assumed to be equal to both groups (see section 4.2.2), it was necessary to control the top-down information available in the music session. Based on the fact that when the music context is familiar, listeners tend to have stronger music experience through noise (Müller et al., 2013), we selected music tunes that were difficult to recognize so as to minimize veridical expectation, which relates to episodic memory and thus may provide an adequate amount of top-down information (Huron, 2006). As for the residual veridical familiarity (the type of familiarity leading to veridical expectation) here, we used self-reported evaluation to regress out this effect.

The three genres used in this study (Classical, Jazz and Latino), to some extent, represent a diversity of musical cultures in North America, but they do not cover an entire tapestry of musical life— easily accessible styles like Rock and Pop were not included so as to avoid excessive schematic familiarity—familiarity with general knowledge of how events typically unfold (Huron, 2006), especially for some musician participants. However, our reasoning was that Jazz and Latino clips, whose rhythmic patterns might sound less

43 predictable, do have their own regularities, which can be acquired at least through statistical learning. Presumably, musicians have gained more schematic knowledge such as music scales and rhythms through statistical learning or formal impartation and should be more able to process regularities of Jazz and Latino clips. Therefore, we also included schematic familiarity in the self-reported evaluation.

We should note that in fact, as the task unfolded itself, a third type of expectation called dynamic expectation—the expectation shaped by immediately experience and linked to short-term memory (or can be paraphrased as due to the familiarity gained through short- term exposure)— might also be generated (Huron, 2006), especially when each selected music work contributed multiple clips to the stimuli and these multiple clips might have intrinsic similarities. However this effect was trivial because of the short duration and randomized order of our stimuli, and therefore, we did not take any specific step to control it.

In retrospect, the use of tunes of less familiar genres may have been a shortcoming of the experimental design; especially since in the speech task we used English words that were equally familiar to both groups. The acoustic regularities may be different between genres and thus musicians may not be able to apply previously acquired skills to processing signals in an unfamiliar genre (or language). In contrast, musicians are familiar with the regularities of speech and one may argue that the rules they learned in the music domain may be transferable to processing rules in their native language. In hindsight, ideal tunes should probably have come from Pop music, which can be

44 considered familiar to both groups. Alternatively, the selection of musicians could have been more delicately matched with the genres used here.

Our reasoning for excluding schematic familiarity to control for top-down mechanisms may have been unwise. Familiarity of music genres or languages should not necessarily be interpreted as a top-down mechanism. The amplitude envelope of speech or music we hear and the formant trajectory are encoded in the N1-P2 obligatory auditory evoked potentials (AEPs) (Carpenter & Shahin, 2013). These AEPs are known to be generated in the primary auditory cortex (PAC) and surrounding regions (Bosnyak, Eaton, & Roberts,

2004; Shahin et al., 2003), and are evoked whether someone is conscious or not. Thus we can argue that familiarity of the speech envelope, pitch, harmonic structure or formant trajectory may only involve a bottom-up process. In turn, expectations of the unfolding tunes or words in our experimental design can be instigated by a bottom-up mechanism, if familiarity with the genre or language is guaranteed for most listeners (e.g. by using

Pop or Classical music only). A strong expectation of a beat will make it easier for the continuity illusion to occur through noise interruption when music pieces provide a reasonable context. In rhythm-focused research as done by (Snyder & Large, 2005), the beat induction can be realized by using a sequence of tones at the same pitch. Another type of can be induced by upward or downward frequency modulation of a tone with the insertion of noise (Husain, Lozito, Ulloa, & Horwitz, 2005). This provides the spectrotemporal smoothness in resemblance to that in speech but not available in our music session. Instead, in our stimuli, tones are distributed at discreet frequencies on a musical scale and do not necessarily ascend or descend monotonously. 45

In light of this difference, we reminded the participants not to think about the melody, but the general “continuity”, even if we call this phenomenon “tonal restoration”.

4.1.2 Possible influence of number of stimulus presentations

One may argue that the absence of difference in the music session might be related to the smaller number of trials than that in the speech session (128: 231). The current numbers of trials were used because the study was optimized for the speech task while the music session was intended to be no more than the pavement. The large number of word stimuli ensured that the noise proportion would rove around the listener’s perceptual threshold before the session ended. This would render our averaged proportion close to the actual value. Nevertheless, this might not be true for the tunes given the smaller number of trials. In addition, unlike the speech stimuli which were spoken by one person throughout, the multi-genre music stimuli exhibited less homogeneity, which might slow down the convergence of noise proportion. Therefore, it is possible that more trials were needed for some musicians to show their advantage over non-musicians.

Based on our observation, only 6 out of the 11 musicians and 6 out of the 10 non- musicians reached the convergence of noise proportion before the session ended.

However, this alone cannot account for the insignificant result.

4.1.3 Possible competition between sound segregation and illusory continuity

On the other hand, we should not overlook other benefits music training may bring to auditory functions. First, research has shown that long-term musical training can enhance

46 listeners’ ability to segregate concurrent sounds based on harmonicity— after some harmonic of a complex tone has been mistuned, the listener may hear more than one sound stream (Zendel & Alain, 2009). As one of the vital aspects of auditory scene analysis, concurrent sound segregation is very useful in appreciation and production of music. Musicians, especially those who work in group musical activities, have been trained to identify different auditory objects based on a variety of features such as pitch and timbre. Here the noise may be treated as an extra auditory object instead of an inducer of continuity illusion or an interpolator of the temporal gap.

Second, it is evident that musicians have more sensitivity to the acoustical onsets and offsets of missing segments (Musacchia et al., 2008). In other words, musicians have been specifically trained to detect the irregular rhythmic elements embellishing a sequence of metric pulses. The most common irregular element is the rest—a short duration of silence. Paying attention to the length of the rest is equivalent to paying attention to the lengths of adjacent tones. For traditional music performers, the duration of each tone often needs to be followed accurately, so filling up the gap freely is not considered as beneficial. Another form of irregularity lies in syncopation, briefly speaking, abnormal placement of the stress or accent, which is also called “off-beat”.

Music training effect in rhythm may include the ability to upgrade rhythmic expectation based on any subtle deviation from normality, instead of leaving the old expectation stagnant.

47

These two mechanisms mentioned above may happen at the same time as illusory filling- in, and therefore, the final perception of continuity/interruption is a combinatory result of neural processing in these different pathways. They can be accomplished unconsciously but directing attentive resources to any of them may turn the balance toward continuity or interruption. Very experienced musicians may have enhanced ability to voluntarily boost up illusory filling-in and suppress stream segregation or/and gap detection.

4.1.4 Selection of target tone and replacer noise

Comparing the two sessions, we reinstate that white noise may not be a perfect inducer of continuity illusion for real-life music like those in our stimuli. In the speech session, the fricatives/affricatives we replaced by noise has a power spectrum with adequate energy on middle and high frequency registers. On the contrary, the harmonics of a music tone usually have decreasing amplitudes as frequency goes up, unlike that of white noise. In this aspect, the “tunes” session is more difficult than the “words” session, which is consistent with some participants’ comments during the debriefing. This imbalance may have rendered the data susceptible to a floor effect and contributed to the lack of advantage of musicians over non-musicians in the “tunes” session.

An ideal masker in substitute may be amplitude-modulated noise or even pure tones, but this remains to be tested. Or alternatively, we should have targeted at music segments that sound like “noise”, i.e. with a broad power spectrum, in the same way as we chose fricatives/affricatives. However, it is difficult to implement as we need to first define

“noisy music” without a noisy background. Some percussion instruments such as timpani 48 may be considered as both noisy and melodic, although this type of music they can bring may only represent a small and atypical fraction of daily music.

4.2 Music training influence on illusory perception in the speech domain

The post-hoc analysis result shows that musicians have more tolerance for noise interruptions than non-musicians in the phonemic restoration task— musicians perceive continuity for longer interruptions in words, which is consistent with the hypothesis of functional transfer toward the domain of speech perception.

4.2.1 Consonants and vowels

As mentioned in previous sections, we chose fricatives/affricatives as the target to be replaced by noise because their acoustic features are similar to noise. This makes it easier for these phonemes to induce continuity illusion and therefore, they are frequently used in research. Inevitably, this raised a question: Can our result be generalized to other phonemes as well?

Here we need to address consonants and vowels separately. For consonants, including the fricatives/affricatives we used, there is certain chance that when we replace other phonemes with white noise, top-down information tends to suggest a word template containing a phoneme acoustically similar to the noise. At the same time, template mapping is in favor of the words that are frequently used. For instance, if we replace the

/v/ in the word “reliever” by white noise, the listener may hear either “releaser” or

“reliever”, or realize both are possible. Sensory level repairing would probably support 49 the illusory continuity of “releaser” while top-down information would primarily recommend “reliever”. If such an alternative word as “releaser” did not exist, there would be no conflict in word selection. Due to the potential ambiguity as described above, fricatives/affricatives are prevalently used in phonemic restoration.

Since our hypothesized mechanism does not only apply to a specific set of consonants, we can thus argue that for other degraded consonants in daily speech, music training can also have a positive impact on continuity illusion as the functional transfer takes effect on the level of “sensory repair”. However, the effect may be obscured due to the indeterminacy of template matching. That is why our experimental design was reflective of reductionism—using tri-syllabic words and replacing fricatives/affricatives.

For the restoration of vowels, the modification cannot be as simple as replacing the whole segment by noise because the indeterminacy based on lexical memory is even more remarkable. One typical way was demonstrated in the study by (Heinrich et al., 2008), where they used sequences of two-formant vowels. When two formants were presented simultaneously, the stimuli were perceived as “speech-like”; when two formants alternated in time, the “speech-likeness” was reduced, which can be partially restored by filling noise into the gap caused by the alternation. Their result confirmed that vowels also have the phenomenon of illusory filling-in, but the focus was on whether the stimulus sounds like human speech or not. However, it is important to note that the adaptive procedure we used in this study allowed the noise duration to span beyond the fricative/affricate duration. As a matter of fact, most individuals tolerated noise

50 interruptions that were much longer than the fricative/affricate durations (mean and standard deviation). So the noise covered adjacent phonemes including vowels as well, supporting the conclusion that music training transfers to vowel restoration as well.

4.2.2 Top-down information

Since we hypothesized that the functional transfer occurs on the lower level of sensory repair, we should ideally use stimuli entailing minimal top-down involvement. In the

“tunes” session, we discussed our attempt to tease out the contribution of expectation, though in fact it is impossible to eradicate it for both groups. However, we note above that excluding familiarity with tunes was a possible shortcoming of the experimental design, as familiarity does not necessarily imply top-down mechanisms.

For the “words” session, however, we did not consider this problem, assuming that (1) using words instead of sentences will largely reduce top-down information based on the context; (2) nearly equal amount of top-down lexical information was available to each individual. Truth of the second assumption was not guaranteed as we did not particularly control for the age and educational background among our participants. In order to eliminate the influence of familiarity with the words, it is meaningful to use pseudo- words as stimuli in future studies. A significant difference between the two groups in that case will corroborate our hypothesis even further.

51

4.3 Musicianship and OMSI

Unlike post-hoc analysis following ANOVA which revealed a significant difference in the “words” session, the correlation between OMSI and performance in either session turned out to be very weak. This brings forth a question on how to better describe the musicality of participants.

4.3.1 Musician & musicality

The word “musician” is being used differently in our society. Even Merriam-Webster dictionary gives two versions of its definition— (1) a person who writes, sings, or plays music; (2) a composer, conductor, or performer of music, especially an instrumentalist.

Apparently the latter version is more stringent, as it implicitly requires higher music ability and relates to an occupation. Due to this ambiguity, empirical researchers need to specify their requirements when they recruit musician subjects. The simplistic way is to draw college students in a music program but this method is flawed in its ignorance of people’s diverse music background outside school.

Accordingly, the term “musicality” was first adopted by Révész2001) ( to denote “the need and the capacity to understand and to experience the autonomous effects of music and to appraise musical utterances on the score of their objective quality (aesthetic content)”.

Hallam & Prince (2003) conducted a survey to collect different opinions on conceptualization of “music ability” from different social groups. The organized result 52 included six super-ordinates with sub-categories : (1) Aural skills, e.g. having a musical ear; (2) Receptive responses, e.g. being able to evaluate music and performance; (3)

Generative skills, e.g. Being able to play or sing or read music; (4)The integration of skills; (5) Personal qualities, e.g. Metacognition; (6) The origins of musical ability, e.g.

Progressive development. As can be seen from the broad result, it is impossible to use a single criterion to measure musicality, and nor is it feasible to find a clear cutoff to dichotomize “musicianship”.

4.3.2 Ollen Music Sophistication Index and other instruments

OMSI is by far one of the best self-report instruments that can be used to characterize music ability, or as Ollen (2006) called, music sophistication. It compensates for the drawbacks of using a single indicator such as years of private lessons or status in music major at college. The model containing 9 indicators was generated from a sample of 633 individuals and able to classify 79.5% of the sample. In 2009, cross-validation was conducted upon another 284 adults and the classification accuracy reached 69%, which surpassed other contemporary questionnaires. According to the original annotation, a score of 500 is the cutoff point, which excluded half of our musician participants from being “musically sophisticated”.

As Ollen (2006) pointed out, OMSI has its limitations. First, it is most useful for adults living in America, Canada, Australia and perhaps European countries. Second, the majority of the sample was from traditional musical contexts which emphasize

“reproduction of notated scores” instead of “personal stylization and improvisation”. 53

Apparently for musicians specialized in other genres than classical music, OMSI is less capable of characterizing their music skills. Since many of our musician participants perform non-classical music, it is not surprising that OMSI did not capture their musicality very objectively.

In addition, we noticed that the last multiple choice is a self-assessment of musicianship level, which contains some obscure branches like “serious amateur musician” and

“semiprofessional musician”. Participants, especially students in music major, reported their difficulty in making the decision. This reflects a shortcoming of subjective assessment in general.

In parallel to the self-report instruments, objective perceptual tests have also been widely used, including the Montreal Battery of Evaluation of (Peretz, Champod, &

Hyde, 2003) and the Musical Ear Test (Wallentin, Nielsen, Friis-Olivarius, Vuust, &

Vuust, 2010). Besides, neurophysiological tests can also be used to reliably distinct musicians from non-musicians, e.g. using Mismatch Negativity Paradigm (Vuust et al.,

2011). The benefit of objective tests is self-explanatory. However, it is time-consuming so far as the result is only used to categorize participants. To measure multiple dimensions of musicality may require an omnibus of simple tests geared toward acuity in pitch, timbre, rhythm etc. Even so, existing tests still cannot cover many other aspects of music ability as proposed by Hallam & Prince (2003). In a word, a person with solely high aural skills cannot be called a musician.

54

4.3.3 How to treat vocalists and songwriters

A vocalist is a respectful way of referring to a singer. By the definition of “musician” in a broad sense, a vocalist should doubtlessly fall under the category of “musician”.

However, due to the stereotypical association between musicians and instrumentalists, a vocalist/singer is rarely called a musician explicitly. That’s why the vocalist participant in our study originally identified himself with the non-musician group and later attributed his decision to his inability to play any instruments. In fact, an experienced vocalist often needs to undergo rigorous training in a variety of music skills. What distinguishes them from instrumentalists is their tool of producing music: the biological vocal system. Our vocalist participant, in particular, had been in a church choir for decades, which indicates a higher command of harmony and synchrony than a soloist.

One may wonder whether the involvement of lyrics in vocal training will interfere with the comparison between speech and music perception. According to the modular model by Peretz & Coltheart (2003), acoustic analysis of singing signals incorporates one language processing pathway that goes through acoustic-to-phonological conversion, which subsequently sends output to phonological lexicon. However, given that the linguistic component in singing has lost the original speech-like temporal pattern, it is reasonable to posit that the acoustic-to-phonological conversion mentioned here is separate from that for normal speech. Simply speaking, singing is music, not speech.

Likewise, songwriters are also a special group who are often considered as musicians by laymen, but as non-musicians by academic musicians. Most songwriters are able to play 55 instruments and they often engage in singing as well. In terms of virtuosity in specific music skills, they can hardly compare with professional musicians. However, when it comes to music creativity and personal qualities, they sometimes outscore those who perform without composing. For example, in our study, some musicians do get involved in songwriting activity.

4.4 Cross-domain correlation as a means of elucidating functional transfer

The strong positive correlation between perceptual thresholds for two sessions may support our hypothesis of functional transfer, despite the fact that musicians did not outperform non-musicians in the “tunes” session, based on ANOVA results. Here we will examine this strong correlation from different perspectives.

4.4.1 Neural markers

The fact that noise tolerance in music highly correlates with that in speech is suggestive of the overlapping between neural networks associated with music and speech processing, including the function of sensory repair. Apart from that, we did observe that musicians in general have higher noise tolerance in speech, though not in music, probably due to some concurrent neural/mental process (see section 4.1.3).

Behavior results sometimes may reveal less than electrophysiological results because the eventual decision making often stands for a combination of processing results in more than one neural pathway while electrophysiology takes advantage of decomposition of neural components that corresponds to separable mental processes. To elucidate the 56 mechanism of music-to-speech transfer, it is necessary to zoom in to the neural substrates that mediate the continuity illusion. In future undertaking we would like to focus on AEP complex P1-N1-P2-N2, which is associated with rhythm processing. We would expect to see smaller amplitudes of P1-N1-P2-N2 in musicians than non-musicians toward the offset of interruption when in the speech session. Also, we expect that the salience of P1-

N1-P2-N2 complex when perceiving continuity relative to interruption in degraded speech is highly correlated with that for degraded music.

For the speech session alone, the effect of music training may also manifest itself in the salient change of theta-band (4-8 Hz) oscillatory activity, as effective illusory filling-in is associated with decreased phase-locking index (PLI) at the interruption offset, as well as suppressed theta band power in our case (Shahin et al., 2012). These continuous variables retrieved from neurophysiological data can provide a better resolution for capturing perceptual changes, compared to the enforced binary choice in our behavior experiment.

4.4.2 Adequacy of rhythmic information

In our explanation, the functional transfer happens upon the overlapping neural networks associated with rhythm processing, more specifically, the real-time encoding of rhythm information followed by expectation of ensuing rhythmic elements. This requires us to assume that the duration of our stimuli is long enough to ensure the perception of rhythm.

Notably, both the pre-interruption part and the post-interruption part of the stimulus contribute to the encoding, but this type of retrospection is different from the excessive rumination as discouraged in the methodology. 57

Plazak & Huron (2011) have conducted a seminal study to investigate how rapidly human perceptual and cognitive system can process characteristic music information, as well as how long it would take for the listener to gain confidence about the acquired music information. Their results suggested a rapid unfolding of such information within three seconds. According to the “tentative chronology” that they proposed, listeners generally need 400ms to confidently catch the genre, 2000ms for the meter, 3000ms for the tempo and 3000ms for possible syncopation. Therefore, the lengths of our tunes (1~2) might fail to provide each subject with all rhythm-related information, which could be partially accountable for the absence of significant result in the “tunes” session.

Nevertheless, we can argue that the “tentative chronology” is too conservative because the processing speed in the brain is higher than can be measured in a behavior study.

Also, music training plus broad exposure can potentially facilitate rhythm expectation and thus reduce this amount of time.

For speech, the same question exists. Since the smallest element related to speech rhythm should be a syllable instead of a phoneme, can a tri-syllabic word itself induce a percept of rhythm, when it is separate from a larger context of utterance or even discourse?

Considering the hierarchical structure of speech and music rhythms, we can say the available information is incomplete. However, it might be sufficient if the short stimulus serves as a cue to trigger the neural oscillation within a stereotypical frequency band related to speech processing, rather than through gradual entrainment. Therefore, the required duration of external input may be substantially less than expected.

58

4.4.3 Nature vs. Nurture

One essential limitation of comparing existing musicians and non-musicians is that we cannot with certainty draw any causal conclusion that the significant difference of performance is due to music training, without precluding the possibility of genetic predisposition. The results in this study only partially support but do not provide strong evidence for the hypothesis of functional transfer.

One can argue that the strong correlation can be explained by individual difference, i.e. whether the participant carries “musical genes” or not. According to this point of view, whether one has the potentiality to become a successful musician or not is largely predetermined regardless of the received amount of music training. In particular, some talented musicians who do not use music scores can still achieve success by practicing abundantly according to aural information with their so-called “musical ear” (Seppänen et al., 2007). It is presumptuous to totally attribute this proclivity for aural vs non-aural practice strategy to the person’s choice of career because the preference for auditory vs visual learning might be considered as individual traits.

However, the putative “musical genes” is insufficient to explain the neuroplastic changes following music training. First, no musical genes have been testified in an empirical way in human beings and only by knocking-in or knocking-out certain genes can one make any causal conclusion in support of the genetic theory. Second, the original view of genetic determination need to be updated by an epigenetic perspective, as music training is the reason why the music-related genes are opened up and expressed, and also people 59 born with some musical talent may actively seek music training and benefit more from that (Hambrick & Tucker-Drob, 2014). Third, as a comprehensive human activity, music ability covers far more than auditory functions. The so-called “musical ear” is actually a misnomer because perception and cognition of music melody, rhythm and other features is mostly accomplished in the central nervous system instead of in the ear. Also, musician prodigies like Mozart were outstanding more because of their performing skills and creativity than for their aural skills, although the latter might be crucial in making a good musician (Beethoven being no exception as he made use of bone conduction when suffering from hearing loss). Finally, intervention studies with randomly assigned treatment group do exist, and have shown impact of music training on neuroplastic adaptation. Particularly, the strongest evidence in favor of the dominant role of music training comes from longitudinal studies on children before and after music training

(Fujioka, Ross, Kakigi, Pantev, & Trainor, 2006; Hyde et al., 2009; Moreno et al., 2009;

Shahin et al., 2008).

In fact, musicians have already been widely used as a model for studying neuroplasticity, because of two major advantages in the complexity of the music stimuli and the extent of exposure to the stimuli (Münte, Altenmüller, & Jäncke, 2002), plus the difficulty in implementing interventional studies. Anatomical and functional differences have been detected in musicians’ brains with the help of modern brain imaging techniques, but still, one should not give undue interpretation of these results because these studies are correlational in nature, leaving genetic disposition as a suspicious confounding factor.

60

References

Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., & Merzenich, M. M. (2001). Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences, 98(23), 13367– 13372.

Bendor, D., & Wang, X. (2005). The neuronal representation of pitch in primate auditory cortex. Nature, 436(7054), 1161–1165.

Besson, M., & Faïta, F. (1995). An event-related potential (ERP) study of musical expectancy: Comparison of musicians with nonmusicians. Journal of Experimental Psychology: Human Perception and Performance, 21(6), 1278–1296.

Bosnyak, D. J., Eaton, R. A., & Roberts, L. E. (2004). Distributed auditory cortical representations are modified when non-musicians are trained at pitch discrimination with 40 Hz amplitude modulated tones. Cerebral Cortex (New York, N.Y.: 1991), 14(10), 1088–1099.

Braaten, R. F., & Leary, J. C. (1999). Temporal Induction of Missing Birdsong Segments in European Starlings. Psychological Science, 10(2), 162–166.

Bregman, A. S. (1990). Auditory scene analysis: the perceptual organization of sound. Cambridge, Mass.: MIT Press.

Burton, M. W., Small, S. L., & Blumstein, S. E. (2000). The role of segmentation in phonological processing: an fMRI investigation. Journal of Cognitive Neuroscience, 12(4), 679–690.

Carlyon, R. P., Micheyl, C., Deeks, J. M., & Moore, B. C. J. (2004). Auditory processing of real and illusory changes in frequency modulation (FM) phase. The Journal of the Acoustical Society of America, 116(6), 3629–3639.

Carpenter, A. L., & Shahin, A. J. (2013). Development of the N1-P2 auditory evoked response to amplitude rise time and rate of formant transition of speech sounds. Neuroscience Letters, 544, 56–61.

61

Chan, A. S., Ho, Y. C., & Cheung, M. C. (1998). Music training improves verbal memory. Nature, 396(6707), 128.

Darwin, C. (1981). The descent of man, and selection in relation to sex. Princeton, N.J.: Princeton University Press. Retrieved from http://public.eblib.com/EBLPublic/PublicView.do?ptiID=581579

Ehret, G. (1997). The auditory cortex. Journal of Comparative Physiology A: Sensory, Neural, and Behavioral Physiology, 181(6), 547–557.

Eulitz, C., & Hannemann, R. (2010). On the matching of top-down knowledge with sensory input in the perception of ambiguous speech. BMC Neuroscience, 11(1), 67.

Fujioka, T., Ross, B., Kakigi, R., Pantev, C., & Trainor, L. J. (2006). One year of musical training affects development of auditory cortical-evoked fields in young children. Brain: A Journal of Neurology, 129(Pt 10), 2593–2608.

Fujioka, T., Trainor, L. J., Ross, B., Kakigi, R., & Pantev, C. (2005). Automatic Encoding of Polyphonic Melodies in Musicians and Nonmusicians. Journal of Cognitive Neuroscience, 17(10), 1578–1592.

Hallam, S., & Prince, V. (2003). Conceptions of Musical Ability. Research Studies in , 20(1), 2–22.

Hambrick, D. Z., & Tucker-Drob, E. M. (2014). The genetics of music accomplishment: Evidence for gene-environment correlation and interaction. Psychonomic Bulletin & Review.

Heinrich, A., Carlyon, R. P., Davis, M. H., & Johnsrude, I. S. (2008). Illusory vowels resulting from perceptual continuity: a functional magnetic resonance imaging study. Journal of Cognitive Neuroscience, 20(10), 1737–1752.

Huron, D. (2006). Sweet Anticipation-Music and the Psychology of Expectation. The MIT Press.

Husain, F. T., Lozito, T. P., Ulloa, A., & Horwitz, B. (2005). Investigating the Neural Basis of the Auditory Continuity Illusion. Journal of Cognitive Neuroscience, 17(8), 1275–1292.

Hyde, K. L., Lerch, J., Norton, A., Forgeard, M., Winner, E., Evans, A. C., & Schlaug, G. (2009). Musical training shapes structural brain development. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 29(10), 3019–3025.

Jackendoff, R. (2009). Parallels and Nonparallels between Language and Music. Music Perception: An Interdisciplinary Journal, 26(3), 195–204.

62

Kaiser, A. R., Kirk, K. I., Lachs, L., & Pisoni, D. B. (2003). Talker and lexical effects on audiovisual word recognition by adults with cochlear implants. Journal of Speech, Language, and Hearing Research: JSLHR, 46(2), 390–404.

Lakatos, P., Shah, A. S., Knuth, K. H., Ulbert, I., Karmos, G., & Schroeder, C. E. (2005). An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911.

Lenz, D., Jeschke, M., Schadow, J., Naue, N., Ohl, F. W., & Herrmann, C. S. (2008). Human EEG very high frequency oscillations reflect the number of matches with a template in auditory short-term memory. Brain Research, 1220, 81–92.

Luo, H., & Poeppel, D. (2007). Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory Cortex. Neuron, 54(6), 1001–1010.

Lyzenga, J., Carlyon, R. P., & Moore, B. C. J. (2005). Dynamic aspects of the continuity illusion: perception of level and of the depth, rate, and phase of modulation. Hearing Research, 210(1-2), 30–41.

Marie, C., Magne, C., & Besson, M. (2011). Musicians and the metric structure of words. Journal of Cognitive Neuroscience, 23(2), 294–305.

McMurray, B., Dennhardt, J. L., & Struck-Marcell, A. (2008). Context effects on musical chord categorization: Different forms of top-down feedback in speech and music? Cognitive Science, 32(5), 893–920.

Michael Wilson. (1987). MRC Machine Usable Dictionary. Version 2.00. Informatics Division Science and Engineering Research Council Rutherford Appleton Laboratory Chilton, Didcot, Oxon, OX11 0QX. Retrieved from (http://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm

Miller, C. T., Dibble, E., & Hauser, M. D. (2001). Amodal completion of acoustic signals by a nonhuman primate. Nature Neuroscience, 4(8), 783–784.

Miller, G. A. (1950). The Intelligibility of Interrupted Speech. The Journal of the Acoustical Society of America, 22(2), 167.

Moreno, S., Marques, C., Santos, A., Santos, M., Castro, S. L., & Besson, M. (2009). Musical training influences linguistic abilities in 8-year-old children: more evidence for brain plasticity. Cerebral Cortex (New York, N.Y.: 1991), 19(3), 712–723.

Müller, N., Keil, J., Obleser, J., Schulz, H., Grunwald, T., Bernays, R.-L., … Weisz, N. (2013). You can’t stop the music: reduced auditory alpha power and coupling between auditory and memory regions facilitate the illusory perception of music during noise. NeuroImage, 79, 383–393.

63

Münte, T. F., Altenmüller, E., & Jäncke, L. (2002). The musician’s brain as a model of neuroplasticity. Nature Reviews. Neuroscience, 3(6), 473–478.

Musacchia, G., Sams, M., Skoe, E., & Kraus, N. (2007). Musicians have enhanced subcortical auditory and audiovisual processing of speech and music. Proceedings of the National Academy of Sciences of the United States of America, 104(40), 15894–15898.

Musacchia, G., Strait, D., & Kraus, N. (2008). Relationships between behavior, brainstem and cortical encoding of seen and heard speech in musicians and non-musicians. Hearing Research, 241(1-2), 34–42.

Neuhaus, C., & Knösche, T. R. (2008). Processing of pitch and time sequences in music. Neuroscience Letters, 441(1), 11–15.

Ollen, J. E. (2006). A criterion-related validity test of selected indicators of musical sophistication using expert ratings. Doctoral dissertation, The Ohio State University, Columbus, OH, USA. Retrieved from http://www.ohiolink.edu/etd/view.cgi?osu1161705351

Ollen, J. E. (2009). Cross-validation of a Model for Classifying Musical Sophistication. Presented at the 9th Conference of Society of Music Perception and Cognition, University of Indianapolis-Purdue University, Indianapolis, IN.

Palva, S., & Palva, J. M. (2007). New vistas for α-frequency band oscillations. Trends in Neurosciences, 30(4), 150–158.

Pantev, C., Oostenveld, R., Engelien, A., Ross, B., Roberts, L. E., & Hoke, M. (1998). Increased auditory cortical representation in musicians. Nature, 392(6678), 811–814.

Parbery-Clark, A., Skoe, E., & Kraus, N. (2009). Musical experience limits the degradative effects of background noise on the neural processing of sound. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 29(45), 14100– 14107.

Parbery-Clark, A., Skoe, E., Lam, C., & Kraus, N. (2009). Musician enhancement for speech-in-noise. Ear and Hearing, 30(6), 653–661.

Peretz, I. (2006). The nature of music from a biological perspective. Cognition, 100(1), 1–32.

Peretz, I., Champod, A. S., & Hyde, K. (2003). Varieties of musical disorders. The Montreal Battery of Evaluation of Amusia. Annals of the New York Academy of Sciences, 999, 58–75.

Peretz, I., & Coltheart, M. (2003). Modularity of music processing. Nature Neuroscience, 6(7), 688–691. 64

Petkov, C. I., O’Connor, K. N., & Sutter, M. L. (2007). Encoding of illusory continuity in primary auditory cortex. Neuron, 54(1), 153–165.

Plazak, J., & Huron, D. (2011). The first three seconds: Listener knowledge gained from brief musical excerpts. Musicae Scientiae, 15(1), 29–44.

Randall Pruim, Daniel Kaplan, & Nicholas Horton. (2014). Project MOSAIC (mosaic- web.org) statistics and mathematics teaching utilities. Retrieved from http://mosaic- web.org/r-packages/

Repp, B. H. (1992). Perceptual restoration of a “missing” speech sound: auditory induction or illusion? Perception & Psychophysics, 51(1), 14–32.

Révész, G. (2001). Introduction to the psychology of music. Mineola, NY: Dover Publications.

Riecke, L., Esposito, F., Bonte, M., & Formisano, E. (2009). Hearing illusory sounds in noise: the timing of sensory-perceptual transformations in auditory cortex. Neuron, 64(4), 550–561.

Sadie, S., & Tyrrell, J. (Eds.). (2001). The new Grove dictionary of music and musicians (2nd ed.). New York: Grove.

Sammler, D., Grigutsch, M., Fritz, T., & Koelsch, S. (2007). : Electrophysiological correlates of the processing of pleasant and unpleasant music. Psychophysiology, 44(2), 293–304.

Samuel, A. G. (1981). Phonemic restoration: insights from a new methodology. Journal of Experimental Psychology. General, 110(4), 474–494.

Seeba, F., Schwartz, J. J., & Bee, M. A. (2010). Testing an auditory illusion in frogs: Perceptual restoration or sensory bias? Animal Behaviour, 79(6), 1317–1328.

Seppänen, M., Brattico, E., & Tervaniemi, M. (2007). Practice strategies of musicians modulate neural processing and the learning of sound-patterns. Neurobiology of Learning and Memory, 87(2), 236–247.

Shahin, A., Bosnyak, D. J., Trainor, L. J., & Roberts, L. E. (2003). Enhancement of neuroplastic P2 and N1c auditory evoked potentials in musicians. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 23(13), 5545–5552.

Shahin, A. J. (2011). Neurophysiological influence of musical training on speech perception. Frontiers in Psychology, 2, 126.

Shahin, A. J., Bishop, C. W., & Miller, L. M. (2009). Neural mechanisms for illusory filling-in of degraded speech. NeuroImage, 44(3), 1133–1143. 65

Shahin, A. J., Kerlin, J. R., Bhat, J., & Miller, L. M. (2012). Neural restoration of degraded audiovisual speech. NeuroImage, 60(1), 530–538.

Shahin, A. J., & Pitt, M. A. (2012). Alpha activity marking word boundaries mediates speech segmentation. European Journal of Neuroscience, 36(12), 3740–3748.

Shahin, A. J., Roberts, L. E., Chau, W., Trainor, L. J., & Miller, L. M. (2008). Music training leads to the development of timbre-specific gamma band activity. NeuroImage, 41(1), 113–122.

Shahin, A. J., Trainor, L. J., Roberts, L. E., Backer, K. C., & Miller, L. M. (2010). Development of auditory phase-locked activity for music sounds. Journal of Neurophysiology, 103(1), 218–229.

Shahin, A., Roberts, L. E., Pantev, C., Trainor, L. J., & Ross, B. (2005). Modulation of P2 auditory-evoked responses by the spectral complexity of musical sounds. Neuroreport, 16(16), 1781–1785.

Shahin, A., Roberts, L. E., & Trainor, L. J. (2004). Enhancement of auditory cortical development by musical experience in children. Neuroreport, 15(12), 1917–1921.

Sivonen, P., Maess, B., Lattner, S., & Friederici, A. D. (2006). Phonemic restoration in a sentence context: evidence from early and late ERP effects. Brain Research, 1121(1), 177–189.

Snyder, J. S., & Large, E. W. (2005). Gamma-band activity reflects the metric structure of rhythmic tone sequences. Brain Research. Cognitive Brain Research, 24(1), 117–126.

Strait, D. L., Kraus, N., Parbery-Clark, A., & Ashley, R. (2010). Musical experience shapes top-down auditory mechanisms: evidence from masking and auditory attention performance. Hearing Research, 261(1-2), 22–29.

Sugita, Y. (1997). Neuronal correlates of auditory induction in the cat cortex. Neuroreport, 8(5), 1155–1159.

Tervaniemi, M., & Huotilainen, M. (2003). The promises of change-related brain potentials in cognitive . Annals of the New York Academy of Sciences, 999, 29–39.

Tervaniemi, M., Just, V., Koelsch, S., Widmann, A., & Schroger, E. (2004). Pitch discrimination accuracy in musicians vs nonmusicians: an event-related potential and behavioral study. Experimental Brain Research, 161(1), 1–10.

Tervaniemi, M., Medvedev, S. V., Alho, K., Pakhomov, S. V., Roudas, M. S., Van Zuijen, T. L., & Näätänen, R. (2000). Lateralized automatic auditory processing of phonetic versus musical information: a PET study. Human Brain Mapping, 10(2), 74–79. 66

Tremblay, K., Kraus, N., McGee, T., Ponton, C., & Otis, B. (2001). Central auditory plasticity: changes in the N1-P2 complex after speech-sound training. Ear and Hearing, 22(2), 79–90.

Vuust, P., Brattico, E., Glerean, E., Seppänen, M., Pakarinen, S., Tervaniemi, M., & Näätänen, R. (2011). New fast mismatch negativity paradigm for determining the neural prerequisites for musical ability. Cortex, 47(9), 1091–1098.

Wager, T. D., & Nichols, T. E. (2003). Optimization of experimental design in fMRI: a general framework using a genetic algorithm. NeuroImage, 18(2), 293–309.

Wallentin, M., Nielsen, A. H., Friis-Olivarius, M., Vuust, C., & Vuust, P. (2010). The Musical Ear Test, a new reliable test for measuring musical competence. Learning and Individual Differences, 20(3), 188–196.

Warren, R. M. (1970). Perceptual Restoration of Missing Speech Sounds. Science, 167(3917), 392–393.

Watanabe, T., Yagishita, S., & Kikyo, H. (2008). Memory of music: roles of right hippocampus and left inferior frontal gyrus. NeuroImage, 39(1), 483–491.

Wertheimer, M., & King, B. D. (2004). Max Wertheimer and Gestalt Theory. Piscataway: Transaction Publishers.

Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nature Neuroscience, 10(4), 420–422.

Zaehle, T., Geiser, E., Alter, K., Jancke, L., & Meyer, M. (2008). Segmental processing in the human auditory dorsal stream. Brain Research, 1220, 179–190.

Zanto, T. P., Snyder, J. S., & Large, E. W. (2006). Neural correlates of rhythmic expectancy. Advances in Cognitive Psychology, 2(2), 221–231.

Zatorre, R. J. (2001). Spectral and Temporal Processing in Human Auditory Cortex. Cerebral Cortex, 11(10), 946–953.

Zendel, B. R., & Alain, C. (2009). Concurrent sound segregation is enhanced in musicians. Journal of Cognitive Neuroscience, 21(8), 1488–1498.

67

Appendix A: Ollen Music Sophistication Index

In order to obtain your score, please enter an answer for every question unless you are specifically directed to skip one.

1. How old are you today? _____ age in years

2. At what age did you begin sustained musical activity? "Sustained musical activity" might include regular music lessons or daily musical practice that lasted for at least three consecutive years. If you have never been musically active for a sustained time period, answer with zero. _____ age at start of sustained musical activity

3. How many years of private music lessons have you received? If you have received lessons on more than one instrument, including voice, give the number of years for the one instrument/voice you've studied longest. If you have never received private lessons, answer with zero. _____ years of private lessons

4. For how many years have you engaged in regular, daily practice of a musical instrument or singing? "Daily" can be defined as 5 to 7 days per week. A "year" can be defined as 10 to 12 months. If you have never practiced regularly, or have practiced regularly for fewer than 10 months, answer with zero. _____ years of private lessons

5. Which category comes nearest to the amount of time you currently spend practicing an instrument (or voice)? Count individual practice time only; no group rehearsals. _____ I rarely or never practice singing or playing an instrument _____ About 1 hour per month _____ About 1 hour per week _____ About 15 minutes per day _____ About 1 hour per day _____ More than 2 hours per day

6. Have you ever enrolled in any music courses offered at college (or university)? 68

_____ No (Skip to #8) _____ Yes

7. (If Yes) How much college-level coursework in music have you completed? If more than one category applies, select your most recently completed level. _____ None _____ 1 or 2 NON-major courses (e.g., music appreciation, playing or singing in an ensemble) _____ 3 or more courses for NON-majors _____ An introductory or preparatory music program for Bachelor’s level work _____ 1 year of full-time coursework in a Bachelor of Music degree program (or equivalent) _____ 2 years of full-time coursework in a Bachelor of Music degree program (or equivalent) _____ 3 or more years of full-time coursework in a Bachelor of Music degree program (or equivalent) _____ Completion of a Bachelor of Music degree program (or equivalent) _____ One or more graduate-level music courses or degrees

8. Which option best describes your experience at composing music? _____ Have never composed any music _____ Have composed bits and pieces, but have never completed a piece of music _____ Have composed one or more complete pieces, but none have been performed _____ Have composed pieces as assignments or projects for one or more music classes; one or more of my pieces have been performed and/or recorded within the context of my educational environment _____ Have composed pieces that have been performed for a local audience _____ Have composed pieces that have been performed for a regional or national audience (e.g., nationally known performer or ensemble, major concert venue, broadly distributed recording)

9. To the best of your memory, how many live concerts (or any style, with free or paid admission) have you attended as an audience member in the past 12 months? Please do not include regular religious services in your count, but you may include special musical prodcutions or events. _____ None _____ 1 - 4 _____ 5 - 8 _____ 9 - 12 _____ 13 or more

10. Which title best describes you? _____ Nonmusician _____ Music-loving nonmusician 69

_____ Amateur musician _____ Serious amateur musician _____ Semiprofessional musician _____ Professional musician

70

Appendix B: Music works used for generating music stimuli

Bach -Violin concerto in Gm

Baden Powell -Braziliense

Haydn -Symphony 101The Clock -Symph 103 Drum-roll -Symph 104 London

John Coltrane -Greensleeves -Naima -Summertime

Pat Martino -Road Song -The Phineas Trane

Ralph Vaughan Williams -The Lark Ascending

Tchaikovsky -Allegro Moderato (violin) -Symphony #4 -Piano Concerto #1

71

Appendix C: Experimental data used in statistical analysis

(In ascending order of Ollen Music Sophistication Index within non-musician and musician groups respectively)

non-musicians OMSI threshold_tunes threshold_words familiarity(Q1) familiarity(Q2) 21 162.2619 155.7895 1.5 1.7 36 203.4524 124.4737 2.2 7 81 141.82895 174.4156 1 1.9 99 98.125 75.8043 1 2.2 99 209.2188 187.5325 1.5 3.05 109 108.6719 98.7013 1.5 2.35 160 169.6429 176.9651 1.5 2.35 161 79.0234 58.8961 1 5.05 161 136.0938 146.9481 1.5 2.7 162 58.8115 120.6 4.05 5.2

musicians OMSI threshold_tunes threshold_words familiarity(Q1) familiarity(Q2) 129 145.3571 187.8478 1.5 2.85 226 169.1667 177.6856 1 4.9 336 196.7969 233.75 6.05 4.7 345 58.2143 91.2338 1.85 4.7 358 154.2578 216.6234 4 3.85 439 135.1563 135.9091 1 2.85 633 130.7031 108.146 1 4.5 683 129.7656 193.1169 1.35 3.05 839 127.1875 180.8442 2.35 5.4 856 138.2031 149.0435 1 4.2 978 116.4063 206.7609 1 4.05

72