AUDITORY SALIENCE in NATURAL SCENES by Nicholas Huang a Dissertation Submitted to Johns Hopkins University in Conformity with Th

AUDITORY SALIENCE IN NATURAL SCENES

by Nicholas Huang

A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy

Baltimore, Maryland October 2019

Salience describes the phenomenon by which an object stands out from a scene. While its underlying processes are extensively studied in vision, mechanisms of auditory salience remain largely unknown. In particular, few studies have explored salience within the more realistic context of extended natural scenes. Thus, the focus of this dissertation is to first create a foundation for studying auditory salience in natural scenes, complete with a database of acoustic scenes and a behavioral measurement of salience within those scenes to serve as ground truth. Then, this framework is used to determine the underlying acoustic attributes that give rise to salience, as well as the neurophysiological response elicited by salient auditory stimuli. The commonalities and differences between salience as bottom-up attention and task-driven (top- down) attention are also explored to shed light on the neurological basis for both mechanisms. Further, the contribution of sound category to salience is quantified using a convolutional neural network. Together, these findings contribute to a comprehensive understanding of auditory salience.

Primary Reader and Advisor: Mounya Elhilali Secondary Reader: Ernst Niebur

ii Thesis Committee

Mounya Elhilali (Primary Advisor) Associate Professor Department of Electrical and Computer Engineering Johns Hopkins University

Ernst Niebur Professor Department of Neuroscience Johns Hopkins School of Medicine

Kechen Zhang Associate Professor Department of Biomedical Engineering Johns Hopkins School of Medicine

iii Acknowledgments

Thank you to my thesis advisor and mentor, Professor Mounya Elhilali, for your guidance, wisdom, knowledge, and patience. I have learned so much from you as your student - none of this would be possible without you. Thank you to the other members of my thesis committee, Professor Ernst Niebur and Professor Kechen Zhang. Your insights, advice, and support have been invaluable to this work. Thank you to the other members of my lab, both current and previous, for your great camaraderie, collaboration, and discussion. Thank you to the professors who served on my Oral Examination Commit- tee: Professor Stewart Hendry, Professor Reza Shadmehr, Professor Danielle Tarraf, and Professor Eric Young. Thank you to my other friends in the Johns Hopkins community and beyond, especially my friends from ballroom, TSA / 桌遊社, choir, and other activities throughout the years. Thank you to my family (Hsin-Hsiung Huang, Shu-Ming Chiou, and Oliver Huang) for your love and support throughout my life. Last but certainly not least, thank you to the love of my life, Luoluo Liu, for bringing such joy into my world. I love that we can discuss anything under the sun, including our work and research. Looking forward to many new adventures together!

iv Table of Contents

Abstract ii

Thesis Committee iii

Acknowledgments iv

List of Tables ix

List of Figuresx

1 Introduction1

1.1 Existing Auditory Salience Literature ...... 2

1.2 Contributions of this Thesis ...... 5

1.2.1 Natural Scene Salience Database, Measurement, Acous-

tic Analysis, and Modeling ...... 7

1.2.2 Neural Markers of Auditory Salience ...... 8

1.2.3 Effects of Audio Class on Salience: Neural Network Analysis ...... 10

1.3 Summary ...... 11

v 2 Salience Database, Measurement, Acoustic Analysis, and Modeling 12

2.1 Introduction ...... 12

2.2 Methods ...... 18

2.2.1 Stimuli ...... 18

2.2.2 Behavioral Data Collection ...... 21

2.2.3 Analysis of Behavioral Data ...... 23

2.2.4 Acoustic Analysis of Stimuli ...... 25

2.2.5 Event Analysis ...... 27

2.2.6 Pupil Response Analysis ...... 29

2.3 Results ...... 30

2.3.1 Behavioral Results ...... 30

2.3.2 Acoustic Features ...... 33

2.3.3 Acoustic Context ...... 36

2.3.4 Event Types ...... 36

2.3.5 Long-term context ...... 37

2.3.6 Event Prediction ...... 39

2.3.7 Pupillary Response ...... 44

2.4 Discussion ...... 47

3 Neural Markers of Auditory Salience: Musical Sequences with Vi- sual Task 55

3.1 Introduction ...... 55

3.2 Materials and Methods ...... 59

vi 3.2.1 Participants ...... 59

3.2.2 Stimuli and experimental paradigm ...... 60

3.2.3 Neural data acquisition ...... 62

3.2.4 Neural data analysis ...... 63

3.3 Results ...... 68

3.4 Discussion ...... 77

4 Push-pull competition between bottom-up and top-down auditory attention to natural soundscapes 85

4.1 Introduction ...... 85

4.2 Methods ...... 91

4.2.1 Stimuli: ...... 91

4.2.2 Participants and Procedure: ...... 93

4.2.3 Electroencephalography: ...... 94

4.3 Results ...... 101

4.4 Discussion ...... 113

5 Hierarchical Representation in Deep Neural Networks and Contri- bution of Audio Class to Salience 121

5.1 Introduction ...... 122

5.2 Materials and Methods ...... 127

5.2.1 Stimuli ...... 127

5.2.2 Acoustic Features ...... 128

vii 5.2.3 Behavioral salience ...... 128

5.2.4 Electroencephalography ...... 129

5.2.5 Deep Neural Network ...... 130

5.2.6 Network Surprisal ...... 133

5.2.7 Correlation Analyses ...... 134

5.2.8 Event Prediction ...... 137

5.3 Results ...... 137

5.3.1 Comparison to Basic Acoustic Features ...... 138

5.3.2 Comparison to Behavioral Measure of Salience . . . . . 140

5.3.3 Comparison to Neural Activity ...... 144

5.4 Discussion ...... 146

6 Conclusions and Future Directions 152

References 155

Biography 176

viii List of Tables

1.1 Breakdown of the existing literature in Auditory Salience. Refer- ences in red are published works presented in this dissertation,

while those in blue are works presented this in dissertation which have been submitted for peer review ...... 6

2.1 List of the twenty audio recordings of natural scenes used in this study and suggested for use in future studies. Scenes selected to cover a wide range of acoustic environments. Scenes were

labeled subjectively as sparse or dense in regards to the density of acoustic objects within...... 19

3.1 Feature effects on EEG measures of salience ...... 71

5.1 Dimensions of the input and each layer of the neural network. Bold text indicates which layers are used in the analysis. . . . 132

ix List of Figures

2.1 a Presentation screen for the behavioral task. Subjects were asked to listen to one natural scene in each ear, and to move

the cursor to the side corresponding to the scene that they were focusing on at any given time. (Text on top was not part of the presentation screen). b Example waveforms of two stimuli

presented in a given trial along with behavioral response of one subject shown in the center. c The average response across subjects and competing scenes is shown for waveform 2. Salience is

denoted as the percentage of subjects attending to that scene at any given time. Peaks in the height of the slope of the average salience are marked as salient events...... 20

x 2.2 Distribution of events across the different natural scenes. Each bar represents one of the twenty scenes in the database. The

height of the bar indicates the relative salience strength of the events within that scene. The shading of the bar describes the number of events present within that scene. Scenes are

sorted by category, then by number of events within each scene. Sparse scenes contain significantly stronger salient events than

the other scene categories, indicated by a ∗...... 31

2.3 a The strength of an event relative to the time since the last event. b Histogram of individual onset to offset times. These times each correspond to a single mouse movement to and from

a scene. A probability density function was fitted using kernel density estimation (a non-parametric method) and peaks at 3.11 seconds...... 32

2.4 Reaction times. The reaction time is defined here as the time

between a change in loudness and the change in the behavioral response (inset). Events are ranked by absolute consensus to avoid timing issues. Boxes represent 95% confidence intervals,

adjusted for multiple comparisons using a post-hoc Tukey HSD test...... 34

xi 2.5 a Acoustic feature change averaged across events. All features are z-score normalized. Statistically significant increase or de-

creases are labeled by a ∗, where ∗ indicates p < .05, and ∗∗ p < .0001). Inset: Average loudness relative to salient events. Vertical bars indicate the time windows used to measure the

change in acoustic feature across events. b Average loudness around salient events, separated by strength of salience. c Aver-

age loudness around salient events, separated by scene density. All error bars and shaded areas represent ±1 standard error. . 35

2.6 a Number of events in each category, and the loudness change associated with each (bottom, ∗ indicates p < .05, ∗∗ indicates

p < .0001). b Feature change associated with specific categories of events (∗ indicates p < .05, ∗∗ indicates p < .001). All error bars represent ±1 standard error...... 38

2.7 a An example of a stimulus that produces a time-varying re-

sponse. A repeated knocking sound elicits a lower response to the second knock. b Effects of strong loudness increases on neighboring loudness increases. Statistically significant differ-

ences in the response to these acoustic changes are labeled by a ∗...... 39

xii 2.8 a Comparison between different methods of predicting salient events. The inset shows the maximum number of events pre-

dicted by each individual feature, and by the combination of all features. Features in order are: 1 - Flatness, 2 - Irregularity, 3 - Bandwidth, 4 - Pitch, 5 - Brightness, 6 - Rate, 7 - Scale, 8 -

Harmonicity, 9 - Loudness. b ROC for dense and sparse scenes separately, using the prediction with the highest performance. 41

2.9 Effects of adding opposing scene information. a Examples

of prediction performance for two specific scenes, with and without features from the opposing scene. b ROC over all control scenes...... 45

2.10 a Average pupil size following events, z-score normalized and

detrended. The black bar represents the region where the pupil size is significantly greater than zero. b Percentage of pupil size increases that are near salient events or other peaks in the slope

of salience. c Change in loudness associated with increases in pupil size. Error bars represent ±1 standard error...... 46

xiii 3.1 A. Schematic of experimental paradigm. Participants are asked to attend to a screen and perform a rapid serial visual presen-

tation (RSVP) task. Concurrently, a melody plays in the background and subjects are asked to ignore it. In “salient” trials (shown), the melody has an occasional deviant note that did not

fit the statistical structure of the melody. In “control” trials (not shown), no deviant note is present.B. Power spectral density of

a sample melody. Notes forming the rich melodic scene overlap but form a regular rhythm at 3.33Hz. Only one note in the melody deviates from the statistical structure of the surround.

C. Grand average EEG power shows a significant enhancement upon the presentation of the salient note. The enhancement is particularly pronounced around 3-3.5 Hz, a range including the

stimulus rate 3.33 Hz (1/0.3 s)...... 67

3.2 A. The analysis of spectral power compares phase-locking at various frequency bins for salient notes vs. the same notes when not salient (top), and compares various degree of salience

in pitch, intensity and different instruments (timbre). B. The same analysis is performed on phase-coherence across trials at different frequency bands...... 70

xiv 3.3 A. Summary of interaction weights based on phase-locking response and coherence results as outlined in Table 3.1. Solid

lines indicate 2-way, dashed lines 3-way and dotted lines 4- way interactions. Effects that emerge for both measures are shown black, and those that are found for at least one measure

are shown gray. B. Summary of interaction weights based on human behavioral responses. Solid lines indicate 2-way,

dashed lines 3-way and dotted lines 4-way interactions. Black lines indicate effects that emerge from the behavioral results for the stimulus in this work. Gray lines indicate effects that

emerge from behavioral results for speech and nature stimuli tested with the same experimental design as music stimuli. . . 72

3.4 Distribution of neural effects as a function of overall salience. Both phase-locking magnitude and cross-trial coherence in-

crease with greater stimulus salience. The x-axis denotes the number of features in which the deviant has a high level of difference from the standards, with no deviation in controls. 1

corresponds to deviants with the lowest level of salience: No difference in timbre, 2 dB difference in intensity, and 2 semitone difference in pitch. Any change in timbre is considered a high

level of salience...... 74

xv 3.5 A. Estimate of ERP peak over different windows of interest (P50, N1, MMN and P3a) comparing notes when salient vs.

not. The same analysis compares salient notes with different acoustic attribute levels. B. Estimated STRF averaged across subjects, computed for overall control vs. salient notes. C-E

STRFs estimated from salient notes that belong to specific levels of tested features, or specific foreground timbres...... 76

4.1 Stimulus paradigm during EEG recording. Two stimuli are presented concurrently in each trial: a recording of a natural audio

scene (top), which subjects are asked to ignore; and a periodic tone sequence (bottom), which subjects pay attention to and detect presence of occasional modulated tones (shown in orange).

A segment of one trial is shown along with EEG recordings, where analyses focus on changes in neural responses due to presence of salient events in the ambient scene or target tones

in the attended scene...... 103

xvi 4.2 Phase-locking results. (A) Spectral density across all stimuli. The peak in energy at the tone presentation frequency is marked

by a red arrow. Inset shows average normalized tone-locking energy for individual electrodes. (B) Spectral density around target tones (top) and salient events (bottom). Black lines show

energy preceding the target or event, while colored lines depict energy following. Note that target tones are fewer throughout

the experiment leading to lower resolution of the spectral profile. (C) Change in phase-locking energy across target tones, non-events, and salient events. (D) Change in tone-locking

energy across high, mid, and low salience events. Error bars depict ±1 SEM...... 104

4.3 Reconstruction of ignored scene envelopes from neural responses before and after salient events for high, mid and low

salience instances. Correlations between neural reconstructions and scene envelopes are estimated using ridge regression (see Methods). Error bars depict ±1 SEM...... 107

xvii 4.4 High gamma band energy results. (A) Time frequency spectrogram of neural responses aligned to onsets nearest modulated

targets, averaged across central and frontal electrodes. Con- tours depict the highest 80% and 95% of the gamma response. (B) Time frequency spectrogram of tones nearest salient events

in the background scene. Contours depict the lowest 80% and 95% of the gamma response.(C) Change in energy in the high

gamma frequency band (70-110 Hz) across target tones, non- events, and salient events relative to a preceding time window. (D) Change in high gamma band energy across high, mid, and

low salience events. Error bars depict ±1 SEM...... 108

4.5 Neural sources. (A) Canonical correlation method for comparing network activation patterns between different overt and covert attention conditions. (B) Canonical correlation values

comparing neural activity patterns after salient event (x-axis) and target tones (y-axis). The contour depicts all correlation values with statistical significance less than p < 0.01. (C) Example

overlapping network activity between the response after salient events and the response after target tones (at the point shown with an asterisk in panel B). The overlap is right-lateralized and

primarily located within the superior parietal lobule (SPL), the inferior frontal gyrus (IFG), and the medial frontal gyrus (MFG).110

xviii 4.6 Event Prediction Accuracy. Average event prediction accuracy (area under the ROC curve) using only high gamma band en-

ergy, only tone-locking energy, and both features. Error bars depict ±1 SEM...... 112

5.1 Structure of the convolutional neural network and signals analyzed. A The convolutional neural network receives the time-

frequency spectrogram of an audio signal as input. It is composed of convolutional and pooling layers in an alternating fashion, followed by fully connected layers. B An example

section of an acoustic stimulus (labeled Audio); along with corresponding neural network activity from five example units within one layer of the CNN. A network surprisal measure

is then computed as the Euclidian distance between the current activity of the network nodes at that layer (shown in red) against the activity in a previous window (shown in gray with

label ’History’). Measures of behavioral salience by human listeners (in green) and cortical activity recorded by EEG (in brown) are also analyzed...... 131

xix 5.2 Correlation between neural network activity and acoustic features. A Correlation coefficients between individual acoustic

features and layers of the neural network. Loudness, harmonicity, irregularity, scale, and pitch are the most strongly correlated features overall. B Average correlation across acoustic features

and layers of the neural network. Shaded area depicts ±1 standard error of the mean (SEM). Inset shows the slope of the trend

line fitted with a linear regression. The shaded area depicts 99% confidence intervals of the slope...... 138

5.3 CNN surprisal and behavioral salience. A Correlation between CNN activity and behavioral salience. B Cumulative variance

explained after including successive layers of the CNN. The gray line shows a baseline level of improvement estimated by using values drawn randomly from a normal distribution for

all layers beyond Pool2. For both panels, shaded areas depict ±1 SEM. Insets show the slope of the trend line fitted with a linear regression, with shaded areas depicting 99% confidence

intervals of the slope...... 139

xx 5.4 Cumulative variance explained after including successive layers of the CNN for specific categories of events A Speech events,

B Music events, C Vehicle events, D Tapping/striking events. The gray line shows a baseline level of improvement estimated by using values drawn randomly from a normal distribution

for all layers beyond Pool2. For all panels, shaded areas depict ±1 SEM. Insets show the slope of the trend line fitted with

linear regression, with shaded areas depicting 99% confidence intervals of the slope...... 141

5.5 Event prediction performance. Predictions are made using LDA on overlapping time bins across scenes. The area under

the ROC curve is .775 with a combination of acoustic features and surprisal, while it reaches only .734 with acoustic features alone...... 142

5.6 Analysis of dense vs. sparse scenes A Difference in correlation

between salience and CNN activity for dense and sparse scenes. Negative values indicate that salience in sparse scenes were more highly correlated with CNN activity. B Difference in

correlation between acoustic features and CNN activity for dense and sparse scenes...... 143

xxi 5.7 Correlation between neural network activity and energy in EEG frequency bands. A Correlation coefficients between individ-

ual EEG frequency bands and layers of the neural network. Gamma, beta, and high-gamma frequency bands are the most strongly correlated bands overall. B Average correlation across

EEG activity and layers of the neural network. Shaded area depicts ±1 SEM. Inset shows the slope of the trend line fitted

with linear regression, with a shaded area depicting the 99% confidence interval of the slope...... 144

5.8 Correlation between neural network activity and energy in EEG frequency bands for specific electrodes. A Average cor-

relation between electrode activity across frequency bands for electrodes in Central (near Cz) and Frontal (near Fz) regions. B Correlation between beta/gamma band activity for individual

electrodes and convolutional/deep layers of the neural network.145

xxii Chapter 1

Introduction

Imagine pulling up to a stop sign in your car. Your friend is telling you a story in a loud voice to overcome the music playing from your car speakers and the sounds of kids playing nearby. Your car engine hums quietly as you come to a stop and starts revving up again as you begin to enter the intersection. Suddenly, a car horn blares to your left. Without thinking, you quickly slam on the brakes just in time to see another driver rush by mere inches from the front end of your vehicle.

This automatic and critical shift in attention is the result of auditory salience, also known as bottom-up attention. In contrast to top-down attention, salience guides attention towards particularly important stimuli without requiring conscious effort. By being automatic, salience not only saves cognitive resources, but also reduces reaction times and overcomes distractors.

A knowledge of what sounds are salient to the human auditory system is instrumental for advancements in a multitude of fields. Salience is used in acoustic scene classification to identify objects of interest. In other words, bottom-up attention can be used to guide computers to process sounds the way

1 that the human brain does. Salience can be used to improve assisted hearing devices by amplifying or otherwise highlighting the most relevant sounds over less important acoustic signals. It can be used to create sounds that draw attention, such as alarms that are salient without being uncomfortably loud.

The existing body of literature in the field of auditory salience is still in its infancy compared to that of visual salience, and in fact the literature that does exist borrows heavily from ideas developed in the visual domain. However, auditory salience also presents some significant challenges that are distinct from the challenges faced in studying vision. Most prominently, while eye gaze has proven to be an effective marker of visual attention (Wolfe and Horowitz, 2004b; Itti and Koch, 2001c; Tatler et al., 2011; Zhao and Koch, 2013a;

Borji, Sihite, and Itti, 2013a; Bruce et al., 2015), no analogous response exists for determining the focus of auditory attention. In addition, auditory objects in an acoustic signal are not as readily separated spatially as visual objects can be in an image. Finally, auditory signals necessarily unfold over time, which can make data collection very time consuming, and which also means that previous context can play an important role in determining salience. Presented in this dissertation are solutions for these challenges and more, resulting in a more complete understanding of auditory salience in natural scenes.

1.1 Existing Auditory Salience Literature

The first major study published with a focus on auditory salience was performed by Kayser et al. (2005a). The study included two salience measurement methods with brief natural sound recordings. First, subjects rated

2 salience directly by listening to two stimuli sequentially and subsequently reporting which one they found more salient. Second, salience of these audio recordings was measured by how easily they were detected in a naturalistic noise background, with the idea that more salient sounds are more easily detected.

The two salience measures used by Kayser et al. (2005a) highlight the dichotomy between asking for a salience rating after the fact and obtaining an immediate response using a behavioral task. While the former method has the benefit of simplicity and being more straightforward to setup, it also has the drawback of more potential influence by top-down attention (Kim et al., 2014a).

The Kayser study also prominently displays how models of auditory salience have borrowed from ideas from studies of visual salience. In essence, the Kayser model uses the time-frequency spectrogram of a sound as an image, and applies a modification of a seminal visual salience model developed by

Itti, Borji, and Niebur (1998) to that spectrogram to determine salience. In brief, the model performs feature extraction on the spectrogram and then applies center-surround filters to find salient portions of the audio. Other auditory salience studies have also been influenced by this model as well, either directly by using a variation of the original model (Duangudom and Anderson, 2007a), or indirectly by employing the same strategy of looking for contrasts in basic features (Kaya and Elhilali, 2014b).

One aspect of salience not emphasized in the Kayser study is the influence of long-term acoustic history and context, as stimuli used were maximally 1.2

3 seconds long. In order to explore these effects, later studies have incorporated scenes of longer duration. Some of these scenes have been artificial in nature and were constructed by piecing together short audio snippets into a longer scene (Duangudom and Anderson, 2007a; Duangudom and Anderson, 2013a; Kaya and Elhilali, 2014b; Shuai and Elhilali, 2014; Southwell et al., 2017).

Using this construction method allows for prior knowledge of what portions of the scene are expected to be salient, while also enabling acoustic parameters to be systematically varied. However, there is still an element of artificiality to these stimuli.

The least commonly used stimuli overall has been extended natural scenes. Natural recordings are closest to the acoustic scenes that we encounter in our everyday lives. While difficult to study, these stimuli contain semantic and contextual information that is unavailable in the previous stimuli. The literature using natural auditory scenes that does exist primarily relies on a sort of manual labeling to determine the salient portions (Kalinli and Narayanan, 2007a; Kim et al., 2014a), again risking the influence of voluntary attention.

Ideally, salience would be measured by some sort of physiological effect that is not under direct voluntary control. One example of such a measurement is pupil dilation. Besides an inverse correlation to luminosity of visual stimuli, pupil size is also directly affected by arousal. Perhaps due to this relationship, pupil size has been shown to correlate with the salience of auditory stimuli

(Liao et al., 2016b; Huang and Elhilali, 2017). However, these studies have also shown evidence that loudness may be the predominant factor in pupil dilation.

4 Alternatively, neural responses can be directly recorded as a measure of salience. A myriad of neurological processes comprise external recordings of neural signals, including attentive states. However, one of the major difficul- ties lies in obtaining the signal of interest from the amalgamation.

While there are many methods of recording neural activity, two common modalities for collecting such data non-invasively are Functional Magnetic Resonance Imaging (fMRI) and Electroencephalography (EEG). Recordings made by fMRI have better spatial resolution; on the other hand, EEG has better temporal resolution, so may be more suitable for a dynamic auditory scene. Seifritz et al. (2002) discovered a neurological bias towards approach- ing sounds as opposed to receding sounds using fMRI, while Lewis et al.

(2012) found distinct areas of activation for salient acoustic objects compared to background-like audio. With Electroencephalography (EEG), Shuai and Elhilali (2014) and Southwell et al. (2017) found markers of attentional shifts in response to salient notes in a constructed musical sequence.

Table 1.1 presents a summary of the existing work in auditory salience, divided by stimuli used and data collection methods. References in red are published papers that are presented as parts of this dissertation. Of note, these works are the only existing studies of auditory salience in long natural scenes using behavioral and neural measurements of salience.

1.2 Contributions of this Thesis

This dissertation addresses key deficiencies in our knowledge of auditory salience, particularly in the difficult context of natural scenes.

5 Isolated Sounds Constructed Scenes Natural Scenes Kalinli and Manual Kayser et al., 2005a; Duangudom and Narayanan, 2007a; Labelling Liao et al., 2016b Anderson, 2007a Kim et al., 2014a Duangudom and Behavioral Kayser et al., 2005a; Huang and Elhilali, Anderson, 2013a Measurement Tordini et al., 2015a 2017 Kaya and Elhilali, 2014b Shuai and Elhilali, Huang, Slaney, and Ghazanfar et al., 2002; 2014; Elhilali, 2018 Neural Data Seifritz et al., 2002; Southwell et al., 2017 Huang and Elhilali, Lewis et al., 2012 Kaya, Huang, Elhilali 2018

Table 1.1: Breakdown of the existing literature in Auditory Salience. References in red are published works presented in this dissertation, while those in blue are works presented this in dissertation which have been submitted for peer review

A natural scene database for auditory salience study is presented in Chap- ter2. A behavioral measurement of salience is developed to serve as the ground truth. In addition to basic acoustic feature analysis, a model of salience based on changes in acoustic features is also presented, which performs as well as or better than all existing models.

Neural markers of salience as recorded by EEG are identified and quantified in Chapters3 and4. Stimulus entrainment, phase-coherence, and gamma band activity are all found to be modulated by the presentation of salient stimuli. Effects of top-down and bottom-up attention are also compared and contrasted in Chapter4. Although certain effects from these two mechanisms may have opposing polarities, both processes engage some of the same brain regions.

The effects of changes in sound category on auditory salience is explored in Chapter5. Categorical information is extracted from a deep neural network trained for audio classification, and is shown to account for some portion

6 of salience not covered by acoustic features alone. In addition, the neural network is shown to have a hierarchical representation, with greater similarity to acoustic features at earlier layers and correlating more strongly to human cortical activity at later layers.

1.2.1 Natural Scene Salience Database, Measurement, Acous- tic Analysis, and Modeling

In an effort to overcome the difficulty of studying salience in natural scenes, Chapter2 presents a method of measuring salience using a behavioral task with immediate response. In this dichotic listening task, subjects listen to two scenes at a time, one in each ear, and report to which scene they are attending using a mouse in a continuous fashion.

Using this salience measurement method, a database of acoustic scenes has been constructed with associated salience measurements. This dataset can be used to further knowledge of auditory salience in the same way that image datasets have advanced the field of visual salience. In addition, the database has later shown to be easily extended through online data collection methods (Amazon’s Mechanical Turk).

One point of contention within the existing literature is whether salience is simply a product of loudness or if other acoustic features and factors play a role in determining salience. A study by Kim et al. (2014a) found that acoustic features other than loudness did not improve their prediction of salience. In contrast, a study by Tordini et al. (2015a) argues that while loudness is a dominating feature, brightness and tempo also play a role.

7 Chapter2 also sheds some light on this controversy by examining variation in salience among a wide range of contexts. Specifically, salience near speech and music events exhibits a greater contribution from acoustic features other than loudness compared to events in other categories; for instance, speech events are marked by a strong increase in pitch strength. In addition, comparison between sparse and dense scenes indicate that salience within scenes with a low density are more completely explained by simple acoustic features such as loudness. In contrast, scenes with a high density of overlapping acoustic objects may be more impacted by context and semantic information. These analyses show that the chosen stimuli can have a dramatic effect on results.

1.2.2 Neural Markers of Auditory Salience

Chapters3 and4 present potential markers of auditory salience within neural recordings. The former examines salience systematically using constructed musical sequences, while the latter explores salience within the context of natural scenes. Both studies direct top-down attention away from their respective auditory stimuli of interest in order to isolate the effects of bottom-up attention. Several markers of attention are identified within these two chapters that can be used to determine when attention to the task-related stimulus is diverted due to a salient event in the ignored auditory scene.

One such marker of attention is the stimulus-locked response, which is a neural response that is phase-locked to the presentation rate of a stimulus (Krishnan, 2002; Nozaradan et al., 2011; Power et al., 2012). This phase- locked response is modulated by attention, in that it is stronger to an attended

8 stimulus than to an unattended one (Galbraith et al., 1998; Galbraith and Doan, 1995; Shuai and Elhilali, 2014). As such, modulation in this response is shown here to indicate shifts in attention.

In a similar vein, envelope decoding has been used to determine attentional states (O’Sullivan et al., 2015; Treder et al., 2014; Ekin et al., 2016; Fuglsang,

Dau, and Hjortkjær, 2017). Essentially, a decoder is trained that translates neural signals into the envelope of the corresponding audio input. The quality of the resulting decoded envelope is higher for attended stimuli compared to unattended stimuli. In this way, which sound a person is attending to can be inferred by how well the envelope is decoded.

Another potential marker of attentional shifts is phase coherence, which is the consistency in phase information across trials within specific EEG frequency bands. Phase coherence in the theta band increases after a salient event occurs, as shown in Chapter3.

Finally, attention also modulates activity in the gamma frequency bands

(Gurtubay et al., 2001; Howard et al., 2003; Jensen, 2001) Gamma is shown to decrease when a salient distractor occurs in Chapter4, Further source localization shows a similar area of neural activation for top-down and bottom- up attention, with a small time delay for the latter, possibly indicating a redirection of attention to the target stimulus after distraction.

9 1.2.3 Effects of Audio Class on Salience: Neural Network Analysis

As shown in Chapter2, information about audio object category can contribute to a full understanding of salience. However, such information is not easily extracted automatically on a point-by-point basis. Recent advances in machine learning with Convolutional Neural Networks have shown that these systems can be trained to classify images at a high level of accuracy

(Krizhevsky, Sutskever, and Hinton, 2012; Simonyan and Zisserman, 2015). Such networks have now been trained for audio classification (Hershey et al., 2017; Jansen et al., 2017).

Chapter5 characterizes the activity of one of these these artificial neural networks trained for audio object classification. In addition, their potential use in modeling salience is explored. As in the visual literature (Lee et al.,

2009), CNN activity at earlier layers show stronger correlation with basic acoustic features. However, at higher layers the CNN activity corresponds more closely to the EEG data from Chapter4, particularly at higher frequency bands (e.g. Beta, Gamma).

Finally, although salience is overall more similar to the response of the neural network at earlier layers, higher layers are shown to contain information about the scene that is vital to our understanding of salience. This information not fully captured within the lower layers, and is particularly important for speech and music events. These more complex events are not as well characterized solely by basic acoustic features.

10 1.3 Summary

In summary, this dissertation provides novel measurement methods for auditory salience and a more complete understanding of salience as it relates to natural scenes, as well as a solid framework for continued study. In addition to a database of natural scenes with corresponding behavioral measurements of salience, neural markers of salience are also identified. Further, top-down and bottom-up attentional networks in the brain are compared and contrasted.

Finally, information about sound category as extracted by a deep neural network is found to contribute to a more complete explanation of salience than acoustic features alone.

11 Chapter 2

Salience Database, Measurement, Acoustic Analysis, and Modeling

This chapter presents a database of natural scenes to serve as a foundation for studying auditory salience. A continuous behavioral measurement is developed and used to establish ground truth salience given the lack of a common measurement standard in the literature. Salience is then studied in a data-driven fashion, with an analysis of what acoustic features give rise to salience as well as a model to predict salient events based on changes in acoustic features. The contents of this chapter have been published in the Journal of the Acoustical Society of America (Huang and Elhilali, 2017).

2.1 Introduction

Attention is at the center of any study of sensory information processing in the brain (Itti, Rees, and Tsotsos, 2005). It describes mechanisms by which the brain focuses both sensory and cognitive resources on important elements in the stimulus (Driver, 2001). Intuitively, the brain has to sort through the

12 flood of sensory information impinging on its senses at every instance andput the spotlight on a fraction of this information that is relevant to a behavioral goal. For instance, carrying a conversation in a crowded restaurant requires focusing on a specific voice to isolate it from a sea of competing voices, noises and other background. It is common in the literature to employ the term attention to refer to “voluntary attention” or “top-down attention” that is directed to a target of interest (Baluch and Itti, 2011a).

A complementary aspect to these processes is bottom-up attention, also called salience or saliency, which describes qualities of the stimulus that attract our attention and make certain elements in the scene stand out relative to others (Treue, 2003a). This form of attention is referred to as bottom-up since it is stimulus-driven and fully determined by attributes of the stimulus and its context. In many respects, salience is involuntary and not dictated by behavioral goals. A fire alarm will attract our attention regardless of whether we wish to ignore it or not. While salience is compulsory, it is modulated by top-down attention. As such, the study of salience alone away from influences of top-down attention is a challenging endeavor.

Much of what we know today about salience in terms of neural correlates, perceptual qualities and theoretical principles comes from vision studies (Carrasco, 2011; Itti and Koch, 2001a). When an object has a different color, size or orientation relative to neighboring objects, it stands out more, making it easier to detect. Behavioral paradigms such as visual search or eye tracking tasks are commonly used to investigate the exact perceptual underpinnings of visual salience (Wolfe and Horowitz, 2004a; Zhao and Koch, 2013b; Borji et al.,

13 2015; Parkhurst, Law, and Niebur, 2002). This rich body of work has resulted in numerous resources such as standard databases which enable researchers to probe progress in the field by working collectively to replicate human behavior in response to different datasets such as still natural images, faces, animate vs. inanimate objects, videos, etc (Borji, Sihite, and Itti, 2013b). Using common references not only contributes to our understanding of salience in the brain, but has tremendous impact on computer vision applications ranging from robotics to medical imaging and surveillance systems (Borji and Itti, 2013a; Frintrop, Rome, and Christensen, 2010).

In contrast, the study of salience in audition remains in its infancy (see review (Kaya and Elhilali, 2016)). Among the many issues hampering the study of auditory salience is a lack of agreed upon behavioral paradigms that truly probe the involuntary, stimulus-driven nature of salience. Most published studies used a detection paradigm where a salient object is embedded in a background scene, and subjects are asked to detect whether a salient event is present or not; or to compare two scenes and detect or match salience levels across the two scenes (Kayser et al., 2005b; Tsuchida and Cottrell, 2012a;

Duangudom and Anderson, 2007b; Kaya and Elhilali, 2014a). In these setups, salient events are determined by an experimenter based on a clear distinction from the background scene. As such, their timing provides a ground truth against which accompanying models is tested, and relevance of various features is assessed. A complementary approach favored in other studies used annotation where listeners are asked to mark timing of salient events in a continuous recording (Kim et al., 2014b). This paradigm puts less emphasis

14 on a choice of events determined by the experimenter, and adopts a more stimulus-driven approach that polls responses across subjects to determine salience. Still, this approach presents its own limitation in terms of controlling top-down and cognitive factors since listeners are presented with a single scene whose context might bias their judgment of stimulus-driven salience.

A related challenge to the study of auditory salience is a lack of standard datasets and stimulus baselines that guide the development of theoretical frameworks and explain why certain sound events stand out in a soundscape.

Of the few studies conducted on auditory salience, most have used well- controlled stimuli, like simple tones (Duangudom and Anderson, 2013b) or short natural recordings in controlled uniform backgrounds (Duangudom and Anderson, 2007b; Kayser et al., 2005b; Kaya and Elhilali, 2014a). A few studies incorporated more rich natural scenes using recordings from the BU Radio News Corpus (Kalinli and Narayanan, 2007b) and the AMI Corpus

(Kim et al., 2014b). Still, both the latter databases consist of scenes with relatively homogeneous compositions, both in terms of audio class (mostly speech interspersed with other sounds) and density (mostly sparse with rare instances of overlapping sound objects). The lack of a diverse and realistic selection of natural soundscapes in the literature results in a narrow view of what drives salience in audition. In turn, theoretical frameworks developed to model empirical data remain limited in scope and fail to generalize beyond the contexts for which they were developed (Kaya and Elhilali, 2016).

The current study aims to break this barrier and offer a baseline database for auditory salience from an unconstrained choice of stimuli. This collection

15 includes a variety of complex natural scenes, which were selected with a goal of representing as many types of real world scenarios as possible. The database was not developed as underlying paradigm to accompany a specific model. Rather, it was designed to challenge existing models of auditory salience and stimulate investigations into improved theories of auditory salience. The sound dataset and its accompanying behavioral measures are intended for public use by the research community and can be obtained by contacting the corresponding author.

The experimental paradigm collected behavioral responses from human volunteers while listening to two competing natural scenes, one in each ear. Subjects indicated continuously which scene attracted their attention. This paradigm was chosen to address some of the limitations of existing approaches in studies of auditory salience. Salience judgments are determined by the scenes themselves with no prearranged placement of specific events or salience ground truth. Moreover, employing two competing scenes in a dichotic setting dissipates effects of top-down attention and allows salient events to attract listenerss attention in a stimulus-driven fashion; hence expanding on the previously used annotation paradigm used by Kim. et al. (Kim et al., 2014b). Still, studying salience using these sustained natural scenes comes with many limitations and challenges. Firstly, it is impossible to fully nullify the effects of voluntary attention in this paradigm. Subjects are actively and continuously indicating scenes that attract their attention. To mitigate this issue, the use of a large number of listeners and assessment of behavioral consistency across responses averages out voluntary attentional effects, and yields consistent

16 response patterns reflecting signal-driven cues. Secondly, certain sounds may induce preferential processing relative to others. The use of real world recordings inherently provides context that may or may not be familiar to some listeners. That is certainly the case for speech sounds, certain melodies (for subjects who are musicians or music-lovers) as well as familiar sounds.

While this is an aspect that again complicates the study of auditory salience for subsets of listeners, the use of a large pool of volunteers will highlight the behavior of average listeners, leaving the focus on specific sounds of interest as a follow-up analysis. In addition, it is the complex interactions in a realistic setting that limits the translation of salience models developed in the laboratory to real applications. Thirdly, two competing auditory scenes are presented simultaneously, to further divide the subjects’ voluntary attention. As such, the study presents an analysis of salience ‘relative’ to other contexts.

By counter-balancing relative contexts across listeners, we are again probing average salience of a given sound event or scene regardless of context. Still, it is worth noting that the use of dichotic listening may not be very natural but does simulate the presence of competing demands on our attentional system in challenging situations. Finally, subjects are continuously reporting which scene attracts their attention; rather than providing behavioral responses after the stimulus. By engaging listeners throughout the scene presentation, the paradigm aims to minimize effects of different cognitive factors including voluntary attention and memory that can cloud their report. In the same vein, this study focuses on the onset of shifts in attention, before top-down attention has time to ‘catch up.’ Throughout the paradigm, pupil dilation is also acquired, hence providing a complementary account of salience that is

17 less influenced by top-down attentional effects.

In this report, we provide an overview of the dataset chosen as well as analyses of the bottom-up attention markers associated with it. Section 2.2 outlines the experimental setup, behavioral data collection as well as acoustic features used to analyze the stimulus set. Section 2.3 presents the results of the behavioral experiment along with analyses of different type of salient events. The results also delve into an acoustic analysis of the dataset in hopes of establishing a link between the acoustic structure of the scene stimuli and the perceived salience of events in those scenes. We conclude with a discussion of the relevance of these findings to studies of bottom-up attention and auditory perception of natural soundscapes.

2.2 Methods

2.2.1 Stimuli

A set of twenty recordings of natural scenes was gathered from various sources including the BBC Sound Effects Library (The BBC Sound Effects Library - Original Series), Youtube (Youtube 2016), and the Freesound database (The

Freesound Project 2016). Scenes were selected with a goal of covering a wide acoustic range. This range included having dense and sparse scenes, smooth scenes and ones with sharp transitions, homogeneous scenes versus ones that change over time, and so forth. A variety of sound categories was included, such as speech, music, machine and environmental sounds. All recordings were carefully selected to maintain a subjectively good audio quality. An overview of all scenes included in this dataset is given in Table 1.

18 # Dur Sparse Control Description Source 1 2:00 No No Nature, Birds Youtube.com/watch?v=U2QQUoYDzAo 2 2:15 Yes No Sporting Event BBC:Humans/Crowds/CrowdsExterior.BC.ECD48c 3 2:00 No No Cafeteria BBC:Ambience/Cafes/CafesCoffees.BBC.ECD40b 4 2:02 Yes No Battle, Guns Youtube.com/watch?v=TVkqVIQ-8ZM 5 1:21 Yes No Airplane, Baby Soundsnap.com/node/58458 6 1:13 No No Fair Freesoundeffects.com/track/carnival-street-party—42950/ 7 1:59 No No Classical Music Youtube.com/watch?v=pKOpdt9PYXU 8 2:01 Yes No Bowling Alley BBC:Ambience/America/America24t 9 2:00 No No Egypt Protests Youtube.com/watch?v=H0ADRvi0FW4 10 2:08 No No Store Counter BBC:Ambience/America/America24k 11 1:57 No No Drum Line Youtube.com/watch?v=c4S4MMvDrHg 12 1:59 No No Blacksmith Youtube.com/watch?v=Ka7b-cicOpw 13 2:13 No No Orchestra Youtube.com/watch?v=6lwY2Dz15VY 14 2:07 No No Japanese Show Youtube.com/watch?v=I5PS8zHBbxg 15 2:22 No No Dog Park Freesound.org/people/conleec/sounds/175917/ 16 2:14 No No Casino Freesound.org/people/al_barbosa/sounds/149341/ 17 2:00 No Yes Piano Concerto Youtube.com/watch?v=ePFBZjVwfQI 18 1:22 Yes Yes Car Chase BBC:Trans/Land/Vehicles/Motor/MotorCycleRacing17c 19 1:59 Yes Yes Maternity Ward BBC:Emergency/Hospitals/MaternityWard 20 1:19 No Yes Sailing Barge BBC:Trans/Water/Ships/ShipsSailing.BBC.ECD22b

Table 2.1: List of the twenty audio recordings of natural scenes used in this study and suggested for use in future studies. Scenes selected to cover a wide range of acoustic environments. Scenes were labeled subjectively as sparse or dense in regards to the density of acoustic objects within.

Each scene had a duration of approximately two minutes (average of 116

± 20 seconds). All wave files were downsampled to 22050 Hz with a bitrate of 352 kbps. One of the scenes (scene number 13) was originally recorded in stereo, from which only the left channel was used. Scenes were normalized

based on the RMS (root mean square energy) of the loudest 1% of each wave file. Alternative normalizations (such as RMS across the entire waveform or peak normalization) were deemed inappropriate for regulating the scene

amplitudes because they made the sparse scenes too loud and dense scenes too quiet or the reverse.

19 Figure 2.1: a Presentation screen for the behavioral task. Subjects were asked to listen to one natural scene in each ear, and to move the cursor to the side corresponding to the scene that they were focusing on at any given time. (Text on top was not part of the presentation screen). b Example waveforms of two stimuli presented in a given trial along with behavioral response of one subject shown in the center. c The average response across subjects and competing scenes is shown for waveform 2. Salience is denoted as the percentage of subjects attending to that scene at any given time. Peaks in the height of the slope of the average salience are marked as salient events.

20 2.2.2 Behavioral Data Collection

Fifty healthy volunteers (ages 18-34, 16 males) with no reported history of hearing problems participated in this study. Subjects were asked if they had any musical training, for example if they had played an instrument, and if so how many years of training. Subjects were compensated for their participation and all experimental procedures were approved by the Johns Hopkins University Homewood Institutional Review Board (IRB).

Behavioral data were recorded using Experiment Builder (SR Research

Ltd., Oakville, ON, Canada), a software package designed to interface with the EyeLink 1000 eye tracking camera. Subjects were seated with their chins and foreheads resting on a headrest. Subjects were instructed to maintain fixation on a cross in the center of the screen while they performed the listening task. The eye tracker primarily recorded each participant’s pupil size over the course of the experiment. It was calibrated at the beginning of the experiment, with a sequence of five fixation points. Dichotic audio stimuli were presented over Sennheiser HD595 headphones inside a sound proof booth. Sounds were presented at comfortable listening levels, though subjects were able to notify the experimenter if any adjustments were needed. Pupil size and mouse movements were both initially recorded at a sampling frequency of 1 kHz. Mouse movements were later downsampled by a factor of 64. Data were analyzed using MATLAB software (Mathworks, Natick, MA, USA).

Behavioral responses were collected using a dichotic listening paradigm. Subjects were presented with two different scenes simultaneously, one in each ear. Subjects were instructed to indicate continuously which of the two scenes

21 they were focusing on by moving a mouse to the right or the left of the screen. Subjects were not instructed to listen for any events in particular, but simply to indicate which scene (if any or both) they were focusing on at any given time. A central portion of the screen was reserved for when a subject was either not focusing on either scene or attending to both scenes simultaneously.

Subjects were not instructed to differentiate between locations within each region. As such, participants were limited to three discrete responses at any given point in time: left, right, or center. Subjects were also instructed to fixate their gaze on the cross in the center of the screen, in order to avoid effects of eye movements on pupil size. A diagram of the screen is shown in Figure

2.1a.

Before data collection began, subjects were presented with a brief, fifteen second example stimulus, consisting of a pair of natural scenes distinct from the twenty used in the main experiment. The same example was played a second time with a corresponding cursor recording, to serve as one possibility of how they could respond.

Two pairs from the set of twenty scenes used in this study were chosen as ‘control’ scenes, and were always presented with the same opposing wave file.

Thus, each of the four control scenes was heard exactly once by each subject (50 trials across subjects). The remaining sixteen scenes were presented in such a way that, across subjects, each scene was paired with each other scene roughly an equal number of times. Each individual subject heard each of these scenes at most twice (total of 62 or 63 trials per scene across 50 subjects), with an average of six scenes presented a second time to each subject. A trial ended

22 when either of the two competing scenes concluded, with the longer scene cut short to that time. There was a fixed rest period of five seconds between trials, after which subjects could initiate the next trial at their leisure.

In parallel, we collected data from 14 listeners in a passive listening mode. In these passive sessions, no active behavioral responses were collected and subjects simply listened to the exact stimuli as described earlier. Only pupil size was collected using the eye tracker.

2.2.3 Analysis of Behavioral Data

Individual behavioral responses were averaged across subjects for each scene by assigning a value of +1 when a participant attended to the scene in question, -1 when a participant attended to the competing scene, and 0 otherwise (Figure 2.1b). Choosing a different range to distinguish the three behavioral states (instead of {+1,0,-1}) resulted in qualitatively similar results.

The mean behavioral response across all subjects will henceforth be referred to as the average salience for each scene (since it averages across all competing scenes). The derivative of this average salience curve was computed and smoothed with three repetitions of an equally-weighted moving average with a window length of 1.5 seconds (Smith, 1997). This derivative curve captured the slope of the average salience and was used to define notable moments in each scene. Peaks in the derivative curve were marked as onsets of potential events (Figure 2.1c). A peak was defined as any point larger than neighboring values (within .064 seconds). The smoothing operation performed on the average salience curve prior to peak picking eliminated small fluctuations

23 from being selected. The smoothing and selection parameters were selected heuristically and agreed with intuitive inspection of behavioral responses and location of peaks based on eye inspection.

A total of 468 peaks were identified across the twenty scenes. These peaks were ranked based on a linear combination of two metrics: the height of the derivative curve (slope height), which reflects the degree of temporal agreement between subjects; and the maximum height of the average salience curve within four seconds following the event onset (absolute consensus) which reflects the percentage of subjects whose attention was drawn by anevent irrespective of timing. The four second range was chosen in order to avoid small false peaks close to the event, as well as peaks far away that may be associated with later events. The linear combination of these two metrics was used in order to avoid the dominance of either particularly salient or particularly sparse scenes. The absolute consensus was scaled by the 75th percentile of slope of salience. The most salient 50% of these peaks (234 events) were taken as the set of salient events used in the analysis reported in this article, except where indicated.

Events were manually classified into specific categories after their extraction. The experimenter listened to several seconds of the underlying scene before and after each event in order to make the classification. The categories of events were speech, music, other human vocalizations (e.g. laughter), animal sounds, sounds created by a device or machinery, and sounds created by an object tapping or striking another object. Each event was individually categorized into one of these six types.

24 2.2.4 Acoustic Analysis of Stimuli

The acoustic waveform of each scene was analyzed to extract a total of nine features along time and frequency, many of which were derived from the spectrogram of each scene. The time-domain signal s(t) for each scene was processed using the NSL Matlab Toolbox (Chi, Ru, and Shamma, 2005) to obtain a time-frequency spectrogram y(t, x). The model maps the time waveform through a bank of log-spaced asymmetrical cochlear filters (spanning

255Hz to 10.3KHz) followed by lateral inhibition across frequency channels and short-term temporal integration of 8 ms.

1. The centroid of the spectral profile in each time frame was extracted asa

brightness measure (Schubert and Wolfe, 2006), defined as:

x ∗ y(t, x)2 ( ) = ∑x BR t 2 ∑x y(t, x)

2. The weighted distance from the spectral centroid was extracted as a

measure of bandwidth (Esmaili, Krishnan, and Raahemifar, 2004), defined as:

∑ |x − BR(t)| ∗ y(t, x) BW(t) = x ∑x y(t, x)

3. Spectral flatness was calculated as the geometric mean of the spectrum divided by the arithmetic mean (Johnston, 1988), also at each time frame:

1/N (︂ N−1 )︂ ∏x=0 y(t, x) ( ) = FL t N−1 ∑x=0 y(t, x)/N

25 4. Spectral irregularity was extracted as a measure of the difference in strength between adjacent frequency channels (McAdams, Beauchamp,

and Meneguzzi, 1999), defined as:

(y(t, x + 1) − y(t, x))2 ( ) = ∑x IR t 2 ∑x y(t, x)

5. Pitch (fundamental frequency) was derived according to the optimum processor method by Goldstein (Goldstein, 1973). Briefly, the procedure

operates on the spectral profile at each time slice y(t0, x) and compares it to a set of pitch templates, from which the best matching template is

chosen. The pitch frequency (F0) is then estimated using a maximum likelihood method to fit the selected template (Shamma and Klein, 2000).

6. Harmonicity is a measure of the degree of matching between the spectral

slice y(t0, x) and the best matched pitch template.

7. Temporal modulations were extracted using the NSL Toolbox. These features capture the temporal variations along each frequency channel over

a range of dynamics varying from 2 to 32 Hz. See (Chi, Ru, and Shamma, 2005) for further details.

8. Spectral modulations were extracted using the NSL Toolbox. These modulations capture the spread of energy in the spectrogram along the

logarithmic frequency axis; as analyzed over a bank of log-spaced spectral filters ranging between 0.25 and 8 cycles/octave. See (Chi, Ru,and Shamma, 2005) for further details. For both temporal and spectral mod-

ulations, the centroid of each of these measures was taken as a summary

26 value at each time frame.

9. Loudness in Bark frequency bands was extracted using the algorithm from the Acoustics Research Centre at the University of Salford (Zwicker

et al., 1991). The process starts by filtering the sound using a bank of 28 log-spaced filters ranging from 250 Hz to 12.5 kHz, each filter having a one-third octave bandwidth to mirror the critical bands of the

auditory periphery. The outputs of each filter are scaled to match human perception at each frequency range. Loudness was averaged across frequency bands, providing a specific representation of the temporal

envelope of each waveform.

The collection of all nine features resulted in a 9xT representation. Finally, features were z-score normalized to facilitate combining information across features.

2.2.5 Event Analysis

Event prediction was conducted on the acoustic features using a similar method to that for extracting behavioral events. This procedure serves as the basis for the simple model to be evaluated in Section 2.3. The derivative of each feature was calculated, and then smoothed using the same moving window average procedure used in the behavioral analysis. Peaks in this smoothed derivative were taken as predicted events for each specific feature. The height of this peak was retained as a measure of the strength of this predicted event, allowing for thresholding of events based on this ranking

27 later in the analysis. Thus, a series of events was predicted for each of the nine acoustic features included in this study.

In order to assess the correspondence between events predicted by the acoustic features and behavioral events, the scenes were analyzed over overlapping bins of length TB. Results reported here are for two-second time windows with a time step of half a second, but qualitatively similar results were observed for other reasonable values of TB. Each bin containing both a behavioral event and a predicted event was determined to be a ‘hit’; each bin with only a predicted event was determined to be a ‘false alarm’. A Receiver operating characteristic or ROC curve could then be generated by varying a threshold applied to the predicted events.

This analysis was performed using events predicted from individual features (e.g. loudness only) or across combinations of features. The cross-feature analysis was achieved in two ways: (a) A simple method was used to flag a predicted event if any feature indicated an event within a given bin. (b)

Cross-feature integration was performed using Linear Discriminant Analysis (LDA), using ten-fold cross validation (Izenman, 2013a). Each time bin was assigned the value of the strongest event predicted within that bin for each feature, or zero if no such event was present. The acoustic features, and thus the strength of the corresponding predicted events, were z-score normalized. Events from both increases and decreases in each acoustic feature were included in the LDA analysis. Each bin was assigned a label 1 if a behavioral event was present, and 0 if not. Training was done across scenes, with 10% of the data reserved for testing, using contiguous data as much as possible. The

28 process was repeated nine times using a distinct 10% of the data for testing each time. An ROC curve was again generated by varying a threshold on the output of the LDA classifier.

An ROC curve representing the inter-observer agreement was also generated as a rough measure of the theoretical limit of predictability. This curve was generated by again dividing the scenes into overlapping bins, with the value of each bin corresponding to the percentage of subjects who began attending to the scene in the corresponding time frame. This time series was then thresholded at varying levels and compared to the set of behavioral events used in the previous section in order to plot the ROC.

2.2.6 Pupil Response Analysis

Pupil data, acquired using the EyeLink 1000 eye tracking camera (SR Re- search Ltd., Oakville, ON, Canada) were sampled at 1000 Hz and normalized as follows: all data were z-score normalized, then averaged across all subjects and all scenes to reach one mean trend line. A power law function was fit to this trend curve starting from 3 seconds until 100 seconds. This region avoids the initial rising time of the pupil response and the higher amount of noise near the end due to a lower number of scenes with that duration. The trend line was then subtracted from the normalized pupil size recordings from each trial. In order to examine the relationship between pupil responses and the acoustic stimuli or subjects’ behavioral responses, increases in pupil size were identified using a similar procedure as the extraction of salient events. The derivative of the pupil size measure was first computed. This derivative was

29 smoothed using the same moving window average procedure used in the behavioral analysis. Peaks in this smoothed derivative were marked as pupil dilations.

2.3 Results

2.3.1 Behavioral Results

A total of 234 behavioral events are recorded over the 20 scenes used in the database. The behavioral measures used in this study give an estimate of both absolute height (reflecting overall consensus across subjects) and slope height (indicating temporal agreement across subjects). Both measures can be taken as indicator of salience strength of a sound event; and indeed both measure are found to be significantly correlated with each other (ρ(232) = 0.26, p = 6.1 ∗ 10−5), indicating that an event judged as strong with one metric will often be determined to be strong based on the other.

The minimum number of events found in a scene is four (a quiet cafeteria scene, with an average of 1.8 events/minute), while the densest scene has sixteen events (an energetic piano solo excerpt, with an average of 12 events/minute). Some scenes have very clear and distinct events, as evi- denced by the high slopes in the average salience curve, indicating that the subjects agree upon when the event occurred more closely in time. Other scenes have comparatively less clear events. The mean absolute height of consensus for all events is 56%, indicating that on average, a little over half of the subjects attended to the scene within four seconds following one of these salient events (as described in methods).

30 Figure 2.2: Distribution of events across the different natural scenes. Each bar represents one of the twenty scenes in the database. The height of the bar indicates the relative salience strength of the events within that scene. The shading of the bar describes the number of events present within that scene. Scenes are sorted by category, then by number of events within each scene. Sparse scenes contain significantly stronger salient events than the other scene categories, indicated by a ∗.

Figure 2.2 depicts the number of events and their salience strength for different scenes. Scenes are divided by category: sparse scenes, music scenes, predominantly speech scenes and other. Events in sparse scenes are significantly stronger (as measured by slope height) than events in anyofthe other three categories (Music, p = 2.3 ∗ 10−3; Speech, p = 3.5 ∗ 10−5; Other, p = 3.8 ∗ 10−3), based on post-hoc Tukey HSD tests at the .05 significance level.

Moreover, the time since the last event is inversely correlated with the absolute height of consensus (ρ(212) = −.33, p = 9.04 ∗ 10−7) (Figure 2.3a), suggesting that subjects may be sensitized towards a scene after each event. To further assess the timing of salient events, we compute the duration of all shifts in attention towards a scene as the time between when a subject moved the cursor to that scene and when that subject moved the cursor away from

31 it. The distribution of these ‘onset to offset times’ was fit using kernel density estimation and peaks at around 3.11 seconds (Figure 2.3b). This histogram does corroborate the assertion observed in Figure 2.3a indicating that subjects continue attending to a scene after a salient event up to 3 sec post-onset or more; hence masking any potential behavioral responses within that time window.

Figure 2.3: a The strength of an event relative to the time since the last event. b Histogram of individual onset to offset times. These times each correspond to a single mouse movement to and from a scene. A probability density function was fitted using kernel density estimation (a non-parametric method) and peaks at 3.11 seconds.

32 2.3.2 Acoustic Features

An acoustic analysis of the scenes sheds light on the attributes of sound events that are deemed salient. On average, subjects tend to responsd a little under a second after a change in acoustic properties of the scenes. Using loudness as an example, Figure 2.4(inset) shows the average behavioral response overlaid with the loudness of the scene highlighting the roughly one second shift between acoustic change in the scene and the behavioral response. Look- ing closely at these reaction times, we divide salient events by their strength as quantified by absolute consensus without the slope of salience, toavoid any influence of timing on the rankings. The reaction time is calculated for each behavioral event by finding the time of the maximum slope of loudness in a two second window preceding the behavioral event. Figure 2.4 shows the confidence intervals (95%) of reaction times for different salience levels, adjusted for multiple comparisons. A post-hoc Tukey HSD test shows that subjects respond significantly more quickly for high salience events than for low salience events (p = .047).

Previous studies have indicated that loudness is indeed a strong predictor of salience (Kim et al., 2014b). The current dataset supports this intuitive notion. Figure 2.5a(inset) shows a significant increase in loudness preceding an event, calculated as the average loudness within 0.5 seconds before an event minus the average loudness between 2.0 and 1.5 seconds before the event (t(232) = 11.6, p << .001). These time ranges relative to the behavioral event are chosen in accordance to the delay of a little under a second between when an acoustic feature changes and when subjects respond (Figure 2.4).

33 Figure 2.4: Reaction times. The reaction time is defined here as the time between a change in loudness and the change in the behavioral response (inset). Events are ranked by absolute consensus to avoid timing issues. Boxes represent 95% confidence intervals, adjusted for multiple comparisons using a post-hoc Tukey HSD test.

In addition to loudness, changes in other acoustic features also correlate with presence of salient events. Figure 2.5a depicts changes in each feature following salient events, normalized for better comparison across features.

There are significant increases in brightness (t(232) = 4.0, p = 9.1 ∗ 10−5), pitch (t(232) = 3.0, p = .0033), and harmonicity (t(232) = 7.3, p = 5.2 ∗ 10−12), as well as a significant decrease in scale(t(232) = −3.8, p = 1.9 ∗ 10−4).

In order to extend our analysis beyond just the salient events used in this paper, we examine all peaks in the behavioral response (beyond the top 50% retained in most analyses, as outlined in Methods). The strength of derivative peaks of these responses correlate with degree of change in acoustic features. Specifically, loudness (ρ = .44, p = 1.0 ∗ 10−23), harmonicity (ρ = .33, p = 2.7 ∗ 10−13), brightness (ρ = .17, p = 1.95 ∗ 10−4) and scale (ρ = −.14, p = .002) are all statistically significantly correlated with degree of event salience. In

34 other words, the weaker the change in acoustic features, the weaker the behavioral response; hence our choice to focus on the stronger (top 50%) changes since they correlate with better defined acoustic deviations.

Figure 2.5: a Acoustic feature change averaged across events. All features are z-score normalized. Statistically significant increase or decreases are labeled bya ∗, where ∗ indicates p < .05, and ∗∗ p < .0001). Inset: Average loudness relative to salient events. Vertical bars indicate the time windows used to measure the change in acoustic feature across events. b Average loudness around salient events, separated by strength of salience. c Average loudness around salient events, separated by scene density. All error bars and shaded areas represent ±1 standard error.

35 2.3.3 Acoustic Context

In addition to the absolute value of acoustic features immediately related to a salient event, the acoustic context preceding the event plays an important role in affecting the percept of salience. For instance, an event induced with a given loudness level does not always induce a fixed percept of salience. Instead, the context in which this event exists plays a crucial role, whereby a loud sound is perceived as more salient in a quieter context than in a raucous one. Figure 2.5b shows the average loudness over time around salient events, ranked into 3 tiers by salience strength. The figure highlights the fact that the events with the highest salience are preceded by a baseline loudness that is lower than average. Specifically, while both high and mid-salience events tend to be equally loud -acoustically-, their perceived salience is different given the loudness of their preceding contexts. Figure 2.5c also shows the average loudness over time for events split by scene density. There is a much larger loudness change for events in sparse scenes.

2.3.4 Event Types

Although loudness shows a strong average increase across all events, that increase is not uniform across all types of events, revealing a nonlinear behavior across contexts. In order to better understand the effect of the scene context around the event on its perceived salience, events are manually categorized into different types based on preceding scene immediately prior to the event. Note that many scenes are heterogeneous and include events of different types. The most straight-forward categories are speech, music,

36 and animal sounds. ‘Other vocalizations’ include laughter, wordless cheering, and crying. Of the remaining events, most are either created by some sort of device or machinery (‘device’), or associated with objects impacting one another (‘tapping/striking’). The number of events in each category are shown in Figure 2.6a. As revealed in this plot, music and speech events do not necessarily exhibit strong changes in loudness and are significantly softer when compared to other event categories, t(223) = −5.71, p = 3.7 ∗

10−8. In particular, post-hoc Tukey HSD tests indicate that speech and music events both show significantly lower loudness increases than any of the top three categories (other vocalization, device, and tap/strike events) at the 0.05 significance level.

In order to further examine the acoustic features associated with events of different types, the histogram of feature changes is broken down into different subcategories (Figure 2.6b). This figure is an expansion of Figure 2.5a that hones in on context-specific effects. The figure highlights different changes in acoustic features that vary depending on context; such as a prominent increase in harmonicity associated with speech events versus a notable change in spectral metrics such as brightness and flatness for more percussive events such as devices (Figure 2.6b, rightmost panels).

2.3.5 Long-term context

Although salient events are associated -on average- with changes in acoustic features, there are many instances in which a change in acoustic feature does not result in a behaviorally-defined event. Figure 2.7a shows an example

37 Figure 2.6: a Number of events in each category, and the loudness change associated with each (bottom, ∗ indicates p < .05, ∗∗ indicates p < .0001). b Feature change associated with specific categories of events (∗ indicates p < .05, ∗∗ indicates p < .001). All error bars represent ±1 standard error.

of a knocking sound consisting of two consecutive knocks. While both knocks elicit equally loud events (Figure 2.7a top), the behavioral response (bottom) is markedly reduced to the second knock, presumably because its pop-out factor is not as strong given that it is a repetitive event based on recent history.

In order to evaluate the effects of long-term context more comprehensively, we tally the behavioral responses before and after the strongest loudness changes in our stimuli (top 25% of all loudness changes). On the one hand, we

38 note that loudness changes before ([-8,0]sec) and after ([0,8]sec and [8,16]sec) these strong loudness deviations are not significantly different (Figure 2.7b top). On the other hand, loudness increases in a [0,8]sec window are found to elicit a significantly lower response than both increases in a eight-second window prior to these events (t(381) = 2.68, p = .0077) and increases in the following eight seconds (t(413) = 3.2, p = .0015) (Figure 2.7b bottom). This analysis indicates that acoustic changes over a longer range (up to 8 sec) can have a masking effect that can weaken auditory salience of sound events.

Figure 2.7: a An example of a stimulus that produces a time-varying response. A repeated knocking sound elicits a lower response to the second knock. b Effects of strong loudness increases on neighboring loudness increases. Statistically significant differences in the response to these acoustic changes are labeled by a ∗.

2.3.6 Event Prediction

Based on the analysis of acoustic features in the scene, we can evaluate how well a purely feature-driven analysis predicts behavioral responses of human listeners. Here, we compare predictions from the features presented in this

39 study using a direct combination method as well as cross-feature combination (see Methods). We also contrast predictions from three other models from the literature. The first comparison uses the Kayser et al. model (Kayser et al., 2005b) which calculates salience using a center-surround mechanism based on models of visual salience. The second is a model by Kim et al. (Kim et al., 2014b), which calculates salient periods using a linear salience filter and linear discriminant analysis on loudness. The third is a model by Kaya et al.

(Kaya and Elhilali, 2014a), which detects deviants in acoustic features using a predictive tracking model using a Kalman filter. While the first two models are implemented as published in their original papers, the Kaya model is used with features expanded to include all the features included in the current study. All three models are trained using cross-validation across scenes. Hit and false alarm rates are calculated using two-second bins (see Methods).

Figure 2.8 shows a Receiver Operating Characteristic or ROC curve contrasting hit rates and false alarms as we sweep through various thresholds for each model in order to predict behavioral responses of salient events. As immediately visible for this plot, none of the models is achieving a performance that comes very close to the theoretical limit of predictability. The latter is computed here using a measure of inter-observer agreement (see Methods for details). Comparing the different methods against each other, it is clear that a model like Kayser’s is limited in its predictive capacity given its architecture as mainly a vision-based model that treats each time-frequency spectrogram as an image upon which center-surround mechanisms are implemented (accuracy of about 0.58, defined as the area under the ROC curve).

40 Figure 2.8: a Comparison between different methods of predicting salient events. The inset shows the maximum number of events predicted by each individual feature, and by the combination of all features. Features in order are: 1 - Flatness, 2 - Irregularity, 3 - Bandwidth, 4 - Pitch, 5 - Brightness, 6 - Rate, 7 - Scale, 8 - Harmonicity, 9 - Loudness. b ROC for dense and sparse scenes separately, using the prediction with the highest performance.

Given the highly dynamic and heterogeneous nature of the dataset used here, the corresponding spectrograms tend to be rather busy ‘images’ where salient events are often not clearly discernible.

In contrast, the model by Kim et al. was designed to map the signal onto a discriminable space maximizing separation between salient and non-salient events using a linear classifier learned from the data. Despite its training on the current dataset, the model is still limited in its ability to predict the behavioral responses of listeners, with an accuracy of about 0.65. It is worth noting that the Kim et al. model was primarily designed to identify periods of

41 high salience, rather than event onsets. Moreover, it was developed for the AMI corpus, a dataset that is very homogeneous and sparse in nature. That is certainly not the case in the current dataset.

The Kaya model performs reasonably well, yielding an accuracy of about 0.71. Unlike the other models, the Kaya model attempts to track changes in the variability of features over time in a predictive fashion. It is worth noting that extending the features of the original Kaya model to the rich set employed in the current study was necessary to improve its performance. Interestingly, a relatively simple model which simply detects changes in acoustic features and integrates across these features in a weighted fashion using linear discriminant analysis (‘Feat Change - LDA’ in Figure 2.8a) performs equally-well as the Kaya model. On average, this model is able to predict human judgments of salience with an accuracy of about 0.74 using a linear integration across features. Of note, a simpler version of the feature change model that treats all features as equally important remains limited in its predictable accuracy (about 0.63). This result further corroborates observations from earlier studies about the important role of interaction across acoustic features in determining salience of auditory events (Tordini, Bregman, and Cooperstock, 2015b; Kaya and Elhilali, 2014a). Also worth noting is that smoothing the behavioral responses helps stabilize the detection of change points in the acoustic features. Employing the LDA-based model with raw acoustic features yields an accuracy of 0.59.

In order to get a better insight of the contribution of different acoustic features in accurately predicting salient events, Figure 2.8a (inset) examines the proportion of events explained by certain features alone. The analysis

42 quantifies the maximum hit rate achievable. The figure shows that loudness is the single best feature, as it can explain about 66% of all events alone. The next best feature, harmonicity, can explain 59% of events alone. Combining all features together (rightmost bar) is able to explain all events (100%), even though it also comes with false alarms as shown in the ROC curve (Figure

2.8a). Qualitatively similar results about relative contribution of features are obtained when generating an ROC curve using each of these features individually. Moreover, the relative contribution of acoustic features to the prediction of events is consistent with an analysis of LDA classifier weights which also show that loudness is the strongest indicator of event salience, followed by harmonicity and spectral scale.

In order to examine the effect of scene structure on the predictability of its salient events based on acoustics, we look closely at predictions for sparse vs. less sparse scenes. As noted earlier in this paper, sparse scenes tend to give rise to more events. Also, by their very nature, sparse scenes tend to be quieter which made changes to features such as loudness more prominent, resulting in stronger salience events (as discussed in Figure 2.5b). The sparse scenes with well-defined events may be more easily described by loudness alone. This hypothesis is supported by manually separating the twenty scenes used here into sparse and dense scenes, and evaluating the model’s performance on each subset. The prediction for dense scenes with more overlapping objects shows more of an improvement with the inclusion of additional features in comparison to the prediction for sparse scenes (Figure 2.8b). In addition, the event prediction for the dense scenes is closer to the inter-observer ROC than

43 the prediction for the sparse scenes.

For the four ‘control’ scenes, which were always paired with the same competing scene, it is interesting to examine the impact of the opposite scene on salience of a scene of interest. As noted throughout this work, auditory salience of a sound event is as much about that sound event as it is about the surrounding context including a competing scene playing in the opposite ear. We include features of this opposite scene in the LDA classifier. The prediction for one scene shows great benefit (Figure 2.9a), since it is paired with a very sparse scene containing highly distinct events. However, this effect was not consistent. The prediction of events in a very sparse scene are hindered slightly by including information from the opposing scene (Figure

2.9a), further reinforcing the idea that event prediction is much more effective on sparse scenes. The other control pair, which involved two more comparable scenes in terms of event strength, are mostly unaffected by including features from the competing scene.

2.3.7 Pupillary Response

In addition to recording behavioral responses from listeners, our paradigm monitored their pupil diameter throughout the experiment. The subjects’ pupil size shows a significant increase immediately following events defined using these methods (Figure 2.10a); with a typical pupil dilation associated with a salient event lasting about three seconds (Figure 2.10a). While all salient events correlate with pupil dilation, not all dilations correlate with salient events. Figure 2.10b shows that about 29% of pupil dilations coincide with

44 Figure 2.9: Effects of adding opposing scene information. a Examples of prediction performance for two specific scenes, with and without features from the opposing scene. b ROC over all control scenes.

salient events (defined as occurring within one second of such an event).

Another 26% coincide with other peaks in the behavioral responses that we did not label as significant salient events (see Methods). Interestingly,% 49 of pupil dilations did not coincide with any peak in the behavioral response. In contrast, there is a strong correspondence between pupil dilations and changes in acoustic features in the stimuli. An analysis of all pupil dilations correlates significantly with increases in acoustic loudness during the behavioral task t(548) = 7.42, p = 4.4 ∗ 10−13 (Figure 2.10c, top bar). At the same time, randomly selected points in the pupil responses that do not coincide with any significant pupil variations also do not coincide with any significant changes of acoustic loudness (Figure 2.10c, bottom bar). Finally, in order to rule out any possibility of motor factors during the behavioral task contaminating the

45 pupil response analysis, we replicate the same analysis of pupil changes vs. loudness changes during a passive listening session of the same task with no active responses from subjects (N=14). The results confirm that pupil dilations significantly correlate with increases in acoustic loudness even during passive listening t(355) = 4.53, p = 8.2 ∗ 10−6 (Figure 2.10c, middle bar).

Figure 2.10: a Average pupil size following events, z-score normalized and detrended. The black bar represents the region where the pupil size is significantly greater than zero. b Percentage of pupil size increases that are near salient events or other peaks in the slope of salience. c Change in loudness associated with increases in pupil size. Error bars represent ±1 standard error.

46 2.4 Discussion

In this study, we present a database of diverse natural acoustic scenes with a goal to facilitate the study and modeling of auditory salience. In addition to the sound files themselves, a set of salient events within each scene hasbeen identified, and the validity of these events have been checked using several methods. Specifically, our analysis confirms the intuition that a changein some or all acoustic features correlates with a precept of salience that attracts a listener’s attention. This is particularly true for loud events; but extends to changes in pitch and spectral profile (particularly for speech and music sound events). These effects fit well with contrast theories explored in visual salience (Itti and Koch, 2001a). In the context of auditory events, the concept of contrast is specifically applicable in the temporal dimension, whereby as sounds evolve over time, a change from a low to a high value (or vice versa) of an acoustic attribute may induce a pop-out effect of the sound event at that moment. While earlier studies have emphasized the validity of this theory to dimensions such as loudness (Kim et al., 2014b; Duangudom and

Anderson, 2013b; Tsuchida and Cottrell, 2012a), our findings emphasize that the salience space is rather multidimensional spanning both spectral and temporal acoustic attributes. Increases in brightness and pitch both suggest that higher frequency sounds tend to be more salient than lower frequency sounds. Naturally, this could be correlated with changes in loudness that are associated with frequency (Moore, 2003) or more germane to how high frequencies are perceived. The rise in harmonicity suggests that a more strongly pitched sound (eg. speech, music) tends to be more salient than a

47 sound with less pitch strength. Interestingly, the scale feature (representing a measure of bandwidth) tended in the opposite direction. Events that are more broadband and spectrally spread tended to be more salient than more spectrally-narrow events.

The relative contribution of various acoustic features has been previously explored in the literature. Kim et al. (Kim et al., 2014b) reported that they observed no improvement to their model predictions with the incorporation of other features besides sound loudness in Bark frequency bands. However, in the present study, features other than loudness do show a contribution. Loudness is only able to account for around two-thirds of the salient events, with the remaining events explained by one or more of the other eight acoustic features (Figure 2.8a, inset). In particular, the other features seem helpful in establishing salient events in the context of acoustically dense scenes, in which the difference in loudness between the background and the salient sound is much lower. This finding is supported by Kaya et al. (Kaya and Elhilali, 2015) which showed that the contribution to salience from different features varied across sound classes. The Kaya et al. study used comparatively more dense acoustic stimuli.

In addition to its multidimensional nature, the space of auditory salience appears to also be context or category-dependent. By using a diverse set of scenes of different genres (e.g. speech, music, etc), our results indicate that observed contrast across acoustic attributes does not uniformly bias salience of auditory event, but rather depends on general context in which this event is present. Notably, speech events appear to be strongly associated with

48 increases in harmonicity. This result aligns with expectations, as speech is a very strongly pitched sound. Similarly, because it is pitched, speech is characterized by a low spectral flatness. As such it is unsurprising that speech events are associated with a decrease in flatness. Also worth noting is that both speech and music events were not associated with any significant or big changes in a number of spectral metrics (including brightness, bandwidth) in contrast with more percussive-type events such as devices and tapping.

These latter two contexts induced an expected dominance of loudness (Figure 2.6b, rightmost panels). However, it is interesting that musical events show the least consistent changes across all features, perhaps indicating that what makes a musical section salient is much more intricate than basic acoustics. Overall, these results suggest that the analysis of salience may need to carefully consider the contextual information of the scene; even though the exact effect of this categorization remains unclear.

Moreover, an interesting observation that emerges from the behavioral judgments of listeners is that auditory salience is driven by acoustic contrast over both short (within a couple of seconds) and long (as much as 8 seconds) time spans. The multi-scale nature of integration of acoustic information is not surprising given that such processing scheme is common in the auditory system; especially at more central stages including auditory cortex (Elhilali et al., 2004). However, it does complicate the development of computational models capable of multiplexing information along various time constants; as well as complementing a basic acoustic analysis with more cognitive processes that enable the recognition of contextual information about the scene. This

49 extended analysis over the longer context of a scene will need to delve into memory effects and their role in shaping salience of sound events by reflecting interesting structures in the scene such as presence of repetitive patterns (that may become less conspicuous with repetition) or familiar transitions (as in the case of melodic pieces).

The choice of a diverse set of natural scenes was crucially to paint a fuller picture of this auditory salience space. This follows in the tradition of studies of visual salience; which greatly benefited from an abundance of databases with thousand of images or video with clearly labeled salient regions based on eye-tracking, visual search, identification and detection measures (Borji and Itti, 2013a). The widespread availability of such databases has greatly facilitated work ranging from visual perception to computer vision. Crucially, the abundance of datasets brought forth a rich diversity of images and videos that challenged existing models and pushed forth advances in our understanding of visual attention and salience. Our goal in the present study is to provide a rich database that takes a step towards achieving the same benefits for the study of auditory salience, providing a level of facilitation and consistency which has previously been unavailable. While the dataset used here represents only a tiny fraction of the diversity of sound in real life, it provides a small benchmark for future models and theories of auditory salience as well as stirs the conversation about appropriate selections for development of future datasets.

Scene variety is one of the key qualities of this set of scenes, encouraging the development of models that apply to a wider range of settings. This

50 versatility would be vital to many applications of salience research, such as use in hearing aids and audio surveillance, or as a preliminary filter for scene classification (Kalinli, Sundaram, and Narayanan, 2009). Aside from subjective evaluation and comparison of acoustic features, the scene variety in this database is also reflected in the differing behavioral response to each scene.

Some scenes had strong, clearly defined events while others had relatively weaker events; some had many events while others had only a few.

As with all studies of salience, the choice of behavioral metric plays a crucial role in defining the perceptual context of a salient event or object. In vision, eye-tracking is a commonly chosen paradigm as eye fixations have been strongly linked to attention (Peters et al., 2005) although there is still some influence of top-down attention as seen through task-dependent effects (Yarbus, 1967). Other methods that have been employed include mouse tracking, which is a reasonable approximation to eye tracking while requiring less equipment, even though it is noisier (Scheier and Egner, 1997). Manual object annotation is also common, such as in the LabelMe (Russell et al., 2008) and Imagenet (Deng et al., 2009) databases. Although manual labeling may be influenced more by top-down information and can consume more timeper scene, large amounts of such annotations have been collected using web-based tools (Russell et al., 2008).

In the present study, pupil size provides the closest analogy to eye tracking while mouse movements are analogous to mouse tracking. Consistent with previous results in audio and vision studies(HESS and POLT, 1960; Partala and Surakka, 2003; Liao et al., 2016c), our analysis shows that when a salient

51 stimulus is received, pupils naturally dilate to try to absorb as much information as possible. However, using the complex scenes employed in the current study sheds light on a more nuanced relationship between pupil dilations and auditory salience. As the results indicate, pupil dilation correlate strongly with acoustic variations; but not all acoustic variations imply behaviorally-salient sound events. In other words, while almost all salient events correlate with an increase in pupil size locally around the event, increases in pupil size do not always coincide with presence of a behaviorally-salient event. As such, pupilometry cannot be used as a sole marker of bottom-up salience. It is therefore necessary to engage listeners in a more active scanning of the scenes, since they have to indicate their attentional state in a continuous fashion. Obviously, by engaging listeners in this active behavioral state, this paradigm remains a flawed indicator of salience. To counter that effect, a large number of subjects is sought to balance results against consistency across subjects. Alternative measures based on non-invasive techniques such as EEG and MEG are possible and are being investigated in an ongoing follow-up work.

Despite being an imperfect metric for salience, the behavioral paradigm used in the present study provides a clear account of onsets of salient events; hence allowing a direct evaluation of the concept of temporal contrast which also champions an analysis of temporal transitions over sound features. Using such a model proves relatively powerful in predicting human judgments of salient events, though limitations in this approach remain due to its local nature in the acoustic analysis, as well as failure to account for the diversity of scenes present. Qualitatively, the results of this study match some of the other

52 trends in the visual salience literature as well. In both, some categories of objects are more difficult than others, in addition to the discrepancy between different scenes. Difficulty in predicting salience in dense acoustic scenes lines up with performance of visual salience models. Borji (Borji, 2015a) recommends focusing on scenes with multiple objects, in part because models perform more poorly for those complex scenes. Similarly, the dense auditory scenes in this study contain events that were much more difficult for any of the existing models to predict. The larger discrepancy between the model predictions and the inter-observer agreement confirms that there is more room for improvement for these scenes. Models of auditory salience need to be designed with these types of stimuli in mind.

In line with the visual literature, our analysis suggests that smoothing of the acoustic features is required to even come close to a good prediction of the acoustic salient events. This is analogous to suggested observations that models that generate blurrier visual maps perform better in the visual domain (Borji, Sihite, and Itti, 2013b). Moreover, models of visual salience incorporating motion in videos have not performed better than static models

(Borji, Sihite, and Itti, 2013b). Similarly, some basic attempts at incorporating longer term history in our model of auditory salience were not able to improve predictions. That is not to say that history plays no role in salience in either field, but that its role may be complex and difficult tomodel.

Finally, it is worth noting that the lingering issue of the role of top-down attention in the entire paradigm remains unclear despite efforts to minimize its effects. It has been suggested that events with high inter-subject agreement

53 may be associated with true salience, while those with low inter-subject agreement may be associated with top-down attention, because of the variability in what people consciously think is important (Kim et al., 2014b). However, since it would be difficult to dissociate events due to top-down attention from simply weaker events resulting from salience, that distinction is not made here.

In fact, with the attempts made to minimize effects of top-down attention in this study, it is likely that weaker events identified here are still a result of bottom-up processing.

54 Chapter 3

Neural Markers of Auditory Salience: Musical Sequences with Visual Task

This chapter presents neurophysiological effects elicited by salient stimuli as recorded by Electroencephalography (EEG). In this section, constructed musical scenes are used to have more controlled and consistent stimuli for initial exploration, to be built upon in later chapters with natural scenes. A visual task is used to occupy subjects’ top-down attention in order to focus on the effects of salience. Several neurological effects of attention are identified, including modulation in the phase-locked response and phase coherence. The contents of this chapter have been submitted for peer review.

3.1 Introduction

Attention facilitates focus on an object of interest even in the presence of distractors. It is thought to allocate neural resources for detailed processing of a target (via top-down, selective attention); while maintaining perceptual

55 flexibility to allow important, salient objects to enter the spotlight (via bottom- up, involuntary attention). While there is a large body of work exploring the underpinnings of top-down attention across sensory modalities, the study of bottom-up attention has mostly flourished in vision, owing to the easeof obtaining eye-tracking data for complex natural scenes and the high contribution of visual salience to fixations, resulting in a proliferation of experimental and modeling studies in the visual salience literature (Borji and Itti, 2013b).

Similar studies in the auditory domain remain challenged by the lack of a clear correlate to eye saccades. A variety of experimental techniques have been proposed to measure auditory salience, such as comparing sound clips

(Kayser et al., 2005a; Tsuchida and Cottrell, 2012b), making salience judgments (Kaya and Elhilali, 2014b; Petsas et al., 2016), actively denoting salient events in free listening (Kim et al., 2014a), or dichotic listening across competing sounds (Huang and Elhilali, 2017). Still, behavioral measures of bottom-up auditory attention ultimately require careful experimental design, with cau- tious interpretation of results that will likely remain confounded by top-down factors (Kaya and Elhilali, 2017). Other biometric markers such as pupilometry are viable alternatives, but a number of confounds remain to be assessed including the dominant role of loudness rather than salience in general in driving pupil dilation variations (Liao et al., 2016a; Huang and Elhilali, 2017).

While attentional orientation cannot be inferred from the ear, electrical or magnetic fields generated by the synchronized firing of large neural popu- lations can be recorded non-invasively using electroencephalography (EEG) or magnetoencephalography (MEG), respectively. Oddball experiments have

56 been used extensively to reveal deflections in event-related potential (ERP) components in response to rare or deviant stimuli (Picton et al., 2000; Näätänen et al., 2007; Winkler, 2007), hence laying the groundwork for interpretation of sounds that might attract bottom-up attention of unattending subjects. While this design is attractive for simplicity of analysis and minimal top-down inter- ference, it requires artificial scenes made of short, often identical, sounds that create a standard or baseline stimulus. Not only does this requirement make it challenging to immediately extend the paradigm to natural sound scenes to probe free-listening, it focuses the analysis on changes of acoustic attributes of the deviant relative to a standard stimulus. In addition, it limits the spectral information that can be extracted from EEG oscillations due to its sparse setup in both time and space.

Recent work using natural speech, animal vocalizations, and ambient sounds demonstrates that neural activity fluctuates in a pattern matching that of the attended stimulus, driving the power of oscillations at the stimulus rate or low-frequency oscillations and modulating this power by attention (Lakatos et al., 2008; Zion Golumbic et al., 2013; Besle et al., 2011; Jaeger et al., 2018).

This enhanced representation of the attended stimulus has been used to track auditory attention using envelope decoding paradigms (Ding and Simon, 2012; O’Sullivan et al., 2015; Fuglsang, Dau, and Hjortkjær, 2017). While these studies have successfully extracted stimulus-specific information from neural recordings to natural continuous sound environments, they have all employed experimental paradigms under the influence of top-down attention.

In the current study, we explore whether it is possible to record a bottom-up

57 attentional response with a rich, dynamic acoustic scene using EEG in an experimental paradigm that directs subjects’ top-down attention away from the auditory input to a visual task. Musical melodies ignored by participants are constructed using notes from real musical instrument without any familiar melody but synthesized to sound natural and pleasing. Occasional salients notes are introduced that differ from the statistical regularity of the melody along some acoustic dimension (e.g. a violin solo with an unexpected piano note, Fig. 3.1A).

The paradigm differs from traditional oddball experimental designs in a number of ways: (i) subjects’ attention is directed to a visual task, (ii) the notes, while regularly repeating, occur at a much faster rate (3.33 Hz) than typical oddball paradigms, (iii) while the melody exhibits an inherent rhythmicity, it consists of overlapping notes resulting in a complex sequence (Figure 3.1B), (iv) the analysis contrasts the same physical notes when they occur as salient vs. control, (v) each melody consists of non-identical notes that vary in a stochastic but controlled structure, and (vi) rare salient notes are also non-identical, and deviate from the scene in multiple features (pitch, intensity, and timbre).

These latter two points are of particular significance as real environments reflect complex statistical structures with salient sounds varying along various acoustic attributes. Exploring the interaction across different attributes sheds light on processes of feature integration and their role in guiding salience encoding in the brain. Moreover, by contrasting neural responses to the same physical sounds under different attentional states (salient attracting bottom-up attention vs. being ignored in the background), the analysis hones in on effects

58 of bottom-up attention without confounds of local acoustic structure of the note itself.

In earlier work using the same dynamic stimuli (Kaya and Elhilali, 2014b), we explored behavioral responses to the same salient events (with subjects attending to these melodies) and quantified the effects of the acoustic dimensions of pitch, timbre and intensity, along with their interactions, on salience judgments. Here, we examine neural markers of auditory salience using the same events while subjects ignore the melodies, and report on effects of salience on neural entrainment to the auditory scene and phase alignment of intrinsic neural oscillations, as well as spectro-temporal response characteristics reflecting changes in cortical processing to salient sounds across different acoustic features despite attentional focus being directed away from these notes to a concurrent visual task.

3.2 Materials and Methods

3.2.1 Participants

Thirteen subjects (7 female) with normal vision and hearing and no history of neurological disorders participated in the experiment in a soundproof booth after giving informed consent and were compensated for their participation. All procedures were approved by the Johns Hopkins Institutional Review

Board.

59 3.2.2 Stimuli and experimental paradigm

Subjects performed an active visual task while auditory stimuli were concurrently presented. Subjects were instructed to ignore the sound being delivered to their ears.

The auditory stimuli closely followed the music experiment in (Kaya and

Elhilali, 2014b), where pyschoacoustic detection results were collected. Each sound stimulus had a duration of 5 seconds and consisted of regularly spaced overlapping musical notes 1.2 s long, with a new note starting every 300 ms.

Instrument notes were extracted from the RWC Musical Instrument Sound Database (Goto et al., 2003) for Pianoforte (Normal, Mezzo), Acoustic Guitar (Al Aire, Mezzo) and Clavichord (Normal, Forte) at 44.1 kHz; and were amplitude normalized relative to maximum with 0.1 s onset and offset sinusoidal ramps. The repetition rate and instruments were specifically selected to sound pleasing and flow naturally in order to resemble natural sounds.

All notes in a given 5-second sequence had a controlled average intensity, were played by the same instrument (timbre), and varied within ±2 semitones of a nominal pitch value around A3 (220Hz), with one exception. Within each sequence, one note (chosen at random between 2.4 s and 3.8 s) was chosen as “salient”, and differed in any combination of the following acoustical properties: timbre (new instrument), pitch, and intensity. We selected three instruments (piano, guitar, clavichord), two levels of pitch difference (2 or 6 semitones higher), and two intensity levels (2 or 6 dB higher) to test in a factorial design. Due to the difficulty of defining timbre on a scale, we characterized timbre difference categorically by testing all 9 combinations of

60 the 3 instruments for standard notes (Background-Timbre) and deviant notes (Salient-Timbre). This resulted in 3 * 3 * 2 * 2 = 36 trials to test every possible feature deviation. Each feature deviation was repeated 10 times with different dynamic backgrounds and salient onset times, for a total of 360 salient trials.

In addition, “control” trials were constructed in a similar fashion, but without any salient notes. Their attributes (pitch, timbre, intensity) were carefully chosen to embed each of the salient notes without making it salient given its context. In each control trial, a given salient note appeared at two randomly selected positions no earlier than 2.4 s from the start of the trial, with a minimum of 900 ms between the two occurrences. The order of trials was randomized for each subject. Five control trials were constructed for each of the 12 salient notes, resulting in 60 control trials and 420 experiment trials in total.

Visual stimuli consisted of digits and capital letters presented on a black screen. Each target was uniformly chosen at random from the numbers 0-9, while each non-target was uniformly chosen at random from the letters A-Z. Within each trial, 56 characters were presented in sequence, with one presented every 90 ms. The visual task was divided into two difficulty conditions. The low-load condition consisted of white numbers in contrast to gray letters, with all characters remaining on-screen for 90 ms. In the high-load condition, both targets and non-targets were the same shade of gray, and all characters were presented for only 20 ms. Trials were presented in 12 blocks, with blocks alternating between the two load conditions. Half of the subjects started with the low-load condition, and half began with the high-load condition.

61 For a majority of the trials, two of the stimuli were targets. To avoid confounds with the salient auditory stimuli, one target (“early”) occurred within the first 2.4 seconds of the trial, while the other (“late”) occurred after the first 4.2 seconds of the trial. In addition, the first and last twocharacters were always non-targets. The positions of these targets were uniformly chosen at random within their respective ranges. In 20% of trials, only one target was presented, with an equal chance of it being either early or late. Finally, for 30 trials, one of the visual targets was positioned between 2.4 and 4.2 seconds from the start of the trial, while still being at least 1.3 seconds away from any salient auditory stimuli. This adjustment ensured that subjects paid attention to the visual stimulus throughout each trial.

3.2.3 Neural data acquisition

Electroencephalography (EEG) recordings were performed with a Biosemi Active Two system with 128 electrodes, plus left and right mastoids acquired at 2048 Hz. Four additional electrodes recorded eye and facial artifacts from

EOG (electrooculography) locations, and a final electrode was placed on the nose to serve as reference. The nose electrode was used only to examine ERP components, particularly mismatch negativity at the mastoids, and the average mastoid reference was used for all further analyses.

The initial processing of signals was performed with the Fieldtrip software package for MATLAB (Oostenveld et al., 2011). Trials were epoched with 1 s of buffer time before and after each trial, referenced to the left and right mastoid average, and downsampled to 256 Hz. To remove muscle and eye

62 artifacts from the signals, we used independent component analysis (ICA) as implemented by FieldTrip. ICA components were removed if their amplitude was greater than the mean plus 4 standard deviations for more than 5 trials. The resulting filtered signals were visually confirmed to be free of prominent eye blinks and large amplitude deviants.

3.2.4 Neural data analysis

All neural data analyses was performed with all salient notes together and with all control notes similarly grouped. Each analysis was then repeated with trials divided by salience level each feature tested (pitch, timbre, intensity).

Power analysis: Time-frequency power analysis of each experimental trial was computed with a matching pursuit algorithm using a discrete cosine transform dictionary (Mallat and Zhang, 1993). For precise spectral resolution, neural responses from salient notes and matching notes in control trials were extracted in segments spanning two notes (i.e. 0.6 sec post note onset). Extracted segments were concatenated across trials, and the power of the Dis- crete Fourier Transform (DFT) was obtained at each frequency bin of interest.

Concatenating the signals was necessary to increase the spectral resolution of the frequency analysis. While this process could create an edge effect at exactly 1.67 Hz (1/0.6 s) and possibly its integer multiples, post-hoc analyses and visual inspection confirmed that no significant artifacts resulted from the concatenation procedure. Furthermore, the same potential effects would affect salient and control trials equally.

The power of the Discrete Fourier Transform (DFT) of the signal at exactly

63 3.33 Hz (1/0.3 s) was divided by the average power at 2.33-4.33 Hz, with the power at 3.33 Hz excluded. The phase-locking power of the salient notes was defined as the normalized power at 3.33Hz averaged over the top15 channels with the strongest response. The power at other adjacent and further frequencies (3.2, 3.4, 6, 12, 20, 30, 38, 40) was also obtained from the same spectrum. Channels were allowed to vary between subjects to allow for inter- participant variability. The same analysis was performed for salient notes as well as the matched notes appearing in control trials. Effects of individual and combined features were tested with within-subject ANOVAs with Holm- Bonferroni correction for significance (Snedecor and Cochran, 1989).

Phase-coherence: Narrowband decompositions of the neural signals were first computed by spectrally filtering the responses for each channel individually along the following ranges: Delta 1-3 Hz, Theta 4-8 Hz, Alpha 9-15 Hz, Beta 16-30 Hz, Gamma 31-100 Hz. Then the phase of the Hilbert transform was obtained for each narrowband signal. Segments around the salient notes (0-300ms) and the corresponding control notes were obtained. All of these segments were positioned away from the onset and offset of the sequence to avoid any filter boundary effects. Trial-by-trial coherence (cθ) was computed for each segment i separately, as the magnitude of the average instantaneous phase (θ(t)) at each time point t of the segment, averaged over all segments

N N. The phase coherence was thus defined as: cθ = ∑i=1 exp(j ∗ θi(t)).

ERP analysis: EEG trials were bandpass filtered between 0.7 and 25 Hz. Responses from frontal electrodes (Fz and 21 surrounding electrodes) and central electrodes (Cz and 23 surrounding electrodes) were analyzed(Shuai

64 and Elhilali, 2014). Segments corresponding to salient notes and corresponding control notes were extracted separately for each of these channels. First, difference waveforms at the left mastoid, right mastoid, and Fz channels were analyzed. These channels were selected based on the MMN literature, which has established that the maximum amplitude of MMN is observed at Fz and that MMN is characterized by polarity reversal at the mastoids. Visual inspection confirmed this to be the case. Significant negative peaks were confirmed for all subjects at Fz by paired t-tests comparing the MMN time window point-by-point to 0, and likewise polarity reversals were observed at the mastoids. Following confirmation of an MMN, the trials were re-referenced to the average mastoids. Difference waveforms were re-computed for all subjects across central and frontal electrodes (though qualitatively similar results were obtained for individual subjects, albeit at a lower SNR). A number of baseline corrections were explored but did not quantitatively change the results. For each average waveform, peaks were extracted over various windows of interest: P1 (positive peak) at 25-75 ms, N1 (negative peak) at 75-120 ms, MMN

(negative peak) at 120-180 ms, P3a (positive peak) at 225-275 ms.

Spectrotemporal receptive fields: The cortical activity giving rise to EEG signals was modeled by spectro-temporal receptive fields (STRF). The processing filter acting on the the auditory stimulus s(t) is denoted as STRF. The EEG recording r(t) is a result of this processed stimulus signal representation, plus all background cortical activity and noise, and is denoted as ϵ(t). The STRF

65 model is then described as:

∫︂ r(t) = ∑ STRF( f , τ)s( f , t − τ)dτ + ϵ(t) f

Estimation of the STRF was performed by boosting (David, Mesgarani, and Shamma, 2007), implemented by a simple iterative algorithm that con- verges to an unbiased estimate. A brief description of the algorithm is as follows. The STRF (size F x T) is initialized to zero, and a small step size is defined as δ. For each time-frequency point in the STRF (every element in the matrix), the STRF is incremented by δ and −δ giving a pool of F * T

* 2 possible STRF increments. Among these STRFs, the one that provides the smallest mean-squared error is selected as the increment for the current iteration. This process is repeated until none of the STRFs in the possible increment pool improve the mean-squared error. Here, we further iterate on the STRF by setting the step size to δ/2 and continuing the same process, with 4 step size reductions in total.

STRFs were estimated for salient and control segments separately and were defined for a 300 ms window, reflecting the frequency of new notes.Two- fold cross validation was used to validate STRFs during estimation: Trials were divided into two groups with equal number of factorial repetitions in each group. A STRF was estimated for one group, and used to obtain an estimated neural response for the other group which is then correlated with the actual response. STRFs with a correlation of less than 0.05 were eliminated to remove estimates with low predictive power, and the remaining STRFs were averaged as the final STRF estimate for that condition. Using higher fold

66 Figure 3.1: A. Schematic of experimental paradigm. Participants are asked to attend to a screen and perform a rapid serial visual presentation (RSVP) task. Concurrently, a melody plays in the background and subjects are asked to ignore it. In “salient” trials (shown), the melody has an occasional deviant note that did not fit the statistical structure of the melody. In “control” trials (not shown), no deviant note is present.B. Power spectral density of a sample melody. Notes forming the rich melodic scene overlap but form a regular rhythm at 3.33Hz. Only one note in the melody deviates from the statistical structure of the surround. C. Grand average EEG power shows a significant enhancement upon the presentation of the salient note. The enhancement is particularly pronounced around 3-3.5 Hz, a range including the stimulus rate 3.33 Hz (1/0.3 s). estimates did not give significantly different results for the overall case across all salient notes. The number of folds was limited to two given the limited number of trials that allowed an analysis of STRFs for each salient feature category (pitch, intensity, timbre). Feature STRFs were analyzed by using the salient segments that corresponded to each level of the feature at hand in separate estimations. All STRFs were averaged over data in channel Fz (C21 on the Biosemi map) and 4 surrounding channels (C22, C20, C12, C25).

67 3.3 Results

Subjects performed a rapid serial visual presentation (RSVP) task by identifying numbers within a sequence of characters (Haroush, Hochstein, and

Deouell, 2010) (Fig. 3.1A-top). Participants were closely engaged in this task (overall target detection accuracy 70.3%), and their performance was modulated by task difficulty (accuracy 75.2% for low-load and 65.4% for high-load task, see Methods for details). Concurrently, sound melodies were played di- otically in the background and subjects were asked to completely ignore them. Acoustic stimuli consisted of sequences of musical notes with inherent regularity. Fig. 3.1A shows a schematic of such melody composed of violin notes with varying intensities and pitches and an unexpected salient piano note. Salient notes varied along pitch, timbre and intensity in a crossed-factorial design (Fisher, 1935). While notes were overlapping, the entire melody had an inherent regularity with a period of 300 ms (Fig. 3.1B). “Salient” melodies included an occasional deviant note that did not fit the statistical structure of the melody (e.g. the piano among violins), while “control” melodies did not contain any deviants. Control melodies were constructed to include the same deviant notes used in salient trials, but these notes were consistent with the inherent statistical structure of the control melody and as such were not salient.

While visual targets and auditory salient notes were not aligned in the experimental design, we probed any distraction effects due to the presence of occasional conspicuous notes in the ignored melodies. Visual trials coinciding with salient melodies contained a subset of targets that occurred prior to

68 salient notes while others occurred after the salient note. When comparing effect of salient and control melodies on visual target detection, there was no notable difference in detection accuracy for targets occurring prior to salient notes (unpaired t-test, t(13) = 0.86, p = 0.41). In contrast, the presence of a deviant note in salient trials significantly affected visual task performance when compared with control trials (unpaired t-test, t(13) = 3.27, p = 0.006).

Though ignored, the auditory melody induced a strong neural response with a clear activation around 3.33 Hz. Fig. 3.1C shows the time-frequency profile of neural responses centered around deviant notes (at 0) withaclear strong activation near 3.33Hz. To confirm the observed change in neural power, the spectral energy averaged over the region [3,3.5] Hz was compared before and after the onset of salient notes and confirmed to be statistically significant (paired t-test, t(13) = −11.3, p < 10−7). Part of this change in neural response power is likely due to acoustic changes in the physical nature of the salient note when compared to before the change. We therefore focused subsequent analyses on comparing the same physical note when salient in deviant trials and when non-salient in control trials, hence eliminating any effects due to the local acoustic attributes of the note itself.

Using the discrete Fourier transform (DFT), we looked closely at entrainment of cortical responses exactly to 3.33Hz, as well as other frequencies. Comparing the same note when salient versus not, the phase-locked response significantly increased only at the stimulus rate (paired t-test, p < 10−4. Fig. 3.2A, top row). No such increase was found for neural responses at close frequencies in the delta and theta bands, nor in higher EEG bands.

69 Figure 3.2: A. The analysis of spectral power compares phase-locking at various frequency bins for salient notes vs. the same notes when not salient (top), and compares various degree of salience in pitch, intensity and different instruments (timbre). B. The same analysis is performed on phase-coherence across trials at different frequency bands.

Looking closely at effects of acoustic attributes of the salient note, stronger salience in a particular feature resulted in stronger phase-locking relative to a weaker salience (Fig. 3.2A, second and third rows). Specifically, both pitch (F(1, 13) = 37.0, p < 10−4) and intensity (F(1, 13) = 35.6, p < 10−4) resulted in greater modulation of neural power only at the melodic rhythm. Deviance along timbre also revealed significant differential phase-locked enhancement only at the stimulus rate (Fig. 3.2A, fourth row); with guitar deviants driving

70 a larger increase in phase-locked responses compared to piano and clavichord notes (Timbre-bg: F(2, 26) = 6.13, p < 10−2, Timbre-fg F(2, 26) = 15.5, p < 10−4). These effects are in line with reported variations of acoustic profiles of clavichord, piano, and guitar, indicating stronger differences in the guitar spectral flux, spectral irregularity as well as temporal attack time relative to the other two instruments (McAdams et al., 1995).

Table 3.1: Feature effects on EEG measures of salience

Effects F (p)

Neural power Theta Coherence

Pitch 37.00 (

P, I 28.08 (

P, Tb 8.04 (

P, Tf 6.38 (

I, Tb 2.72 (0.08) 0.72 (0.49)

I, Tf 8.97 (

Tb,Tf 3.96 (

P, I, Tb 0.57 (0.57) 10.42 (

P, I, Tf 3.94 (0.03) 8.57 (

P, Tb,Tf 4.91 (

I, Tb,Tf 3.30 (.02) 0.66 (0.63)

P, I, Tb,Tf 4.14 (

71 Figure 3.3: A. Summary of interaction weights based on phase-locking response and coherence results as outlined in Table 3.1. Solid lines indicate 2-way, dashed lines 3-way and dotted lines 4-way interactions. Effects that emerge for both measures are shown black, and those that are found for at least one measure are shown gray. B. Summary of interaction weights based on human behavioral responses. Solid lines indicate 2-way, dashed lines 3-way and dotted lines 4-way interactions. Black lines indicate effects that emerge from the behavioral results for the stimulus in this work. Gray lines indicate effects that emerge from behavioral results for speech and nature stimuli tested with the same experimental design as music stimuli.

Given the factorial design of the paradigm concurrently probing combinations of features, phase-locking power can be examined across acoustic dimensions. Results of a within-subjects ANOVA are given in Table 3.1, middle column. The analysis shows a sweeping range of strong effects and significant interactions across features. Worth noting are main effects of Pitch,

Intensity and Salient-Timbre, as well as numerous nonlinear interactions across many features including 3-way and 4-way interactions. Many effects reported in Table 3.1 are in line with similar interactions previously reported in behavioral experiments using the same acoustic stimuli (Kaya and Elhilali, 2014b); while other interactions are only observed here in the phase-locked responses (e.g. 4-way interaction between Pitch * Intensity * Salient-Timbre *

Background-Timbre).

To complement the magnitude phase-locking analysis, we investigated

72 effects of salient notes on phase-profiles of neural responses. A measure of inter-trial phase-coherence was used to quantify similarity of neural phase patterns across trial repetitions. Again when comparing salient notes with the same notes in control context, phase-coherence was overall strongest in the theta band, where salient notes evoked significantly higher phase-coherence

(Fig. 3.2B, top row). Phase-coherence across salient notes also increased based on salience strength. No significant phase effects were observed in the delta or beta ranges. A higher pitch or intensity resulted in stronger coherence compared to a low pitch or intensity difference (Pitch: F(1, 13) = 81.28, p < 10−6, Intensity: F(1, 13) = 23.23, p < 10−3, Fig. 3.2B, second and third rows). Different salient note timbres also elicited significantly different amounts of phase-coherence (Timbre-bg: F(2, 26) = 9.85, p < 10−3, Timbre- fg: F(2, 26) = 12.98, p < 10−3, Fig. 3.2B, bottom row). An assessment of interaction effects of phase-coherence in the Theta band also revealed sweeping effects, with many interactions consistent with those observed in neural power (Table 3.1, right column).

Figure 3.3(A) summarizes the nonlinear interactions across acoustic features observed in both response power and phase-coherence detailed in Ta- ble 3.1. Effects that are common across both neural measures are shown in black, revealing consistent nonlinear synergy across acoustic features. In- terestingly, these interactions are closely aligned with nonlinear interactions obtained from the previous behavioral experiments using the same stimuli (Kaya and Elhilali, 2014b) (Fig. 3.3(B) replicates the published effects for comparison).

73 Figure 3.4: Distribution of neural effects as a function of overall salience. Both phase- locking magnitude and cross-trial coherence increase with greater stimulus salience. The x-axis denotes the number of features in which the deviant has a high level of difference from the standards, with no deviation in controls. 1 corresponds to deviants with the lowest level of salience: No difference in timbre, 2 dB difference in intensity, and 2 semitone difference in pitch. Any change in timbre is considered a high level of salience.

Because the experimental design manipulates multiple acoustic features simultaneously, we probed the neural correlates of salience as a function of overall acoustic salience. The greater the number of deviant features, the greater the effect on the neural response in phase-locking magnitude and theta phase-coherence (Fig. 3.4). The figure varies a systematic increase in the number of salient features (x-axis), with change in the neural response (y-axis).

A slope quantifying the linear fit of this increase confirms significantly positive increases for phase-locking magnitude (95% bootstrap intervals [0,4.34])

74 and theta-band phase-coherence (95% bootstrap intervals 3.42, 7.18]). No significant increases are noted for delta-band and beta-band phase-coherence

(95% bootstrap intervals [-1.92, 5.57] and [-3.53, 5.79] respectively).

To complement the phase-locking analysis, we analyzed evoked potentials (ERP) in response to control vs. salient notes. Overall difference ERPs obtained by subtracting control from overall salient notes revealed significant MMN and P3a amplitude effects (paired t-test: t(13) = −1.4, p < 10−6 for MMN and t(13) = 2.2, p < 10−5 for P3a at fronto-central sites), but no significant latency effects and no notable differences in the P50 or N1 time windows at any channel (Fig. 3.5A top). Both components were further modulated by the increase of salience along pitch or intensity (paired t-test: p < 10−3 for both components and both features; Fig. 3.5A). The timbre of salient notes showed an effect on P3a (one-way repeated-subjects ANOVA: p < 10−2) but not on MMN. The presence of a typical P3a component complementing the mismatch peak and its modulation by note salience further supports the engagement of involuntary attention due to conspicuous deviances in melody statistics.

Furthermore, the dynamic and continuous nature of the experimental stimulus allowed the estimation of neural filters that process sensory input.

These filters are modeled after spectro-temporal receptive fields (STRFs) that have been used to characterize the tuning properties of neurons in auditory cortex (Elhilali et al., 2013). The STRF represents an optimal mapping between spectro-temporal features of a stimulus and the corresponding neural response. Unlike time-locked ERP analysis, the STRF is derived as an ongoing convolution filter (the STRF slides over the spectrogram to calculate the

75 response). As a result, negative and positive deflections in the STRF at time t do not represent the averaged trend of the neural epochs at time t after onset.

Rather, they reflect how energy patterns at any point in the stimulus affect the neural response with a delay of t, with no concept of an “onset”. Figure 3.5B shows notable changes in this neural filter derived from responses to salient notes (right) relative to background notes (left). Overall, the patterns (positive and negative regions) were consistent across both filters but are much more pronounced in the salient filter, likely in line with stronger response

Figure 3.5: A. Estimate of ERP peak over different windows of interest (P50, N1, MMN and P3a) comparing notes when salient vs. not. The same analysis compares salient notes with different acoustic attribute levels. B. Estimated STRF averaged across subjects, computed for overall control vs. salient notes. C-E STRFs estimated from salient notes that belong to specific levels of tested features, or specific foreground timbres.

76 profiles as reported earlier. Of particular interest were filter characteristics in areas corresponding to time windows of neural responses that showed significant changes in the evoked ERP: the 120-170ms MMN time window, and the 220-270ms P3a time window. At the corresponding time shifts, both the control and salient STRFs showed a response: negative for the MMN time window and positive for the P3a time window, but much more pronounced for salient notes and with greater spectral spread. Looking closely at specific acoustic features, an increase in pitch and intensity salience levels also resulted in a similar stronger response; though pitch salience also induced a broader spectral spread than intensity. Different instruments also gave rise to varying spectro-temporal patterns, possibly indicating different neural processing for each instrument. These variations are consistent with greater conspicuity of guitar spectrotemporal structure relative to piano and clavichord notes, particularly in terms of irregularity of spectral spread and sharp temporal dynamics caused by the plucking of guitar strings (Giordano and McAdams, 2010; Peeters et al., 2011; Patil et al., 2012).

3.4 Discussion

This study examines neural markers of auditory salience using complex natural melodies. Specifically, the results show that the long-term statistical structure of sounds shapes the neural and perceptual salience of each note in the melody, much like spatial context dictates salience of visual items beyond their local visual properties (Wolfe and Horowitz, 2004b; Nothdurft, 2005; Itti and Koch, 2001b). In this work, brain responses are shown to be sensitive

77 to the acoustic context of sounds by tracking the dynamic changes in pitch, timbre and intensity of musical sequences. The presence of salient notes that stand out from their context significantly enhances phase-locking power and cross-trial theta phase alignment of salient events; and causes them to distract subjects from the task at hand, even in another modality (visual task). The degree of modulation of neural responses is closely linked to the acoustic structure of salient notes given their context (Fig. 3.4); and reflects a nonlinear integration of variability across a high-dimensional acoustic feature space. For instance, a deviance in the melodic pitch line induces neural changes that are closely influenced by the musical timbre and overall intensity ofthe melody. While such synergistic interactions have been previously reported in behavioral studies (Melara and Marks, 1990; Kaya and Elhilali, 2014b), the close alignment between these perceptual effects and neural responses in the context of salience suggests the presence of interdependent encoding of these attributes in the auditory system and provides a neural constraint on such nonlinear interactions that explain perception of salient sound objects. Neuro- physiological studies have provided support for such nonlinear integration of acoustic features in overlapping neural circuits (Bizley et al., 2009; Allen et al., 2017); a concept which lays the groundwork for an integrated encoding of auditory objects in terms of their high-order attributes (Nelken and Bar- Yosef, 2008). Here, we show that such integrated encoding is itself shaped by the long-term statistical structure of the context of the acoustic scene, in line with a wide-range of contextual feedback effects that shape nonlinear neural responses in auditory cortex (Bartlett and Wang, 2005; Asari and Zador, 2009; Mesgarani, David, and Shamma, 2009; Angeloni and Geffen, 2018).

78 The neural changes in response to salient notes are specifically observed in neural entrainment and phase-alignment to the auditory rhythm, even with subjects’ attention directed away from the auditory stimulus. The enhancement of phase-locking power complements previously reported “gain” effects that have mostly been attributed to top-down attention (Hillyard, Vogel, and

Luck, 1998) and interpreted as facilitating the readout of attended sensory information, effectively modulating the signal-to-noise ratio of sensory encoding in favor of the attended target. A large body of work has shown that directing attention towards a target of interest does in fact induce clear neural entrainment to the rate or envelope of the attended auditory streams, hence enhancing its representation (Elhilali et al., 2009; Kerlin, Shahin, and Miller, 2010). In fact, studies simulating the “cocktail party effect” with multiple competing speakers reveal that neural oscillations entrain to the envelope of the attended speaker (Ding and Simon, 2012; Mesgarani and Chang, 2012; O’Sullivan et al., 2015; Fuglsang, Dau, and Hjortkjær, 2017). Of particular interest to the present work is the observation that representations of unattended acoustic objects are nonetheless maintained in early sensory areas even if in an unsegregated fashion (Ding and Simon, 2012; Puvvada and Simon, 2017). Here, we observe that even ignored sounds can induce similar gain changes when these events are conspicuous enough relative to their context, effectively engaging attentional processes in a bottom-up fashion. The melodic rhythm used in the current study falls within the slow modulation range typical for natural sounds (e.g. speech) and is commensurate with rates that single-neurons and local field potentials in early auditory cortex are known to phase-lock to (Wang et al., 2008; Kayser et al., 2009; Chandrasekaran et al., 2010). While it is unclear

79 whether the observed enhancement in phase-locking gain is a direct result of contextual modulations of these local neural computations or whether it reflects cognitive networks typically engaged in top-down attentional tasks, the nature of the stimulus and observed behavioral effects suggest an engagement of both: the complex nature of salient stimuli likely evokes large neural circuits or multiple neural centers spanning multiple acoustic feature maps, and the observed distraction effects on a visual task also posit an engagement of association or cognitive areas likely spanning parietal and frontal networks in agreement with broad circuits reported to be engaged during involuntary attention (Watkins et al., 2007; Salmi et al., 2009; Ahveninen et al., 2013). The reported presence of a P3a evoked component that is itself modulated by note salience further supports the engagement of involuntary attentional mechanisms that likely extends to neural circuits beyond sensory cortex (Escera and

Corral, 2007; Soltani and Knight, 2012).

Complementing the steady-state “gain” effects, the study also reports an enhancement of inter-trial phase-coherence in the theta range whose effect size is strongly regulated by the degree of salience. Enhanced entrainment to low- frequency cortical oscillations has been posited as a mechanism that boosts or stabilizes the neural representation of attended objects relative to distractors in the environment (Henry and Obleser, 2012; Ng, Schroeder, and Kayser, 2012).

As a correlate of temporal consistency of brain responses across trials, inter- trial phase coherence measures temporal fidelity in specific oscillation ranges. Modulations in the theta band specifically have been tied to shared attentional

80 paradigms whereby a theta rhythmic sampling operation allows less target- relevant stimuli to be sampled, resulting in a more ecologically essential examination of the environment; thus, these modulations can be a marker of divided attention (Landau et al., 2015; Keller, Payne, and Sekuler, 2017; Spyropoulos, Bosman, and Fries, 2018; Teng et al., 2018). In the present study, not only does the strength of phase-coherence follow closely the salience of the conspicuous note, but the strong parallels between neural and behavioral nonlinear interactions (Fig. 3.3) proffer a link between perceptual detection and temporal fidelity of the underlying neural representation of salient events in a dynamic ambient scene.

The use of complex melodic structures in this work is crucial in shedding light on strong nonlinear interactions in neural processing of salient sound events. While effects reported here are heavily tied to the acoustic changes in the stimulus, the presence of a mismatch component followed by an early

P3a component provides further support that entrainment effects are indeed associated with engagement of attentional networks. The emergence of a deviance MMN component despite the dynamic nature of the background strongly suggests that the auditory system collects statistics about the ongoing environment, thereby forming internal representations about the regularity in the melody. Violation of these regularities is clearly marked by a mismatch component and further engages attentional processes as reflected by the P3a component (Muller-Gass et al., 2007; Escera and Corral, 2007). The presence of both components in this paradigm is in line with existing hypotheses positing a distributed architecture spanning the pre-attentive and attentional cerebral

81 generators and reflecting that the complex nature of deviant notes inthe melody indeed engages listeners’ attention in a stimulus-driven fashion (Gar- rido et al., 2009; Escera et al., 2000). An interesting question remains regarding the link between these ERP components and neural oscillations. Generally, ERP components, including MMN and P3a, are hypothesized to be a result of either transient bursts of activity across neurons or neural groups time-locked to the stimulus superimposed on “irrelevant” background neural oscillations, or realignment of the phase of ongoing oscillations (phase-resetting) (David, Harrison, and Friston, 2005; Sauseng et al., 2007). Previous work has observed MMN responses with increased phase-coherence in the theta band with no increase in power coherence, and suggested that MMN is at least partially brought forth by phase-resetting (Klimesch et al., 2004; Fuentemilla et al., 2008; Hsiao et al., 2009). Our study presents similar coherence and

ERP results; however, a few factors assist in distinguishing these two markers, which suggests that they reflect different processes. First, theta phase coherence shows its strongest effect in the time window prior to the MMN window. Second, the two markers reflect different interactions across acoustic features (Table 3.1). If ERP components were generated by additive evoked responses, a related question becomes whether these trial-by-trial amplitude modulations could be the driving factor for the observed phase-locking power enhancement, rather than a true phase alignment of neural oscillations, as the rate of the evoked responses matches the rhythm of the stimulus. We note that ERP amplitude increase is limited to time ranges of the negative and positive components around 150ms and 250ms, and that time-frequency analysis by matching pursuit reveals increased effect of target rhythm on a

82 trial-by-trial basis, making evoked responses an unlikely mechanism for the observed entrainment effects.

In a similar vein, it is interesting to consider the distinction between ERP and STRF results. ERPs are obtained by time-locked averages of neural signals, thus extracting the positive or negative signal deflections that occur at the same time across epochs. The STRF, on the other hand, finds a sparse set of filter coefficients that best explain every instance in the epoch as a function of the past 300 ms of input sound. Crucially, it reflects a two-dimensional filter for processing the input signal at various frequencies and time bins (Ding and Simon, 2012; Elhilali et al., 2013). While the time windows of significant ERP components (MMN and P3a) reflect corresponding STRF inhibition and excitation respectively, both their amplitude and spectral span appear to be heavily modulated by degree of salience. Interestingly, the STRFs derived here have significant predictive power at only 1-8 Hz, indicating thatthe neural signal encodes slow temporal dynamics of the acoustic input most prominently.

Overall, the findings of this study open new avenues to investigate bottom- up auditory attention without relying on active subject responses in the auditory domain, thus eliminating top-down confounds. Results suggest a unified framework where both bottom-up and top-down auditory attention modulate the phase of the ongoing neural activity to organize scene perception. The entrainment measures employed in this study can further be used for natural scenes to decode salience responses from EEG or MEG recordings, allowing the construction of a ground-truth salience dataset for the auditory domain as

83 an analog to eye-tracking data in vision.

84 Chapter 4

Push-pull competition between bottom-up and top-down auditory attention to natural soundscapes

This chapter presents neurophysiological effects elicited by natural auditory scenes, as recorded by EEG. In this section, an auditory amplitude modulation detection task is used to occupy subject’s voluntary attention, while the natural scene serves as a distractor. In particular, effects from changes in attention caused by salient events in the natural scene (bottom-up) are compared and contrasted to enhancement of attention caused by target stimuli (top-down). The contents of this chapter have been uploaded to bioRxiv (Huang and Elhilali, 2018) and has been submitted for peer review.

4.1 Introduction

Attention is a selection mechanism that deploys our limited neural resources to the most relevant stimuli in the environment. Without such a process, the sights and sounds of everyday life would overwhelm our senses.

85 Filtering out important information from our surroundings puts a constant demand on the cognitive system given the dynamic nature of everyday scenes.

On the one hand, we attend to sounds, sights and smells that we choose based on what matches our behavioral goals and contextual expectations. At the same time, we have to balance perception of salient events and objects that we need to be alerted to both for survival as well as awareness of our ever changing surrounds. These various factors guide our attentional resources to dynamically shift in order to shape the representation of sensory information based on its behavioral relevance, and ultimately influence how we perceive the world around us.

How does the brain manage its executive attentional resources faced with these dynamic demands? Studies of voluntary (“top-down” or endogenous) attention have shown that cognitive feedback modulates the encoding of sensory cues in order to improve the signal-to-noise ratio of attended targets relative to irrelevant maskers or other objects in the environment (Baluch and Itti, 2011b; Desimone and Duncan, 1995; Shamma and Fritz, 2014). Whether attending to someone’s voice amidst a crowd, or spotting that person in a busy street, or even identifying a particular smell among others, changes to neural encoding due to attention have been reported across all sensory modalities including auditory, visual, olfactory and somatosensory systems and appear to operate across multiple neural scales (Corbetta and Shulman, 2002; Zelano et al., 2005; Bauer, 2006; Knudsen, 2007). Response profiles of individual neurons, entire sensory circuits, and cross brain regions are modulated by attentional feedback to produce neural responses that reflect

86 not only the physical properties of the stimulus but also the behavioral state and reward expectations of the system. Some of the hallmarks of selective attention are enhanced neural encoding in sensory cortex of attended features and dynamics (e.g. envelope of an attended speaker (Mesgarani and Chang, 2012)) as well as recruitment of dorsal fronto-parietal circuits mediated by boosted gamma oscillatory activity (Fries, 2009; Baluch and Itti, 2011b).

In contrast, our understanding of the effects of involuntary (“bottom-up” or exogenous) attention has been mostly led by work in vision. There is a well established link between perceptual attributes of salience and their influence on attention deployment in visual scenes (Borji, Sihite, and Itti, 2013a; Treue, 2003b; Wolfe and Horowitz, 2004b). Studies of visual salience greatly benefited from natural behaviors such as eye gaze which facilitate tracking subjects’ attentional focus and allow the use of rich and naturalistic stimuli including still pictures and videos (Borji, 2015b; Carmi and Itti, 2006; Hart et al., 2009). In parallel, exploration of brain networks implicated in visual salience revealed engagement of both subcortical and cortical circuits that balance the sensory conspicuity of a visual scene with more task-related information to shape attentional orienting to salient visual objects (Veale, Hafed, and Yoshida, 2017). Bottom-up attention also engages ventral fronto-parietal circuits that orient subjects’ focus to salient stimuli outside the spotlight of voluntary attention

(Corbetta and Shulman, 2002; Fox et al., 2006; Asplund et al., 2010).

By comparison, the study of auditory bottom-up attention has proven more challenging owing to the difficulty of properly defining behavioral metrics that determine when attention is captured by a salient sound. Pupilometry has

87 been explored in auditory salience studies with recent evidence suggesting a correlation between stimulus salience and pupil size (Liao et al., 2016b; Wang et al., 2014). Still, there are various aspects of this correlation that remain ill understood, with some evidence suggesting that a large component of the pupillary response is driven by the sound’s loudness and its local context rather than a full account of its perceptual salience (Liao et al., 2017; Huang and Elhilali, 2017). In parallel, a large body of work has focused on deviance detection in auditory sequences, and has established neural markers associated with deviant or rare events. An example of such a response is mismatch negativity (MMN) which can be elicited pre-attentively, though this response is modulated by attention and associated with behavioral measures of distraction (Roeber, Widmann, and Schröger, 2003; Näätänen et al., 2007). By using sequences with deviant tokens or snippets, studies of novelty processing are able to better control the acoustic parameters of the stimulus and the precise occurrence and nature of salient events. Other studies have extended oddball designs using richer sound structures including musical sequences or noise patterns, but still piecing together sound tokens to control presence of transient salient events in the acoustic signal (Duangudom and Anderson, 2013a; Kayser et al., 2005a; Kaya and Elhilali, 2014b; Tordini, Bregman, and Cooper- stock, 2015a). Nonetheless, this structure falls short of the natural intricacies of realistic sounds in everyday environments where salience can take on more nuanced manifestations. Similar to established results in vision, use of natural soundscapes could not only extend results observed with simpler oddball sequences; but also shed light on the privileged processing status of social signals that reflect attentional capture in everyday life (Doherty etal., 2017).

88 In the current work, we explore dynamic deployment of attention using an unconstrained dataset of natural sounds. The collection includes a variety of contents and compositions and spans samples from everyday situations taken from public databases such as YouTube and Freesound. It covers various settings such as a busy cafeteria, a ballgame at the stadium, a concert in a symphony hall, a dog park, and a protest in the streets. Concurrent with these everyday sounds, subjects’ attention is directed towards a controlled tone sequence where we examine effects of top-down attention to this rhythmic sequence as well as intermittent switches of attention elicited by salient events in background scenes. This paradigm tackles three key limitations in our understanding of the dynamic deployment of attentional resources in complex auditory scenes. First, the study investigates the relationship between auditory salience and bottom-up attentional capture beyond deviance detection paradigms. The stimulus probes how distracting events in a background scene organically compete for attentional resources already deployed toward a dynamic sound sequence. Unlike clearly defined “deviants” typically used in oddball paradigms, the attention-grabbing nature of salient events in natural soundscapes is more behaviorally and cognitively nuanced and exhibits a wide range of dynamics both in strength and buildup. A salient event in a complex scene can vary from a momentary transient event (e.g. a phone ringing in an auditorium) to a gradually dynamic sound (e.g. a distinct voice steadily emerging from the cacophony in a busy cafeteria). Here, we are interested in probing whether this natural and nuanced capture of attention induces equally profound effects on brain responses as previously reported from top-down, task-related attention. The study leverages the organic nature

89 of competition between bottom-up and top-down attention in natural soundscapes to not only test the hypothesis of limited resources shared between the two modes of attention, but also engagement of distinct but overlapping brain networks (Corbetta and Shulman, 2002; Salmi et al., 2009).

Second, employing dynamic scenes allows us to focus our analysis beyond event-related potentials (ERPs) which require precisely-aligned event onsets, hence often limiting paradigms to oddball designs or intermittent distractors. In the current study, the use of continuous scenes is anchored against a rhythmic scene which provides a reference for temporal alignment of competing attentional states throughout the experiment. Third, while the study sheds lights on neural markers of auditory salience, it does so relative to a competing sequence in a controlled attentional task. As such, it balances the dichotomy often found in studies of auditory salience using either distraction or detection paradigms. A number of studies probe bottom-up attention using irrelevant stimuli presented to the subjects without necessarily competing for their attentional focus; or where subjects are informed or learn their value (see (Lavie, 2005)). Here, we are interested in characterizing the dynamic effect of salient distractors on encoding of attended targets. Ultimately, the current study aims to determine how well we can predict the existence of attention-grabbing events while subjects are engaged in a competing task.

90 4.2 Methods

4.2.1 Stimuli:

Auditory stimuli consisted of scenes from a previous study of auditory salience in natural scenes (Huang and Elhilali, 2017). This JHU DNSS (Dichotic Natural Salience Soundscapes) database includes twenty natural scenes, each roughly two minutes in length. All scenes were sampled at 22 kHz with a bit rate of 352 kbps, and converted to mono signals whenever applicable. Scenes were drawn from many sources (including Youtube, Freesound, and the BBC Sound Effects Library), and encompassed a wide range of sounds and scenarios. Some scenes were acoustically sparse, such as bowling sounds in an otherwise quiet bowling alley. Others were acoustically dense, such as an orchestra playing in a concert hall or a busy cafeteria.

Salience of events in each scene was measured using a dichotic listening behavioral paradigm (see (Huang and Elhilali, 2017) for full details). Human listeners were presented with two simultaneous scenes, one in each ear, and asked to indicate which scene they were attending to in a continuous fashion. The proportion of subjects attending to a scene compared to all other scenes was defined as salience. Peaks in the derivative of this salience measure were defined as salient events. The strength of salient events was further definedas a linear combination of the salience slope and salience maximum peak in a four second window after event. This salience strength was used to rank the events as high-salience, mid-salience, or low-salience, with one-third of the events falling within each category. Each category contained 117 events, with

91 a total of 351 across all scenes.

In the current study, each scene was presented one at a time, concurrently and binaurally with a sequence of repeating tones; together forming a single trial (Figure 4.1). The tone sequence consisted of repeated 440 Hz tones, each tone 300ms long, with 10 ms cosine on and off ramps. Tones were presented at a rate of 2.5 Hz. A behavioral study was performed using Amazon’s Mechani- cal Turk to determine reasonable detection parameters for the tone sequence. Roughly 12.5% of the tones were amplitude modulated at 64 Hz to serve as targets for this behavioral modulation detection task; target tones were randomly positioned throughout the sequence. This experiment was developed using the jsPsych library (Leeuw, 2015) and the Psiturk framework (Gureckis et al., 2016). Two modulation depths were tested: 0 dB (easy condition) and -5 dB (hard condition). During EEG recording sessions, tones were presented at a presentation rate of 2.6 Hz with a duration of 300 ms and 10 ms cosine on and off ramps. To avoid any confounds of neural effects between targets and salient events in background scenes, only three to five tones in each trial were amplitude modulated at 64 Hz with a modulation depth of 0 dB. Further, amplitude-modulated targets were constrained to be at least 1.5 seconds away from any salient event within the concurrent natural scene. A total of 79 modulated targets were present throughout the entire experiment. Stimuli

(concurrent scene and tone sequence) were presented binaurally to both ears via a pair of ER-3A insert earphones.

92 4.2.2 Participants and Procedure:

Eighty-one subjects (ages 22-60, twenty-seven female) participated in the behavioral study over Mechanical Turk. After a short training period, each subject performed ten trials, alternating between easy and hard conditions.

The order of easy and hard conditions was counter-balanced across subjects. For each trial, a natural scene was presented concurrently with the tone sequence; subjects were instructed to devote their attention to the tone sequence and to press the space bar in response to each modulated target tone. Targets with a response between 200 and 800 ms after stimulus onset were considered hits; accuracy was calculated as the percentage of targets that were hits. In order to evaluate any distraction effects from salient events, we contrasted the detection accuracy of two groups of tones: (i) targets occurring between .25 and 1.25 seconds after an event (the period of the strongest event effect),

(ii) targets occurring more than 4 seconds after any event (thus unlikely to be affected).

Twelve subjects (ages 18-28, nine female) with no reported history of hearing problems participated in the main EEG experiment. Subjects were tasked with counting the number of amplitude modulated tones within each sequence, and they reported this value at the end of the trial using a keyboard. Each subject heard each of the twenty scenes one time only. They were instructed to focus on the tone sequence and to treat the auditory scene as background noise. All experimental procedures (for both behavioral and EEG experiments) were approved by the Johns Hopkins University Homewood

Institutional Review Board (IRB), and subjects were compensated for their

93 participation.

4.2.3 Electroencephalography:

EEG measurements were obtained using a 128-electrode Biosemi Active

Two system (Biosemi Inc., The Netherlands). Electrodes were placed at both left and right mastoids (for referencing) as well as below and lateral to each eye, in order to monitor eye blinks and movements. Data were initially sampled at 2048 Hz. 24 electrodes surrounding and including the Cz electrode were considered to be in the ‘Central’ area, while 22 electrodes surrounding and including the Fz electrode were considered to be in the ‘Frontal’ area (Shuai and Elhilali, 2014). Eight electrodes were located in the intersection of these two sets; all statistical tests were performed using the union of the Central and Frontal electrodes.

Preprocessing: EEG data were analyzed using MATLAB (Mathworks Inc.,

MA), with both FieldTrip (Oostenveld et al., 2011) and EEGLab (Delorme and Makeig, 2004) analysis tools. Neural signals were first demeaned and detrended, then highpass filtered (3rd order Butterworth filter with 0.5Hz cutoff frequency). Signals were then downsampled to 256 Hz. Power line noise at 60 Hz was removed using the Cleanline plugin for EEGLab (Mullen, 2012). Outlier electrodes (around 2%) were removed by excluding channels exceeding 2.5 standard deviation of average energy in 20-40 Hz across all channels. Data were then re-referenced using a common average reference.

Data Analysis: For event-based analyses (steady-state tone locking and gamma band energy), EEG signals were divided into epochs centered at the

94 onset of a tone of interest (onset of modulated target, tone nearest a salient event, or onset of a control tone away from either target or salient event).

Control tones were selected at random with the constraint that no target tone or salient event were contained within the control epoch; in total, 245 such control tones were selected. Each ‘epoch’ was ten seconds in duration

(± 5 seconds relative to onset). Noisy epochs were excluded using a joint probability criterion on the amplitude of the EEG data, which rejects trials with improbably high electrode amplitude, defined using both a local (single electrode) threshold of 6 standard deviations away from the mean, as well as a global (all electrodes) threshold of 2 standard deviations. Data were then decomposed using Independent Component Analysis (ICA). Components stereotypical of eyeblink artifacts were manually identified and removed by an experimenter.

For the tone-locking analysis, epochs were further segmented into 2.3 seconds (6 tones) before and after onsets of interest (target, salient event, and control tone). All 2.3 seconds segments for a given group were concatenated before taking the Fourier Transform. This concatenation was performed to achieve a higher frequency resolution (given short signal duration and low frequency of interest). The concatenated signal was also zero padded to a length corresponding to 260 event windows in order to maintain the same frequency resolution for all conditions. As the analysis focused on spectral energy around 2.6 Hz, edge effects were minimal at that frequency region. Moreover, any concatenation effects affected all 3 conditions of interest equally.

Finally, a normalized peak energy was obtained by dividing the energy at

95 2.6 Hz by the average energy at surrounding frequencies (between 2.55 and 2.65 Hz excluding the point at 2.6 Hz). The change in tone-locking power was defined as the normalized power using post-onset segments minus thepower at pre-onset segments. For illustrative purposes, a tone-locking analysis was also performed over the full scene without dividing the data into epochs. (Fig.

4.2A).

High gamma band analysis was also performed by taking the Fourier Transform of the data in two-second windows around attended targets, salient events, or control tones. The energy at each frequency and each electrode was normalized by dividing by the mean power across the entire event window after averaging across trials, and then converted to decibels (dB). The average power between 70 and 110 Hz was taken as the energy in the high gamma band. The change in gamma-band power was defined as the high gamma energy in the window containing 1.5 and 3.5 seconds post-onset minus high gamma energy between 2.5 and 0.5 seconds pre-onset.

All statistical tests that compared pre- and post-onset activity were performed using a paired t-test analysis. Effects were considered statistically significant if p ≤ 0.05 and shown with ∗ in figures. Effects with p ≤ 0.01 are shown with ∗∗.

Envelope Decoding:

Decoding of the background scene envelope was performed by training a linear decoder, as described in O’Sullivan et al. (O’Sullivan, Shamma, and

Lalor, 2015). Briefly, a set of weights (W) was calculated to obtain an estimate

96 of the envelope of the natural scenes (Yˆ ) from the EEG data (X) according to the equation Yˆ = XW. Stimulus envelopes were extracted using an 8

Hz low-pass filter followed by a Hilbert transform. The EEG data itself was band-pass filtered between 2-8 Hz (bothth 4 order Butterworth filters). Each linear decoder was trained to reconstruct the corresponding stimulus envelope from the EEG data, using time lags of up to 250 ms on all good electrodes. In contrast to the paper by O’Sullivan et al. which used least squares estimation, these weights W were estimated using ridge regression (Fuglsang, Dau, and Hjortkjær, 2017; Wong et al., 2018). Here, an additional regularization parameter (λ) is included to mitigate the effects of collinearity between independent variables (here the various EEG channels). The equation −1 for estimating the weights is given by the equation W = (︁XTX + λI)︁ XTY (Wong et al., 2018). To avoid overfitting, a separate decoder was trained for each of the twenty scenes, using data from the remaining nineteen. The quality of these reconstructions was then evaluated over a range of regularization parameters λ by taking the correlation between the original stimulus envelope and the decoded envelope. A fixed high value of 220 was chosen to maximize this overall correlation value for a majority of subjects. Finally, correlation between estimated and original envelopes within a one second sliding window was calculated as a local measure of attention to the natural scene.

Gamma Band Source Localization:

Source localization was performed using the Brainstorm analysis software package for MATLAB (Tadel et al., 2011) and its implementation of sLORETA

97 (standardized low resolution brain electromagnetic tomography). In contrast to other methods for converting electrode recordings into brain sources, sLORETA is a linear imaging method that is unbiased and has zero localization error (Pascual-Marqui, 2002). It is similar to minimum norm estimation, which minimizes the squared error of the estimate, but its estimation of source activities are standardized using the covariance of the data.

A surface head model was computed using the ICBM 152 brain template (Mazziotta et al., 2001) and standard positions for the BioSemi 128 electrode cap, in lieu of individual MRI scans. Gamma band activity was computed by first taking the Fourier Transform of the data in 500 ms windows with200ms step size centered around salient events, targets tones, and control areas, then averaging over frequency bands between 70 and 110 Hz. Energy at each voxel on the cortical surface was computed from the EEG gamma band activity using sLORETA, then z-score normalized across time for each trial.

Voxel activation correlation:

In order to compare brain networks engaged by bottom-up and top-down attention, we selected voxels of interest from the surface models during target tones XT(t), salient events XE(t), and control tones XC(t) for each subject. For each time instant t, target voxel activations (salient event activations, respectively) across trials were compared voxel-by-voxel to the control activations at the same relative point in time t using a paired t-test across all trials (significance at p=0.01). A Bonferroni correction was applied to select only voxels that were uniquely activated during target tones Xˆ T(t) and salient events

98 Xˆ E(t) (Snedecor and Cochran, 1989). Using false discovery rates or random field theory to correct for multiple comparisons did not qualitatively change the final outcome of the correlation analysis (Hsu, 1996; Efron, 2010; Lindquist and Mejia, 2015). Using this comparison, all voxels that were not significantly different than control activations were set to zero, therefore maintaining only voxels that were uniquely activated during target and near-event tones. Next, matrices Xˆ T(t)[n, v] and Xˆ E(t)[n, v] representing activations across subjects n and voxels v for target and event responses at each time instant t were constructed, and columns of each matrix were standardized to 0-mean and 1-variance. A sparse canonical correlation analysis (CCA) was performed for each time τT and τE to yield canonical vectors (or weights) wT(τT) and wE(τE). CCA effectively determines linear transformations of Xˆ T and Xˆ E that are maximally correlated with each other following the objective function

T T L : max limitswE,wT q = wE XE XTwT (Uurtio et al., 2017). Here, we specifically implemented a sparse version of CCA which optimizes the same objective function L but imposes sparse, non-negative constraints on wT and WE using a least absolute shrinkage and selection operator (L1) penalty function, following the approach proposed in (Rosa et al., 2015). This analysis resulted in a single similarity measure across the entire dataset, rather than a per-voxel analysis. A permutation-based approach was used to choose the regularization parameters, which performs a permutation test across choices of regularization parameters by independently permuting data samples and maximizing correlation values across all data shufflings (Rosa et al., 2015). This same permutation also resulted in an overall significance metric for the final correlation values, by accepting correlation values q that are unlikely to be obtained

99 through a random shuffling of the two datasets (details are described in(Rosa et al., 2015)). We performed the sparse-CCA across all time samples relative to targets and events in a cross-correlation fashion and scored as significant only the time lags that yielded a correlation value q with statistical significance less than p < 0.01. For all time lags for which a statistically significant correlation was obtained, we visually compared the significant canonical vectors wE and wT which represent the sparse weights that yield a maximal correlation between the two sets and noted common activation areas.

Predictions:

Event prediction was performed using an artificial neural network. The analysis was performed on a per-tone basis where tones starting between 1 and 2.5 seconds before an event were labeled as “before”, while tones starting between 1 and 2.5 seconds following an event were labeled as “after”. Any tones that qualified for both groups due to proximity of two salient events were removed from the analysis. Two features were used in the classification: energy at the tone presentation frequency and gamma band energy. Both measures were averaged across central and frontal electrodes for each of the twelve subjects, resulting in twenty-four features used for classification.

Gamma band energy was calculated by performing a Fourier transform on the data within a 2-second window around the onset of the tone, and then averaging energy between 70-110 Hz. A longer window was required to capture energy at the lower tone-presentation frequency. Thus, tone-locked energy was calculated by performing a Fourier transform on data within

100 a 5-second window centered around the tone. Then energy at 2.6 Hz was normalized by dividing by neighboring frequencies (between 2.5 and 2.7 Hz).

Classification was performed using a 3-layer feed-forward back-propagation neural network with sigmoid and relu activations for the first and second layers respectively and a softmax for the final layer (Deng, 2014; Goodfellow,

Bengio, and Courville, 2016). The network was trained to classify each tone as either occurring before an event or after an event. Ten-fold cross-validation was performed by randomly dividing all of the events across scenes into ten equal portions. During each of the ten iterations, one group of events was used as test data and the remainder used as training data. A receiver operating characteristic (ROC) curve was constructed by applying varying thresholds on the network’s outputs and prediction accuracy was calculated as the area under the ROC curve.

4.3 Results

Listeners perform an amplitude-modulation (AM) detection task by attending to a tone sequence and indicate presence of intermittent modulated target tones (orange note in Fig.4.1). Concurrently, a busy acoustic scene is presented in the background and subjects are asked to completely ignore it. Background scenes are taken from the JHU DNSS (Dichotic Natural Salience Soundscapes) database for which behavioral estimates of salience timing and strength have been previously collected (Huang and Elhilali, 2017) (see Methods for details). In a first experiment, easy and hard AM detection tasks are interleaved in experimental blocks by changing the modulation depth of the target note

101 (easy: 0dB, hard: -5dB). As expected, subjects report a higher overall detection accuracy for the easy condition (75.4%) compared to the hard condition

(48.2%). Moreover, target detection (in both easy and hard conditions) drops significantly over a period up to a second after onset of a salient event inthe ignored background scenes (drop in detection accuracy; hard task, t(62) =

-5.25, p = 1.96*10-6; easy task, t(62) = -5.62, p = 4.92*10-7). Salient events attract listeners’ attention away from the task at hand and cause a drop in detection accuracy that is proportional to the salience level of background distractors; especially for high and mid-salience events (hard task - high-salience event t(62) = -4.97, p = 5.57*10-6; mid salience event t(62) = -3.70, p = 4.54*10-4; low salience event t(62) = -0.75, p = 0.46; easy task - high salience event t(62) = -4.20, p = 8.54*10-5; mid salience event t(62) = -2.29, p = .025; low salience event t(62) = -1.51, p=0.14). In order to further explore neural underpinnings of changes in the attentional state of listeners, this paradigm is repeated with the easy task while neural activity is measured using Electroencephalography (EEG).

The attended tone sequence is presented at a regular tempo of 2.6 Hz and induces a strong overall phase-locked response around this frequency despite the concurrent presentation of a natural scene in the background. Figure 4.2A shows the grand average spectral profile of the neural response observed throughout the experiment. The plot clearly displays a strong energy at 2.6 Hz, with a left-lateralized fronto-central response stereotypical of an auditory cortex source (Fig. 4.2A, inset).

102 Figure 4.1: Stimulus paradigm during EEG recording. Two stimuli are presented concurrently in each trial: a recording of a natural audio scene (top), which subjects are asked to ignore; and a periodic tone sequence (bottom), which subjects pay attention to and detect presence of occasional modulated tones (shown in orange). A segment of one trial is shown along with EEG recordings, where analyses focus on changes in neural responses due to presence of salient events in the ambient scene or target tones in the attended scene.

Taking a closer look at this phase-locked activity aligned to the tone sequence, the response appears to change during the course of each trial, particularly when coinciding with task-specific AM tone targets, as well as when concurring with salient events in the background scene. Phase-locking near the modulated-tone targets shows an increase in 2.6 Hz power relative to the average level, reflecting an expected increase in neural power induced by top-down attention (Fig. 4.2B-top). The same phase-locked response is notably reduced when tones coincide with salient events in the background (Fig. 4.2B-bottom - red curve), indicating diversion of resources away from the attended sequence and potential markers of distraction caused by salient

103 events in the ignored background.

We contrast variability of 2.6 Hz phase-locked energy over 3 windows of interest in each trial: (i) near AM tone targets, (ii) near salient events and (iii) near tones randomly chosen “away” from either targets or salient events used as control baseline responses. We compare activity in each of these windows

Figure 4.2: Phase-locking results. (A) Spectral density across all stimuli. The peak in energy at the tone presentation frequency is marked by a red arrow. Inset shows average normalized tone-locking energy for individual electrodes. (B) Spectral density around target tones (top) and salient events (bottom). Black lines show energy preceding the target or event, while colored lines depict energy following. Note that target tones are fewer throughout the experiment leading to lower resolution of the spectral profile. (C) Change in phase-locking energy across target tones, non-events, and salient events. (D) Change in tone-locking energy across high, mid, and low salience events. Error bars depict ±1 SEM.

104 relative to a preceding window (e.g. Fig. 4.1, post vs. pre-event interval). Figure 4.2C shows that phase-locking to 2.6 Hz after target tones increases significantly [t(443) = 4.65, p = 4.43*10-6], whereas it decreases significantly following salient events [t(443) = -5.89, p < 10-7], relative to preceding non- target tones. A random sampling of tones away from target tones or salient events does not show any significant variability [t(443) = -0.78, p =0.43] indicating a relatively stable phase-locked power in “control” segments of the experiment away from task-relevant targets or bottom-up background events (4.2C, middle bar). Compared to each other, the top-down attentional effect due to target tones is significantly different from the inherent variability in phase-locked responses in “control” segments [t(886) = 3.81, p = 1.48*10-4]; while distraction due to salient events induces a decrease in phase-locking that is significantly different from inherent variability in “control” segments

[t(886) = -3.58, p = 3.66*10-3].

Interestingly, this salience-induced decrease is modulated in strength by the level of salience of background events. The decrease in phase-locked energy is strongest for events with a higher level of salience [t(443) = -3.78, p =

1.8*10-4]. It is also significant for events with mid-level salience [t(443) =-2.57, p = 0.01], but marginally reduced though not significant for events with the lowest salience [t(359) = 1.33, p = 0.20] (Fig. 4.2D).

A potential confound to reduced phase-locking due to distraction could be local acoustic variability associated with salient events instead of actual deployment of bottom-up attention that disrupts phase-locking to the attended sequence. In order to rule out this possibility, we reassess loss of phase-locking

105 to the attended rhythm near events by excluding salient events with the highest loudness which could cause energetic masking effects (Moore, 2013).

This analysis confirms that phase-locking to 2.6 Hz is still significantly reduced relative to non-event control moments (t(443) = -3.88, p < 10-3). In addition, we analyze acoustic attributes of all salient events in background scenes and compare their acoustic attributes to those of randomly selected intervals in non-salient segments (used as control non-event segments). This comparison assesses whether salient events have unique acoustic attributes that are not observed at other moments in the scene. A Bhattacharyya coefficient -BC- (Kailath, 1967) comparing acoustic attributes of the scene reveals that salient events share the same global acoustic attributes as non-salient moments in the ambient background across a wide range of features (BC for loudness 0.9655, brightness 0.9851, pitch 0.9867, harmonicity .9775 and scale 0.9868).

The reduction of phase-locking to the attended sequences’ rhythm in presence of salient events raises the question whether these “attention-grabbing” instances result in momentary increased neural entrainment to the background scene. While the ambient scene does not contain a steady rate to examine exact phase-locking, its dynamic nature as a natural soundscape allows us to explore the fidelity of encoding the overall stimulus envelope before and after salient events. Generally, synchronization of ignored stimuli tends to be greatly suppressed (Ding and Simon, 2012; Fuglsang, Dau, and Hjortkjær, 2017). Nonetheless, we note a momentary enhancement in decoding accuracy after high salience events compared to a preceding period (paired t-test, t(102)

= 2.18, p = .03) though no such effects are observed in mid (t(113) = -1.09, p =

106 Figure 4.3: Reconstruction of ignored scene envelopes from neural responses before and after salient events for high, mid and low salience instances. Correlations between neural reconstructions and scene envelopes are estimated using ridge regression (see Methods). Error bars depict ±1 SEM.

.28) and low salience (t(107) = 0.24, p = .81) events (Figure 4.3).

Next, we probe other markers of attentional shift and focus particularly on the Gamma band energy in the spectral response (Ray et al., 2008). We contrast spectral profiles of neural responses after target tones, salient events and during “control” tones. Figure 4.4A depicts a time-frequency profile of neural energy around modulated target tones (0 on the x-axis denotes the start of the target tone). A strong increase in Gamma activity occurs after the onset of the target tone and spans a broad spectral bandwidth from 40 to 120 Hz.

Figure 4.4B shows the same time-frequency profile of neural energy relative to attended tones closest to a salient event. The figure clearly shows a decrease in spectral power post-onset of attended tones nearest salient events which is

107 also spectrally broad, though strongest in a high-Gamma range (∼60-120 Hz).

Figure 4.4C quantifies the variations of Gamma energy relative to targets, salient events, and control tones as compared to a preceding time window. High-Gamma band energy increases significantly following target tones [t(443)

Figure 4.4: High gamma band energy results. (A) Time frequency spectrogram of neural responses aligned to onsets nearest modulated targets, averaged across central and frontal electrodes. Contours depict the highest 80% and 95% of the gamma response. (B) Time frequency spectrogram of tones nearest salient events in the background scene. Contours depict the lowest 80% and 95% of the gamma response.(C) Change in energy in the high gamma frequency band (70-110 Hz) across target tones, non-events, and salient events relative to a preceding time window. (D) Change in high gamma band energy across high, mid, and low salience events. Error bars depict ±1 SEM.

108 = 11.5, p < 10-7]; while it drops significantly for attended tones near salient events [t(443) = -6.83, p < 10-7]. Control non-event segments show no significant variations in Gamma energy [t(443) = 1.5, p = 0.13] confirming a relatively stable Gamma energy throughout the experimental trials overall. The increase in spectral energy around the Gamma band is significantly different in a direct comparison between target and control tones [t(886) = 10.3, p < 10-7]. Simi- larly, the decrease in spectral energy around the Gamma band is significantly different when comparing salient events against control tones [t(886) = 6.68, p < 10-7]. As with the decrease in tone locking, the Gamma band energy drop is more prominent for higher salience events [t(443) = -7.72, p < 10-7], is lower but still significant for mid-level salience events [t(443) = -3.64, p =3.02*10-4], but not significant for low-salience events [t(443) = 0.84, p = 0.40] (Fig. 4.4D).

Given this push-pull competition between bottom-up and top-down attentional responses to tones in the attended rhythmic sequence, we examine similarities between neural loci engaged during these different phases of the neural response. Using the Brainstorm software package (Tadel et al., 2011), electrode activations are mapped onto brain surface sources using standardized low resolution brain electromagnetic tomography (sLORETA, see Methods for details). An analysis of localized Gamma activity across cortical voxels examines brain regions uniquely engaged while attending to target tones or distracted by a salient event (relative to background activity of control tones).

We correlate the topography of these top-down and bottom-up brain voxels and employ sparse canonical correlation analysis (sCCA) (Rosa et al.,

109 Figure 4.5: Neural sources. (A) Canonical correlation method for comparing network activation patterns between different overt and covert attention conditions. (B) Canonical correlation values comparing neural activity patterns after salient event (x-axis) and target tones (y-axis). The contour depicts all correlation values with statistical significance less than p < 0.01. (C) Example overlapping network activity between the response after salient events and the response after target tones (at the point shown with an asterisk in panel B). The overlap is right-lateralized and primarily located within the superior parietal lobule (SPL), the inferior frontal gyrus (IFG), and the medial frontal gyrus (MFG).

110 2015; Witten and Tibshirani, 2009; Lin et al., 2013) to estimate multivariate similarity between these brain networks at different time lags (Figure 4.5A, see

Methods for details). Existence of a statistical correlation implies engagement of similar brain networks at corresponding time lags. Figure 4.5B shows that a significant correlation between Gamma activity in brain voxels is observed about 1 second after tone onset, with bottom-up attention to salient events engaging these circuits about 0.5 sec earlier relative to activation by top- down attention. The contoured area denotes statistically significant canonical correlations with p < 0.005, and highlights that the overlap in bottom-up and top-down brain networks is slightly offset in time (mostly off the diagonal axis) with an earlier activation by salient events. A closer look at canonical vectors resulting from this correlation analysis reveals the topography of brain networks most contributing to this correlation. Canonical vectors reflect the set of weights applied to each voxel set that results in maximal correlation, and can therefore be represented in voxel space. These canonical vectors show a stable pattern over time lags of significant correlation and reveal a topography with strong contributions of frontal and parietal brain. Figure 4.5C shows a representative profile of overlapped canonical vectors obtained from SCCA analysis corresponding to the the time lag shown with an asterisk in Figure

4.5B and reveals the engagement of inferior/middle frontal gyrus (IFG/MFG) as well as the superior parietal lobule (SPL).

Given the profound effects of bottom-up attention on neural responses, we examine the predictive power of changes in tone-locking and Gamma-energy modulations as biomarkers of auditory salience. We train a neural network

111 classifier to infer whether a tone in the attended sequence is aligned with a salient distractor in the background scene or not. Figure 4.6 shows that classification accuracy for each neural marker, measured by the area underthe ROC curve. Both Gamma and tone-locking yield significant predictions above chance (Gamma energy: 68.5% accuracy, t(9) = 4.12, p <10−3; Tone-locking:

73% accuracy, t(9) = 6.03, p <10−5). Interestingly, the best accuracy is achieved when including both features (79% accuracy, t(9) = 7.20, p<10−7), suggesting that Gamma-band energy and phase-locking contribute complementary information regarding the presence of attention-grabbing salient events in the background.

Figure 4.6: Event Prediction Accuracy. Average event prediction accuracy (area under the ROC curve) using only high gamma band energy, only tone-locking energy, and both features. Error bars depict ±1 SEM.

112 4.4 Discussion

Selective attention heavily modulates brain activity (Baluch and Itti, 2011b; Knudsen, 2007), and bottom-up auditory attention is no exception (Ahveninen et al., 2013; Alho et al., 2015; Salmi et al., 2009). The current study reinforces the view that profound and dynamic changes to neural responses in both sensory and cognitive networks are induced by bottom-up auditory attention. It further demonstrates that these effects compete with top-down attention for neural resources, interrupting behavioral and neural encoding of the attended stimulus by engaging neural circuits that reflect the cognitive salience of ambient distractors. Modulation of both steady-state phase-locked activity in response to the attended stream as well as energy in the high gamma band is so profound that it can accurately identify moments in the neural response coinciding with these salient events with an accuracy of up to 79% relative to control -non salient- moments. The observed changes in both phase-locked activity as well as gamma oscillations dynamically change in opposing directions based on engagement of voluntary or stimulus-driven attention. This dichotomy strongly suggests shared limited resources devoted to tracking a sequence of interest, resulting in either enhancement or suppression of neural encoding of this overt target as a result of occasional competing objects (Lavie,

2005; Scalf et al., 2013). This push-pull action is strongly modulated by the salience of events in the ambient scene which not only reflect the dynamic acoustic profile of natural sounds, but also their higher-level perceptual and semantic representations, hence shedding light on dynamic reorienting mechanisms underlying the brain’s awareness of its surroundings in everyday social

113 environments (Corbetta, Patel, and Shulman, 2008; Doherty et al., 2017).

The fidelity of the neural representation of an auditory stimulus canbe easily quantified by the power phase-locked responses to the driving rhythm. Enhancement of this phase-locking is accepted as one of the hallmarks of top-down attention and has been observed using a wide range of stimuli from simple tone sequences (similar to targets employed here) to complex signals (e.g. speech) (Elhilali et al., 2009; Mesgarani and Chang, 2012; Fuglsang, Dau, and Hjortkjær, 2017). By enhancing neural encoding of voluntarily- attended sensory inputs relative to other objects in the scene, attentional feedback effectively facilitates selection and tracking of objects of interest and suppression of irrelevant sensory information (Knudsen, 2007; Shamma and

Fritz, 2014). In the current study, we observe that diverting attention away from the attended stream does in fact suppress the power of phase-locking relative to a baseline level, tapered by the degree of salience of the distracting event (Fig. 4.2C-D). The locus of this modulated phase-locked activity is generally consistent with core auditory cortex, though no precise topography can be localized from scalp activity only. Still, engagement of core sensory cortex in both enhancement and suppression of phase-locked responses to the attended sequence concurs with a role of auditory cortex in auditory scene analysis and auditory object formation (Ding and Simon, 2012; Leaver and

Rauschecker, 2010). Moreover, the drop of steady-state following responses due to distractors coincides with a significant increase in the encoding of the background scene near highly-salient events, as reflected in the accuracy of the ambient scene reconstruction (Fig. 4.3). Such enhancement further suggests

114 that neural resources are indeed being diverted during those specific moments due to competition for neural representation between the attended target and the salient background object.

Further analysis of this push-pull modulation of phase-locked activity rules out an interpretation based on acoustic masking which reduces the target-to-noise ration during distracting events resulting in a weaker auditory stimulation. A comparison of acoustic features throughout the background scene shows that there are no global differences between moments deemed attention-grabbing vs. background segments in the JHU DNSS dataset. Specif- ically, salient events were not confined to simply louder moments in the scene. Rather, some dense scenes contained ongoing raucous activity that would be perceived as continuously loud but not necessarily salient, except for specific conspicuous moments (e.g. emergence of a human voice or occasional discernible background music in an otherwise busy cafeteria scene). As such, what makes certain events salient is not its instantaneous acoustic profile independent of context. Instead, it is often a relative change in the scene statistics reflecting not only acoustic changes but also perceptual and semantic manifestations that normally emerge during everyday social settings. Moreover, excluding the loudest salient events still reveals a significant drop of phase- locked activity relative to baseline tones further confirming that this effect cannot be simply attributed to energetic masking of the attended sequence. Finally, the significant drop in behavioral accuracy in the attended task further confirms the distraction effect likely due to disengagement of attentional focus away from the attended sequence. It should however be noted that the

115 behavioral design of the EEG experiment did not place any target tones in the post-event region in order to avoid any contamination of the neural signal.

As such, it was not possible to perform a direct comparison of target tones and salient events and all analyses were compared against preceding time windows to assess the relative modulation of neural phase-locked activity as the stimulus sequence unfolded.

The engagement of executive attention resources in the current paradigm is further observed in enhancements in gamma-band activity, which is also consistent with prior effects linking modulation of gamma-band activity and engagement of cognitive control, particularly with attentional networks. Specifi- cally, enhancement of high-frequency gamma has been reported in conjunction with overt attention in auditory, visual, and somatosensory tasks (Debener et al., 2003; Tallon-Baudry et al., 2005; Bauer, 2006; Senkowski et al., 2005). While enhanced gamma oscillations index effective sensory processing due to voluntary attention, they also interface with mnemonic processes, particularly encoding of sensory information in short-term memory (Sederberg et al., 2003; Jensen, Kaiser, and Lachaux, 2007). Given the demands of the current paradigm to remember the number of modulated targets in the attended sequence, it is not surprising to observe that the enhancement of gamma energy extends over a period of a few seconds post target onset, in line with previously reported effects of memory consolidation (Tallon-Baudry and Bertrand, 1999). In some instances, gamma band activity have also been reported to increase during attentional capture (Buschman and Miller, 2007); though two distinctions with the current study design are worth noting: First, our analysis

116 is explicitly aligned to the attended target sequence hence probing distraction effects on activity relative to the foreground sound. Second, the modulation of gamma-band activity are very much tied with task demands, the notion of salience, and how it relates to both the perceptual and cognitive load imposed by the task at hand (Lavie, 2005). Importantly, the current paradigm emulates natural listening situations, which reaffirms the privileged status of social distractors previously reported in visual tasks (Doherty et al., 2017).

While effects of enhancement of broadband gamma frequency synchronization indexing an interface of attentional and memory processing have been widely reported, reduction in gamma-band energy due to distraction effects is not commonly observed. Few studies, encompassing data from animal or human subjects, have shown a potential link between modulation of gamma energy and impaired attention, in line with results observed here. Ririe and colleagues reported reduced Local Field Potential activity (including gamma band energy) during audiovisual distraction in medial prefrontal cortex in freely behaving rats (Ririe et al., 2017). Bonnefond et al. used Magentoen- cephalography (MEG) in human listeners and reported reduced gamma for distractors during a memory task very much in line with observed effects in the current task (Bonnefond and Jensen, 2013). Moreover, modulation of gamma activity has been ruled out as being associated with novel unexpected stimuli (Debener et al., 2003) and is likely linked to ongoing shifts in attentional focus of listeners throughout the task reflecting the interaction between top-down executive control and sound-driven activity.

117 Importantly, the analysis of neural networks engaged during this push- pull effect points to overlapping networks spanning frontal and parietal areas involved during both overt and covert attention with a temporal offset for engagement of these networks (Fig. 4.5). On the one hand, there is strong evidence in the literature dissociating dorsal and ventral networks of overt and covert attentional engagement, respectively (Corbetta and Shulman, 2002). Such networks have been reported across sensory tasks suggesting a supramodel attentional circuitry that interfaces with sensory networks in guiding selection and tracking of targets of interest while maintaining sensi- tivity to salient objects in a dynamic scene (Spagna, Mackie, and Fan, 2015).

On the other hand, there are overlapping brain regions that underlie top- down as well as bottom-up attentional control particularly centered in the lateral prefrontal cortex, and interface between the two systems to guide behavioral responses that not only account for task-guided goals, but also environmentally-relevant stimuli (Asplund et al., 2010; Corbetta, Patel, and Shulman, 2008). Our analysis shows that there is a coordinated interaction between these two networks that reveals a consistent temporal offset with signals from the bottom-up attentional network interrupting activity in the top-down network, potentially reorienting the locus of attention (Ahveninen et al., 2013; Corbetta, Patel, and Shulman, 2008; Salmi et al., 2009). The earlier engagement of this common orienting network by bottom-up attention can have two possible interpretations. One, that engagement of covert attention sends an inhibitory reset signal in order to reorient attention to socially engaging salient events. Two, that earlier engagement of overlapped networks of attention by salient events reflects reduced memory consolidation due toa

118 distraction effect, which is itself tied to diminished sensory encoding of the attended rhythm.

The ability to analyse the circuitry underlying the interaction between overt and covert networks is facilitated by a powerful multivariate regression tech- nique named Canonical correlation analysis (CCA). This approach attempts to circumvent the limitations of comparisons across source topographies in electroencephalography (EEG) and leverages multivariate techniques to explore relationships between high-dimensional datasets which have been champi- oned with great success in numerous applications including neuroimaging, pharmacological and genomic studies (Parkhomenko, Tritchler, and Beyene, 2009; Wang et al., 2015; Bilenko and Gallant, 2016). While applying CCA directly to very high-dimensional data can be challenging or uninterpretable, the use of kernel-based mappings with constraints such as sparsity as adopted in the current work allows mappings between high-dimensional brain images

(Witten, Tibshirani, and Hastie, 2009; Gao et al., 2015; Uurtio et al., 2017). Here, we adapted the approach proposed by Rosa and colleagues (Rosa et al., 2015) as it not only regularizes the correlation analysis over sparse constraints, but also optimally defines these constraints in a data-driven fashion. This permutation-based approach yields an optimal way to define statistical significance of observed correlation effects as well as constraints on optimization parameters much in line with cross-validation tests performed in statistical analyses, hence reducing bias by the experimenter in defining parameters (Witten and Tibshirani, 2009; Lin et al., 2013). The resulting sCCA coefficients delimit brain regions with statistically significant common effects. Consistent

119 with previous findings, the analysis does reveal that areas with statistically significant activations lie in inferior parietal and frontal cortices, particularly inferior and medial temporal gyri as well as the superior parietal lobule (Alho et al., 2015).

In conclusion, the use of an almost naturalistic experimental paradigm provides a range of nuanced ambient distractions similar to what one experiences in everyday life and demonstrates the dynamic competition for attentional resources between task and socially-relevant stimulus-driven cues. It not only sheds light on profound effects of both top-down and bottom-up attention in continuously shaping brain responses, but it also reveals a steady push-pull competition between these two systems for limited natural resources and active engagement by listeners.

120 Chapter 5

Hierarchical Representation in Deep Neural Networks and Contribution of Audio Class to Salience

This chapter presents the role of sound class on auditory salience. Results in Chapter2 indicate that basic acoustic feature information by itself does not provide a complete explanation of auditory salience for the difficult, dense scenes. Thus, change in sound category is evaluated for its potential to complete our understanding of salience. A convolutional neural network

(CNN) trained for audio classification is used to extract information about sound category, and that data is shown to contribute to the salience prediction. In addition, further analysis of the CNN shows that activity at earlier stages is more closely related to acoustic features, while the activation patterns at later stages are more correlated to human cortical activity. The contents of this chapter have been published in Frontiers in Neuroscience (Huang, Slaney, and Elhilali, 2018).

121 5.1 Introduction

Over the past few years, convolutional neural networks (CNNs) have revolutionized machine perception, particularly in the domains of image understanding, speech and audio recognition and multimedia analytics (Cai and Xia, 2015; He et al., 2016; Hershey et al., 2017; Karpathy et al., 2014; Krizhevsky, Sutskever, and Hinton, 2012; Poria et al., 2017; Simonyan and Zis- serman, 2015). A CNN is a form of a deep neural network (DNN) where most of the computation are done with trainable kernel that are slid over the entire input. These networks are hierarchical architectures that mimic the biological structure of the human sensory system. They are organized in a series of processing layers that perform different transformations of the incoming signal, hence ’learning’ information in a distributed topology. CNNs specifically include convolutional layers which contain units that are connected only to a small region of the previous layer. By constraining the selectivity of units in these layers, nodes in the network have emergent ’receptive fields,’ allowing them to learn from local information in the input and structure processing in a distributed way; much like neurons in the brain have receptive fields with localized connectivity organized in topographic maps that afford powerful scalability and flexibility in computing. This localized processing is often complemented with fully-connected layers which integrate transformations learned across earlier layers, hence incorporating information about content and context and completing the mapping from the signal-domain (e.g. pixels, acoustic waveforms) to a more semantic representation.

As with all deep neural networks, CNNs rely on vast amounts of data

122 to train the large number of parameters and complex architecture of these networks. CNNs have been more widely used in a variety of computer vision tasks for which large datasets have been compiled (Goodfellow, Bengio, and Courville, 2016). In contrast, due to limited data, audio classification has only recently been able to take advantage of the remarkable learning capability of CNNs. Recent interests in audio data curation has made available a large collection of millions of YouTube videos which were used to train CNNs for audio classification with remarkable performance (Hershey et al., 2017; Jansen et al., 2017). These networks offer a powerful platform to gain better insights on the characteristics of natural soundscapes. The current study aims to use this CNN platform to elucidate the characteristics of everyday sound events that influence their acoustic properties, their salience (i.e. howwell they ’stand-out’ for a listener), and the neural oscillation signatures that they elicit. All three measures are very closely tied together and play a crucial role in guiding our perception of sounds.

Given the parallels between the architecture of a convolutional neural network and the brain structures from lower or higher cortical areas, the current work uses the CNN as springboard to examine the granularity of representations of acoustic scenes as reflected in their acoustic profiles, evoked neural oscillations and crucially their underlying salience; this latter being a more abstract attribute that is largely ill-defined in terms of its neural underpinnings and perceptual correlates. Salience is a characteristic of a sensory stimulus that makes it attract our attention regardless of where our intentions are. It is what allows a phone ringing to distract us while we are intently in the

123 midst of a conversation. As such, it is a critical component of the attentional system that draws our attention towards potentially relevant stimuli. Studies of salience have mostly flourished in the visual literature which benefited from a wealth of image and video datasets as well as powerful behavioral, neural and computational tools to explore characteristics of visual salience.

The study of salience in audition has been limited both by lack of data as well as limitations in existing tools that afford exploring auditory salience in a more natural and unconstrained way. A large body of work has explored aspects of auditory salience by employing artificially-constructed stimuli, such as tone and noise tokens (Duangudom and Anderson, 2013a; Elhilali et al., 2009).

When natural sounds are used, they are often only short snippets that are either played alone or pieced together (Duangudom and Anderson, 2007a; Kaya and Elhilali, 2014b; Kayser et al., 2005a; Petsas et al., 2016; Tordini, Breg- man, and Cooperstock, 2015a). Such manipulations limit the understanding of effects of salience in a more natural setting, which must take into account contextual cues as well as complexities of listening in everyday environments.

Despite the use of constrained or artificial settings, studies of auditory salience have shed light on the role of the acoustic profile of a sound event in determining its salience. Loudness is a natural predominant feature, but is complemented by other acoustic attributes, most notably sound roughness and changes in pitch (Arnal et al., 2015; Nostl, Marsh, and Sorqvist, 2012). Still, the relative contribution of these various cues and their linear or non-linear interactions have been reported to be very important (Kaya and Elhilali, 2014b;

Tordini, Bregman, and Cooperstock, 2015a) or sometimes provide little benefit

124 (Kim et al., 2014a) to determining the salience of a sound event depending on the stimulus structure, its context and the task at hand. Unfortunately, a complete model of auditory salience that can account for these various facets of auditory salience has not yet been developed. Importantly, studies of auditory salience using very busy and unconstrained soundscapes highlight the limitations of explaining behavioral reports of salience using only basic acoustic features (Huang and Elhilali, 2017). By all accounts, auditory salience is likely a multifaceted process that not only encompasses the acoustic characteristics of the event itself, but is shaped by the preceding acoustic context, the semantic profile of the scene as well as built-in expectation bothfrom short-term and long-term memory, much in line with processes that guide visual salience especially in natural scenes (Treue, 2003b; Veale, Hafed, and Yoshida, 2017; Wolfe and Horowitz, 2004b).

Convolutional neural networks offer a powerful platform to shed light on these various aspects of a natural soundscape and hence can provide insight into the various factors at play in auditory salience in everyday soundscapes. In the present work, we leverage access to a recently published database of natural sounds for which behavioral and neural salience measures are available (Huang and Elhilali, 2017; Huang and Elhilali, 2018), to ask the question: how well does activity in a large-scale deep neural network at various points in the network correlate with these measures? Owing to the complexity of these convolutional models, we do not expect an explicit account of exact factors or processes that determine salience. Rather, we examine the contribution of peripheral vs. deeper layers in the network to explore

125 contributions of different factors along the continuum from simple acoustic features to more complex representation and ultimately to semantic-level embeddings that reflect sound classes. A number of studies have argued fora direct correspondence between the hierarchy in the primate visual system and layers of deep convolutional neural networks (Kriegeskorte, 2015; Kuzovkin et al., 2017; Yamins and DiCarlo, 2016). A recent fMRI study has also shown evidence that a hierarchical structure arises in a sound classification CNN, revealing an organization analogous to that of human auditory cortex (Kell et al., 2018). In the same vein, we explore how well activations at different layers in an audio CNN explain acoustic features, behaviorally measured salience, and neural responses corresponding to a set of complex natural scenes. These signals are all related (but not limited) to salience, and as such this comparison reveals the likely contribution of early vs. higher cortical areas in guiding judgments of auditory salience.

The paper is organized as follows: First, the material and methods employed are presented. This next section describes the database used, the acoustic analysis of audio features in the dataset, and the behavioral and neural responses for this same set obtained from human subjects. The architecture of the neural network is also described as the platform that guides the analysis of other metrics. The results present the information gleaned from the

CNN about its representation of acoustic, behavioral and neural correlates of salience. Finally, the discussion section summarizes the insights gained from these results and its impact for future work to better understand auditory salience and its role in our perception of sounds.

126 5.2 Materials and Methods

This next section describes the acoustic data, three types of auditory de- scriptors (acoustic features, a behavioral measure, and EEG) as well as three types of analyses employed in this study (CNN, surprisal and correlation.)

5.2.1 Stimuli

The stimuli used in the present study consist of twenty natural scenes taken from the JHU DNSS (Dichotic Natural Salience Soundscapes) Database

(Huang and Elhilali, 2017). Scenes are approximately two minutes in length each and sampled at 22050 Hz. These scenes originate from several sources, including YouTube, FreeSound, and the BBC Sound Effects Library. The scenes encompass a wide variety of settings and sound objects, as well as a range of sound densities. Stimuli are manually divided into two groups for further analysis; a ’sparse’ group, which includes scenes with relatively few but clearly isolated acoustic events. An example of a sparse scene includes a recording of a bowling alley in which a relatively silent background is punctuated by the sound of a bowling ball first striking the floor and then the pins. The remaining scenes are categorized as ’dense’ scenes. Examples of these scenes include a maternity ward, a protest on the streets, and a dog park with continuously ongoing sounds and raucous backgrounds. This comparison between sparse and dense scenes is important because salience in dense scenes is particularly difficult to explain using only acoustic features, thus more complex information such as sound category may provide a benefit.

127 5.2.2 Acoustic Features

Each of the scenes in the JHU DNSS database is analyzed to extract an array of acoustic features, including loudness, brightness, bandwidth, spectral flatness, spectral irregularity, pitch, harmonicity, modulations in the temporal domain (rate), and modulations in the frequency domain (scale). Details of these feature calculations can be found elsewhere (Huang and Elhilali, 2017). In addition, the current study also includes an explicit measure of roughness as one of the acoustic features of interest. It is defined as the average magnitude of temporal modulations between 30 and 150 Hz, normalized by the root mean squared energy of the acoustic signal, following the method proposed by Arnal et al. (2015).

5.2.3 Behavioral salience

The Huang and Elhilali study collected a behavioral estimate of salience in each of the scenes in the JHU DNSS dataset (Huang and Elhilali, 2017). Briefly, subjects listen to two scenes presented simultaneously in a dichotic fashion

(one presented to each ear). Subjects are instructed to use a computer mouse to indicate which scene they are focusing on at any given time. Salience is defined as the percentage of subjects that attend to a scene when compared to all other scenes, as a function of time.

Peaks in the derivative of the salience curve for each scene define onsets of salient events. These are moments in which a percentage of subjects concurrently begin listening to the associated scene, regardless of the content of the opposing scene playing in their other ear. The strength of an event is defined

128 as a linear combination of the height of the slope at that point in time and the maximum percentage of subjects simultaneously attending to the scene within a four-second window following the event. The strongest 50% of these events are used in the event-related analysis in the current study. These events are further manually categorized into one of 7 sound classes (speech, music, other vocalization, animal, device/vehicle, tapping/striking, and other). The classes including speech, music, vehicle/device and tapping/striking contained the most number of events and are included in the current study for further analysis. By this definition of salience, the scenes contained 47 events in the speech class, 57 events in music, 39 events in other vocalization, 44 events in vehicle/device and 28 events in tapping. The two remaining classes consisted of too few instances, with only 11 events in the animal category and 8 in a miscellaneous category.

5.2.4 Electroencephalography

Cortical activity while listening to the JHU DNSS stimuli is also measured using electroencephalography (EEG), following procedures described in the study by Huang and Elhilali (Huang and Elhilali, 2018). Briefly, EEG recordings are obtained using a Biosemi Active Two 128-electrode array, initially sampled at 2048 Hz. Each of the twenty scenes are presented to each subject one time in a random order, and listeners are asked to ignore these scenes playing in the background. Concurrently, subjects are presented with a sequence of tones and perform an amplitude modulation detection task. The neural data relevant to the modulation task is not relevant to the current study

129 and is not presented here. It is discussed in the study by Huang and Elhilali (Huang and Elhilali, 2018).

EEG signals are analyzed using FieldTrip (Oostenveld et al., 2011) and EEGLab (Delorme et al., 2004) analysis tools. Data are demeaned and detrended, then resampled at 256 Hz. Power line energy is removed using the

Cleanline MATLAB plugin (Mullen, 2012). EEG data is then re-referenced using a common average reference, and eyeblink artifacts are removed using independent component analysis (ICA).

Following these preprocessing steps, energy at various frequency bands is isolated using a Fourier transform over sliding windows (length 1 second, step size 100 ms), then averaged across the frequencies in a specific band. Six such frequency bands are used in the analysis to follow: Delta (1-4 Hz),

Theta (4-7 Hz), Alpha (8-15 Hz), Beta (15-30 Hz), Gamma (30-50 Hz), and High Gamma (70-110 Hz). Next, band energy is z-score normalized within each channel. Band activity is analyzed both on a per-electrode basis and also by averaging activity across groups of electrodes. In addition to a grand average across all 128 electrodes, analysis is also performed by averaging activity in Frontal electrodes (21 electrodes near Fz) and Central electrodes (23 electrodes near Cz) as defined in Shuai and Elhilali (2014).

5.2.5 Deep Neural Network

A neural network is used in the current study to explore its relationship with salience judgments based on acoustic analysis, behavioral measures and neural EEG responses (Figure 5.1). The network structure like VGG follows

130 Figure 5.1: Structure of the convolutional neural network and signals analyzed. A The convolutional neural network receives the time-frequency spectrogram of an audio signal as input. It is composed of convolutional and pooling layers in an alternating fashion, followed by fully connected layers. B An example section of an acoustic stimulus (labeled Audio); along with corresponding neural network activity from five example units within one layer of the CNN. A network surprisal measureis then computed as the Euclidian distance between the current activity of the network nodes at that layer (shown in red) against the activity in a previous window (shown in gray with label ’History’). Measures of behavioral salience by human listeners (in green) and cortical activity recorded by EEG (in brown) are also analyzed.

131 Layer Type Abbreviation Dimensions Total Number of Outputs Input Spectrogram 96 x 64 16384 Convolutional Layer Conv1 96 x 64 x 64 393216 Pooling Layer Pool1 48 x 32 x 64 98304 Convolutional Layer Conv2 48 x 32 x 128 196608 Pooling Layer Pool2 24 x 16 x 128 49152 Convolutional Layer Conv3 24 x 16 x 256 98304 Pooling Layer Pool3 12 x 8 x 256 24576 Convolutional Layer Conv4 12 x 8 x 512 49152 Pooling Layer Pool4 6 x 4 x 512 12288 Fully Connected Layer FC1 4096 4096 Fully Connected Layer Embed 128 128 Output Layer / Predictions Predic 4923 4923

Table 5.1: Dimensions of the input and each layer of the neural network. Bold text indicates which layers are used in the analysis. network E presented by Simonyan and Zisserman (2015), with modifications made by Hershey et al. (2017) and Jansen et al. (2017). Briefly, the network staggers convolutional and pooling layers. It contains four convolutional layers, each with relatively small 3x3 receptive fields. After each convolutional layer, a spatial pooling layer reduces the number of units by taking maximums over non-overlapping 2x2 windows. Next, two fully-connected layers then reduce the dimensionality further before the final prediction layer. Table 5.1 lists the layers of the network along with their respective dimensionalities.

Due to dimensionality constraints, only the layers shown in bold are used in this analysis and reported here, without any expected loss of generality about the results.

Our CNN was trained on the audio from a 4923 class video-classification problem that eventually became the YouTube-8M challenge (Abu-El-Haija et al., 2016). This dataset includes 8 million videos totaling around 500,000

132 hours of audio, and is available online (Abu-El-Haija, 2017). As in the study by Hershey et al., the audio from each video was divided into 960 ms frames, each mapped onto a time-frequency spectrogram (25 ms window, 10 ms step size, 64 mel-spaced frequency bins). This spectrogram served as the input to the neural network. For training purposes, ground truth labels from each video were automatically generated and every frame within that video was assigned the same set of labels. Each video could have any number of labels, with an average of around five per video, and 4923 distinct labels in total. The labels ranged from very general to very specific. The most general category labels (such as arts and entertainment, games, autos/vehicles, and sports) were applied to roughly 10-20% of the training videos. The most specific labels (such as classical ballet, rain gutter, injury, FIFA Street) applied only to .0001-.001% of the videos. The network was trained to optimize classification performance over the ground truth labels. The network’s classification performance nearly matches that of the Inception DNN model, which was found to show the best results in Hershey et al. (2017), in terms of equal error rate and average precision. Details about the evaluation process can be found in Jansen et al. (2017).

5.2.6 Network Surprisal

We defined change in the activation patterns within a layer of theCNNas ’network surprisal’ (this definition is unrelated to other surprisal analyses that employ information theory or principles of thermodynamics to characterize system dynamics, often used in physics, chemistry, and other disciplines). It

133 represents an estimate of variability in the response pattern across all nodes of a given layer in the network and as such quantifies how congruent or surprising activity at a given moment is relative to preceding activity (Figure 5.1B). In this study, it is computed by taking the Euclidean distance between the activity in a layer at a given time bin (labeled ’Current’ in red in Figure 5.1B) versus the average activation in that layer across the previous four seconds (labeled ’History’ in gray in Figure 5.1B). Thus, a constant pattern of activity would result in a low level of surprisal, while a fluctuation in that pattern over multiple seconds would result in a higher level of surprisal. This measure corresponds structurally to the definition of semantic dissimilarity by Broderick et al. (2018), although it utilizes Euclidean distance as a common metric for evaluating dissimilarity in neural network activity (Krizhevsky, Sutskever, and Hinton, 2012; Parkhi, Vedaldi, and Zisserman, 2015). This surprisal feature tracks changes in the scene as it evolves over time by incorporating elements of the acoustic history into its calculation.

5.2.7 Correlation Analyses

The audio, EEG, and CNN data have all been reduced to low-dimensional features. The audio is represented by 10 different acoustic measures, while the

128 channel EEG measurements are summarized by the energy in six different frequency bands, and the multi-channel output from the six different layers of the CNN are summarized by the surprisal measure. We next examine correlation between these metrics and the neural network activations.

Each layer of the neural network is compared to behavioral salience, basic

134 acoustic features, and energy in EEG frequency bands using normalized cross correlation. All signals are resampled to the same sampling rate of 10 Hz, and the first two seconds of each scene are removed to avoid the effects of thetrial onset. Scenes that are longer than 120 seconds are shortened to that length. All signals are high pass filtered with a cutoff frequency of 1/30 Hz to remove overall trends, and then low pass filtered at 1/6 Hz to remove noise at higher frequencies. Both filters are 4th order Butterworth filters. The low pass cutoff frequency is chosen empirically to match the slow movements in the salience signal. Despite the low cutoff frequency, no observable ringing artifacts are noted. Adjusting signal duration to examine any filtering artifacts at the onset of the signal yields quantitively similar results as reported in this paper.

After these pre-processing steps, we compute the normalized cross-correlation between network surprisal and the other continuous (acoustic and neural) signals with a maximum delay time of -3 to +3 seconds. The normalized correlation is defined as a sliding dot-product of these two signals normalized by the product of their standard deviation (Rao Yarlagadda, 2010). The highest correlation coefficient within a ±3 sec window is selected as the correlation between network surprisal and each of the corresponding signals.

The behavioral responses reflect onsets of salient events (peaks inthe slope of the salience curve) and are discrete in time. CNN surprisal activity is compared to behavioral salience in windows surrounding salient events, extending from 3 seconds before to 3 seconds after each event. These windows are used to compare correlations for subsets of events, such as for a single category of events. Quantitatively similar results are obtained when using the

135 whole salience curve instead of windows surrounding all salient events. The correlation coefficient between behavioral salience and neural surprisal vectors is taken in these windows. For this analysis, the behavioral salience signal is delayed by a fixed time of 1.4 seconds. A shift is necessary to reflect thedelay in motor response required from the behavioral task to report salience. Here, a shift of 1.4 seconds is empirically determined to correspond to the maximum cross correlation for a majority of the network layers. A fixed delay is used for this case for greater consistency when comparing across different conditions.

To complement the correlation analysis described above, we also examine the cumulative contribution of different CNN layers by assessing the cumulative variance explained by combining activation of consecutive layers. This variance is quantified using a linear regression that uses behavioral salience as dependent variable and network surprisal from individual layers as independent variables (Weisberg, 2005). Consecutive linear regressions with each layer individually are performed starting with lower layers and continuing to higher layers of the network. After each linear regression, the cumulative variance explained is defined as 1 minus the variance of the residual divided by the variance of the original salience curve (i.e. 1 minus the fraction of variance explained). Then, the residual is used as the independent variable for regression with the next layer. To generate a baseline level of improvement by increasing the number of layers, this linear regression procedure is repeated after replacing all values in layers after the first with numbers generated randomly from a normal distribution (mean 0, variance 1).

136 5.2.8 Event Prediction

Prediction of salient events is performed by dividing the scene into overlapping time bins (two second bin size, 0.5 second step size) and then using linear discriminant analysis (LDA) (Duda, Hart, and Stork, 2000). Each time bin is assigned a label of +1 if a salient event occurred within its respective time frame and a label of 0 otherwise. Network surprisal and the slopes of acoustic features are used to predict salient event using an LDA classifier. The slope of an acoustic feature is calculated by first taking the derivative of the signal, and then smoothing it with three iterations of an equally weighted moving average (Huang and Elhilali, 2017). This smoothing process is selected empirically to balance removal of higher-frequency without discarding potential events. As with the previous event-based analysis, these signals are time-aligned by maximizing their correlation with behavioral salience. Each feature is averaged within each time bin, and LDA classification is performed using five-fold cross validation to avoid overfitting (Izenman, 2013b). Finally, a threshold is applied to the LDA scores at varying levels to obtain a receiver operating characteristic (ROC) curve (Fawcett, 2006).

5.3 Results

This section describes the correlation between the six different layers of the CNN versus the 10 acoustic features, salience as measured by a behavioral task, and energy in six different frequency bands from the EEG data.

137 5.3.1 Comparison to Basic Acoustic Features

Figure 5.2: Correlation between neural network activity and acoustic features. A Correlation coefficients between individual acoustic features and layers of the neural network. Loudness, harmonicity, irregularity, scale, and pitch are the most strongly correlated features overall. B Average correlation across acoustic features and layers of the neural network. Shaded area depicts ±1 standard error of the mean (SEM). Inset shows the slope of the trend line fitted with a linear regression. The shaded area depicts 99% confidence intervals of the slope.

First, we examine the correspondence between activity in different neural network layers and the acoustic features extracted from each of the scenes. Figure 5.2A shows the correlation coefficient between each acoustic feature and the activity of individual CNN layers. Overall, the correlation pattern reveals stronger values in the four earliest layers (convolutional and pooling) compared the deep layers in the network (fully connected and embedding). This difference is more pronounced in features of a more spectral nature such as spectral irregularity, frequency modulation, harmonicity, and loudness, suggesting such features may play an important role in informing the network

138 about sound classification during the training of the network. Clearly, notall acoustic features show this strong correlation or any notable correlation. In fact, roughness and rate are basic acoustic measures that show slightly higher correlation in deeper layers relative to earlier layers. Figure 5.2B summarizes the average correlation across all basic acoustic features used in this study as a function of network layer. The trend reveals a clear drop in correlation, indicating that the activity in deeper layers is more removed from the acoustic profile of the scenes. Figure 5.2B inset depicts a statistical analysis of this drop, with slope = -.026, t(1198) = -5.8, p = 7.6 x 10-9.

Figure 5.3: CNN surprisal and behavioral salience. A Correlation between CNN activity and behavioral salience. B Cumulative variance explained after including successive layers of the CNN. The gray line shows a baseline level of improvement estimated by using values drawn randomly from a normal distribution for all layers beyond Pool2. For both panels, shaded areas depict ±1 SEM. Insets show the slope of the trend line fitted with a linear regression, with shaded areas depicting 99% confidence intervals of the slope.

139 5.3.2 Comparison to Behavioral Measure of Salience

Next, we examine the correspondence between activations in the CNN layers and the behavioral judgments of salience as reported by human listeners. Figure 5.3A shows the correlation between behavioral salience and network surprisal across individual layers of the network, taken in windows around salient events (events being local maxima in the derivative of salience, see Section 5.2). As noted with the basic acoustic features (Figure 5.2), correlation is higher for the earlier layers of the CNN and lower for the later layers. A statistical analysis of the change in correlation across layers reveals a significant slope of -.041, t(1360) = -6.8, p = 2.1 x 10-11 (Figure 5.3A, inset). However, although the correlation for individual deeper network layers is relatively poor, an analysis of their complementary information suggests additional independent contributions of each layer. In fact, the cumulative variance explained as one goes deeper into the network shows significantly improved correlation between superficial and deep layers (Figure 5.3B), with a correlation slope of .029, t(1360) = 6.6, p = 5 x 10-11.

While Figure 5.3 looks at complementary information of different network layers in explaining behavioral judgments of salience on average, one can look explicitly at specific categories of events and examine changes in information across CNN layers. Figure 5.4 contrasts the cumulative variance explained for four classes of events that were identified manually in the database (see Section 5.2). The figure compares cumulative variance of behavioral salience explained by the network for speech, music, vehicle and tapping events. The figure shows that speech and music-related events are better explained with

140 Figure 5.4: Cumulative variance explained after including successive layers of the CNN for specific categories of events A Speech events, B Music events, C Vehicle events, D Tapping/striking events. The gray line shows a baseline level of improvement estimated by using values drawn randomly from a normal distribution for all layers beyond Pool2. For all panels, shaded areas depict ±1 SEM. Insets show the slope of the trend line fitted with linear regression, with shaded areas depicting 99% confidence intervals of the slope.

the inclusion of deeper later layers (speech: t(280) = 5.2, p = 3.2 x 10-7; music: t(340) = 5.7, p = 3.3 x 10-08). In contrast, events from the devices/vehicles and tapping categories are well explained by only the first few peripheral layers of the network, with little benefit provided by deeper layers (device: t(262) = 1.8, p = .069; tapping: t(166) = 2.2, p = 0.028. Results for other vocalizations closely match those of the vehicle category (data not shown), t(196) = 2.2, p =

141 .033. Overall, the figure highlights that contribution of different CNN layers to perceived salience of different scenes does vary drastically depending on semantic meaning and show varying degrees of complementarity between the acoustic front-end representation and the semantic deeper representations.

Figure 5.5: Event prediction performance. Predictions are made using LDA on overlapping time bins across scenes. The area under the ROC curve is .775 with a combination of acoustic features and surprisal, while it reaches only .734 with acoustic features alone.

The ability to predict where salient events occur is shown in Figure 5.5. Each scene is separated into overlapping time bins which are labelled based on whether or not an event occurred during that time frame. LDA is then performed using either a combination of acoustics and network surprisal, or the acoustic features alone. The prediction is improved through the inclusion of information from the neural network, with an area under the ROC curve of

.734 when using only the acoustic features compared to an area of .775 after incorporating network surprisal. This increase in performance indicates that

142 changes in network activity make a contribution to the salience prediction that is not fully captured by the acoustic representation.

Figure 5.6: Analysis of dense vs. sparse scenes A Difference in correlation between salience and CNN activity for dense and sparse scenes. Negative values indicate that salience in sparse scenes were more highly correlated with CNN activity. B Difference in correlation between acoustic features and CNN activity for dense and sparse scenes.

One of the key distinctions between the different event categories analyzed in Figure 5.4 is not only the characteristics of the events themselves but also the context in which these events are typically present. On the one hand, speech scenes tend to have ongoing activity and dynamic backgrounds against which salient events stand out; while vehicle scenes tend to be rather sparse with few notable events standing out as salient. An analysis contrasting sparse vs. dense scenes in our entire dataset (see Section 5.2) shows a compelling difference between the correlations of acoustic salience for dense scenes and for sparse scenes especially in the convolutional layers (Figure 5.6A). This difference is statistically significant when comparing the mean correlation for early vs. deep layers, t(4) = -5.4, p = .0057. On the other hand, the network’s activation in response to acoustic profiles in the scenes do not show any

143 distinction between sparse and dense scenes and across early and deep layers (Figure 5.6B), t(4) = -.24, p = .82.

Figure 5.7: Correlation between neural network activity and energy in EEG frequency bands. A Correlation coefficients between individual EEG frequency bands and layers of the neural network. Gamma, beta, and high-gamma frequency bands are the most strongly correlated bands overall. B Average correlation across EEG activity and layers of the neural network. Shaded area depicts ±1 SEM. Inset shows the slope of the trend line fitted with linear regression, with a shaded area depicting the99% confidence interval of the slope.

5.3.3 Comparison to Neural Activity

Finally, we examine the contrast between neural responses recorded using EEG and CNN activations. As shown in Figure 5.7, energy in many frequency bands of the neural signal shows stronger correlation with activity in higher levels of the CNN rather than lower layers and follows an opposite trend to that of acoustic features. Figure 5.7A shows the correlation between network activity and individual EEG frequency bands and shows a notable increase in correlation for higher frequency bands (Delta, Beta, Gamma, High Gamma).

144 The Theta and Alpha bands appear to follow a somewhat opposite trend, though their overall correlation values are rather small. Figure 5.7B summarizes the average correlation trend across all frequency bands, with slope = .015, t(718) = 3.6, p = 3.2 x 10-4). It is worth noting the average correlation between CNN activity and EEG responses is rather small overall (between 0 and 0.1) but still significantly higher than 0 - t(719) = 7.4, p = 4.5 x 10-13.The increasing trend provides further support to the notion that higher frequency neural oscillations are mostly aligned with increasingly complex feature and semantic representations crucial for object recognition in higher cortical areas, and correspondingly in deeper layers of the convolutional neural network

(Kuzovkin et al., 2017).

Figure 5.8: Correlation between neural network activity and energy in EEG frequency bands for specific electrodes. A Average correlation between electrode activity across frequency bands for electrodes in Central (near Cz) and Frontal (near Fz) regions. B Correlation between beta/gamma band activity for individual electrodes and convolutional/deep layers of the neural network.

To explore the brain regions that are most closely related to the CNN activity, individual electrode activities are also correlated with surprisal. Figure

145 5.8A shows a small difference between neural activity in Central and Frontal areas, with the former having relatively higher correlation with early layers and the latter having higher correlation with deep layers. This trend is not statistically significant, however. Figure 5.8B shows the pattern across electrodes of these correlations values for the beta and gamma bands. Activity in the Beta band is most correlated to the convolutional layers of the CNN for central electrodes near C3 and C4, while it is most correlated to the deep layers for frontal electrodes near Fz. In contrast, Gamma band activity shows little correlation with the early layers of the CNN, but more closely matches activation in deep layers for electrodes near Cz.

5.4 Discussion

Recent work on deep learning models provides evidence of strong parallels between the increasing complexity of signal representation in these artificial networks and the intricate sensory transformations in sensory biological systems that map incoming stimuli onto object-level representations (Cichy et al., 2016; Guclu and Gerven, 2015; Yamins et al., 2014). The current study leverages the complex hierarchy afforded by convolutional network networks trained on audio classification to explore parallels between network activation and auditory salience in natural sounds measured through a variety of modalities. The analysis examines the complementary contribution of various layers in a CNN architecture and draws a number of key observations from three types of signals: acoustic, behavioral and neural profiles.

Firstly, as expected, the earlier layers in the CNN network mostly reflect

146 the acoustic characteristics of a complex soundscape. The association of acoustic features with CNN activation decreases in correlation the signal propagates deeper into the network. The acoustic features that are most clearly reflected with higher fidelity are mostly spectral, and include harmonicity, frequency modulation and spectral irregularity; along with loudness which directly modulates overall signal levels. It is important to remember that the CNN network used in the current work is trained for audio classification and employs a rather fine-resolution spectrogram at its input computed with 25 ms binsover frames of about 1 s. As such, it is not surprising to expect a strong correlation between spectral features in the input and early representations of the peripheral layers of the CNN network (Dai et al., 2017; Lee et al., 2017; Wang et al., 2017). Interestingly, two features that are temporal in nature, namely rate and most prominently roughness, show a somewhat opposite trend with a mildly increased correlation with deeper CNN layers. Both these acoustic measures quantify the degree of amplitude modulations in the signal over longer time-scales of tens to hundreds of milliseconds; and we can speculate that such measures would involve longer integration levels that are more emblematic of deeper layers in the network that pool across various localized receptive fields. The distributed activation of CNN layers reflecting various acoustic features supports previous accounts of hierarchical neural structures in auditory cortex that combine low-level and object-level representations extending beyond the direct physical attributes of the scenes (Formisano et al.,

2008; Staeren et al., 2009). This distributed network suggests an intricate, multi-region circuitry underlying the computation of sound salience in the auditory system; much in line with reported underpinnings of visual salience

147 circuits in the brain (Veale, Hafed, and Yoshida, 2017).

Secondly, the results show a strong correlation between peripheral layers of the CNN and behavioral reports of salience. This trend is not surprising given the important role acoustic characteristics of the signal play in determining the salience of its events (Huang and Elhilali, 2017; Kaya and Elhilali, 2014b; Kim et al., 2014a). This view is then complemented by the analysis of cumulative variance explained by gradually incorporating activation of deeper layers in the neural network. Figure 5.3 clearly shows that information extracted in later layers of the network supplements activation in earlier layers and offers an improved account of auditory salience. This increase is maintained even at the level of the fully-connected layers suggesting a complementary contribution of low-level and category-level cues in guiding auditory salience. This observation is further reinforced by focusing on salience of specific sound categories. In certain cases that are more typical of sparse settings with prominent events such as tapping or vehicle sounds, it appears that the low- level acoustic features are the main determinants of auditory salience with little contribution from semantic-level information. In contrast, events in the midst of a speech utterance or a musical performance appear to have a significant increase in variance explained by incorporating all CNN layers (Figure 5.4). The complementary nature of peripheral and object-level cues is clearly more prominent when taking into account the scene context, by contrasting denser, busy scenes with quieter environments with occasional, prominent events. Dense settings typically do not have as many conspicuous clear changes in acoustic information across time; and as a result, they seem

148 to require more semantic-level information to complement information from acoustic features for a complete account of auditory salience.

Thirdly, the CNN layer activation shows an opposite correlation trend with neural oscillation measured by EEG. In particular, the deeper layers of the neural network have higher correlation with activity in the higher frequency bands (beta, gamma, and high gamma bands). Synchronous activity in the Gamma band has been shown to be associated with object representation (Bertrand and Tallon-Baudry, 2000; Rodriguez et al., 1999), which would be directly related to the audio classification task. Activity in both the Gamma and Beta bands has also been linked to hearing novel stimuli (Haenschel et al., 2000). Moreover, Gamma band activity is known to be strongly modulated by attention (Doesburg et al., 2008; Müller, Gruber, and Keil, 2000; Tiitinen et al., 1993), which further reinforces the relationship between object category and salience.

In particular, the CNN activation patterns of the deep layers correlate most strongly with neural oscillations in frontal areas of the brain. This finding expands on the recent work by Kell et al. (2018), which found that activation patterns within intermediate layers of their CNN were the best at predicting activity in the auditory cortex. It stands to reason that later layers of the network would correspond more to higher level brain regions, which may play a role in attention and object recognition.

Overall, all three metrics used in the current study offer different accounts of conspicuity of sound events in natural soundscapes. By contrasting these signals against activations in a convolutional deep neural network trained for

149 audio recognition, we are able to assess the intricate granularity of information that drives auditory salience in everyday soundscapes. The complexity stems from the complementary role of cues along the continuum from low-level acoustic representation to coherent object-level embeddings. Interestingly, the contribution of these different transformations does not uniformly impact auditory salience for all scenes. The results reveal that the context of the scene plays a crucial role in determining the influence of acoustics or semantics or possibly transformations in between. It is worth noting that the measure of surprisal used here is but one way to characterize surprise. Looking at changes in a representation compared to the average of the last second is simple and proves to be effective. However, different ways to capture the context, perhaps including fitting the data to a multimodal Gaussian mixture model, as well as different time scales should be investigated.

Further complicating the interaction with context effects is the fact that certain acoustic features should not be construed as simple transformation of the acoustic waveform or the auditory spectrogram. For instance, a measure such as roughness appears to be less correlated with lower layers of the convolutional neural network. This difference suggests that acoustic roughness may not be as readily extracted from the signal as the other acoustic measures by the neural network, but it is nonetheless important for audio classification and correlates strongly with perception of auditory salience (Arnal et al., 2015).

One limitation of the convolutional neural network structure is that it only transmits information between layers in the forward direction, while

150 biological neural systems incorporate both feed-forward and feedback connections. Feedback connections are particularly important in studies of attention because salience (bottom-up attention) can be modified by top-down attention. This study uses behavioral and physiological data that was collected in such a way that the influence of top-down activity was limited; however, a complete description of auditory attention would need to incorporate such factors. An example of a feedback convolutional neural network that seeks to account for top-down attention can be found in Cao et al. (Cao et al., 2015).

It is not surprising that our limited understanding of the complex interplay between acoustic profiles and semantic representations has impeded development of efficient models of auditory salience that can explain behavioral judgments, especially in natural, unconstrained soundscapes. So far, most accounts have focused on incorporating relevant acoustic cues that range in complexity from simple spectrographic representation to explicit representation of pitch, timbre or spectro-temporal modulation (Duangudom and Anderson, 2007a; Kalinli and Narayanan, 2007a; Kaya and Elhilali, 2014b; Tsuchida and Cottrell, 2012b). However, as highlighted by the present study, it appears that a complementary role of intricate acoustic analysis (akin to that achieved from the complex architecture of convolutional layers in the current CNN) as well as auditory object representations will be necessary to not only account for contextual information about the scene but may determine the salience of a sound event depending on its category; sometimes regardless of its acoustic attributes.

151 Chapter 6

Conclusions and Future Directions

As a whole, this dissertation provides a great step forward in our understanding of auditory salience, and further presents a strong foundation for further study.

Chapter2 presented a database of natural scenes with a novel behavioral measure of salience. Change in acoustic features was determined to be an important factor in determining salience, although it does not fully explain salience particularly in the difficult context of natural scenes. A model of auditory salience based on changes in acoustic features was developed, which performs as well as or better than other state of the art models.

More recently, this database has been extended to include forty additional scenes and hundreds of additional subjects. Data collection for these scenes was facilitated through the use of Amazon’s Mechanical Turk. Data comparisons have shown that the data collected online is highly consistent with the data collected in the laboratory setting for this salience measurement task.

Mechanical Turk allows for rapid expansion of the available dataset. Hav- ing a wider range of scenes with reliable salience measurements would enable

152 models to better generalize to any real world scenario. With a vast amount of data, deep learning becomes an option, which could achieve higher performance than existing models.

While long-term acoustic history is shown to affect salience, it is not a factor that is incorporated into any of the current models. This history information can have a complex effect on salience, and requires further investigation.

Another variable currently unaccounted for is individual listener variability. The salience measure defined in this dissertation averages over all subjects, but listeners were not all in agreement, particularly for the dense scenes. The past experiences of the listener may also need to be considered in a complete description of bottom-up attention.

In Chapters3 and4, neural markers associated with shifts in attentional state were identified. These markers could be used as a physiological measure of salience that does not rely on manual labelling or behavioral studies. These markers include modulations of the neural representation of the stimulus, phase coherence, and activity in the gamma band. In addition, the neurophysiological effects of shifts in attention brought about by bottom-up and top-down processes were compared.

While these results around the salient events are promising, the final goal for a neurophysiological measure of salience would be to achieve a continuous rating throughout the acoustic scene. Envelope decoding methods such as those presented in Chapter4 could be used, but they require a large window for correlation and thus have a lower temporal resolution. It remains to be seen what a more continuous measure could be.

153 Finally, chapter5 determined that changes in sound category contribute to auditory salience, using a convolutional neural network trained for audio classification. Furthermore, activity in the CNN was characterized asbeing more correlated to basic acoustic features and salience at lower layers, while having more similarity to human cortical activity at higher layers. The next step would be to incorporate the categorical and acoustic information together into a more complete model.

154 References

Wolfe, Jeremy M and Todd S Horowitz (2004b). “What attributes guide the deployment of visual attention and how do they do it?” In: Nature Reviews Neuroscience 5.6, pp. 495–501. ISSN: 1471-003X. Itti, Laurent and Christof Koch (2001c). “Feature combination strategies for saliency-based visual attention systems”. In: J.Electronic Imaging, pp. 161– 169. Tatler, Benjamin W, Mary M Hayhoe, Michael F Land, and Dana H Ballard (2011). “Eye guidance in natural vision: Reinterpreting salience”. In: Journal of Vision 11.5. Zhao, Qi and Christof Koch (2013a). “Learning saliency-based visual attention: A review”. In: Signal Processing 93.6, pp. 1401–1407. Borji, A, D Sihite, and L Itti (2013a). “Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study”. In: IEEE Transactions on Image Processing 22.1, pp. 55–69. Bruce, Neil D.B., Calden Wloka, Nick Frosst, Shafin Rahman, and John K. Tsotsos (2015). “On computational modeling of visual saliency: Examining what’s right, and what’s left”. In: Vision Research 116, pp. 95–112. ISSN: 18785646. Kayser, C, C I Petkov, M Lippert, and N K Logothetis (2005a). “Mechanisms for allocating auditory attention: an auditory saliency map”. In: Current Biology 15.21, pp. 1943–1947. Kim, Kyungtae, Kai-Hsiang Lin, Dirk B Walther, Mark A Hasegawa-Johnson, and Tomas S Huang (2014a). “Automatic detection of auditory salience with optimized linear filters derived from human annotation”. In: Pattern Recognition Letters 38.0, pp. 78–85. Itti, L, C Koch, and E Niebur (1998). “A model of saliency-based visual attention for rapid scene analysis”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 20.11, pp. 1254–1259.

155 Duangudom, V and D V Anderson (2007a). “Using Auditory Saliency To Understand Complex Auditory Scenes”. In: 15th European Signal Processing Conference (EUSIPCO 2007). Kaya, Emine Merve and Mounya Elhilali (2014b). “Investigating bottom-up auditory attention”. In: Frontiers in Human Neuroscience 8, p. 327. ISSN: 1662-5161. Duangudom, Varinthira and David Anderson (2013a). “Identifying salient sounds using dual-task experiments”. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. Shuai, Lan and Mounya Elhilali (2014). “Task-dependent neural representations of salient events in dynamic auditory scenes”. In: Frontiers in neuroscience 8, p. 203. ISSN: 1662-4548. Southwell, R, A Baumann, C Gal, N Barascud, K Friston, and M Chait (2017). “Is predictability salient? A study of attentional capture by auditory patterns”. In: Philosophical transactions of the Royal Society of London.Series B, Biological sciences 372.1714, 10.1098/rstb.2016.0105. Epub 2017 Jan 2. Kalinli, O and S Narayanan (2007a). “A Saliency-Based Auditory Attention Model with Applications to Unsupervised Prominent Syllable Detection in Speech”. In: INTERSPEECH-2007, pp. 1941–1944. Liao, Hsin I., Makoto Yoneya, Shunsuke Kidani, Makio Kashino, and Shigeto Furukawa (2016b). “Human pupillary dilation response to deviant auditory stimuli: Effects of stimulus properties and voluntary attention”. In: Frontiers in Neuroscience 10.43, pp. 1–14. ISSN: 1662453X. Huang, Nicholas and Mounya Elhilali (2017). “Auditory salience using natural soundscapes”. In: The Journal of the Acoustical Society of America 141.3, p. 2163. ISSN: 1520-8524. Seifritz, Erich, John G. Neuhoff, Deniz Bilecen, Klaus Scheffler, Henrietta Mustovic, Hartmut Schächinger, Raffaele Elefante, and Francesco Di Salle (2002). “Neural Processing of Auditory Looming in the Human Brain”. In: Current Biology 12.24, pp. 2147–2151. ISSN: 0960-9822. Lewis, James W., William J. Talkington, Katherine C. Tallaksen, and Chris A. Frum (2012). “Auditory object salience: human cortical processing of non- biological action sounds and their acoustic signal attributes”. In: Frontiers in Systems Neuroscience 6, p. 27. ISSN: 1662-5137. Tordini, Francesco, Albert S Bregman, and Jeremy R Cooperstock (2015a). “The loud bird doesn’t (always) get the worm: Why computational salience also needs brightness and tempo”. In: Proceedings of the 21st International Conference on Auditory Display (ICAD 2015).

156 Ghazanfar, Asif A, John G Neuhoff, and Nikos K Logothetis (2002). “Auditory looming perception in rhesus monkeys.” In: Proceedings of the National Academy of Sciences of the United States of America 99.24, pp. 15755–7. ISSN: 0027-8424. Huang, Nicholas, Malcolm Slaney, and Mounya Elhilali (2018). “Connecting deep neural networks to physical, perceptual, and electrophysiological auditory signals”. In: Frontiers in Neuroscience 12.532. ISSN: 1662-453X. Huang, Nicholas and Mounya Elhilali (2018). “Neural underpinnnings of auditory salience in natural soundscapes”. In: bioRxiv. Krishnan, Ananthanarayan (2002). “Human frequency-following responses: representation of steady-state synthetic vowels”. In: Hearing research 166.1, pp. 192–201. Nozaradan, S, I Peretz, M Missal, and A Mouraux (2011). “Tagging the neuronal entrainment to beat and meter”. In: The Journal of neuroscience : the official journal of the Society for Neuroscience 31.28, pp. 10234–10240. Power, Alan James, Natasha Mead, Lisa Barnes, and Usha Goswami (2012). “Neural Entrainment to Rhythmically Presented Auditory, Visual, and Audio-Visual Speech in Children”. In: Frontiers in Psychology 3, p. 216. ISSN: 1664-1078. Galbraith, Gary C, Sunil M Bhuta, Abigail K Choate, Jennifer M Kitahara, and Thomas A Mullen Jr (1998). “Brain stem frequency-following response to dichotic vowels during attention”. In: Neuroreport 9.8, pp. 1889–1893. Galbraith, Gary C. and Bao Q. Doan (1995). “Brainstem frequency-following and behavioral responses during selective attention to pure tone and missing fundamental stimuli”. In: International Journal of Psychophysiology 19.3, pp. 203–214. ISSN: 01678760. O’Sullivan, James A, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn-Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor (2015). “Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG”. In: Cerebral Cortex 25.7, pp. 1697–1706. ISSN: 1460-2199. Treder, M S, H Purwins, D Miklody, I Sturm, and B Blankertz (2014). “Decoding auditory attention to instruments in polyphonic music using single-trial EEG classification”. In: Journal of Neural Engineering 11.2, p. 026009. ISSN: 17412552. Ekin, Bradley, Les Atlas, Majid Mirbagheri, and Adrian K.C. Lee (2016). “An alternative approach for auditory attention tracking using single-trial EEG”.

157 In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2016-May. IEEE, pp. 729–733. Fuglsang, Søren, Torsten Dau, and Jens Hjortkjær (2017). “Noise-robust cortical tracking of attended speech in real-world acoustic scenes”. In: NeuroImage 156, pp. 435–444. ISSN: 10959572. Gurtubay, I.G, M Alegre, A Labarga, A Malanda, J Iriarte, and J Artieda (2001). “Gamma band activity in an auditory oddball paradigm studied with the wavelet transform”. In: Clinical Neurophysiology 112.7, pp. 1219–1228. ISSN: 1388-2457. Howard, Marc W., Daniel S. Rizzuto, Jeremy B. Caplan, Joseph R. Madsen, John Lisman, Richard Aschenbrenner-Scheibe, Andreas Schulze-Bonhage, and Michael J. Kahana (2003). “Gamma Oscillations Correlate with Work- ing Memory Load in Humans”. In: Cerebral Cortex 13.12, pp. 1369–1374. ISSN: 10473211. Jensen, F V (2001). Bayesian networks and decision diagrams. Springer. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton (2012). “ImageNet Classification with Deep Convolutional Neural Networks”. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, pp. 1097–1105. Simonyan, Karen and Andrew Zisserman (2015). “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Conference on Learning Representations (ICRL), pp. 1–14. ISSN: 09505849. Hershey, Shawn, Sourish Chaudhuri, Daniel P.W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson (2017). “CNN architectures for large-scale audio classification”. In: ICASSP,IEEE In- ternational Conference on Acoustics, Speech and Signal Processing - Proceedings. IEEE, pp. 131–135. Jansen, Aren, Jort F. Gemmeke, Daniel P.W. Ellis, Xiaofeng Liu, Wade Lawrence, and Dylan Freedman (2017). “Large-scale audio event discovery in one million YouTube videos”. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. IEEE, pp. 786–790. Lee, Honglak, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng (2009). “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations”. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp. 1–8. Itti, L., G. Rees, and J. K. Tsotsos, eds. (2005). Neurobiology of attention. Elsevier.

158 Driver, J. (2001). “A selective review of selective attention research from the past century”. In: British Journal of Psychology 92.1, pp. 53–78. Baluch, Farhan and Laurent Itti (2011a). “Mechanisms of top-down attention”. In: Trends in neurosciences 34.4, pp. 210–224. Treue, S. (2003a). “Visual attention: the where, what, how and why of saliency”. In: Current opinion in neurobiology 13.4, pp. 428–432. Carrasco, Marisa (2011). “Visual attention: The past 25 years”. In: Vision research 51.13, pp. 1484–1525. Itti, L. and C. Koch (2001a). “Computational modelling of visual attention”. In: Nature Reviews Neuroscience 2.3, pp. 194–203. Wolfe, J. M. and T. S. Horowitz (2004a). “What attributes guide the deployment of visual attention and how do they do it?” In: Nature Reviews Neuroscience 5.6, pp. 495–501. Zhao, Qi and Christof Koch (2013b). “Learning saliency-based visual attention: A review”. In: Signal Processing 93.6, pp. 1401–1407. Borji, Ali, Ming-Ming Cheng, Huaizu Jiang, and Jia Li (2015). “Salient object detection: A benchmark”. In: IEEE Transactions on Image Processing 24.12, pp. 5706–5722. Parkhurst, Derrick, Klinton Law, and Ernst Niebur (2002). “Modeling the role of salience in the allocation of overt visual attention”. In: Vision research 42.1, pp. 107–123. Borji, A., D. N. Sihite, and L. Itti (2013b). “Quantitative analysis of human- model agreement in visual saliency modeling: A comparative study”. In: IEEE Transactions on Image Processing 22.1, pp. 55–69. Borji, A. and L. Itti (2013a). “State-of-the-art in visual attention modeling”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.1, pp. 185– 207. Frintrop, S., E. Rome, and H. I. Christensen (2010). “Computational Visual Attention Systems and their Cognitive Foundation: A Survey”. In: ACM Transactions on Applied Perception 7.1. Kaya, E. M. and M. Elhilali (2016). “Modeling auditory attention: A review”. In: Philosophical Transactions of the Royal Society B: Biological Sciences. Kayser, C., C. I. Petkov, M. Lippert, and N. K. Logothetis (2005b). “Mechanisms for allocating auditory attention: an auditory saliency map”. In: Current Biology 15.21, pp. 1943–1947. Tsuchida, T. and G. Cottrell (2012a). Auditory saliency using natural statistics.

159 Duangudom, V. and D. V. Anderson (2007b). “Using Auditory Saliency To Understand Complex Auditory Scenes”. In: 15th European Signal Processing Conference (EUSIPCO 2007). Kaya, E. M. and M. Elhilali (2014a). “Investigating bottom-up auditory attention”. In: Frontiers in Human neuroscience 8.327, doi: 10.3389/fnhum.2014.00327. Kim, Kyungtae, Kai-Hsiang Lin, Dirk B. Walther, Mark A. Hasegawa-Johnson, and Tomas S. Huang (2014b). “Automatic detection of auditory salience with optimized linear filters derived from human annotation”. In: Pattern Recognition Letters 38.0, pp. 78–85. Duangudom, Varinthira and David V. Anderson (2013b). “Identifying salient sounds using dual-task experiments”. In: 2013 IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics (WASPAA). IEEE, pp. 1– 4. Kalinli, O. and S. Narayanan (2007b). “A Saliency-Based Auditory Attention Model with Applications to Unsupervised Prominent Syllable Detection in Speech”. In: INTERSPEECH-2007, pp. 1941–1944. The BBC Sound Effects Library - Original Series. Youtube (2016). The Freesound Project (2016). Smith, Steven W. (1997). “The scientist and engineer’s guide to digital signal processing”. In: Chi, T., P. Ru, and S. A. Shamma (2005). “Multiresolution spectrotemporal analysis of complex sounds”. In: The Journal of the Acoustical Society of America 118.2, pp. 887–906. Schubert, Emery and Joe Wolfe (2006). “Does timbral brightness scale with frequency and spectral centroid?” In: Acta acustica united with acustica 92.5, pp. 820–825. Esmaili, Shahrzad, Sridhar Krishnan, and Kaamran Raahemifar (2004). “Con- tent based audio classification and retrieval using joint time-frequency analysis”. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on. Vol. 5. IEEE, V–665–8 vol. 5. Johnston, James D. (1988). “Transform coding of audio signals using perceptual noise criteria”. In: Selected Areas in Communications, IEEE Journal on 6.2, pp. 314–323. McAdams, S., J. W. Beauchamp, and S. Meneguzzi (1999). “Discrimination of musical instrument sounds resynthesized with simplified spectrotemporal parameters”. In: 105.2, pp. 882–897.

160 Goldstein, J. L. (1973). “An optimum processor theory for the central formation of the pitch of complex tones”. In: Journal of the Acoustical Society of America 54, pp. 1496–1516. Shamma, S. A. and D. J. Klein (2000). “The case of the missing pitch templates: How harmonic templates emerge in the early auditory system”. In: Journal of the Acoustical Society of America 107.5, pp. 2631–2644. Zwicker, Eberhard, Hugo Fastl, Ulrich Widmann, Kenji Kurakata, Sonoko Kuwano, and Seiichiro Namba (1991). “Program for calculating loudness according to DIN 45631 (ISO 532B).” In: Journal of the Acoustical Society of Japan (E) 12.1, pp. 39–42. Izenman, Alan Julian (2013a). “Linear discriminant analysis”. In: Modern Multivariate Statistical Techniques. Springer, pp. 237–280. Tordini, Francesco, Albert S. Bregman, and Jeremy R. Cooperstock (2015b). “The loud bird doesn’t (always) get the worm: Why computational salience also needs brightness and tempo”. In: Proceedings of the 21st International Conference on Auditory Display (ICAD 2015). Graz, Styria, Austria. Moore, Brian C. J. (2003). An Introduction to the Psychology of Hearing. 5th ed. Emerald Group Publishing Ltd. Kaya, E. and M. Elhilali (2015). Investigating bottom-up auditory attention in the cortex. Elhilali, M., J. B. Fritz, D. J. Klein, J. Z. Simon, and S. A. Shamma (2004). “Dynamics of precise spike timing in primary auditory cortex”. In: Journal of Neuroscience 24.5, pp. 1159–1172. Kalinli, Ozlem, Shiva Sundaram, and Shrikanth Narayanan (2009). “Saliency- Driven Unstructured Acoustic Scene Classification Using Latent Perceptual Indexing”. In: IEEE International Workshop on Multimedia Signal Processing (MMSP 2009). IEEE, pp. 478–483. Peters, Robert J., Asha Iyer, Laurent Itti, and Christof Koch (2005). “Compo- nents of bottom-up gaze allocation in natural images”. In: Vision research 45.18, pp. 2397–2416. Yarbus, AL (1967). “Eye movements and vision. 1967”. In: New York. Scheier, Christian and Steffen Egner (1997). “Visual attention in a mobile robot”. In: Industrial Electronics, 1997. ISIE’97., Proceedings of the IEEE Inter- national Symposium on. Vol. 1. IEEE, SS48–SS52 vol. 1. Russell, Bryan C., Antonio Torralba, Kevin P. Murphy, and William T. Freeman (2008). “LabelMe: a database and web-based tool for image annotation”. In: International journal of computer vision 77.1-3, pp. 157–173.

161 Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009). “Imagenet: A large-scale hierarchical image database”. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, pp. 248– 255. HESS, E. H. and J. M. POLT (1960). “Pupil size as related to interest value of visual stimuli”. In: Science (New York, N.Y.) 132.3423, pp. 349–350. Partala, Timo and Veikko Surakka (2003). “Pupil size variation as an indication of affective processing”. In: International journal of human-computer studies 59.1, pp. 185–198. Liao, Hsin-I, Makoto Yoneya, Shunsuke Kidani, Makio Kashino, and Shigeto Furukawa (2016c). “Human pupillary dilation response to deviant auditory stimuli: Effects of stimulus properties and voluntary attention”. In: Frontiers in neuroscience 10.43. Borji, Ali (2015a). “What is a salient object? a dataset and a baseline model for salient object detection”. In: Image Processing, IEEE Transactions on 24.2, pp. 742–756. Borji, A and L Itti (2013b). “State-of-the-art in visual attention modeling”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.1, pp. 185– 207. Tsuchida, T and G Cottrell (2012b). Auditory saliency using natural statistics. Petsas, Theofilos, Jemma Harrison, Makio Kashino, Shigeto Furukawa, and Maria Chait (2016). “The effect of distraction on change detection in crowded acoustic scenes”. In: Hearing Research 341, pp. 179–189. ISSN: 18785891. Kaya, E.M. M Emine Merve and Mounya Elhilali (2017). “Modelling auditory attention”. In: Philosophical transactions of the Royal Society of London. Series B, Biological sciences 372.1714, p. 20160101. ISSN: 14712970. Liao, Hsin I., Shunsuke Kidani, Makoto Yoneya, Makio Kashino, and Shigeto Furukawa (2016a). “Correspondences among pupillary dilation response, subjective salience of sounds, and loudness”. In: Psychonomic Bulletin and Review 23.2, pp. 412–425. ISSN: 15315320. Picton, T W, C Alain, L Otten, W Ritter, and A Achim (2000). “Mismatch Neg- ativity: Different Water in the Same River.” In: Audiology and Neurotology 5, pp. 111–139. Näätänen, R., P. Paavilainen, T. Rinne, and K. Alho (2007). “The mismatch negativity (MMN) in basic research of central auditory processing: A review”. In: Clinical Neurophysiology 118.12, pp. 2544–2590. ISSN: 13882457.

162 Winkler, István (2007). “Interpreting the mismatch negativity.” In: Journal of Psychophysiology 21.3-4, p. 147. Lakatos, P, G Karmos, A D Mehta, I Ulbert, and C E Schroeder (2008). “En- trainment of neuronal oscillations as a mechanism of attentional selection”. In: Science 320.5872, pp. 110–113. Zion Golumbic, E M, N Ding, S Bickel, P Lakatos, C A Schevon, G M McKhann, R R Goodman, R Emerson, A D Mehta, J Z Simon, D Poeppel, and C E Schroeder (2013). “Mechanisms underlying selective neuronal tracking of attended speech at a "cocktail party"”. In: Neuron 77.5, pp. 980–991. ISSN: 1097-4199. Besle, J., C. A. Schevon, A. D. Mehta, P. Lakatos, R. R. Goodman, G. M. McK- hann, R. G. Emerson, and C. E. Schroeder (2011). “Tuning of the Human Neocortex to the Temporal Dynamics of Attended Events”. In: Journal of Neuroscience 31.9, pp. 3176–3185. ISSN: 0270-6474. Jaeger, Manuela, Martin G. Bleichner, Anna Katharina R. Bauer, Bojana Mirkovic, and Stefan Debener (2018). “Did You Listen to the Beat? Auditory Steady- State Responses in the Human Electroencephalogram at 4 and 7 Hz Modu- lation Rates Reflect Selective Attention”. In: Brain Topography 31.5, pp. 811– 826. ISSN: 15736792. Ding, N and J Z Simon (2012). “Emergence of neural encoding of auditory objects while listening to competing speakers”. In: Proceedings of the National Academy of Sciences of the United States of America 109.29, pp. 11854–11859. Goto, M, H Hashiguchi, T Nishimura, and R Oka (2003). “RWC music database: Music genre database and musical instrument sound database”. In: Pro- ceedings of International Symposium on Music Information Retrieval, pp. 229– 230. Oostenveld, Robert, Pascal Fries, Eric Maris, and Jan Mathijs Schoffelen (2011). “FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data”. In: Computational Intelligence and Neuroscience 2011, p. 156869. ISSN: 16875265. Mallat, S G and Z Zhang (1993). “Matching pursuits with time-frequency dictionaries”. In: IEEE Transactions on Signal Processing 41.12, pp. 3397– 3415. Snedecor, G and W Cochran (1989). Statistical Methods. Iowa State University Press, Ames, p. 503. David, Stephen V, Nima Mesgarani, and Shihab A Shamma (2007). “Esti- mating sparse spectro-temporal receptive fields with natural stimuli”. In: Network: Computation in Neural Systems 18.3, pp. 191–212.

163 Haroush, Keren, Shaul Hochstein, and Leon Y. Deouell (2010). “Momentary Fluctuations in Allocation of Attention: Cross-modal Effects of Visual Task Load on Auditory Discrimination”. In: Journal of Cognitive Neuroscience 22.7, pp. 1440–1451. ISSN: 0898-929X. Fisher, Ronald Aylmer Sir 1890-1962 (1935). The design of experiments. [8th ed.] New York: Hafner Pub. Co. McAdams, S, S Winsberg, S Donnadieu, G De Soete, and J Krimphoff (1995). “Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes”. In: Psychological Research 58.3, pp. 177–192. Elhilali, M, S A Shamma, J Z Simon, and J B Fritz (2013). “A Linear Systems View to the Concept of STRF”. In: Handbook of Modern Techniques in Auditory Cortex. Ed. by D Depireux and M Elhilali. Nova Science Pub Inc, pp. 33–60. Giordano, Bruno L. and Stephen McAdams (2010). “Sound Source Mechanics and Musical Timbre Perception: Evidence From Previous Studies”. In: Music Perception 28.2, pp. 155–168. ISSN: 0730-7829. Peeters, Geoffroy, Bruno L Giordano, Patrick Susini, Nicolas Misdariis, and Stephen McAdams (2011). “The Timbre Toolbox: Extracting audio descrip- tors from musical signals”. In: The Journal of the Acoustical Society of America 130.5, pp. 2902–2916. ISSN: 0001-4966. Patil, Kailash, Daniel Pressnitzer, Shihab Shamma, and Mounya Elhilali (2012). “Music in Our Ears: The Biological Bases of Musical Timbre Perception”. In: PLoS Computational Biology 8.11. Ed. by Frederic E. Theunissen, e1002759. ISSN: 1553-7358. Nothdurft, Hans Christoph (2005). “Salience of feature contrast”. In: Neurobiol- ogy of Attention, pp. 233–239. Itti, L and C Koch (2001b). “Computational modelling of visual attention.” In: Nature reviews. Neuroscience 2.3, pp. 194–203. ISSN: 1471-003X. Melara, R D and L E Marks (1990). “Interaction among auditory dimensions: timbre, pitch, and loudness”. In: Perception & Psychophysics 48.2, pp. 169– 178. Bizley, J K, K M Walker, B W Silverman, A J King, and J W Schnupp (2009). “Interdependent encoding of pitch, timbre, and spatial location in auditory cortex”. In: Journal of Neuroscience 29.7, pp. 2064–2075. Allen, Emily J., Philip C. Burton, Cheryl A. Olman, and Andrew J. Oxenham (2017). “Representations of Pitch and Timbre Variation in Human Auditory Cortex”. In: The Journal of Neuroscience 37.5, pp. 1284–1293. ISSN: 0270-6474.

164 Nelken, I and O Bar-Yosef (2008). “Neurons and objects: the case of auditory cortex”. In: Frontiers in neuroscience 2.1, pp. 107–113. Bartlett, Edward L and Xiaoqin Wang (2005). “Long-lasting modulation by stimulus context in primate auditory cortex.” In: Journal of neurophysiology 94.1, pp. 83–104. Asari, Hiroki and Anthony M Zador (2009). “Long-Lasting Context Depen- dence Constrains Neural Encoding Models in Rodent Auditory Cortex”. In: Journal of Neurophysiology 102.5, pp. 2638–2656. ISSN: 0022-3077. Mesgarani, N, S David, and S Shamma (2009). “Influence of Context and Be- havior on the Population Code in Primary Auditory Cortex”. In: J.Neurophysiology 102, pp. 3329–3333. Angeloni, C. and MN Geffen (2018). “Contextual modulation of sound processing in the auditory cortex”. In: Current Opinion in Neurobiology 49, pp. 8–15. ISSN: 09594388. Hillyard, Steven A., Edward K. Vogel, and Steven J. Luck (1998). “Sensory gain control (amplification) as a mechanism of selective attention: Electro- physiological and neuroimaging evidence”. In: Philosophical Transactions of the Royal Society B: Biological Sciences. ISSN: 09628436. Elhilali, Mounya, Juanjuan Xiang, Shihab A. Shamma, and Jonathan Z. Simon (2009). “Interaction between Attention and Bottom-Up Saliency Mediates the Representation of Foreground and Background in an Auditory Scene”. In: PLoS Biology 7.6. Ed. by Timothy D. Griffiths, e1000129. ISSN: 1545-7885. Kerlin, Jess R, Antoine J Shahin, and Lee M Miller (2010). “Attentional Gain Control of Ongoing Cortical Speech Representations in a "Cocktail Party"”. In: Journal of Neuroscience 30.2, pp. 620–628. ISSN: 0270-6474. Mesgarani, N and E F Chang (2012). “Selective cortical representation of attended speaker in multi-talker speech perception”. In: Nature 485.7397, pp. 233–236. Puvvada, Krishna C and Jonathan Z Simon (2017). “Cortical Representations of Speech in a Multitalker Auditory Scene”. In: The Journal of Neuroscience 37.38, pp. 9189–9196. ISSN: 0270-6474. Wang, X, T Lu, D Bendor, and E Bartlett (2008). “Neural coding of temporal information in auditory thalamus and cortex”. In: Neuroscience 157, pp. 484– 493. Kayser, Christoph, Marcelo A Montemurro, Nikos K Logothetis, and Stefano Panzeri (2009). “Spike-Phase Coding Boosts and Stabilizes Information Carried by Spatial and Temporal Spike Patterns”. In: Neuron 61.4, pp. 597– 608. ISSN: 08966273.

165 Chandrasekaran, Chandramouli, Hjalmar K Turesson, Charles H Brown, and Asif A Ghazanfar (2010). “The Influence of Natural Scene Dynamics on Auditory Cortical Activity”. In: Journal of Neuroscience 30.42, pp. 13919– 13931. ISSN: 0270-6474. Watkins, Susanne, Polly Dalton, Nilli Lavie, and Geraint Rees (2007). “Brain Mechanisms Mediating Auditory Attentional Capture in Humans”. In: Cerebral Cortex 17.7, pp. 1694–1700. ISSN: 1460-2199. Salmi, Juha, Teemu Rinne, Sonja Koistinen, Oili Salonen, and Kimmo Alho (2009). “Brain networks of bottom-up triggered and top-down controlled shifting of auditory attention”. In: Brain Research. ISSN: 00068993. Ahveninen, Jyrki, Samantha Huang, John W. Belliveau, Wei-Tang Chang, and Matti Hämäläinen (2013). “Dynamic Oscillatory Processes Governing Cued Orienting and Allocation of Auditory Attention”. In: Journal of Cognitive Neuroscience 25.11, pp. 1926–1943. ISSN: 0898-929X. Escera, Carles and M. J. Corral (2007). Role of mismatch negativity and novelty-P3 in involuntary auditory attention. Soltani, Maryam and Robert T. Knight (2012). “Neural Origins of the P300”. In: Critical Reviews™ in Neurobiology. ISSN: 0892-0915. Henry, M. J. and J. Obleser (2012). “Frequency modulation entrains slow neural oscillations and optimizes human listening behavior”. In: Proceedings of the National Academy of Sciences 109.49, pp. 20095–20100. ISSN: 0027-8424. Ng, B. S. W., T. Schroeder, and C. Kayser (2012). “A Precluding But Not Ensuring Role of Entrained Low-Frequency Oscillations for Auditory Per- ception”. In: Journal of Neuroscience 32.35, pp. 12268–12276. ISSN: 0270-6474. Landau, Ayelet Nina, Helene Marianne Schreyer, Stan Van Pelt, and Pas- cal Fries (2015). “Distributed Attention Is Implemented through Theta- Rhythmic Gamma Modulation”. In: Current Biology. ISSN: 09609822. Keller, Arielle S., Lisa Payne, and Robert Sekuler (2017). “Characterizing the roles of alpha and theta oscillations in multisensory attention”. In: Neuropsychologia 99, pp. 48–63. ISSN: 00283932. Spyropoulos, Georgios, Conrado Arturo Bosman, and Pascal Fries (2018). “A theta rhythm in macaque visual cortex and its attentional modulation”. In: Proceedings of the National Academy of Sciences. ISSN: 0027-8424. Teng, Xiangbin, Xing Tian, Keith Doelling, and David Poeppel (2018). “Theta band oscillations reflect more than entrainment: behavioral and neural evidence demonstrates an active chunking process”. In: European Journal of Neuroscience 48.8, pp. 2770–2782. ISSN: 0953816X.

166 Muller-Gass, Alexandra, Margaret Macdonald, Erich Schröger, Lauren Sculthorpe, and Kenneth Campbell (2007). “Evidence for the auditory P3a reflecting an automatic process: Elicitation during highly-focused continuous visual attention”. In: Brain Research 1170.SUPPL.: COMPLETE, pp. 71–78. ISSN: 00068993. Garrido, M I, J M Kilner, K E Stephan, and K J Friston (2009). “The mismatch negativity: A review of underlying mechanisms”. In: Clinical Neurophysiol- ogy 120.3, p. 453. Escera, Carles, Kimmo Alho, Erich Schröger, and István Winkler (2000). Invol- untary attention and distractibility as evaluated with event-related brain potentials. David, Olivier, Lee Harrison, and Karl J. Friston (2005). “Modelling event- related responses in the brain”. In: NeuroImage 25.3, pp. 756–770. ISSN: 10538119. Sauseng, P., W. Klimesch, W. R. Gruber, S. Hanslmayr, R. Freunberger, and M. Doppelmayr (2007). Are event-related potential components generated by phase resetting of brain oscillations? A critical discussion. Klimesch, Wolfgang, Manuel Schabus, Michael Doppelmayr, Walter Gruber, and Paul Sauseng (2004). “Evoked oscillations and early components of event-related potentials: an analysis”. In: International Journal of Bifurcation and Chaos 14.2, pp. 705–718. ISSN: 0218-1274. Fuentemilla, Ll., J. Marco-Pallarés, T. F. Münte, and C. Grau (2008). “Theta EEG oscillatory activity and auditory change detection”. In: Brain Research 1220, pp. 93–101. ISSN: 00068993. Hsiao, Fu Jung, Zin An Wu, Low Tone Ho, and Yung Yang Lin (2009). “Theta oscillation during auditory change detection: An MEG study”. In: Biological Psychology 81.1, pp. 58–66. ISSN: 03010511. Baluch, Farhan and Laurent Itti (2011b). “Mechanisms of top-down attention”. In: Trends in neurosciences 34.4, pp. 210–224. Desimone, R and J Duncan (1995). “Neural mechanisms of selective visual attention”. In: Annu.Rev.Neurosci. 18.1, pp. 193–222. Shamma, S and J Fritz (2014). “Adaptive auditory computations”. In: Current opinion in neurobiology 25, pp. 164–168. Corbetta, M and G Shulman (2002). “Control of goal-directed and stimulus- driven attention in the brain”. In: Nature Reviews Neuroscience 3.3, pp. 201– 215. Zelano, Christina, Moustafa Bensafi, Jess Porter, Joel Mainland, Brad Johnson, Elizabeth Bremner, Christina Telles, Rehan Khan, and Noam Sobel (2005).

167 “Attentional modulation in human primary olfactory cortex”. In: Nature Neuroscience 8.1, pp. 114–120. ISSN: 1097-6256. Bauer, M. (2006). “Tactile spatial attention enhances gamma-band activity in somatosensory cortex and reduces low-frequency activity in parieto- occipital areas”. In: Journal of Neuroscience 26.2, pp. 490–501. ISSN: 0270- 6474. Knudsen, E I (2007). “Fundamental Components of Attention”. In: Annu Rev Neurosci 30, pp. 57–78. Fries, Pascal (2009). “Neuronal gamma-band synchronization as a fundamental process in cortical computation”. In: Annual Review of Neuroscience 32.1, pp. 209–224. ISSN: 0147-006X. Treue, S (2003b). “Visual attention: the where, what, how and why of saliency”. In: Current opinion in neurobiology 13.4, pp. 428–432. Borji, Ali (2015b). “What is a salient object? a dataset and a baseline model for salient object detection”. In: IEEE Transactions on Image Processing 24.2, pp. 742–756. Carmi, Ran and Laurent Itti (2006). “Visual causes versus correlates of attentional selection in dynamic scenes”. In: Vision Research 46.26, pp. 4333–4345. ISSN: 00426989. Hart, Bernard, Johannes Vockeroth, Frank Schumann, Klaus Bartl, Erich Schneider, Peter König, and Wolfgang Einhäuser (2009). “Gaze allocation in natural stimuli: Comparing free exploration to head-fixed viewing conditions”. In: Visual Cognition 17.6-7, pp. 1132–1158. ISSN: 13506285. Veale, Richard, Ziad M. Hafed, and Masatoshi Yoshida (2017). “How is visual salience computed in the brain? Insights from behaviour, neurobiology and modelling”. In: Philosophical Transactions of the Royal Society B: Biological Sciences 372.1714, p. 20160113. ISSN: 0962-8436. Fox, Michael, Maurizio Corbetta, Abraham Snyder, Justin Vincent, and Marcus Raichle (2006). “Spontaneous neuronal activity distinguishes human dorsal and ventral attention systems”. In: Proceedings of the National Academy of Sciences 103.26, pp. 10046–10051. ISSN: 0027-8424. Asplund, Christopher, Jay Todd, Andy Snyder, and René Marois (2010). “A central role for the lateral prefrontal cortex in goal-directed and stimulus- driven attention”. In: Nature Neuroscience 13.4, pp. 507–512. ISSN: 1097-6256. Wang, C.-A., Susan E Boehnke, Laurent Itti, and Douglas P Munoz (2014). “Transient pupil response is modulated by contrast-based saliency”. In: Journal of Neuroscience 34.2, pp. 408–417. ISSN: 0270-6474.

168 Liao, Hsin-I, Sijia Zhao, Maria Chait, Makio Kashino, and Shigeto Furukawa (2017). How the Eyes Detect Acoustic Transitions: A Study of Pupillary Responses to Transitions Between Regular and Random Frequency Patterns. Tech. rep. Association for Research in Otolaryngology. Roeber, Urte, Andreas Widmann, and Erich Schröger (2003). “Auditory distraction by duration and location deviants: a behavioral and event-related potential study”. In: Cognitive Brain Research 17.2, pp. 347–357. ISSN: 09266410. Doherty, Brianna, Eva Patai, Mihaela Duta, Anna Nobre, and Gaia Scerif (2017). “The functional consequences of social distraction: Attention and memory for complex scenes”. In: Cognition 158, pp. 215–223. ISSN: 00100277. Lavie, Nilli (2005). “Distracted and confused?: Selective attention under load”. In: Trends in Cognitive Sciences 9.2, pp. 75–82. ISSN: 13646613. Leeuw, Joshua de (2015). “jsPsych: A JavaScript library for creating behavioral experiments in a Web browser”. In: Behavior Research Methods 47.1, pp. 1–12. ISSN: 15543528. Gureckis, Todd, Jay Martin, John McDonnell, Alexander Rich, Doug Markant, Anna Coenen, David Halpern, Jessica B. Hamrick, and Patricia Chan (2016). “psiTurk: An open-source framework for conducting replicable behavioral experiments online”. In: Behavior Research Methods 48.3, pp. 829–842. ISSN: 15543528. Delorme, Arnaud and Scott Makeig (2004). “EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis”. In: Journal of Neuroscience Methods 134.1, pp. 9–21. ISSN: 01650270. Mullen, T (2012). CleanLine EEGLAB plugin. O’Sullivan, J A, S A Shamma, and E C Lalor (2015). “Evidence for neural computations of temporal coherence in an auditory scene and their enhancement during active listening”. In: The Journal of neuroscience 35.18, pp. 7256–7263. Wong, Daniel, Søren Fuglsang, Jens Hjortkjær, Enea Ceolini, Malcolm Slaney, and Alain de Cheveigné (2018). “A comparison of regularization methods in forward and backward models for auditory attention decoding”. In: Frontiers in Neuroscience 12, p. 531. ISSN: 1662-453X. Tadel, Franois, Sylvain Baillet, John Mosher, Dimitrios Pantazis, and Richard Leahy (2011). “Brainstorm: A user-friendly application for MEG/EEG analysis”. In: Computational Intelligence and Neuroscience 2011, pp. 1–13. ISSN: 16875265.

169 Pascual-Marqui, R D (2002). “Standardized low-resolution brain electromagnetic tomography (sLORETA): technical details.” In: Methods and findings in experimental and clinical pharmacology 24 Suppl D, pp. 5–12. ISSN: 0379-0355. Mazziotta, J, A Toga, A Evans, P Fox, J Lancaster, K Zilles, R Woods, T Paus, G Simpson, B Pike, C Holmes, L Collins, P Thompson, D MacDonald, M Iacoboni, T Schormann, K Amunts, N Palomero-Gallagher, S Geyer, L Parsons, K Narr, N Kabani, G Le Goualher, D Boomsma, T Cannon, R Kawashima, and B Mazoyer (2001). “A probabilistic atlas and reference system for the human brain: International Consortium for Brain Mapping (ICBM).” In: Philosophical transactions of the Royal Society of London. Series B, Biological sciences 356.1412, pp. 1293–322. ISSN: 0962-8436. Hsu, Jason (1996). Multiple comparisons: Theory and methods. Chapman & Hall/CRC. Efron, Bradley (2010). Large scale inference. Cambridge University Press, p. 262. Lindquist, Martin A. and Amanda Mejia (2015). “Zen and the Art of Multiple Comparisons”. In: Psychosomatic Medicine 77.2, pp. 114–125. ISSN: 0033- 3174. Uurtio, Viivi, João M. Monteiro, Jaz Kandola, John Shawe-Taylor, Delmiro Fernandez-Reyes, and Juho Rousu (2017). “A tutorial on canonical correlation methods”. In: ACM Computing Surveys 50.6, pp. 1–33. ISSN: 03600300. Rosa, Maria, Mitul Mehta, Emilio Pich, Celine Risterucci, Fernando Zelaya, Antje Reinders, Steve Williams, Paola Dazzan, Orla Doyle, and Andre Marquand (2015). “Estimating multivariate similarity between neuroimaging datasets with sparse canonical correlation analysis: an application to perfusion imaging”. In: Frontiers in Neuroscience 9. ISSN: 1662-453X. Deng, Li (2014). “Deep Learning: Methods and Applications”. In: Foundations and Trends in Signal Processing 7.3-4, pp. 197–387. ISSN: 1932-8346. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. MIT press, p. 800. Moore, Brian C J (2013). An Introduction to the Psychology of Hearing. Sixth Edit. BRILL, pp. 1–458. Kailath, Thomas (1967). “The divergence and Bhattacharyya distance measures in signal selection”. In: IEEE Transactions on Communication Technology 15.1, pp. 52–60. ISSN: 00189332. Ray, Supratim, Ernst Niebur, Steven S. Hsiao, Alon Sinai, and Nathan E. Crone (2008). “High-frequency gamma activity (80-150 Hz) is increased in human cortex during selective attention”. In: Clinical Neurophysiology. ISSN: 13882457.

170 Witten, Daniela M. and Robert J. Tibshirani (2009). “Extensions of sparse canonical correlation analysis with applications to genomic data”. In: Sta- tistical Applications in Genetics and Molecular Biology 8.1, pp. 1–27. ISSN: 1544-6115. Lin, Dongdong, Jigang Zhang, Jingyao Li, Vince D. Calhoun, Hong-Wen Deng, and Yu-Ping Wang (2013). “Group sparse canonical correlation analysis for genomic data integration”. In: BMC Bioinformatics 14.1, p. 245. ISSN: 1471-2105. Alho, Kimmo, Juha Salmi, Sonja Koistinen, Oili Salonen, and Teemu Rinne (2015). “Top-down controlled and bottom-up triggered orienting of auditory attention to pitch activate overlapping brain networks”. In: Brain Research 1626, pp. 136–145. ISSN: 00068993. Scalf, Paige E, Ana Torralbo, Evelina Tapia, and Diane M Beck (2013). “Compe- tition explains limited attention and perceptual resources: implications for perceptual load and dilution theories”. In: Frontiers in Psychology 4, p. 243. ISSN: 1664-1078. Corbetta, Maurizio, Gaurav Patel, and Gordon L. Shulman (2008). “The reorienting system of the human brain: From environment to theory of mind”. In: Neuron 58.3, pp. 306–324. ISSN: 08966273. Leaver, A M and J P Rauschecker (2010). “Cortical representation of natural complex sounds: effects of acoustic features and auditory object category”. In: Journal of Neuroscience 30.22, pp. 7604–7612. Debener, Stefan, Christoph S. Herrmann, Cornelia Kranczioch, Daniel Gem- bris, and Andreas K. Engel (2003). “Top-down attentional processing enhances auditory evoked gamma band activity”. In: NeuroReport 14.5, pp. 683–686. ISSN: 09594965. Tallon-Baudry, Catherine, Olivier Bertrand, Marie-Anne Hénaff, Jean Isnard, and Catherine Fischer (2005). “Attention modulates gamma-band oscillations differently in the human lateral occipital cortex and fusiform gyrus”. In: Cerebral Cortex 15.5, pp. 654–662. ISSN: 1460-2199. Senkowski, Daniel, Durk Talsma, Christoph S. Herrmann, and Marty G. Woldorff (2005). “Multisensory processing and oscillatory gamma responses: effects of spatial selective attention”. In: Experimental Brain Research 166.3-4, pp. 411–426. ISSN: 0014-4819. Sederberg, Per B., Michael J. Kahana, Marc W. Howard, Elizabeth J. Don- ner, and Joseph R. Madsen (2003). “Theta and gamma oscillations during encoding predict subsequent recall”. In: The Journal of Neuroscience 23.34, pp. 10809–10814. ISSN: 0270-6474.

171 Jensen, Ole, Jochen Kaiser, and Jean Lachaux (2007). “Human gamma-frequency oscillations associated with attention and memory”. In: Trends in Neuro- sciences 30.7, pp. 317–324. ISSN: 01662236. Tallon-Baudry, Catherine and Olivier Bertrand (1999). Oscillatory gamma activity in humans and its role in object representation. Buschman, Timothy and Earl Miller (2007). “Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices”. In: Science 315.5820, pp. 1860–1862. Ririe, Douglas G., MD Boada, Benjamin S. Schmidt, Salem J. Martin, Susy A. Kim, and Thomas J. Martin (2017). “Audiovisual Distraction Increases prefrontal cortical neuronal activity and impairs attentional performance in the rat”. In: Journal of Experimental Neuroscience 11, p. 117906951770308. ISSN: 1179-0695. Bonnefond, Mathilde and Ole Jensen (2013). “The role of gamma and alpha oscillations for blocking out distraction”. In: Communicative & Integrative Biology 6.1, e22702. ISSN: 1942-0889. Spagna, Alfredo, Melissa Mackie, and Jin Fan (2015). “Supramodal executive control of attention”. In: Frontiers in Psychology 6. ISSN: 1664-1078. Parkhomenko, Elena, David Tritchler, and Joseph Beyene (2009). “Sparse canonical correlation analysis with application to genomic data integration”. In: Statistical Applications in Genetics and Molecular Biology 8.1, pp. 1– 34. ISSN: 1544-6115. Wang, Y. X. Rachel, Keni Jiang, Lewis J. Feldman, Peter J. Bickel, and Haiyan Huang (2015). “Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis”. In: The Annals of Applied Statis- tics 9.1, pp. 300–323. ISSN: 1932-6157. Bilenko, Natalia Y. and Jack L. Gallant (2016). “Pyrcca: Regularized Kernel Canonical Correlation Analysis in Python and Its Applications to Neu- roimaging”. In: Frontiers in Neuroinformatics 10. ISSN: 1662-5196. Witten, Daniela M., Robert Tibshirani, and Trevor Hastie (2009). “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis”. In: Biostatistics 10.3, pp. 515–534. ISSN: 1465-4644. Gao, Chao, Zongming Ma, Zhao Ren, and Harrison Zhou (2015). “Minimax estimation in sparse canonical correlation analysis”. In: The Annals of Statistics 43.5, pp. 2168–2197. ISSN: 0090-5364.

172 Cai, Guoyong and Binbin Xia (2015). “Convolutional Neural Networks for Multimedia Sentiment Analysis”. In: NATURAL LANGUAGE PROCESS- ING AND CHINESE COMPUTING, NLPCC 2015. Vol. 9362, pp. 159–167. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016). “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. ISSN: 1664- 1078. Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei (2014). “Large-scale Video Classification with Convolutional Neural Networks”. In: {IEEE} Int. Conf. on Computer Vision and Pattern Recognition. Poria, Soujanya, Haiyun Peng, Amir Hussain, Newton Howard, and Erik Cambria (2017). “Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis”. In: Neu- rocomputing 261, pp. 217–230. ISSN: 18728286. Arnal, Luc H., Adeen Flinker, Andreas Kleinschmidt, Anne Lise Giraud, and David Poeppel (2015). “Human Screams Occupy a Privileged Niche in the Communication Soundscape”. In: Current Biology 25.15, pp. 2051–2056. ISSN: 09609822. Nostl, A, J E Marsh, and P Sorqvist (2012). “Expectations modulate the magnitude of attentional capture by auditory events”. In: PloS one 7.11, e48569. Kriegeskorte, Nikolaus (2015). “Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing”. In: Annual Review of Vision Science 1.1, pp. 417–446. ISSN: 2374-4642. Kuzovkin, Ilya, Raul Vicente, Mathilde Petton, Jean-Philippe Lachaux, Monica Baciu, Philippe Kahane, Sylvain Rheims, Juan R. Vidal, and Jaan Aru (2017). “Activations of Deep Convolutional Neural Network are Aligned with Gamma Band Activity of Human Visual Cortex”. In: bioRxiv, pp. 1–11. Yamins, Daniel L K and James J DiCarlo (2016). “Using goal-driven deep learning models to understand sensory cortex”. In: Nature Neuroscience 19.3, pp. 356–365. ISSN: 1097-6256. Kell, Alexander J.E., Daniel L.K. Yamins, Erica N. Shook, Sam V. Norman- Haignere, and Josh H. McDermott (2018). “A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy”. In: Neuron 98.3, pp. 630–644. ISSN: 10974199.

173 Delorme, A, G A Rousselet, M J Mace, and M Fabre-Thorpe (2004). “Interaction of top-down and bottom-up processing in the fast visual analysis of natural scenes”. In: 19.2, pp. 103–113. Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan (2016). “YouTube- 8M: A Large-Scale Video Classification Benchmark”. In: Abu-El-Haija, Sami (2017). YouTube-8M Dataset. Parkhi, Omkar M, Andrea Vedaldi, and Andrew Zisserman (2015). “Deep Face Recognition”. In: Procedings of the British Machine Vision Conference 2015, pp. 1–41. Rao Yarlagadda, R. K. (2010). Analog and digital signals and systems. New york: Springer, pp. 1–540. Weisberg, Sansford (2005). Applied Linear Regression. Wiley-Interscience, p. 335. Duda, R O, P E Hart, and D G Stork (2000). Pattern Classification. Wiley. Izenman, Alan Julian (2013b). “Linear discriminant analysis”. In: Modern Multivariate Statistical Techniques. Springer, pp. 237–280. Fawcett, T (2006). “An Introduction to ROC Analysis”. In: Pattern Recognition Letters 27.8, pp. 861–874. Cichy, Radoslaw Martin, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva (2016). “Comparison of deep neural networks to spatio- temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence”. In: Scientific Reports 6. ISSN: 20452322. Guclu, U. and M. A. J. van Gerven (2015). “Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream”. In: Journal of Neuroscience 35.27, pp. 10005–10014. ISSN: 0270-6474. Yamins, D. L. K., H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo (2014). “Performance-optimized hierarchical models predict neural responses in higher visual cortex”. In: Proceedings of the National Academy of Sciences 111.23, pp. 8619–8624. ISSN: 0027-8424. Dai, Wei, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das (2017). “Very deep convolutional neural networks for raw waveforms”. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. IEEE, pp. 421–425. Lee, Jongpil, Jiyoung Park, Keunhyoung Luke, and Kim Juhan Nam (2017). “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms”. In: arXiv.org 1703.01789.

174 Wang, Chien-Yao, Jia-Ching Wang, Andri Santoso, Chin-Chin Chiang, and Chung-Hsien Wu (2017). “Sound Event Recognition Using Auditory-Receptive- Field Binary Pattern and Hierarchical-Diving Deep Belief Network”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–1. ISSN: 2329-9290. Formisano, E, F De Martino, M Bonte, and R Goebel (2008). “Who is say- ing what? Brain-based decoding of human voice and speech”. In: Science 322.5903, pp. 970–973. Staeren, N, H Renvall, F De Martino, R Goebel, and E Formisano (2009). “Sound categories are represented as distributed patterns in the human auditory cortex”. In: Current Biology 19.6, pp. 498–502. Bertrand, O and C Tallon-Baudry (2000). “Oscillatory gamma activity in humans: a possible role for object representation”. In: 38.3, pp. 211–223. Rodriguez, Eugenio, Nathalie George, Jean Philippe Lachaux, Jacques Mar- tinerie, Bernard Renault, and Francisco J. Varela (1999). “Perception’s shadow: Long-distance synchronization of human brain activity”. In: Na- ture 397.6718, pp. 430–433. ISSN: 00280836. Haenschel, Corinna, Torsten Baldeweg, Rodney J Croft, Miles Whittington, and John Gruzelier (2000). “Gamma and beta frequency oscillations in response to novel auditory stimuli: A comparison of human electroencephalogram (EEG) data with in vitro models”. In: Proceedings of the Na- tional Academy of Sciences 97.13, pp. 7645–7650. ISSN: 0027-8424. Doesburg, Sam M., Alexa B. Roggeveen, Keiichi Kitajo, and Lawrence M. Ward (2008). “Large-scale gamma-band phase synchronization and selective attention”. In: Cerebral Cortex 18.2, pp. 386–396. ISSN: 10473211. Müller, Matthias M, Thomas Gruber, and Andreas Keil (2000). “Modulation of induced gamma band activity in the human EEG by attention and visual information processing”. In: International Journal of Psychophysiology 38.3, pp. 283–299. ISSN: 01678760. Tiitinen, H. T., J. Sinkkonen, K. Reinikainen, K. Alho, J. Lavikainen, and R. Näätänen (1993). “Selective attention enhances the auditory 40-Hz transient response in humans”. In: Nature 364.6432, pp. 59–60. ISSN: 0028-0836. Cao, Chunshui, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, and Zilei Wang (2015). “Look and Think Twice : Capturing Top-Down Visual Attention with Feedback”. In: Proceedings of the IEEE International Conference on Com- puter Vision, pp. 2956–2964.

175 Biographical Statement

Nicholas Huang was born on November 2, 1987,

in Butte, Montana. He received a Bachelor of Science degree in Biomedical Engineering and a Bachelor of Arts degree in Music at the University of Rochester

in 2011. He is pursuing a PhD degree in Biomedical Engineering at the Johns Hopkins University.

Nicholas’s doctoral work at JHU primarily focuses on the study of auditory salience, partic-

ularly in the difficult context of natural acoustic scenes. His work has helped to quantify to what degree sounds catch our bottom-up attention from the standpoints of psychophysics, neuroscience, and computational modeling. Nicholas is a Goldwater Scholarship recipient and a Ford scholarship honorable mention.

176