Cortical markers of auditory stream segregation revealed for streaming based on tonotopy but not pitch

Dorea R. Ruggles and Alexis N. Tausend Department of Psychology, University of Minnesota, 75 East River Parkway, Minneapolis, Minnesota 55455, USA Shihab A. Shamma Electrical and Computer Engineering Department & Institute for Systems, University of Maryland, College Park, Maryland 20740, USA Andrew J. Oxenhama) Department of Psychology, University of Minnesota, 75 East River Parkway, Minneapolis, Minnesota 55455, USA (Received 5 July 2018; revised 5 October 2018; accepted 8 October 2018; published online 29 October 2018) The brain decomposes mixtures of sounds, such as competing talkers, into perceptual streams that can be attended to individually. Attention can enhance the cortical representation of streams, but it is unknown what acoustic features the enhancement reflects, or where in the auditory pathways attentional enhancement is first observed. Here, behavioral measures of streaming were combined with simultaneous low- and high-frequency envelope-following responses (EFR) that are thought to originate primarily from cortical and subcortical regions, respectively. Repeating triplets of har- monic complex tones were presented with alternating fundamental frequencies. The tones were fil- tered to contain either low-numbered spectrally resolved harmonics, or only high-numbered unresolved harmonics. The behavioral results confirmed that segregation can be based on either tonotopic or pitch cues. The EFR results revealed no effects of streaming or attention on subcortical responses. Cortical responses revealed attentional enhancement under conditions of streaming, but only when tonotopic cues were available, not when streaming was based only on pitch cues. The results suggest that the attentional modulation of phase-locked responses is dominated by tonotopi- cally tuned cortical neurons that are insensitive to pitch or periodicity cues. VC 2018 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.1121/1.5065392 [GCS] Pages: 2424–2433

I. INTRODUCTION (Lukas, 1981) and even the (Maison et al.,2001), but most studies have concluded that attentional modulation is not Biologically important sounds, including speech, are evident prior to . Indeed, the earliest cortical rarely presented in isolation, but instead form mixtures with responses, with latencies of 20–30 ms, responding to modula- other sounds in the acoustic environment. The tion rates around 40 Hz, also show little evidence of attentional must therefore segregate the elements of a target sound from modulation (e.g., Gutschalk et al., 2008). the complex background, and fuse those elements together into One important acoustic feature that allows for percep- a perceptual stream that can be followed over time. Following tual segregation is the spectral content of sounds. from early studies on cortical correlates of attention (Hillyard Segregation based on differences in spectral content occurs et al.,1973), recent human studies have revealed strong atten- via the frequency-to-place mapping, or tonotopic organiza- tional modulation effects on auditory streams, with responses tion, that is established in the cochlea and is maintained to an attended target stream enhanced and/or the background throughout the auditory pathways up to and including the suppressed, both for non-speech stimuli (e.g., Gutschalk et al., auditory cortex. Perceptual studies have shown that stream 2008; Elhilali et al.,2009; Xiang et al.,2010) and speech (e.g., segregation can occur on the basis of tonotopic separation Mesgarani and Chang, 2012; Zion Golumbic et al., 2013; between alternating sounds (Miller and Heise, 1950; van O’Sullivan et al.,2015). Although the effects are clear, uncer- Noorden, 1977; Hartmann and Johnson, 1991), and some tainty remains regarding the underlying neural mechanisms physiological studies have reported correlates of perceptual that facilitate the streaming and attentional modulation of streaming of pure-tone sequences, even in the absence of speech (Mesgarani et al.,2014). Some studies have suggested physical stimulus changes, in both subcortical (Yamagishi early influences of attention, extending down to the brainstem et al., 2016) and cortical (Gutschalk et al., 2005) responses. Perceptual studies have also shown that segregation can a)Electronic mail: [email protected] be based on higher-level features, such as pitch, timbre, or

2424 J. Acoust. Soc. Am. 144 (4), October 2018 0001-4966/2018/144(4)/2424/10 VC Author(s) 2018. perceived location (Vliegen and Oxenham, 1999; Roberts et al., 2002; David et al., 2015; Javier et al., 2016) and com- binations thereof (Tougas and Bregman, 1985; Woods and McDermott, 2015). Attention may enhance the responses of spectrally tuned neurons that form the basis of the tonotopic representations within human auditory cortex (e.g., Formisano et al.,2003; Moerel et al., 2012), but may also enhance the responses of neurons that respond to higher-level features, such as pitch (Penagos et al., 2004; Bendor and Wang, 2005; Norman-Haignere et al., 2013; Allen et al., 2017). Because different talkers differ on multiple dimensions, it is not possi- ble to determine whether the reported cortical correlates of attention to speech (Mesgarani and Chang, 2012) are based on tonotopic cues, other higher-level features, or a combination of both. A recent study presented one talker to each ear, provid- ing complete peripheral separation between the male talkers (O’Sullivan et al.,2015). However, in more natural situations, the signal from each talker will be present at both ears, render- ing it less clear how neural segregation might occur. The present study had two main aims. The first aim was to determine whether selective attention enhances the neural corre- lates of streams that are defined in terms of differences in either spectral differences and pitch, or just pitch. The second aim was to test for the existence of subcortical neural correlates of streaming and/or attention. Sequences of harmonic complex tones that alternated in fundamental frequency (F0) were either filtered to contain low-numbered, spectrally resolved harmonics or filtered to contain only high-numbered, spectrally unresolved harmonics. Resolved harmonics can be heard out as individual FIG. 1. (Color online) Schematic diagram of the stimuli used in this experi- tones (Plomp, 1964; Bernstein and Oxenham, 2003), and have ment. (A) Excitation pattern representation of the tones using a model of been shown to produce separate peaks of neural activity corre- effective auditory responses by Glasberg and Moore (1990), showing that sponding to each harmonic in other species (e.g., Cedolin and the tones in the low spectral condition produced individual peaks in the exci- Delgutte, 2005; Fishman et al., 2013); unresolved harmonics tation pattern and so can be considered spectrally resolved, whereas the tones in the high spectral condition did not produce clear individual spectral cannot be heard out, and do not produce individual peaks of peaks and so can be considered spectrally unresolved. (B) Schematic dia- activity, meaning that changes in F0 do not produce changes in gram of the tone sequence, as presented. The higher-F0 A tones (red) are the overall pattern of activity along the tonotopic axis (Houtsma interleaved with the B tones (black) to form a pattern of repeating triplets. and Smurzynski, 1990; Shackleton and Carlyon, 1994; Cedolin Certain tones, selected at random, are increased in level by 6 dB to produce oddballs. (C) Schematic diagram of the tone sequence, as perceived when and Delgutte, 2005; Fishman et al., 2013); see Fig. 1. A behav- the A and B tones form separate perceptual streams. The task of the partici- ioral task was employed to encourage the perceptual segregation pants was to attend to just one of the streams and report oddballs in only that of the alternating tone sequences and to focus attention on one stream. of the two streams. Recordings using (EEG) were made that measured both high- and low-frequency All participants had pure-tone thresholds of 20 dB envelope-following responses (EFR), reflecting primarily sub- hearing level (HL) or better in both ears at octave frequen- cortical and cortical neural populations, respectively, to identify cies between 250 Hz and 8 kHz. All participants provided potential correlates of streaming and attention. written informed consent, and the protocol was approved by The results revealed cortical modulation that reflects the Institutional Review Board of the University of streaming and attention, but only in conditions that provided Minnesota. All participants also completed a short screening tonotopic differences between the tones; in conditions with protocol, described below, to ensure that they were able to only unresolved harmonics, and therefore no tonotopic dif- perform the behavioral task. Nineteen participants (nine ferences, no significant modulation of cortical responses was female, ten male) passed the screening task and completed observed. In addition, no brainstem or midbrain correlates of the entire experiment. The participants who completed the streaming or attention were observed in any condition. experiment were aged between 18 and 32 years of age (mean ¼ 21.9 years) and reported between 0 and 14 years of II. METHODS musical training (mean ¼ 5.3 years). A. Participants B. Stimuli Twenty-eight (15 female, 13 male) young adult listeners [mean age 21.7 years; standard deviation (SD) ¼ 3.6 years] The stimuli were 1-min blocks of alternating harmonic were recruited from the University of Minnesota community. complex tones, presented in a repeating ABA-ABA- triplet

J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. 2425 sequence, where the A tone was higher in F0 than the B They were seated in clear view of a computer monitor, tone. Each tone was 50 ms long, including 10-ms raised- which indicated the target stream for the current block (high cosine onset and offset ramps. Tones within a triplet were or low) and provided an immediate acknowledgment of each separated by 25-ms gaps, and each triplet was separated by a button press. However, no correct-answer feedback was pro- gap of 100 ms (as if every second B tone was silenced within vided. The participants were required to respond to the odd- an alternating ABABABA sequence). The repetition rate of ball tokens within the attended stream by pressing a button the higher A tones was therefore 6.67 Hz, and the repetition immediately after each detected oddball, and to suppress rate of the lower B tones (and the entire triplet pattern) was responses to oddballs in the unattended streams. 3.33 Hz. Each harmonic complex tone was generated in sine The task proved difficult for some individuals, even phase and bandpass filtered into a fixed spectral region. The with a 15-semitone separation between high- and low-F0 F0 of the B tone was always 100 Hz and the F0 of the A tokens, so a training/screening protocol was introduced to tone was either 1 semitone higher (106 Hz) or 15 semitones ensure that each listener could perform the task adequately higher (238 Hz). With a 1-semitone F0 difference, termed prior to data collection. The first two blocks consisted of the small-separation condition, the two tones were likely to dichotic stimuli: the high-F0 (A) tones were presented to form a single perceptual stream, eliciting the percept of a the right ear (attended in the first block), and the low-F0 galloping rhythm; with a 15-semitone F0 difference, termed (B) tones were presented to the left ear (attended in the sec- the large-separation condition, the two tones were most ond block). The results and strategies in the two dichotic likely to form two separate perceptual streams of isochro- blocks were discussed with each participant before they nous tones (Vliegen and Oxenham, 1999). Control condi- moved on to four blocks of diotically presented stimuli tions were also tested, in which only the A tones (at 238 Hz), with the large F0 difference. Participants completed two or only the B tones (at 100 Hz) were presented; these were blocks with resolved harmonics (attend high and attend termed the single-stream conditions. low) and two blocks with unresolved harmonics (attend The stimuli were filtered into either a low spectral high and attend low), and were provided with feedback. In region (300–1800 Hz) so that the complexes contained some order to pass a set of blocks, the participants were required spectrally resolved harmonics, or a high spectral region to accurately respond to at least four of the six oddballs in (3000–4500 Hz) so that no harmonics were spectrally the attended stream and to respond to two or fewer of the resolved (the lowest harmonic number within the pass band oddballs in the unattended stream in all four blocks. The was always greater than 12); see Fig. 1. participants were required to pass two consecutive sets of The tones were presented at an overall root-mean-square these four blocks in order to be included in the experiment. (rms) level of 70 dB sound pressure level (SPL) in either Unlimited attempts at the four-block sets were allowed. positive or negative polarity, and were embedded in spec- The participants who failed to meet this criterion often trally notched threshold-equalizing noise (TEN) at 50 dB attempted it over ten times before discontinuing the experi- SPL per ERB, to reduce potential off-frequency listening ment and most commonly failed the attend-low condition and audible distortion products (Moore et al., 2000; with unresolved harmonics. Those who met the criteria for Oxenham et al., 2009). The TEN was generated from 50 Hz the study generally did so within about four sets. They then to 6 kHz with a spectral notch between 250 Hz and 2 kHz for completed a set of eight blocks that included both large- the resolved-harmonics conditions and a spectral notch and small-F0 separation conditions with both resolved and between 2.7 and 5.25 kHz for the unresolved-harmonics con- unresolved harmonics before undertaking the three EEG ditions. The noise was generated as a 1-s circular token, test sessions. repeated to create a 70-s source sound. For each block, 61 s During the sessions with EEG, simultaneous low- and of noise was randomly sampled and added to the triplet stim- high-frequency EFRs were recorded to the 1-min-long pre- sentations of the stimuli. The low-frequency EFRs corre- uli with a 1-s onset lead. Level oddballs in both the A- and sponded to the presentation rates of the A and B tones (6.67 B-tone sequences were used to monitor stream segregation and 3.33 Hz, respectively) and the individual tones (13.3 Hz, and attention. Each sequence (A and B tones) contained six including the “missing” B tone after each triplet), reflecting oddball tokens per block that were presented at a level 6 dB cortical activity. The high-frequency EFRs corresponded to higher than the regular tokens. Oddballs were prevented the F0s of the tones (between 100 and 238 Hz) and are from occurring in the first or last triplets of the sequence, thought to reflect primarily subcortical activity (Bidelman, and were restricted from occurring within the same triplet in 2018). The stimuli were generated using MATLAB (The both the A- and B-tone sequences and from being separated Mathworks, Natick, MA) and were played to participants via by fewer than two regular tokens within a given sequence. a Tucker Davis Technologies (Alachua, FL) real-time pro- cessor with headphone buffer and ER1 insert earphones C. Procedure (Etymotic Research, Elk Grove Village, IL). The EEG mea- The experiment was carried out in a sound-attenuating surements were acquired using a BioSemi (Amsterdam, and electrically shielded booth. The participants were seated Netherlands) active electrode system with a sampling rate of comfortably with a number pad on their laps and were 4096 Hz and 32 channels, referenced to averaged mastoid instructed to attend selectively to either the high or low tones electrodes. Two blocks of each of the 24 conditions within each 60-s block and to press the button on their key (resolved or unresolved harmonics; small or large F0 differ- pad every time an oddball occurred in the attended tones. ence or single sequence; attend high or low; positive or

2426 J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. FIG. 2. (Color online) Schematic dia- gram of the neural envelope-following responses to the low- and high- frequency components within the EEG signal (red and black lines, respec- tively). The heat map shows the distri- bution across the scalp of the amplitude of the peak EFR component at the low EFR frequency of 3.3 Hz.

negative stimulus polarity) were tested in each of the three (DSS) routine was applied to reduce the role of any artifact in sessions, resulting in 48 blocks per session, presented in ran- the analysis. Using the NoiseTools MATLAB toolbox, the dom order. The participants were allowed to self-initiate epoched data (time channel trial) were orthogonalized each block, allowing for rest as needed between blocks. All through principal component analysis (PCA), the components participants completed each EEG session in 90 min or less. were normalized, and the data were rotated and projected on the orthogonalized reference axes to remove the first two D. EEG analysis components (time-shift PCA) (S€arel€a and Valpola, 2005; de Cheveigne and Simon, 2007, 2008). The resultant data no lon- Finite impulse response (FIR) filters with linear phase ger exhibited visible artifacts at multiples of 50 Hz. response using 800 taps were applied to the EEG recordings The envelope phase-locking value (PLV; Tallon- to separate low-frequency (0.5–30 Hz) from higher-frequency Baudry et al., 1996; Ruggles et al., 2012) was computed for (70–1000 Hz) responses. Data from the low-frequency band all electrodes, frequencies, and conditions. The PLV is the were first epoched into 1-min events corresponding to presen- magnitude of the unit vector in each frequency bin after tation blocks. The 1-min EEG blocks were re-referenced to averaging across presentations. If the EEG waveform were averaged mastoid electrodes and baselined to the 25-ms completely coherent with the stimulus, and hence period prior to stimulus onset. The re-referenced and base- completely repeatable, the PLV would be 1 for each fre- lined 1-min blocks were further segmented into eight-triplet quency bin. For random EEG responses, the angle or direc- events with 50% (four triplets) overlap, resulting in 96 epochs tion of the unit vector at each frequency would be random of 2.4 s, each baselined to the preceding triplet (300 ms). on each presentation, and so the magnitude of the vector Combining data from three sessions resulted in 288 events in after averaging tends to 0. For a given condition, epoch positive polarity and 288 in negative polarity for each experi- lists from the three sessions were first concatenated. A mental condition. bootstrapping procedure was used to estimate PLVs based Data from the high-frequency band were also epoched on 400 trials of 75 random draws from the positive and into 1-min blocks, re-referenced, and baselined before negative event lists. For each draw, a first-order Slepian responses to the single 50-ms tokens were isolated. Individual window was applied before computing the Fourier trans- tokens were baselined to the mean of the 75 ms preceding the form. An average of the unit phase vectors was calculated token. Each test session produced 796 high-F0tokensand across all draws. The PLV was averaged across all 32 398 low-F0 tokens for each block type, and combining data channels, and peak values were extracted at key frequen- from the three sessions resulted in 2388 high-F0 tokens in cies. The slow (3.3 Hz), fast (6.7 Hz), and combined each polarity and 1194 low-F0tokensineachpolarityfor (13.3 Hz) presentation rates of the tone sequences were each experimental condition. analyzed in the low-frequency band, and the tone F0s The EEG signal in the 70–1000-Hz band sometimes (100, 106, and 238 Hz) were analyzed in the high- contained an unexpected artifact at harmonics of 50 Hz, with frequency band. A schematic showing the procedure by increasing magnitude with harmonic number, even in the which the PLVs were extracted, along with a heatmap of absence of 100 Hz stimuli. Given the significance of 100 Hz the distribution of the PLV peak values at 3.3 Hz, is shown as an experimental frequency, a denoising source separation in Fig. 2.

J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. 2427 III. RESULTS A. Behavioral results Participants’ behavioral responses during the EEG recordings were categorized into “hit,” “false alarm,” and “random” responses. Hits were defined as button presses that occurred between 0.8 and 1.8 s after an oddball event in the target stream. False alarms were defined as button pushes that occurred between 0.8 and 1.8 s after an oddball event in the non-target stream, and random responses were all other button presses. These criteria were developed based on response timing during pilot blocks. Random responses were relatively rare, averaging less than one every other block, and so were not analyzed further. Each participant’s d0 values were calculated for the 12 FIG. 3. Mean behavioral performance (N ¼ 19) during the EEG recordings conditions by separately averaging hits and false alarms in terms of the sensitivity index, d0. The results show high performance in across the three sessions, computing the z-scores, and then all conditions where there is a large (15-semitone) F0 difference between subtracting the z-scored false alarms from the z-scored hits the alternating tones, but near-chance performance when the F0 difference is small (one semitone). Performance was not affected by whether the har- (Green and Swets, 1966). Participants who obtained perfect monics were spectrally resolved. Error bars represent 61 standard error of scores for a condition across all sessions were adjusted by the mean. substituting 0.995 for a perfect hit rate and 0.005 for a perfect false alarm rate, resulting in a maximum possible d0 value of conditions, the analyzed token was present in one condition 5.15. The results shown in Fig. 3 confirm high performance (black traces) but absent in the other (red traces). for the two conditions in which only one of the two tone Token-specific phase locking to the F0 was also sequences was presented (single-stream conditions), and also observed in conditions where the stimuli were identical show high performance in conditions with a large F0separa- (both high and low tokens present) and differentiated only tion between the A and B tones. However, when the F0sepa- by which stream the participants were instructed to attend to. ration was small, performance was near chance (d0 ¼ 0). This In these cases, peak magnitudes were similar between condi- pattern of results was observed for both the resolved- tions. Peaks were extracted from these spectra at points cor- harmonics and unresolved-harmonics conditions. responding to the high F0 (238 Hz in large-separation and A three-way repeated-measures analysis of variance single-stream conditions, 106 Hz in small-separation condi- (ANOVA) was performed on the data from the four condi- tions) and low F0 (100 Hz) in each condition; the values of tions where both A and B tones were present (sequences these peaks are shown in Fig. 5, with the individual values as with single tones were excluded), with d0 as the dependent lines and the mean values as symbols. variable and factors of harmonic resolvability (resolved or Repeated-measures ANOVAs were conducted on the unresolved), F0 separation (1 or 15 semitones), and attended PLVs in all conditions that included both the A and B tones, sequence (high, A, or low, B). No main effect of resolvabil- separately for high and low tokens to study the main effects 2 of separation size (1 or 15 semitones), harmonic resolvability ity was found (F1,18 ¼ 1.5, p ¼ 0.24, gp ¼ 0.077), confirming that the participants performed similarly with both resolved (resolved or unresolved), and attention (attended or unat- and unresolved harmonics. There was a main effect of F0 tended). For the high-F0 tokens (A tones), there was a main 2 2 effect of resolvability (F1,18 ¼ 7.5, p ¼ 0.013, gp ¼ 0.30), separation (F1,18 ¼ 268.6, p < 0.0001, gp ¼ 0.94), confirm- ing the observation that performance was much poorer with with the unresolved harmonics producing a larger PLV on average. There was also a main effect of F0 separation the 1-semitone separation than with the 15-semitone separa- 2 tion, in line with expectations. There was also a main effect (F1,18 ¼ 12.4, p ¼ 0.002, gp ¼ 0.41), although it is not clear 2 if this was due to the change in F0 separation between the A of attended sequence (F1,18 ¼ 9.1, p ¼ 0.007, gp ¼ 0.34), reflecting the fact that performance was slightly poorer when and B tones, or simply the different F0 of the A tone. There listeners attended to the low tones than to the high tones. No was no main effect of attention (F1,18 ¼ 0.001, p ¼ 0.982, g 2 < interactions between the main effects were significant p 0.001), and no significant interactions. For the low-F0 tokens (B tones), the ANOVA revealed no significant main (p > 0.6 in all cases). 2 effects (Resolvability: F1,18 ¼ 3.56, p ¼ 0.075, gp ¼ 0.17; 2 Separation: F1,18 ¼ 0.59, p ¼ 0.452, gp ¼ 0.032; Attention: B. EEG responses 2 F1,18 ¼ 1.34, p ¼ 0.261, gp ¼ 0.069), and no significant inter- 1. High-frequency responses actions (all p > 0.09). Thus, for the high-frequency EFR, The PLV spectra for the higher-frequency EEG there was no evidence for attentional modulation. responses demonstrate neural phase locking to the funda- 2. Low-frequency responses mental and harmonics of the tokens presented. As an exam- ple, Fig. 4 shows phase locking in the unresolved single- The PLV spectra for the lower-frequency (cortical) band stream conditions averaged over all 19 participants. In these demonstrate robust neural phase locking to the presentation

2428 J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. FIG. 4. (Color online) Averaged PLV spectra of the high-frequency EFR in response to low-F0 (100 Hz; left panel) and high-F0 (238 Hz; right panel) tokens from the single-stream condi- tions. Thin black lines indicate condi- tions where the tokens were present; thicker red lines indicate where the tokens were absent (baseline). In both panels, a clear peak in the EFR is observed at the F0 when the stimulus was present.

rates and multiples of the high (fast rate: 6.7 Hz) and/or low (slow rate: 3.3 Hz) streams in all conditions (Fig. 6). In con- ditions with both A and B tones with only unresolved har- monics (Fig. 6, lower left and center), there was no clear effect of attention: the red and black lines generally overlap. In contrast, in conditions with resolved harmonics and large F0 separation (Fig. 6, upper left panel), the amplitude of the 3.3-Hz peak seems to be modulated by attention, with its amplitude enhanced when listeners were attending to the slower-rate (3.3-Hz) low-F0 tones. Because the fast presentation rate of the A tones (6.7 Hz) is also the second harmonic of the B-tone presenta- tion rate (3.3 Hz), the interpretation of the 6.7-Hz peak alone is problematic. We considered several methods of analyzing the data, including considering just the 3.3-Hz component, as well as calculating the ratio or difference between the PLV at 3.3 Hz and at 6.7 Hz. All approaches produced results that were highly correlated and resulted in the same statisti- cal conclusions. The mean and individual PLVs at 3.3 Hz are shown in Fig. 7 for all the conditions tested. A three-way repeated-measures ANOVA was performed with this PLV amplitude at 3.3 Hz as the dependent variable, and factors of resolvability (resolved or unresolved), F0 separation (1 or 15 semitones), and attention (high or low tones attended). Conditions containing only one of the two tones were again excluded. A significant main effect of resolvability was 2 found (F1,18 ¼ 51.8, p < 0.0001, gp ¼ 0.80), but there were no main effects of either F0 separation (F1,18 ¼ 0.16, 2 p ¼ 0.70, gp ¼ 0.012) or attention (F1,18 ¼ 4.5, p ¼ 0.053, 2 gp ¼ 0.26). Two-way interactions with resolvability were FIG. 5. (Color online) Individual and mean PLV peak values in the high- not significant (resolvability separation: F1,18 ¼ 1.29, frequency EFR for high tokens (238 Hz in Large Separation and Single 2 p ¼ 0.28, gp ¼ 0.090; resolvability attention: F1,18 ¼ 0.29, Stream conditions and 106 Hz in the Small Separation condition) and for 2 low tokens (100 Hz in all conditions). Individual data (N ¼ 19) are shown as p ¼ 0.60, gp ¼ 0.022); however, there was a significant two- lines, and mean data are shown as symbols with error bars representing 61 way interaction between F0 separation and attention standard error of the mean. In conditions where both A and B tones are 2 (F1,18 ¼ 8.03, p ¼ 0.014, gp ¼ 0.382) and a three-way inter- present (upper and middle panels for large and small F0 separations, respec- 2 tively), the results show no significant effects of attention on the high- action (F1,18 ¼ 9.65, p ¼ 0.011, gp ¼ 0.40), supporting the frequency EFR responses. observation that attention did affect responses, but only

J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. 2429 FIG. 6. (Color online) Averaged PLVs in the range of the tone repetition rates (3.3–6.7 Hz) and their second harmon- ics. Upper panels show data from con- ditions with resolved harmonics; lower panels show data from conditions with only unresolved harmonics. Thin black lines correspond to conditions where the high, fast-rate (6.7 Hz) tones were attended; thicker red traces correspond to conditions where the low-F0, slow- rate (3.3 Hz) tones were attended. The results show a significant effect of attention (***p 0.001), but only in the case of the large F0 separation, and then only for the condition with the resolved harmonics. The far right pan- els show the PLV spectrum for condi- tions where only a single tone sequence (A-tone or B-tone) was pre- sent. For both the resolved-harmonic and unresolved-harmonic complex tones, the 3.3-Hz component was sig- nificantly larger for the B-tones alone than for the A-tones alone (**p 0.01, ***p 0.001).

when the F0 separation was large, and then only when the the tone streams are presented in isolation, and when there is harmonics were resolved. This interpretation is further sup- a large F0 difference between the attended and unattended ported by the fact that paired comparisons revealed a signifi- stream. High levels of performance were observed regardless cant difference due to attention for the resolved harmonics of whether the harmonics in the complex tones were spec- with the large F0 separation (t18 ¼ 3.9, p ¼ 0.001), but no trally resolved or not. This outcome confirms that listeners difference due to attention for the resolved harmonics with are able to segregate sequences into streams based on F0 dif- the small F0 separation (t18 ¼ 0.55, p ¼ 0.59), or for the ferences, even when the harmonics are all unresolved, result- unresolved harmonics with either the large F0 separation ing in no tonotopic cues (Vliegen and Oxenham, 1999; (t18 ¼ 0.13, p ¼ 0.9) or small F0 separation (t18 ¼ 1.1, Grimault et al., 2000). p ¼ 0.27). Overall, therefore, attention affected only the low- frequency (cortical) responses, and then only in the condition B. High-frequency EFRs show no attention-based that induced stream segregation with spectrally resolved streaming effects harmonics. The EFR to high frequencies ( >80 Hz) has until IV. DISCUSSION recently been considered to primarily reflect subcortical activity from brainstem and midbrain nuclei, in particular A. Behavioral outcomes confirm perceptual streaming the (Kiren et al., 1994; Kuwada et al., in the absence of tonotopic cues 2002; Shinn-Cunningham et al., 2017). Some recent work The behavioral results show that listeners are able to using (MEG) and EEG has sug- detect oddballs accurately in one of the tone streams when gested that responses up to around 100 Hz may also include

2430 J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. PLVs in response to tones containing resolved harmonics. Thus, the change in PLV amplitude from resolved to unre- solved harmonics is opposite to the change in perceptual pitch strength or accuracy found in many previous studies (Houtsma and Smurzynski, 1990; Shackleton and Carlyon, 1994; Bernstein and Oxenham, 2003). The dissociation between PLV and pitch strength is consistent with the idea that the EFR reflects stimulus periodicity but not the percep- tion of pitch (Gockel et al., 2011). Despite the robust high- frequency EFRs in our study, no effects of attention were observed in any condition. Thus, our results provide no evi- dence for the emergence of auditory stream-based attentional effects in subcortical structures.

C. Low-frequency EFRs reflect tonotopic-based attentional streaming A different pattern of results was observed in the lower- frequency EEG band, phase-locked to the stimulus repetition rates, which likely reflects cortical activity (Kuwada et al., 2002). In contrast to the high-frequency EFR, a robust effect of attention was observed here, but only in the case of the large F0 separation with the spectrally resolved harmonics. FIG. 7. (Color online) Individual and mean PLVs at 3.3 Hz (N ¼ 19). The lack of an attentional effect with the small F0 separa- Individual data (N ¼ 19) are shown as lines, and mean data are shown as tions is consistent with the behavioral results: the F0 differ- symbols with error bars representing 61 standard error of the mean. The upper panels show results from conditions with resolved harmonics; the ence was too small to induce sequential streaming between lower panels show results from conditions with only unresolved harmonics. the two alternating tones, resulting in near-chance perfor- Far right panels show the results with only one stream present. As shown in mance in the behavioral task, which in turn indicated an Fig. 6, the effect of attention was significant in the two-stream conditions, but only for the condition with the large F0 separation, and then only in the inability to attend selectively to one or other sequence. condition with resolved harmonics (***p 0.001, **p 0.01). However, the lack of an attentional effect on the cortical PLV with the unresolved harmonics and large F0 separation cortical generators (Coffey et al., 2016; Coffey et al., 2017); represents a dissociation between the cortical and the behav- however, the most recent work on the topic suggests that ioral results. It therefore appears that the EEG correlates of with EEG (as opposed to MEG), responses around 100 Hz streaming and attention in our paradigm are limited to condi- remain strongly dominated by subcortical structures tions where there are tonotopic differences between the A- (Bidelman, 2018). and B-tone sequences. Some earlier studies have suggested that neural corre- As mentioned in the Introduction, there have been lates of streaming may emerge in subcortical structures numerous reports of attentional modulation of auditory corti- (Pressnitzer et al., 2008; Schadwinkel and Gutschalk, 2011; cal responses in the past, using EEG (e.g., Hillyard et al., Yao et al., 2015), and one study has reported modulation of 1973), MEG (Gutschalk et al., 2008; Xiang et al., 2010), the subcortical frequency following response (FFR) by per- functional magnetic resonance imaging (fMRI) (Petkov ceived streaming (whether one or two streams were per- et al., 2004), and electrocorticography (ECoG) (Mesgarani ceived in a repeating ABA triplet sequence) (Yamagishi and Chang, 2012). Neural correlates of streaming for pure et al., 2016). However, in that study, the effect was only tones have been reported previously in humans (Cusack, observed for the second A tone of each triplet, whereas corti- 2005; Gutschalk et al., 2005; Snyder et al., 2006; Wilson cal correlates have tended to be observed for the B tone in et al., 2007) and other species (Fishman et al., 2001; each sequence (Gutschalk et al., 2005; Yamagishi et al., Micheyl et al., 2005; Bee et al., 2010; Itatani and Klump, 2016). In general, reports of attentional effects on subcortical 2014). Neural correlates of streaming have also been responses have been rare (e.g., Galbraith et al., 1998; reported based on higher-level features, such as pitch in Hairston et al., 2013), and a review of the earlier literature humans (Gutschalk et al., 2007) and spatial separation in concluded that there was little or no evidence supporting the both humans (Schadwinkel and Gutschalk, 2010) and other attentional modulation of the EFR (Varghese et al., 2015). species (Yao et al., 2015). However, none of these studies Most recently, a study found attentional modulation of the on higher-level features explicitly examined the role or cor- EFR to frequencies less than 100 Hz, but not to frequencies relates of attention in their tasks. Thus, to our knowledge, around 200 Hz (Holmes et al., 2017). this is the first study to examine the correlations of attention Our results using F0s between 100 and 231 Hz showed a to specific sound dimensions in a streaming paradigm. Our strong response to the F0 of tones in the sequence and its results suggest that the tonotopic, but not the periodicity, harmonics. In general, the PLVs in response to tones con- dimension produces measurable attention-based modulation taining only unresolved harmonics were larger than the of the cortical EFRs.

J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. 2431 V. CONCLUSIONS de Cheveigne, A., and Simon, J. Z. (2008). “Denoising based on spatial fil- tering,” J. Neurosci. Methods 171, 331–339. Despite strong responses to the stimuli, no attentional Elhilali, M., Xiang, J., Shamma, S. A., and Simon, J. Z. (2009). “Interaction modulation of high-frequency EFRs was observed, consis- between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene,” PLoS Biol. 7, tent with no modulation of phase-locked brainstem or mid- e1000129. brain responses. Low-frequency EFR provided correlates of Fishman, Y. I., Reser, D. H., Arezzo, J. C., and Steinschneider, M. (2001). attention to streams that are likely to be cortical in nature, “Neural correlates of auditory stream segregation in primary auditory cor- but only in conditions where some tonotopic differences tex of the awake monkey,” Hear Res. 151, 167–187. Fishman, Y. I., Micheyl, C., and Steinschneider, M. (2013). “Neural repre- existed between the alternating tones in the sequences. Our sentation of harmonic complex tones in primary auditory cortex of the findings should not be interpreted as evidence against a neu- awake monkey,” J. Neurosci. 33, 10312–10323. ral representation of non-tonotopic auditory streaming; Formisano, E., Kim, D. S., Di Salle, F., van de Moortele, P. F., Ugurbil, K., instead, it may be that the population-based EEG recordings and Goebel, R. (2003). “Mirror-symmetric tonotopic maps in human pri- mary auditory cortex,” Neuron. 40, 859–869. are not sensitive to the potentially smaller neural populations Galbraith, G. C., Bhuta, S. M., Choate, A. K., Kitahara, J. M., and Mullen, that respond selectively to higher-level features, such as T. A., Jr. (1998). “Brain stem frequency-following response to dichotic pitch. As an example, a search for pitch-sensitive neurons in vowels during attention,” Neuroreport 9, 1889–1893. the auditory cortex of marmosets found only a relatively Glasberg, B. R., and Moore, B. C. J. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103–138. small number of such units in a relatively constrained region Gockel, H. E., Carlyon, R. P., Mehta, A., and Plack, C. J. (2011). “The fre- of auditory cortex (Bendor and Wang, 2005), whereas large quency following response (FFR) may reflect pitch-bearing information areas of primary auditory cortex are known to reflect tono- but is not a direct representation of pitch,” J. Assoc. Res. Otolaryngol. 12, 767–782. topic organization in response to both simple (Formisano Green, D. M., and Swets, J. A. (1966). Signal Detection Theory and et al., 2003) and complex (Moerel et al., 2012) sounds. Psychophysics (Krieger, New York). Taken together with earlier findings, our results suggest that Grimault, N., Micheyl, C., Carlyon, R. P., Arthaud, P., and Collet, L. EEG measures can provide correlates of auditory streaming (2000). “Influence of peripheral resolvability on the perceptual segregation of harmonic complex tones differing in fundamental frequency,” and attention, but that such population-based correlates may J. Acoust. Soc. Am. 108, 263–271. be sensitive primarily to low-level tonotopic differences, and Gutschalk, A., Micheyl, C., Melcher, J. R., Rupp, A., Scherg, M., and not to higher-level features, such as pitch, that extend Oxenham, A. J. (2005). “Neuromagnetic correlates of streaming in human beyond simple tonotopy. auditory cortex,” J. Neurosci. 25, 5382–5388. Gutschalk, A., Micheyl, C., and Oxenham, A. J. (2008). “Neural correlates of auditory perceptual awareness under informational masking,” PLoS ACKNOWLEDGMENTS Biol. 6, 1156–1165. Gutschalk, A., Oxenham, A. J., Micheyl, C., Wilson, E. C., and Melcher, J. This work was supported by NIH Grant No. R01 R. (2007). “Human cortical activity during streaming without spectral DC016119. We thank Alain de Cheveigne, Leonard cues suggests a general neural substrate for auditory stream segregation,” Varghese, and Hari Bharadwaj for their generous assistance J. Neurosci. 27, 13074–13081. Hairston, W. D., Letowski, T. R., and McDowell, K. (2013). “Task-related in analyzing the EEG data. suppression of the brainstem frequency following response,” PLoS One 8, e55215. Allen, E. J., Burton, P. C., Olman, C. A., and Oxenham, A. J. (2017). Hartmann, W. M., and Johnson, D. (1991). “Stream segregation and periph- “Representations of pitch and timbre variation in human auditory cortex,” eral channeling,” Music Percept. 9, 155–184. J. Neurosci. 37, 1284–1293. Hillyard, S. A., Hink, R. F., Schwent, V. L., and Picton, T. W. (1973). Bee, M. A., Micheyl, C., Oxenham, A. J., and Klump, G. M. (2010). “Electrical signs of selective attention in the human brain,” Science 182, “Neural adaptation to tone sequences in the songbird forebrain: Patterns, 177–180. determinants, and relation to the build-up of auditory streaming,” Holmes, E., Purcell, D. W., Carlyon, R. P., Gockel, H. E., and Johnsrude, I. J. Comp. Physiol. A Neuroethol. Sens Neural Behav. Physiol. 196, S. (2017). “Attentional modulation of envelope-following responses at 543–557. lower (93-109 Hz) but not higher (217-233 Hz) modulation rates,” Bendor, D., and Wang, X. (2005). “The neuronal representation of pitch in J. Assoc. Res. Otolaryngol. 19, 83–97. primate auditory cortex,” Nature 436, 1161–1165. Houtsma, A. J. M., and Smurzynski, J. (1990). “Pitch identification and dis- Bernstein, J. G., and Oxenham, A. J. (2003). “Pitch discrimination of diotic crimination for complex tones with many harmonics,” J. Acoust. Soc. Am. and dichotic tone complexes: Harmonic resolvability or harmonic 87, 304–310. number?,” J. Acoust. Soc. Am. 113, 3323–3334. Itatani, N., and Klump, G. M. (2014). “Neural correlates of auditory stream- Bidelman, G. M. (2018). “Subcortical sources dominate the neuroelectric ing in an objective behavioral task,” Proc. Natl. Acad. Sci. U.S.A. 111, auditory frequency-following response to speech,” Neuroimage 175, 10738–10743. 56–69. Javier, L. K., McGuire, E. A., and Middlebrooks, J. C. (2016). “Spatial Cedolin, L., and Delgutte, B. (2005). “Pitch of complex tones: Rate-place stream segregation by cats,” J. Assoc. Res. Otolaryngol. 17, 195–207. and interspike interval representations in the auditory nerve,” Kiren, T., Aoyagi, M., Furuse, H., and Koike, Y. (1994). “An experimental J. Neurophysiol. 94, 347–362. study on the generator of amplitude-modulation following response,” Acta Coffey, E. B., Herholz, S. C., Chepesiuk, A. M., Baillet, S., and Zatorre, R. Otolaryngol. Suppl. 511, 28–33. J. (2016). “Cortical contributions to the auditory frequency-following Kuwada, S., Anderson, J. S., Batra, R., Fitzpatrick, D. C., Teissier, N., and response revealed by MEG,” Nat. Commun. 7, 11070. D’Angelo, W. R. (2002). “Sources of the scalp-recorded amplitude-modu- Coffey, E. B. J., Musacchia, G., and Zatorre, R. J. (2017). “Cortical corre- lation following response,” J. Am. Acad. Audiol. 13, 188–204. lates of the auditory frequency-following and onset responses: EEG and Lukas, J. H. (1981). “The role of efferent inhibition in human auditory atten- fMRI evidence,” J. Neurosci. 37, 830–838. tion: An examination of the auditory brainstem potentials,” Int. J. Cusack, R. (2005). “The intraparietal sulcus and perceptual organization,” Neurosci. 12, 137–145. J. Cogn. Neurosci. 17, 641–651. Maison, S., Micheyl, C., and Collet, L. (2001). “Influence of focused audi- David, M., Lavandier, M., and Grimault, N. (2015). “Sequential streaming, tory attention on cochlear activity in humans,” Psychophysiology 38, binaural cues and lateralization,” J. Acoust. Soc. Am. 138, 3500–3512. 35–40. de Cheveigne, A., and Simon, J. Z. (2007). “Denoising based on time-shift Mesgarani, N., and Chang, E. F. (2012). “Selective cortical representation of PCA,” J. Neurosci. Methods 165, 297–305. attended speaker in multi-talker speech perception,” Nature 485, 233–236.

2432 J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. Mesgarani, N., David, S. V., Fritz, J. B., and Shamma, S. A. (2014). Schadwinkel, S., and Gutschalk, A. (2011). “Transient bold activity locked “Mechanisms of noise robust representation of speech in primary auditory to perceptual reversals of auditory streaming in human auditory cortex and cortex,” Proc. Natl. Acad. Sci. U.S.A. 111, 6792–6797. inferior colliculus,” J. Neurophysiol. 105, 1977–1983. Micheyl, C., Tian, B., Carlyon, R. P., and Rauschecker, J. P. (2005). Shackleton, T. M., and Carlyon, R. P. (1994). “The role of resolved and “Perceptual organization of tone sequences in the auditory cortex of awake unresolved harmonics in pitch perception and frequency modulation dis- macaques,” Neuron 48, 139–148. crimination,” J. Acoust. Soc. Am. 95, 3529–3540. Miller, G. A., and Heise, G. A. (1950). “The trill threshold,” J. Acoust. Soc. Shinn-Cunningham, B., Varghese, L., Wang, L., and Bharadwaj, H. (2017). Am. 22, 637–638. “Individual differences in temporal perception and their implications for every- Moerel, M., De Martino, F., and Formisano, E. (2012). “Processing of natu- day listening,” in The Frequency-Following Response: A Window into Human ral sounds in human auditory cortex: tonotopy, spectral tuning, and rela- Communication, edited by N. Kraus, S. Anderson, T. White-Schwoch, R. R. tion to voice sensitivity,” J. Neurosci. 32, 14205–14216. Fay, and A. N. Popper (Springer Verlag, Cham, Switzerland). Moore, B. C. J., Huss, M., Vickers, D. A., Glasberg, B. R., and Alcantara, J. Snyder, J. S., Alain, C., and Picton, T. W. (2006). “Effects of attention on I. (2000). “A test for the diagnosis of dead regions in the cochlea,” Br. J. neuroelectric correlates of auditory stream segregation,” J. Cogn. Audiol. 34, 205–224. Neurosci. 18, 1–13. Norman-Haignere, S., Kanwisher, N., and McDermott, J. H. (2013). Tallon-Baudry, C., Bertrand, O., Delpuech, C., and Pernier, J. (1996). “Cortical pitch regions in humans respond primarily to resolved harmonics “Stimulus specificity of phase-locked and non-phase-locked 40 Hz visual and are located in specific tonotopic regions of anterior auditory cortex,” responses in human,” J. Neurosci. 16, 4240–4249. J. Neurosci. 33, 19451–19469. Tougas, Y., and Bregman, A. S. (1985). “Crossing of auditory streams,” O’Sullivan, J. A., Power, A. J., Mesgarani, N., Rajaram, S., Foxe, J. J., J. Exp. Psychol.-Human Percept. Perform. 11, 788–798. Shinn-Cunningham, B. G., Slaney, M., Shamma, S. A., and Lalor, E. C. van Noorden, L. P. A. S. (1977). “Minimum differences of level and fre- (2015). “Attentional selection in a cocktail party environment can be quency for perceptual fission of tone sequences ABAB,” J. Acoust. Soc. decoded from single-trial EEG,” 25, 1697–1706. Am. 61, 1041–1045. Oxenham, A. J., Micheyl, C., and Keebler, M. V. (2009). “Can temporal Varghese, L., Bharadwaj, H. M., and Shinn-Cunningham, B. G. (2015). fine structure represent the fundamental frequency of unresolved harmoni- “Evidence against attentional state modulating scalp-recorded auditory cs?,” J. Acoust. Soc. Am. 125, 2189–2199. brainstem steady-state responses,” Brain Res. 1626, 146–164. Penagos, H., Melcher, J. R., and Oxenham, A. J. (2004). “A neural represen- Vliegen, J., and Oxenham, A. J. (1999). “Sequential stream segregation in tation of pitch salience in non-primary human auditory cortex revealed the absence of spectral cues,” J. Acoust. Soc. Am. 105, 339–346. with fMRI,” J. Neurosci. 24, 6810–6815. Wilson, E. C., Melcher, J. R., Micheyl, C., Gutschalk, A., and Oxenham, A. Petkov, C. I., Kang, X., Alho, K., Bertrand, O., Yund, E. W., and Woods, D. J. (2007). “Cortical fMRI activation to sequences of tones alternating in L. (2004). “Attentional modulation of human auditory cortex,” Nat. frequency: Relationship to perceived rate and streaming,” J. Neurophysiol. Neurosci. 7, 658–663. 97, 2230–2238. Plomp, R. (1964). “The ear as a frequency analyzer,” J. Acoust. Soc. Am. Woods, K. J., and McDermott, J. H. (2015). “Attentive tracking of sound 36, 1628–1636. sources,” Curr. Biol. 25, 2238–2246. Pressnitzer, D., Sayles, M., Micheyl, C., and Winter, I. M. (2008). Xiang, J., Simon, J., and Elhilali, M. (2010). “Competing streams at the “Perceptual organization of sound begins in the auditory periphery,” Curr. cocktail party: Exploring the mechanisms of attention and temporal inte- Biol. 18, 1124–1128. gration,” J. Neurosci. 30, 12084–12093. Roberts, B., Glasberg, B. R., and Moore, B. C. J. (2002). “Primitive stream Yamagishi, S., Otsuka, S., Furukawa, S., and Kashino, M. (2016). segregation of tone sequences without differences in fundamental fre- “Subcortical correlates of auditory perceptual organization in humans,” quency or passband,” J. Acoust. Soc. Am. 112, 2074–2085. Hear Res. 339, 104–111. Ruggles, D., Bharadwaj, H., and Shinn-Cunningham, B. G. (2012). “Why Yao, J. D., Bremen, P., and Middlebrooks, J. C. (2015). “Emergence of spa- middle-aged listeners have trouble hearing in everyday settings,” Curr. tial stream segregation in the ascending auditory pathway,” J. Neurosci. Biol. 22, 1417–1422. 35, 16199–16212. S€arel€a, J., and Valpola, H. (2005). “Denoising source separation,” J. Mach. Zion Golumbic, E. M., Ding, N., Bickel, S., Lakatos, P., Schevon, C. A., Learn. Res. 6, 233–272. McKhann, G. M., Goodman, R. R., Emerson, R., Mehta, A. D., Simon, J. Schadwinkel, S., and Gutschalk, A. (2010). “Activity associated with stream Z., Poeppel, D., and Schroeder, C. E. (2013). “Mechanisms underlying segregation in human auditory cortex is similar for spatial and pitch cues,” selective neuronal tracking of attended speech at a ‘cocktail party,’ ” Cereb. Cortex 20, 2863–2873. Neuron 77, 980–991.

J. Acoust. Soc. Am. 144 (4), October 2018 Ruggles et al. 2433