Neural Encoding of Auditory Features During Music Perception and Imagery

Home , Auditory imagery

bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Neural Encoding of Auditory Features during Music Perception and Imagery:

2 Insight into the Brain of a Piano Player

3 Auditory Features and Music Imagery

4 Stephanie Martin1, 2, 8, Christian Mikutta2, 3, 4, 8, Matthew K. Leonard5, Dylan Hungate5, Stefan

5 Koelsch6, Edward F. Chang5, José del R. Millán1, Robert T. Knight2, 7, Brian N. Pasley2, 9

7 1 Defitech Chair in Brain-Machine Interface, Center for Neuroprosthetics, Ecole

8 Polytechnique Fédérale de Lausanne, Switzerland

2 9 Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, USA

3 10 Translational Research Center and Division of Clinical Research Support, Psychiatric

11 Services University of Bern (UPD), University Hospital of Psychiatry, Bern, Switzerland

12 4 Department of Neurology, Inselspital, Bern, University Hospital, University of Bern,

13 Switzerland

14 5Department of Neurological Surgery, Department of Physiology, and Center for Integrative

15 Neuroscience, University of California, San Francisco, CA 94143, USA.

16 6 Languages of Emotion, Freie Universitä, Berlin, Germany

17 7Department of Psychology, University of California, Berkeley, CA 94720, USA

18 8Co-first author

19 9Lead Contact

21 Corresponding Author: Brian Pasley, University of California, Berkeley, Helen Wills

22 Neuroscience Institute, 210 Barker Hall, Berkeley, CA 94720, USA, [email protected]

1 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

26 Abstract

27 It remains unclear how the human cortex represents spectrotemporal sound features during

28 auditory imagery, and how this representation compares to auditory perception. To assess

29 this, we recorded electrocorticographic signals from an epileptic patient with proficient music

30 ability in two conditions. First, the participant played two piano pieces on an electronic piano

31 with the sound volume of the digital keyboard on. Second, the participant replayed the same

32 piano pieces, but without auditory feedback, and the participant was asked to imagine hearing

33 the music in his mind. In both conditions, the sound output of the keyboard was recorded,

34 thus allowing precise time-locking between the neural activity and the spectrotemporal

35 content of the music imagery. For both conditions, we built encoding models to predict high

36 gamma neural activity (70-150Hz) from the spectrogram representation of the recorded

37 sound. We found robust similarities between perception and imagery – in frequency and

38 temporal tuning properties in auditory areas.

40 Keywords: electrocorticography, music perception and imagery, spectrotemporal receptive

41 fields, frequency tuning, encoding models.

43 Abbreviations

44 ECoG: electrocorticography 45 HG: high gamma 46 IFG: inferior frontal gyrus 47 MTG: middle temporal gyrus 48 Post-CG: post-central gyrus 49 Pre-CG: pre-central gyrus 50 SMG: supramarginal gyrus 51 STG: superior temporal gyrus 52 STRF: spectrotemporal receptive field 53

2 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

54 Introduction

55 Auditory imagery is defined here as the mental representation of sound perception in the

56 absence of external auditory stimulation. The experience of auditory imagery is common,

57 such as when a song runs continually through someone’s mind. On an advanced level,

58 professional musicians are able to imagine a piece of music by looking at the sheet music [1].

59 Behavioral studies have shown that structural and temporal properties of auditory features

60 (see [2] for complete review), such as pitch [3], timbre [4,5], loudness [6] and rhythm [7] are

61 preserved during auditory imagery. However, it is unclear how these auditory features are

62 represented in the brain during imagery. Experimental investigation is difficult due to the lack

63 of observable stimulus or behavioral markers during auditory imagery. Using a novel

64 experimental paradigm to synchronize auditory imagery events to neural activity, we

65 investigated the neural representation of spectrotemporal auditory features during auditory

66 imagery in an epileptic patient with proficient music abilities.

68 Previous studies have identified anatomical regions active during auditory imagery [8], and

69 how they compare to actual auditory perception. For instance, lesion [9] and brain imaging

70 studies [4,10–14] have confirmed the involvement of bilateral temporal lobe regions during

71 auditory imagery (see [15] for a review). Brain areas consistently activated with fMRI during

72 auditory imagery include the secondary auditory cortex [10,12,16], the frontal cortex [16,17],

73 the sylvian parietal temporal area [18], ventrolateral and dorsolateral cortices [19] and the

74 supplementary motor area [17,20–24]. Anatomical regions active during auditory imagery

75 have been compared to actual auditory perception to understand the interactions between

76 externally and internally driven cortical processes. Several studies showed that auditory

77 imagery has substantial, but not complete overlap in brain areas with music perception [8] –

78 e.g. the secondary auditory cortex is consistently activated during music imagery and

3 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

79 perception while the primary auditory areas appear to be activated solely during auditory

80 perception [4,10,25,26].

82 These studies have helped to unravel anatomical brain areas involved in auditory perception

83 and imagery, however, there is lack of evidence for the representation of specific acoustic

84 features in the human cortex during auditory imagery. It remains a challenge to investigate

85 neural processing during internal subjective experience like music imagery, due to the

86 difficulty in time-locking brain activity to a measurable stimulus during auditory imagery. To

87 address this issue, we recorded electrocorticographic neural signals (ECoG) of a proficient

88 piano player in a novel task design that permitted robust marking of the spectrotemporal

89 content of the intended music imagery to neural activity – thus allowing us to investigate

90 specific auditory features during auditory imagery. In the first condition, the participant

91 played an electronic piano with the sound output turned on. In this condition, the sound was

92 played out loud through speakers at a comfortable sound volume that allowed auditory

93 feedback (perception condition). In the second condition, the participant played the electronic

94 piano with the speakers turned off, and instead imagined the corresponding music in his mind

95 (imagery condition). In both conditions, the digitized sound output of the MIDI-compatible

96 sound module was recorded. This provided a measurable record of the content and timing of

97 the participant’s music imagery when the speakers of the keyboard were turned off and he did

98 not hear the music. This task design allowed precise temporal alignment between the recorded

99 neural activity and spectrogram representations of music perception and imagery – providing

100 a unique opportunity to apply receptive field modeling techniques to quantitatively study

101 neural encoding during auditory imagery.

102

103 A well-established role of the early auditory system is to decompose complex sounds into

104 their component frequencies [27–29], giving rise to tonotopic maps in the auditory cortex (see

4 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

105 [30] for a review). Auditory perception has been extensively studied in animal models and

106 humans using spectrotemporal receptive field (STRFs) analysis [27,31–34], which identifies

107 the time-frequency stimulus features encoded by a neuron or population of neurons. STRFs

108 are consistently observed during auditory perception tasks, but the existence of STRFs during

109 auditory imagery is unclear due to the experimental challenges associated with synchronizing

110 neural activity and the imagined stimulus. To characterize and compare the spectrotemporal

111 tuning properties during auditory imagery and perception, we fitted two encoding models on

112 data collected from the perception and imagery conditions. In this case, encoding models

113 describe the linear mapping between a given auditory stimulus representation and its

114 corresponding brain response. For instance, encoding models have revealed the neural tuning

115 properties of various speech features, such as acoustic, phonetic and semantic representations

116 [33,35–38].

117

118 In this study, the neural representation of music perception and imagery was quantified by

119 spectrotemporal receptive fields (STRFs) that predict high gamma (HG; 70-150Hz) neural

120 activity. High gamma correlates with the spiking activity of the underlying neuronal ensemble

121 [39–41] and reliably tracks speech and music features in auditory and motor cortex [33,42–

122 45]. Results demonstrated the presence of robust spectrotemporal receptive fields during

123 auditory imagery with extensive overlap in frequency tuning and cortical location compared

124 to receptive fields measured during auditory perception. These results provide a quantitative

125 characterization of the shared neural representation underlying auditory perception and the

126 subjective experience of auditory imagery.

127

128 Results

129 High gamma neural encoding during auditory perception and imagery

5 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

130 In this study, ECoG recordings were obtained using subdural electrode arrays implanted in a

131 patient undergoing neurosurgical procedures for epilepsy. The participant was a proficient

132 piano player (age of start: 7, years of music education: 10, hours of training per week: 5).

133 Prior studies have shown that music training is associated with improved auditory imagery

134 ability, such as pitch and temporal acuity [46–48]. We took advantage of this rare clinical

135 case to investigate the neural correlates of music perception and imagery. In the first task, the

136 participant played two music pieces on an electronic piano with the speakers of the digital

137 keyboard turned on (perception condition; Fig 1A). In the second task, the participant played

138 the two same piano pieces, but the volume of the speaker system was turned off to establish a

139 silent room. The participant was asked to imagine hearing the corresponding music in his

140 mind as he played the piano (imagery condition; Fig 1B). In both conditions, the sound

141 produced by the MIDI-compatible sound module was digitally recorded in synchrony with the

142 ECoG signal. The recorded sound allowed synchronizing the auditory spectrotemporal

143 patterns of the imagined music and the neural activity when the speaker system was turned off

144 and no audible sounds were recorded in the room.

145

146

147 Fig 1. Experimental task design.

148 (A) The participant played an electronic piano with the sound of the digital keyboard turned

149 on (perception condition). (B) In the second condition, the participant played the piano with

150 the sound turned off and instead imagined the corresponding music in his mind (imagery

151 condition). In both conditions, the digitized sound output of the MIDI-compatible sound

6 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

152 module was recorded in synchrony with the neural signals (even when the participant did not

153 hear any sound in the imagery condition). The models take as input a spectrogram consisting

154 of time-varying spectral power across a range of acoustic frequencies (200– 7,000 Hz, bottom

155 left) and output time-varying neural signals. To assess the encoding accuracy, the predicted

156 neural signal (light lines) is compared to the original neural signal (dark lines).

157

158 The participant performed two famous music pieces: Chopin’s Prelude in C minor, Op. 28,

159 No.20 and Bach’s Prelude in C major BWV 846. Example auditory spectrograms from

160 Chopin’s Prelude determined through the participant's key presses with the electronic piano

161 are shown in Fig 2B for both perception and imagery conditions. To evaluate how

162 consistently the participant performed across perception and imagery tasks, we computed the

163 realigned Euclidean distance [49] between the spectrograms of the same music pieces played

164 across conditions (within-stimulus distance). We compared the within-stimulus distance with

165 the realigned Euclidean distance between the spectrograms of the different musical pieces

166 (between-stimulus distance). The realigned Euclidean distance was 251.6% larger for the

167 between-stimulus distance compared to the within-stimulus distance (p<10-3; randomization

168 test), suggesting that the spectrograms of same musical pieces played across conditions were

169 more similar than the spectrograms of different musical pieces. This indicates that the

170 participant performed the task with relative consistency and specificity across the two

171 conditions. To compare spectrotemporal auditory representations during music perception and

172 music imagery tasks, we fit separate encoding models in each condition. We used these

173 models to quantify specific anatomical and neural tuning differences between auditory

174 perception and imagery.

175

7 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

176

177 Fig 2. Encoding accuracy.

178 (A) Electrode location overlaid on cortical surface reconstruction of the participant’s

179 cerebrum. (B) Overlay of the spectrogram contours for the perception (blue) and imagery

180 (orange) condition (10% of maximum energy from the spectrograms). (C) Actual and

181 predicted high gamma band power (70–150 Hz) induced by the music perception and imagery

182 segment in (B). Top electrodes have very similar predictive power, whereas bottom electrodes

183 are very different for perception and imagery. Recordings are from two different STG sites,

184 highlighted in pink in (A). (D) Encoding accuracy is plotted on the cortical surface

185 reconstruction of the participant’s cerebrum (map thresholded at p<0.05; FDR correction). (E)

186 Encoding accuracy of significant electrodes of the perception model as a function the imagery

187 model. Electrode-specific encoding accuracy is correlated between both perception and

188 imagery models (r=0.65; p<10-4; randomization test). (F) Encoding accuracy as a function of

189 anatomic location (pre-central gyrus (pre-CG), post-central gyrus (post-CG), supramarginal

190 gyrus (SMG), medial temporal gyrus (MTG) and superior temporal gyrus (STG)).

191

8 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

192 For both perception and imagery conditions, the observed and predicted high gamma neural

193 responses are illustrated for two individual electrodes in the temporal lobe, respectively (Fig

194 2C), together with the corresponding music spectrum (Fig 2B). The predicted neural response

195 for the electrode shown in the upper panel of Fig 2B was significantly correlated with its

196 corresponding measured neural response in both perception (r=0.41; p<10-7; one-sample z-

197 test; FDR correction) and imagery (r=0.42; p<10-4; one-sample z-test; FDR correction)

198 conditions. The predicted neural response for the lower panel electrode was correlated with

199 the actual neural response only in the perception condition (r=0.23; p<0.005; one-sample z-

200 test; FDR correction) but not in the imagery condition (r=-0.02; p>0.5; one-sample z-test;

201 FDR correction). The difference between both conditions was significant for the electrode in

202 the lower panel (p<0.05; two-sample t-test), but not in the upper panel (p>0.5; two-sample t-

203 test). This suggests that there is a strong continuous relationship between time-varying

204 imagined sound features and STG neural activity, but that this relationship is dependent on

205 cortical location.

206

207 To further investigate anatomical similarities and differences between the perception and

208 imagery conditions, we plotted the anatomical layout of prediction accuracy of individual

209 electrodes. In both conditions, results showed that sites with the highest prediction in both

210 conditions were located in the superior and middle temporal gyrus, pre- and post-central

211 gyrus, and supramarginal gyrus (Fig 2D; heat map thresholded to p<0.05; one-sample z-test;

212 FDR correction), consistent with previous ECoG results [33]. Among the 256 electrodes

213 recorded, 210 were fitted in the encoding model, while the remaining 46 electrodes were

214 removed due to excessive noise. Within the fitted electrodes, while 35 and 15 electrodes were

215 significant in the perception and imagery condition, respectively (p<0.05; one-sample z-test;

216 FDR correction), of which 9 electrodes were significant in both conditions. Anatomic

217 locations of the electrodes with significant encoding accuracy are depicted in Fig 3 and Fig

9 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

218 S1. To compare the encoding accuracy across conditions, we performed additional analysis on

219 the electrodes that had significant encoding accuracy in at least one condition (41 electrodes;

220 unless otherwise stated). Prediction accuracy of individual electrodes was correlated between

221 perception and imagery (Fig 2E; 41 electrodes; r=0.65; p<10-4; randomization test). Because

222 both perception and imagery models are based on the same auditory stimulus representation,

223 the correlated prediction accuracy provides strong evidence for a shared neural representation

224 of sound based on spectrotemporal features.

225

226 To assess how brain areas encoding auditory features varied across experimental conditions,

227 we analyzed the significant electrodes in the gyri highlighted in Fig 2A (pre-central gyrus

228 (pre-CG), post-central gyrus (post-CG), supramarginal gyrus (SMG), medial temporal gyrus

229 (MTG) and superior temporal gyrus (STG)) using Wilcoxon signed-rank test (p>0.05; one-

230 sample Kolmogorov-Smirnov test; Fig 2F). Results showed that the encoding accuracy in the

231 MTG and STG was higher for the perception (MTG: M = 0.16, STG: M = 0.13) than for the

232 imagery (MTG: M = 0.11, STG: M= 0.08; p<0.05; Wilcoxon signed-rank test; Bonferroni

233 correction). The encoding accuracy in the pre-CG, post-CG and SMG was not different

234 between the perception (pre-CG M=0.17; post-CG M=0.15; SMG M=0.12; p>0.5; Wilcoxon

235 signed-rank test; Bonferroni correction) and imagery (pre-CG M=0.14; post-CG M=0.13;

236 SMG M=0.12; p>0.5; Wilcoxon signed-rank test; Bonferroni correction) conditions. The

237 significant improvement of the perception vs. imagery model was thus specific to the

238 temporal lobe, which may reflect underlying differences in spectrotemporal encoding

239 mechanisms, or alternatively, a greater sensitivity to discrepancies between the actual content

240 of imagery and the recorded sound stimulus used in the model.

241

242 Spectrotemporal tuning during auditory perception and imagery

243 Auditory imagery and perception are distinct subjective experiences yet both are

10 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

244 characterized by a sense of sound. How does the auditory system encode sound during music

245 perception and imagery? Examples of standard STRFs are shown in Fig 3 for temporal

246 electrodes (Fig S2 for all the STRFs). These STRFs highlight neural stimulus preferences as

247 shown by the excitatory (warm color) and inhibitory (cold color) subregions. Fig 4A shows

248 the latency (s) for the perception and imagery conditions, defined as the temporal coordinates

249 of the maximum deviation in the STRF. The peak latency was significantly correlated

250 between both conditions (r=0.43; p<0.005; randomization test), suggesting that both

251 perception and imagery are simultaneously active for most of their response durations. Then,

252 we analyzed frequency tuning curves estimated from the STRFs (see Materials and Methods

253 for details). Examples of tuning curves for both perception and imagery encoding models are

254 shown for the electrodes indicated by the black outline in the anatomic brain (Fig 4B). Across

255 conditions, the majority of individual electrodes exhibited a complex frequency tuning

256 profile. For each electrode, similarities between the tuning curves in the perception and

257 imagery models were quantified using Pearson’s correlation coefficient. The anatomical

258 distribution of tuning curve similarity is plotted in Fig 4B, with the correlation at individual

259 sites ranging between r=-0.3-0.6. The effect of anatomical location (pre-CG, post-CG, SMG,

260 MTG and STG) on tuning curve similarity was not significant (Chi-square=3.59; p>0.1;

261 Kruskal-Wallis Test). Similarities in tuning curve shape between auditory imagery and

262 perception suggest a shared auditory representation, but there is no evidence that similarities

263 depended on gyral area.

11 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

264

265 Fig 3. Spectrotemporal receptive fields (STRFs).

266 Examples of standard STRFs for the perception (left panel) and imagery (right panel) models

267 (warm colors indicate where the neuronal ensemble is excited, cold colors indicate where the

268 neuronal ensemble is inhibited). On the lower left corner, Electrode location overlaid on

269 cortical surface reconstructions of the participant’s cerebrum. Electrodes whose STRFs are

270 shown are outlined in black. Grey electrodes were removed from the analysis due to excessive

271 noise (see Materials and Methods).

272

12 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

273

274 Fig 4. Auditory tuning.

275 (A) Latency peaks – estimated from the STRFs – were significantly correlated between

276 perception and imagery conditions (r=0.43; p<0.005; randomization test). (B) Examples of

277 tuning curves for both perception and imagery encoding models defined as the average gain

278 of the STRFs as a function of acoustic frequency. Black outline in the anatomic brain indicate

279 electrode location. for the electrodes indicated by the black outline in the left panel –

280 Correlation coefficients between the perception and imagery conditions are plotted for

281 significant electrodes on the cortical surface reconstruction of the participant’s cerebrum.

282 Grey electrodes were removed from the analysis due to excessive noise. Bottom panel is a

283 histogram of electrode correlation coefficients between the perception and imagery tuning.

284 (C) Proportion of predictive electrode sites (N=41) with peak tuning at each frequency.

285 Tuning peaks were identified as significant parameters in the acoustic frequency tuning

286 curves (z>3.1; p<0.001) – separated by more than one third an octave.

287

288 Different electrodes are sensitive to different acoustic frequencies important for music

13 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

289 processing. We next quantitatively assessed how frequency tuning of predictive electrodes

290 (N=41) varied during the two conditions. First, to evaluate how the acoustic spectrum was

291 covered at the population level, we quantified the proportion of significant electrodes with a

292 tuning peak at each acoustic frequency (Fig 4C). Tuning peaks were identified as significant

293 parameters in the acoustic frequency tuning curves (z>3.1; p<0.001; separated by more than

294 one third an octave). The proportion of electrodes with tuning peaks was significantly larger

295 for the perception (mean = 0.19) than for the imagery (mean = 0.14) condition (Fig 4C;

296 p<0.05; Wilcoxon signed-rank test). Despite this, both conditions exhibited reliable frequency

297 selectivity, as nearly the full range of the acoustic frequency spectrum was encoded. The

298 fraction of acoustic frequency covered with peaks by predictive electrodes was 0.91 for the

299 perception and 0.89 for the imagery.

300

301 Reconstruction of auditory features during music perception and imagery

302 To evaluate the ability to identify piano keys from the brain activity, we reconstructed the

303 same auditory spectrogram representation used in the encoding models. Results showed that

304 the overall reconstruction accuracy was higher than zero in both conditions (Fig 5A; p <

305 0.001; randomization test), but did not differ between conditions (p > 0.05; two-sample t-test).

306 As a function of acoustic frequency, mean accuracy ranged from r=0–0.45 (Fig 5B).

14 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

307

308 Fig 5. Reconstruction accuracy.

309 (A) Right panel: Overall reconstruction accuracy of the spectrogram representation for both

310 perception (blue) and imagery (orange) conditions. Error bars denote resampling SEM. Left

311 panel: Reconstruction accuracy as a function of acoustic frequency. Shaded region denotes

312 SEM over the resamples. (B) Examples of original and reconstructed segments for the

313 perception (left) and the imagery (right) model. (C) Distribution of identification rank for all

314 reconstructed spectrogram notes. Median identification rank is 0.65 and 0.63 for the

315 perception and imagery decoding model, respectively, which is significantly higher than 0.50

316 chance level (p<0.001; randomization test). Left panel: Receiver operating characteristic

317 (ROC) plot of identification performance for the perception (blue curve) and imagery (orange

318 curve) model. Diagonal black line indicates no predictive power.

319

15 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

320 We further assessed reconstruction accuracy by evaluating the ability to identify isolated

321 piano notes from the test set auditory spectrogram reconstructions. Examples of original and

322 reconstructed segments are depicted in Fig 5B for the perception (left) and imagery model

323 (right). For the identification, we extracted 0.5-second segments at piano note onsets from the

324 original and reconstructed auditory spectrogram. Onsets of the notes were defined with the

325 MIRtoolbox [50]. Then, we computed the correlation coefficient between a target

326 reconstructed spectrogram and original spectrograms in the candidate set. Finally, we sorted

327 the coefficients and computed the identification rank as the percentile rank of the correct

328 spectrogram. This metric reflects how well the target reconstruction matched the correct

329 original spectrogram out of all candidate. Results showed that the median identification rank

330 of individual piano notes was significantly higher than chance for both conditions (Fig 5C;

331 median identification rank perception = 0.72 and imagery = 0.69; p< 0.001; randomization

332 test). Similarly, the area under the curve (AUC) of identification performance for the

333 perception (blue curve) and imagery (orange curve) model was well above chance level

334 (diagonal black dashed line indicates no predictive power; p<0.001; randomization test).

335 Cross-condition analysis

336 Another way to evaluate the overlapping degree between both perception and imagery

337 conditions is to apply the decoding model built in the perception condition to imagery neural

338 data, and vice-versa. This approach is based on the hypothesis that both tasks share neural

339 mechanisms and is useful when one of the models cannot be built directly, because of the lack

340 of observable measures. This technique has been successfully applied to various fields, such

341 as vision [51–53], and speech [54]. When the model was trained on the perception condition

342 and tested on the imagined condition (r=0.28), decoding performances improved by 50%

343 compared to when the perception model was applied to imagined data (r=0.19) (Fig 6). This

344 highlight the importance of having a model that is specific to each condition.

345

16 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

346

347 Fig 6. Cross-condition analysis.

348 Reconstruction accuracy when the decoding model was built on the perception condition and

349 applied to the imagery neural data and vis-versa. Decoding performances improved by 50%

350 when the model was trained on the perception condition and tested on the imagined condition

351 (r=0.28), compared to when the perception model was applied to imagined data (r=0.19).

352

353 Control analysis for sensorimotor confounds

354 In this study, the participant played piano in two different conditions (music perception and

355 imagery). In addition to the auditory percept, arm-, hand- and finger-movements related to the

356 active piano task could have presented potential confounds to the decoding process. We

357 controlled for possible motor confounds in three different ways. First, the electrode grid did

358 not cover hand or finger sensorimotor brain area (Fig 7A). This reduces the likelihood that the

359 decoding model involved hand-related motor confounds. Second, examples of STRFs for two

360 neighboring electrodes show different weight patterns for both conditions (Fig 7B). For

361 instance, the weights of the electrode depicted in the left example of Fig 7B are correlated

362 between both conditions (r=0.48), whereas the weights are not correlated in the right example

363 (r=0.04). Differences across conditions cannot be explained by motor confounds, because

364 finger movement was similar in both tasks. Third, brain areas that significantly encoded

365 music perception and imagery overlapped with auditory sensory areas (Fig S1 and Fig S3), as

366 revealed by the encoding accuracy and STRFs during passive listening to TIMIT sentences

367 (no movements). These findings provide evidence against motor confounds, and suggest that

17 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

368 the brain responses were induced by auditory percepts rather than motor movements

369 associated with pressing piano keys. Finally, we built two additional decoding models, using

370 1) only temporal lobe electrodes and 2) only auditory-responsive electrodes (Fig 7C; see

371 Materials and Methods for details). Both models showed significant reconstruction accuracy

372 (p<0.001; randomization test) and median identification rate (p<0.001; randomization test).

373 This suggests that even if we removed all electrodes that are potentially not related to auditory

374 processes, the decoding model still performs well.

375

18 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

376

377 Fig 7. Control analysis for motor confound.

378 (A) Electrode location overlaid on cortical surface reconstructions of the participant’s

379 cerebrum. View from the top of the brain shows that hand motor areas not recorded with the

380 grid. Brain areas associated with face, upper and lower limbs are depicted in green, blue and

381 pink, respectively. (B) Different STRFs for two neighboring electrodes highlighted in (A) for

382 both perception and imagery encoding models. (C) Overall reconstruction accuracy (left

19 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

383 panel) and median identification rank (right panel) when using all electrodes, only temporal

384 electrodes, and auditory electrodes (see Materials and methods for details).

385

386 Discussion

387 Music imagery studies are daunting due to the subjective nature and absence of verifiable and

388 observable measures. This task design allowed precise time-locking between the recorded

389 neural activity and spectrotemporal features of music imagery, and provided a unique

390 opportunity to quantitatively study neural encoding during auditory imagery, and compare

391 tuning properties with auditory perception. We provide the first evidence of spectrotemporal

392 receptive field and auditory features encoding in the brain during music imagery, and

393 provided comparative measures with actual music perception encoding. We observed that

394 neuronal ensembles were tuned to acoustic frequencies during imagined music, suggesting

395 that spectral organization occurs in the absence of actual perceived sound. Supporting

396 evidence has shown that restored speech – when a speech instance is replaced by noise, but

397 the listener perceives a specific speech sound – is grounded in acoustic representations in the

398 superior temporal gyrus [55,56]. In addition, results showed substantial, but not complete

399 overlap in neural properties – i.e. spectral and temporal tuning properties, and brain areas –

400 during music perception and imagery. These findings are in agreement with visual studies

401 showing that visual imagery and perception involve many, but not all, overlapping brain areas

402 [57,58]. We also showed that auditory features could be reconstructed from neural activity of

403 the imagined music. Because both perception and imagery models are based on the same

404 auditory stimulus representation, the correlated prediction accuracy provides strong evidence

405 for a shared neural representation of sound based on spectrotemporal features. This confirms

406 that the brain encodes spectrotemporal properties of sounds – as previously shown by

407 behavioral and brain lesion studies (see [2] for a review).

408

20 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

409 Methodological issues in investigating imagery are numerous, including the lack of evidence

410 that the desired mental task was operational. The current task design did not allow verifying

411 how the mental task was performed, yet the behavioral index of keynote press on the piano

412 indicated the precise time and frequency content of the intended imagined sound. In addition,

413 we recorded a skilled piano player, and it has been suggested that participants with musical

414 training exhibited better pitch and temporal acuity during auditory imagery than did

415 participants with little or no musical training [47,59]. Furthermore, tonotopic maps located in

416 the STG are enlarged within trained musicians [60]. Thus, having a trained piano-player

417 suggests improved auditory imagery ability (see also [7,9,14], and reduced issues related to

418 spectral and temporal errors.

419

420 Finally, the electrode grid was located on the left hemisphere of the participant. This raises

421 the question of lateralization in the brain response to music perception and imagery. Studies

422 have shown the importance of both hemispheres for auditory perception and imagination

423 [4,9,10,12–14], although brain patterns tend to shift toward the right hemisphere (see [15] for

424 a review). In our task, the grid was located on the left hemisphere, and allowed significant

425 encoding and decoding accuracy within high gamma frequency ranges. This is consistent with

426 the notion that music auditory processes are also evident in the left hemisphere.

427

428 Materials and Methods

429 Participant and data acquisition

430 Electrocorticographic (ECoG) recording was obtained using subdural electrode arrays

431 implanted in one patient undergoing neurosurgical treatment for refractory epilepsy. The

432 participant gave his written informed consent prior to surgery and experimental testing. The

433 experimental protocol was approved by the University of California, San Francisco and

434 Berkeley Institutional Review Boards and Committees on Human Research. Inter-electrode

21 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

435 spacing (center-to-center) was 4 mm. Grid placement (Fig 2A) was defined solely based on

436 clinical requirements. Localization and co-registration of electrodes was performed using the

437 structural MRI. Multi-electrode ECoG data were amplified and digitally recorded with

438 sampling rate of 3,052 Hz. ECoG signals were re-referenced to a common average after

439 removal of electrodes with epileptic artifacts or excessive noise (including broadband

440 electromagnetic noise from hospital equipment or poor contact with the cortical surface). In

441 addition to the ECoG signals, the audio output of the piano was recorded along with the

442 multi-electrode ECoG data.

443

444 Experimental paradigm

445 The recording session included two conditions. In the first condition, the participant played on

446 an electronic piano with the sound turned on. That is, the music was played out loud through

447 the speakers of the digital keyboard in the hospital room (volume at comfortable and natural

448 sound level; Fig 1A; perception condition). In the second condition, the participant played on

449 the piano with the speakers turned off and instead imagined hearing the corresponding music

450 in his mind (Fig 1B; imagery condition). Fig 1B illustrates that in both conditions, the

451 digitized sound output of the MIDI sound module was recorded in synchrony with the ECoG

452 data (even when the speakers were turned off and the participant did not hear the music). The

453 two music pieces were Chopin’s Prelude in C minor Op. 28 no. 20 and Bach’s Prelude in C

454 major (BWV 846), respectively.

455

456 Feature extraction

457 We extracted the ECoG signal in the high gamma frequency band from eight bandpass filters

458 (hamming window non-causal filter of order 20, logarithmically increasing center frequencies

459 (70–150 Hz) and semi-logarithmically increasing bandwidths), and extracted the envelope

460 using the Hilbert transform. Prior to model fitting, the power was averaged across these eight

22 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

461 bands, down-sampled to 100 Hz and z-scored.

462

463 Auditory spectrogram representation

464 The auditory spectrogram representation was a time-varying representation of the amplitude

465 envelope at acoustic frequencies logarithmically spaced between 200-7,000 Hz. This

466 representation was calculated by affine wavelet transforms of the sound waveform using

467 auditory filter banks that mimics neural processing in the human auditory periphery [31]. To

468 compute these acoustic representations, we used the NSL MATLAB toolbox (http://www.

469 isr.umd.edu/Labs/NSL/Software.htm).

470

471 Encoding model

472 The neural encoding model, based on the spectrotemporal receptive field (STRF) [34]

473 describes the linear mapping between the music stimulus and the high gamma neural response

474 at individual electrodes. The encoding model was estimated as follows:

475 � �, � = ℎ(�, �, �)�(� − �, �)

476 where � �, � is the predicted high gamma neural activity at time t and electrode n, S(t-t, f) is

477 the spectrogram representation at time (t-t) and acoustic frequency f. Finally, h(t, f, n) is the

478 linear transformation matrix that depends on the time lag t, the frequency f and electrodes n. h

479 represents the spectrotemporal receptive field of each electrode. The Neural tuning properties

480 of a variety of stimulus parameters in different sensory systems have been assessed using

481 STRFs [61]. We used Ridge regression to fit the encoding model [62], and a 10-fold cross-

482 validation resampling procedure, with no overlap between training and test partitions within

483 each resample. We performed grid search on the training set to define the penalty coefficient

484 � and the learning rate h, using a nested loop cross-validation approach. We standardized the

485 parameter estimates to yield the final model.

23 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

486

487 Decoding model

488 The decoding model linearly mapped the neural activity to the music representation, as a

489 weighted sum of activity at each electrode, as follows:

490 � �, � = �(�, �, �)�(� − �, �)

491 where � �, � is the predicted music representation at time t and frequency f. R(t-t, n) is the

492 HG neural response of electrode n at time (t-t), t is the time lag ranging between -100ms and

493 400ms. Finally, g(t, f, n) is the linear transformation matrix that depends on the time lag t,

494 frequency f, and electrode n. Both, neural response and music representation were

495 synchronized, downsampled to 100 Hz, and standardized to zero mean and unit standard

496 deviation prior to model fitting. To fit model parameters, we used gradient descent with early

497 stopping regularization. We used a 10-folds cross-validation resampling scheme, and 20% of

498 the training data were used as validation set to determine the early stopping criterion. Finally,

499 model prediction accuracy was evaluated on the independent testing set, and the parameter

500 estimates were standardized to yield the final model.

501

502 Evaluation

503 Prediction accuracy was quantified using the correlation coefficient (Pearson’s r) between the

504 predicted and actual HG signal using data from the independent test. Overall encoding

505 accuracy was reported as the mean correlation over folds. The z-test was applied for all

506 reported mean r values. Electrodes were defined as significant if the p-value was smaller than

507 the significance threshold of a=0.05 (95th-percentile; FDR correction). To define auditory

508 sensory areas, we built an encoding model on data recorded while the participant listened

24 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

509 passively to speech sentences from the TIMIT corpus [63] during 10min. Electrodes with

510 significant encoding accuracy are highlighted in Fig S1.

511

512 To further investigate the neural encoding of spectrotemporal acoustic features during music

513 perception and music imagery, we analyzed all the electrodes that were at least significant in

514 one condition (unless otherwise stated). Tuning curves were estimated from spectrotemporal

515 receptive fields, by first setting all inhibitory weights to zero, then averaging across the time

516 dimension and converting to standardized z-scores. Tuning peaks were identified as

517 significant peak parameters in the acoustic frequency tuning curves (z>3.1; p<0.001) –

518 separated by more than one third an octave.

519

520 Decoding accuracy was assessed by calculating the correlation coefficient (Pearson’s r)

521 between the reconstructed and original music spectrogram representation using testing set

522 data. Overall reconstruction accuracy was computed by averaging over acoustic frequencies

523 and resamples, and standard error of the mean (SEM) was computed by taking the standard

524 deviation across resamples. Finally, to further assess the reconstruction accuracy, we

525 evaluated the ability to identify isolated piano notes from the test set auditory spectrogram

526 reconstructions – using similar approach as in [33,54].

527

528 Acknowledgments

529 This work was supported by the Zeno-Karl Schindler Foundation, NINDS Grant R3721135,

530 NIH 5K99DC012804 and the Nielsen Corporation.

531

532 References

533 1. Meister IG, Krings T, Foltys H, Boroojerdi B, Müller M, Töpper R, et al. Playing piano 534 in the mind--an fMRI study on music imagery and performance in pianists. Brain Res 535 Cogn Brain Res. 2004;19: 219–228. doi:10.1016/j.cogbrainres.2003.12.005

25 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

536 2. Hubbard TL. Auditory imagery: Empirical findings. Psychol Bull. 2010;136: 302–329. 537 doi:10.1037/a0018436

538 3. Halpern AR. Memory for the absolute pitch of familiar songs. Mem Cognit. 1989;17: 539 572–581.

540 4. Halpern AR, Zatorre RJ, Bouffard M, Johnson JA. Behavioral and neural correlates of 541 perceived and imagined musical timbre. Neuropsychologia. 2004;42: 1281–1292. 542 doi:10.1016/j.neuropsychologia.2003.12.017

543 5. Pitt MA, Crowder RG. The role of spectral and dynamic cues in imagery for musical 544 timbre. J Exp Psychol Hum Percept Perform. 1992;18: 728–738.

545 6. Intons-Peterson M. Components of auditory imagery. In: Reisberg D, editor. Auditory 546 imagery. Hillsdale, N.J: L. Erlbaum Associates; 1992. pp. 45–72.

547 7. Halpern. Mental scanning in auditory imagery for songs. J Exp Psychol Learn Mem 548 Cogn. 1988;14: 434–443.

549 8. Kosslyn SM, Ganis G, Thompson WL. Neural foundations of imagery. Nat Rev 550 Neurosci. 2001;2: 635–642. doi:10.1038/35090055

551 9. Zatorre, Halpern. Effect of unilateral temporal-lobe excision on perception and imagery 552 of songs. Neuropsychologia. 1993;31: 221–232.

553 10. Griffiths TD. Human complex sound analysis. Clin Sci Lond Engl 1979. 1999;96: 231– 554 234.

555 11. Halpern AR. When That Tune Runs Through Your Head: A PET Investigation of 556 Auditory Imagery for Familiar Melodies. Cereb Cortex. 1999;9: 697–704. 557 doi:10.1093/cercor/9.7.697

558 12. Kraemer DJM, Macrae CN, Green AE, Kelley WM. Musical imagery: Sound of silence 559 activates auditory cortex. Nature. 2005;434: 158–158. doi:10.1038/434158a

560 13. Rauschecker JP. Cortical plasticity and music. Ann N Y Acad Sci. 2001;930: 330–336.

561 14. Zatorre RJ, Halpern AR, Perry DW, Meyer E, Evans AC. Hearing in the Mind’s Ear: A 562 PET Investigation of Musical Imagery and Perception. J Cogn Neurosci. 1996;8: 29–46. 563 doi:10.1162/jocn.1996.8.1.29

564 15. Zatorre RJ, Halpern AR. Mental Concerts: Musical Imagery and Auditory Cortex. 565 Neuron. 2005;47: 9–12. doi:10.1016/j.neuron.2005.06.013

566 16. Zatorre RJ, Halpern AR, Bouffard M. Mental Reversal of Imagined Melodies: A Role 567 for the Posterior Parietal Cortex. J Cogn Neurosci. 2009;22: 775–789. 568 doi:10.1162/jocn.2009.21239

569 17. Halpern AR, Zatorre RJ. When That Tune Runs Through Your Head: A PET 570 Investigation of Auditory Imagery for Familiar Melodies. Cereb Cortex. 1999;9: 697– 571 704. doi:10.1093/cercor/9.7.697

26 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

572 18. Hickok G, Buchsbaum B, Humphries C, Muftuler T. Auditory-motor interaction 573 revealed by fMRI: speech, music, and working memory in area Spt. J Cogn Neurosci. 574 2003;15: 673–682. doi:10.1162/089892903322307393

575 19. Meyer M, Elmer S, Baumann S, Jancke L. Short-term plasticity in the auditory system: 576 differential neural responses to perception and imagery of speech and music. Restor 577 Neurol Neurosci. 2007;25: 411–431.

578 20. Halpern AR. Cerebral substrates of musical imagery. Ann N Y Acad Sci. 2001;930: 579 179–192.

580 21. Mikumo M. Motor Encoding Strategy for Pitches of Melodies. Music Percept Interdiscip 581 J. 1994;12: 175–197. doi:10.2307/40285650

582 22. Petsche H, von Stein A, Filz O. EEG aspects of mentally playing an instrument. Brain 583 Res Cogn Brain Res. 1996;3: 115–123.

584 23. Brodsky W, Henik A, Rubinstein B-S, Zorman M. Auditory imagery from musical 585 notation in expert musicians. Percept Psychophys. 2003;65: 602–612.

586 24. Schürmann M, Raij T, Fujiki N, Hari R. Mind’s ear in a musician: where and when in 587 the brain. NeuroImage. 2002;16: 434–440. doi:10.1006/nimg.2002.1098

588 25. Bunzeck N, Wuestenberg T, Lutz K, Heinze H-J, Jancke L. Scanning silence: mental 589 imagery of complex sounds. NeuroImage. 2005;26: 1119–1127. 590 doi:10.1016/j.neuroimage.2005.03.013

591 26. Yoo SS, Lee CU, Choi BG. Human brain mapping of auditory imagery: event-related 592 functional MRI study. Neuroreport. 2001;12: 3045–3049.

593 27. Aertsen AMHJ, Olders JHJ, Johannesma PIM. Spectro-temporal receptive fields of 594 auditory neurons in the grassfrog: III. analysis of the stimulus-event relation for natural 595 stimuli. Biol Cybern. 1981;39: 195–209. doi:10.1007/BF00342772

596 28. Eggermont JJ, Aertsen AMHJ, Johannesma PIM. Quantitative characterisation 597 procedure for auditory neurons based on the spectro-temporal receptive field. Hear Res. 598 1983;10: 167–190. doi:10.1016/0378-5955(83)90052-7

599 29. Tian B. Processing of Frequency-Modulated Sounds in the Lateral Auditory Belt Cortex 600 of the Rhesus Monkey. J Neurophysiol. 2004;92: 2993–3013. 601 doi:10.1152/jn.00472.2003

602 30. Saenz M, Langers DRM. Tonotopic mapping of human auditory cortex. Hear Res. 603 2014;307: 42–52. doi:10.1016/j.heares.2013.07.016

604 31. Chi T, Ru P, Shamma SA. Multiresolution spectrotemporal analysis of complex sounds. 605 J Acoust Soc Am. 2005;118: 887. doi:10.1121/1.1945807

606 32. Clopton BM, Backoff PM. Spectrotemporal receptive fields of neurons in cochlear 607 nucleus of guinea pig. Hear Res. 1991;52: 329–344.

27 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

608 33. Pasley BN, David SV, Mesgarani N, Flinker A, Shamma SA, Crone NE, et al. 609 Reconstructing Speech from Human Auditory Cortex. Zatorre R, editor. PLoS Biol. 610 2012;10: e1001251. doi:10.1371/journal.pbio.1001251

611 34. Theunissen FE, Sen K, Doupe AJ. Spectral-temporal receptive fields of nonlinear 612 auditory neurons obtained using natural sounds. J Neurosci Off J Soc Neurosci. 2000;20: 613 2315–2331.

614 35. Lotte F, Brumberg JS, Brunner P, Gunduz A, Ritaccio AL, Guan C, et al. 615 Electrocorticographic representations of segmental features in continuous speech. Front 616 Hum Neurosci. 2015;09. doi:10.3389/fnhum.2015.00097

617 36. Mesgarani N, Cheung C, Johnson K, Chang EF. Phonetic Feature Encoding in Human 618 Superior Temporal Gyrus. Science. 2014;343: 1006–1010. doi:10.1126/science.1245994

619 37. Tankus A, Fried I, Shoham S. Structured neuronal encoding and decoding of human 620 speech features. Nat Commun. 2012;3: 1015. doi:10.1038/ncomms1995

621 38. Huth AG, de Heer WA, Griffiths TL, Theunissen FE, Gallant JL. Natural speech reveals 622 the semantic maps that tile human cerebral cortex. Nature. 2016;532: 453–458. 623 doi:10.1038/nature17637

624 39. Lachaux J-P, Axmacher N, Mormann F, Halgren E, Crone NE. High-frequency neural 625 activity and human cognition: past, present and possible future of intracranial EEG 626 research. Prog Neurobiol. 2012;98: 279–301. doi:10.1016/j.pneurobio.2012.06.008

627 40. Miller KJ, Leuthardt EC, Schalk G, Rao RPN, Anderson NR, Moran DW, et al. Spectral 628 changes in cortical surface potentials during motor movement. J Neurosci Off J Soc 629 Neurosci. 2007;27: 2424–2432. doi:10.1523/JNEUROSCI.3886-06.2007

630 41. Boonstra TW, Houweling S, Muskulus M. Does Asynchronous Neuronal Activity 631 Average out on a Macroscopic Scale? J Neurosci. 2009;29: 8871–8874. 632 doi:10.1523/JNEUROSCI.2020-09.2009

633 42. Towle VL, Yoon H-A, Castelle M, Edgar JC, Biassou NM, Frim DM, et al. ECoG 634 gamma activity during a language task: differentiating expressive and receptive speech 635 areas. Brain. 2008;131: 2013–2027. doi:10.1093/brain/awn147

636 43. Llorens A, Trébuchon A, Liégeois-Chauvel C, Alario F-X. Intra-Cranial Recordings of 637 Brain Activity During Language Production. Front Psychol. 2011; 638 doi:10.3389/fpsyg.2011.00375

639 44. Crone NE, Boatman D, Gordon B, Hao L. Induced electrocorticographic gamma activity 640 during auditory perception. Brazier Award-winning article, 2001. Clin Neurophysiol Off 641 J Int Fed Clin Neurophysiol. 2001;112: 565–582.

642 45. Sturm I, Blankertz B, Potes C, Schalk G, Curio G. ECoG high gamma activity reveals 643 distinct cortical representations of lyrics passages, harmonic and timbre-related changes 644 in a rock song. Front Hum Neurosci. 2014;8. doi:10.3389/fnhum.2014.00798

28 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

645 46. Aleman A, Nieuwenstein MR, Böcker KB., de Haan EH. Music training and mental 646 imagery ability. Neuropsychologia. 2000;38: 1664–1668. doi:10.1016/S0028- 647 3932(00)00079-8

648 47. Janata P, Paroo K. Acuity of auditory images in pitch and time. Percept Psychophys. 649 2006;68: 829–844.

650 48. Lotze M, Scheler G, Tan H-RM, Braun C, Birbaumer N. The musician’s brain: 651 functional imaging of amateurs and professionals during performance and imagery. 652 NeuroImage. 2003;20: 1817–1829.

653 49. Ellis D. Dynamic time warping (DTW) in Matlab [Internet]. 2003. Available: 654 http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

655 50. Lartillot O, Toiviainen P, Eerola T. A Matlab Toolbox for Music Information Retrieval. 656 In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R, editors. Data Analysis, 657 Machine Learning and Applications. Berlin, Heidelberg: Springer Berlin Heidelberg; 658 2008. pp. 261–268. Available: http://link.springer.com/10.1007/978-3-540-78246-9_31

659 51. Haynes J-D, Rees G. Predicting the orientation of invisible stimuli from activity in 660 human primary visual cortex. Nat Neurosci. 2005;8: 686–691. doi:10.1038/nn1445

661 52. Horikawa T, Tamaki M, Miyawaki Y, Kamitani Y. Neural Decoding of Visual Imagery 662 During Sleep. Science. 2013;340: 639–642. doi:10.1126/science.1234330

663 53. Reddy L, Tsuchiya N, Serre T. Reading the mind’s eye: Decoding category information 664 during mental imagery. NeuroImage. 2010;50: 818–825. 665 doi:10.1016/j.neuroimage.2009.11.084

666 54. Martin S, Brunner P, Holdgraf C, Heinze H-J, Crone NE, Rieger J, et al. Decoding 667 spectrotemporal features of overt and covert speech from the human cortex. Front 668 Neuroengineering. 2014; doi:10.3389/fneng.2014.00014

669 55. Leonard MK, Baud MO, Sjerps MJ, Chang EF. Perceptual restoration of masked speech 670 in human cortex. Nature Communications. 2016; doi:10.1038/ncomms13619

671 56. Holdgraf CR, de Heer W, Pasley B, Rieger J, Crone N, Lin JJ, et al. Rapid tuning shifts 672 in human auditory cortex enhance speech intelligibility. Nat Commun. 2016;7: 13654. 673 doi:10.1038/ncomms13654

674 57. Kosslyn SM, Thompson WL. Shared mechanisms in visual imagery and visual 675 perception: Insights from cognitive neuroscience. In: Gazzaniga MS, editor. The New 676 Cognitive Neurosciences. 2nd ed. Cambridge, MA: MIT Press; 2000.

677 58. Kosslyn SM, Thompson WL. When is early visual cortex activated during visual mental 678 imagery? Psychol Bull. 2003;129: 723–746. doi:10.1037/0033-2909.129.5.723

679 59. Herholz SC, Lappe C, Knief A, Pantev C. Neural basis of music imagery and the effect 680 of musical expertise. Eur J Neurosci. 2008;28: 2352–2360. doi:10.1111/j.1460- 681 9568.2008.06515.x

29 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

682 60. Pantev C, Oostenveld R, Engelien A, Ross B, Roberts LE, Hoke M. Increased auditory 683 cortical representation in musicians. Nature. 1998;392: 811–814. doi:10.1038/33918

684 61. Wu MC-K, David SV, Gallant JL. Complete functional characterization of sensory 685 neurons by system identification. Annu Rev Neurosci. 2006;29: 477–505. 686 doi:10.1146/annurev.neuro.29.051605.113024

687 62. Thirion B, Duschenay E, Michel V, Varoquaux G, Grisel O, VanderPlas J, et al. 688 scikitlearn. J Mach Learn Res. 2011;12: 2825–2830.

689 63. Garofolo JS. TIMIT acoustic-phonetic continuous speech corpus. 1993;

690

691 Supporting Information

692 693 Figure S1. Anatomical distribution of significant electrodes. Electrodes with significant encoding accuracy 694 overlaid on cortical surface reconstruction of the participant’s cerebrum. To define auditory sensory areas (pink), 695 we built an encoding model on data recorded while participant listened passively to speech sentences from the 696 TIMIT corpus [64]. Electrodes with significant encoding accuracy (p<0.05; FDA correction) are highlighted. 697

30 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

698 699 Figure S2. Spectrotemporal receptive fields. STRFs for the perception (top panel) and imagery (bottom panel) 700 models. Grey electrodes were removed from the analysis due to excessive noise.

31 bioRxiv preprint doi: https://doi.org/10.1101/106617; this version posted February 7, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

701 702 Figure S3. Neural encoding during passive listening. Standard STRFs for the passive listening model – built 703 on data recorded while the participant listened passively to speech sentences from the TIMIT corpus [64]. Grey 704 electrodes were removed from the analysis due to excessive noise. Encoding accuracy is plotted on the cortical 705 surface reconstruction of the participant’s cerebrum (map thresholded at p<0.05; FDR correction). 706 707