ARTICLES https://doi.org/10.1038/s41562-017-0261-8

Diversity in pitch perception revealed by task dependence

Malinda J. McPherson! !1,2* and Josh H. McDermott! !1,2

Pitch conveys critical information in speech, music and other natural sounds, and is conventionally defined as the perceptual correlate of a sound’s fundamental (F0). Although pitch is widely assumed to be subserved by a single F0 esti- mation process, real-world pitch tasks vary enormously, raising the possibility of underlying mechanistic diversity. To probe pitch mechanisms, we conducted a battery of pitch-related music and speech tasks using conventional sounds and inharmonic sounds whose lack a common F0. Some pitch-related abilities—those relying on musical interval or voice recognition—were strongly impaired by inharmonicity, suggesting a reliance on F0. However, other tasks, including those dependent on pitch contours in speech and music, were unaffected by inharmonicity, suggesting a mechanism that tracks the frequency spectrum rather than the F0. The results suggest that pitch perception is mediated by several different mechanisms, only some of which conform to traditional notions of pitch.

itch is one of the most common terms used to describe sound. could involve first estimating the F0 of different parts of a sound Although in lay terms pitch denotes any respect in which and then registering how the F0 changes over time. However, pitch sounds vary from high to low, in scientific parlance pitch is changes could also be registered by measuring a shift in the con- P 24–27 the perceptual correlate of the rate of repetition of a periodic sound stituent frequencies of a sound, without first extracting F0 . It (Fig. 1a). This repetition rate is known as the sound’s fundamental thus seemed plausible that pitch perception in different stimulus frequency, or F0, and conveys information about the meaning and and task contexts might involve different computations. identity of sound sources. In music, F0 is varied to produce melo- We probed pitch computations using inharmonic stimuli, ran- dies and harmonies. In speech, F0 variation conveys emphasis and domly jittering each frequency component of a harmonic sound intent, as well as lexical content in tonal languages. Other everyday to make the stimulus aperiodic and inconsistent with any single sounds (birdsong, sirens, ringtones and so on) are also identified F0 (Fig. 1b)30. Rendering sounds inharmonic should disrupt in part by their F0. Pitch is thus believed to be a key intermediate F0-specific mechanisms and impair performance on pitch-related perceptual feature and has been a topic of intense interest through- tasks that depend on such mechanisms. A handful of previous out history1–3. studies have manipulated harmonicity for this purpose and found The goal of pitch research has historically been to characterize modest effects on pitch discrimination that varied somewhat the mechanism for estimating F0 from sound4,5. Periodic sounds across listeners and studies25–27. As we revisited this line of inquiry, contain frequencies that are harmonically related, being multiples it became clear that effects of inharmonicity differed substantially of the F0 (Fig. 1a). The role of peripheral frequency cues, such as across pitch tasks, suggesting that pitch perception might parti- the place and timing of excitation in the cochlea, have thus been a tion into multiple mechanisms. The potential diversity of pitch focal point of pitch research6–10. The mechanisms for deriving F0 mechanisms seemed important both for the basic understanding from these peripheral representations are also the subject of a rich of the architecture of the auditory system and for understanding research tradition6–14. Neurophysiological studies in non-human the origins of pitch deficits in listeners with hearing impairment or animals have revealed F0-tuned neurons in the auditory cortex of cochlear implants. one species (the marmoset)15,16, although as of yet there are no com- We thus examined the effect of inharmonicity on essentially parable findings in other species17,18. Functional imaging studies in every pitch-related task we could conceive and implement. These humans suggest pitch-responsive regions in non-primary auditory ranged from classic psychoacoustic assessments with pairs of notes cortex19–21. The role of these regions in pitch perception is an active to ethologically relevant melody and voice recognition tasks. Our area of research22,23. Despite considerable efforts to characterize the results show that some pitch-related abilities—those relying on mechanisms for F0 estimation, there has been relatively little con- musical interval or voice perception—are strongly impaired by sideration of whether behaviours involving pitch might necessitate inharmonicity, suggesting a reliance on F0 estimation. However, other sorts of computations24–27. tasks relying on the direction of pitch change, including those using One reason to question the underlying basis of pitch percep- pitch contours in speech and music, were unaffected by inhar- tion is that our percepts of pitch support a wide variety of tasks. In monicity. Such inharmonic sounds individually lack a well-defined some cases it seems likely that the F0 of a sound must be encoded, pitch in the normal sense, but when played sequentially nonetheless as when recognizing sounds with a characteristic F0, such as a per- elicit the sensation of pitch change. The results suggest that what son’s voice28, but in many situations we instead judge the way that has traditionally been couched as ‘pitch perception’ is subserved by F0 changes over time—often referred to as —as when several distinct mechanisms, only some of which conform to the recognizing a melody or speech intonation pattern29. Relative pitch traditional F0-related notion of pitch.

1Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA. 2Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA, USA. *e-mail: [email protected]

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR a Waveform Power spectrum Autocorrelation 1.0 1.0 –10 0.8 0.8 –20 0.6 0.6

–30 ) 0.4 0.4 r ( 0.2 0.2 –40 0 0 –0.2 –50 –0.2 Power (dB) –0.4 –60 Correlation –0.4 –0.6 –0.6 –70 –0.8 –0.8 Sound pressure (arbitrary units) –1.0 –80 –1.0 21 23 25 27 29 31 0 500 1,000 1,500 2,000 2,500 3,000 012345678 Time (ms) Frequency (Hz) Lag (ms) b Waveform Power spectrum Autocorrelation 1.0 –10 1.0 0.8 0.8 –20 0.6 0.6

–30 ) 0.4

0.4 r ( 0.2 –40 0.2 0 0 –0.2 –50 –0.2 Power (dB) –0.4 –60 Correlation –0.4 –0.6 –0.6 –70 –0.8 –0.8 Sound pressure (arbitrary units) –1.0 –80 –1.0 21 23 25 27 29 31 0 500 1,000 1,500 2,000 2,500 3,000 012345678 Time (ms) Frequency (Hz) Lag (ms)

Fig. 1 | Example harmonic and inharmonic tones. a, Waveform, power spectrum and autocorrelation for a harmonic complex tone with a F0 of 250!Hz. The waveform is periodic (repeating in time), with a period corresponding to one cycle of the F0. The power spectrum contains discrete frequency components () that are integer multiples of the F0. The harmonic tone yields an autocorrelation of 1 at a time lag corresponding to its period (1/F0). b, Waveform, power spectrum and autocorrelation for an inharmonic tone generated by randomly ‘jittering’ the harmonics of the tone in a. The waveform is aperiodic and the constituent frequency components are not integer multiples of a common F0 (evident in the irregular spacing in the frequency domain). Such inharmonic tones are thus inconsistent with any single F0. The inharmonic tone exhibits no clear peak in its autocorrelation, indicative of its aperiodicity.

Results by detecting shifts in the spectrum without estimating F0, perfor- Experiment 1: Pitch discrimination with pairs of synthetic tones. mance should be impaired for Inharmonic-changing but similar for We began by measuring pitch discrimination using a two-tone dis- Harmonic and Inharmonic conditions. Finally, if the jitter manipu- crimination task standardly used to assess pitch perception31–33. lation was insufficient to disrupt F0 estimation, performance should Participants heard two tones and were asked whether the second be similar for all three conditions. tone was higher or lower than the first (Fig. 2a). We compared per- To isolate the effects of harmonic structure, a fixed bandpass fil- formance for three conditions: a condition where the tones were ter was applied to each tone (Fig. 2c). This filter was intended to harmonic, and two inharmonic conditions (Fig. 2b). Here and else- approximately equate the spectral centroids (centres of mass) of where, stimuli were made inharmonic by adding a random amount the tones, which might otherwise be used to perform the task, and of ‘jitter’ to the frequency of each partial of a harmonic tone (up to to prevent listeners from tracking the frequency component at the 50% of the original F0 in either direction) (Fig. 1b). This manipula- F0 (by filtering it out). This type of tone also mimics the acous- tion was designed to severely disrupt the ability to recover the F0 tics of many musical instruments, in which a source that varies in of the stimuli. One measure of the integrity of the F0 is available in F0 is passed through a fixed filter (for example, the resonant body the autocorrelation peak height, which was greatly reduced in the of the instrument). Here, and in most other experiments, low-pass inharmonic stimuli (Fig. 1 and Supplementary Fig. 1). noise was added to the stimuli to mask distortion products34,35, For the ‘Inharmonic’ condition (here and throughout all experi- which might otherwise confer an advantage to harmonic stimuli. ments), the same pattern of jitters was used within a given trial. In Demontrations of these and all other experimental stimuli from experiment 1, this meant that the same pattern of jitters was applied this Article are available as supplementary materials and at http:// to harmonics in both tones of the trial. This condition was intended mcdermottlab.mit.edu/Diversity_In_Pitch_Perception.html. to preserve the ability to detect F0 changes via shifts in the spec- Contrary to the idea that pitch discrimination depends on com- trum. For the ‘Inharmonic-changing’ condition, a different random parisons of F0, performance for Harmonic and Inharmonic tones jitter pattern was applied to the harmonics of each tone in the exper- was indistinguishable provided differences were small iment. For example, for the first tone, the second harmonic could (a semitone or less: Fig. 2d; F(1,29) = 1.44, P = 0.272). Thresholds be shifted up by 30%, and in the second tone, the second harmonic were ~1% (0.1–0.25 of a semitone) in both conditions, which are could be shifted down by 10%. This lack of correspondence in the similar to thresholds measured in previous studies using complex pattern of harmonics between the tones should impair the detection harmonic tones33. Performance for Harmonic and Inharmonic con- of shifts in the spectrum (Fig. 2b) if the jitter is sufficiently large. ditions differed slightly at two semitones (t(29) = 5.22, P < 0.001), We hypothesized that if task performance was mediated by and this difference is explored further in experiment 9. By contrast, F0-based pitch, performance should be substantially worse for the Inharmonic-changing condition produced much worse perfor- both Inharmonic conditions. If performance was instead mediated mance (F(1,29) = 92.198, P < 0.001). This result suggests that the

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

a b Harmonic Inharmonic Inharmonic-changing Was the second note higher or lower than the first note? Frequency Frequency

Time Time Time Time c Power spectrum, 200 Hz tone d –40 Masking noise –50 –60 Experiment 1: Pitch discrimination with pairs of synthetic tones

–70 N = 30 100 Harmonic –80 Inharmonic Power (dB) –90 Inharmonic-changing 90 –100 –110 02,000 4,000 6,000 8,000 80 Frequency (Hz) Power spectrum, 400 Hz tone –40 70

–50 Percent correct –60 60 –70 –80 50

Power (dB) –90 –100 40 Masking noise 0.1 0.25 12 –110 02,000 4,000 6,000 8,000 Diference (semitones) Frequency (Hz) e Power spectrum, harmonic violin –30 –40 –50 f –60 Experiment 2: Pitch discrimination with pairs of instrument notes –70 N = 30 100 Harmonic –80 Inharmonic Power (dB) –90 Inharmonic-changing –100 90 –110 –120 01,000 2,000 3,000 4,0005,000 6,000 80 Frequency (Hz) Power spectrum, inharmonic violin –30 70

–40 Percent correct –50 60 –60 –70 50 –80 Instruments included piano, clarinet, Power (dB) –90 oboe, violin and trumpet –100 40 –110 0.1 0.25 12 –120 Diference (semitones) 01,000 2,000 3,000 4,0005,000 6,000 Frequency (Hz)

Fig. 2 | Task, example stimuli and results for experiments 1 and 2: pitch discrimination with pairs of synthetic tones and pairs of instrument notes. a, Schematic of the trial structure for experiment 1. During each trial, participants heard two tones and judged whether the second tone was higher or lower than the first. b, Schematic of the three conditions for experiment 1. Harmonic trials consisted of two harmonic tones. Inharmonic trials contained two inharmonic tones, where each tone was made inharmonic by the same jitter pattern, such that the frequency ratios between components were preserved. This maintains a correspondence in the spectral pattern between the two tones, as for harmonic notes (indicated by red arrows). For Inharmonic-changing trials, a different jitter pattern was applied to the harmonics of each tone, eliminating the correspondence in the spectral pattern. c, Power spectra of two example tones from experiment 1 (with F0 values of 200 and 400!Hz, to convey the range of F0 used in the experiment). The fixed band-pass filter applied to each tone is evident in the envelope of the spectrum, as is the low-pass noise added to mask distortion products. The filter was intended to eliminate the spectral centroid or edge as a cue for pitch changes. d, Results from experiment 1. Error bars denote standard error of the mean. e, Example power spectra of harmonic and inharmonic violin notes from experiment 2. f, Results from experiment 2. Error bars denote standard error of the mean.

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR similar performance for Harmonic and Inharmonic conditions was for the Inharmonic-changing condition was at chance (t(28) = − 0.21, not due to residual evidence of the F0. P = 0.84, single sample t-test versus 0.5), suggesting that accurate To assess whether listeners might have determined the shift contour estimation depends on the correspondence in the spectral direction by tracking the lowest audible harmonic, we ran a control pattern between notes. These results suggest that even for melodies experiment in which the masking noise level was varied between of moderate length, pitch contour perception is not dependent on the two tones within a trial, such that the lowest audible harmonic extracting F0 and instead can be accomplished by detecting shifts in was never the same for both tones. Performance was unaffected by the spectrum from note to note. this manipulation (Supplementary Fig. 2), suggesting that listeners are relying on the spectral pattern rather than any single frequency Experiment 4: Prosodic contour discrimination. To test whether component. The results collectively suggest that task performance the results would extend to pitch contours in speech we measured does not rely on estimating F0 and that participants instead track the effect of inharmonicity on prosodic contour discrimination. We shifts in the spectrum, irrespective of whether the spectrum is har- used speech analysis/synthesis tools (a variant of STRAIGHT37–39) monic or inharmonic. to manipulate the pitch contour and harmonicity of recorded speech excerpts. Speech excitation was sinusoidally modelled and Experiment 2: Pitch discrimination with pairs of instrument then recombined with an estimated spectrotemporal filter following notes. To assess the extent to which the effects in experiment perturbations of individual frequency components. 1 would replicate for real-world pitch differences, we repeated During each trial, participants heard three variants of the same the experiment with actual instrument notes. We resynthesized one-second speech token (Fig. 4a,b). Either the first or last excerpt recorded notes played on piano, clarinet, oboe, violin and trumpet, had a random frequency modulation (FM) added to its F0 con- preserving the spectrotemporal envelope of each note but altering tour, and participants were asked to identify the excerpt whose the underlying frequencies as in experiment 1 (Fig. 2e; see Methods). prosodic contour was different from that of the middle excerpt. As shown in Fig. 2f, the results of these manipulations with actual The middle excerpt was ‘transposed’ by shifting the F0 contour instrument notes were similar to those for the synthetic tones of up by two semitones to force listeners to rely on the prosodic con- experiment 1. Performance was indistinguishable for the Harmonic tour rather than some absolute feature of pitch. Stimuli were high- and Inharmonic conditions (F(1,29) = 2.36, P = 0.136), but substan- pass filtered to ensure that listeners could not simply track the F0 tially worse in the Inharmonic-changing condition, where different component (which would otherwise be present in both Harmonic jitter patterns were again used for the two notes (F(1,29) = 41.88, and Inharmonic conditions) and noise was added to mask poten- P < 0.001). The results substantiate the notion that pitch changes are tial distortion products. Because voiced speech excitation is con- in many cases detected by tracking spectral shifts without estimat- tinuous, it was impractical to change the jitter pattern over time ing the F0s of the constituent sounds. and we thus included only Harmonic and Inharmonic condi- tions, the latter of which used the same jitter pattern throughout Experiment 3: Melodic contour discrimination. To examine each trial. whether the effects observed in standard two-tone pitch discrimi- As the amplitude of the added FM increased, performance for nation tasks would extend to multinote melodies, we used a pitch Harmonic and Inharmonic conditions improved, as expected contour discrimination task36. Participants heard two five-note (Fig. 4c). However, performance was not different for harmonic melodies composed of semitone steps, with Harmonic, Inharmonic and inharmonic stimuli (F(1,29) = 1.572, P = 0.22), suggesting that or Inharmonic-changing notes. The second melody was trans- the perception of speech prosody also does not rely on extracting posed up in pitch by half an octave and had either an identical pitch F0. Similar results were obtained with FM tones synthesized from contour to the first melody or one that differed in the sign of one speech contours (Supplementary Fig. 3). step (for example, a + 1 semitone step was changed to a − 1 semi- tone). Participants judged whether the melodies were the same or Experiment 5: Mandarin tone perception. In languages such different (Fig. 3a). as Mandarin Chinese, pitch contours can carry lexical meaning We again observed indistinguishable performance for Harmonic in addition to signalling emphasis, emotion and other indexical and Inharmonic trials (Fig. 3b; t(28) = 0.28, P = 0.78); performance properties. To probe the pitch mechanisms underlying lexical tone was well above chance in both conditions. By contrast, performance perception, we performed an open-set word recognition task using

abExperiment 3: Melodic contour discrimination 100 N = 16 ***P < 0.001 NS *** 90 Are the two melodies the same or diferent? 80 Contour: + – – + 70 ROC area Contour: 60

Frequency + + –+ 50

40 Time Harmonic Inharmonic Inharmonic- changing

Fig. 3 | Task and results for experiment 3: melodic contour discrimination. a, Schematic of the trial structure for experiment 3. Participants heard two melodies with note-to-note steps of + 1 or − 1 semitones and judged whether the two melodies were the same or different. The second melody was always transposed up in pitch relative to the first melody. b, Results from experiment 3. Performance was measured as the area under ROC curves. Error bars denote standard error of the mean.

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES a c Experiment 4: Speech contour perception 100 Which excerpt is N = 30 Harmonic diferent from the 90 Inharmonic second one, the 80 first or last? NS 70

Frequency 60 Percent correct 50

Time 40 5 15 25 FM amplitude (% of F0) b Spectrogram of sample harmonic trial Spectrogram of sample inharmonic trial 6 0 –5 5 –10 4 –15 –20 3 –25 –30 2 –35 Frequency (kHz) 1 –40 –45 –50 00.5 1.0 1.5 2.0 2.5 3.0 3.5 00.5 1.0 1.5 2.0 2.5 3.0 3.5 Time (s) Time (s)

d e Experiment 5: Mandarin tone perception What word did you hear? 100 N = 32 NS ***P < 0.001 Correct response: *** 90 Wu4li3 Wùlĭ, 80 Incorrect response: 70 Wu2li3 Percent correct 60

50 Harmonic Inharmonic Whispered

Fig. 4 | Tasks and results for experiments 4 and 5: speech contour perception and Mandarin tone perception. a, Schematic of the trial structure for experiment 4. Participants heard three 1!s resynthesized speech utterances, the first or last of which had a random frequency modulation (1–2!Hz bandpass noise, with modulation depth varied across conditions) added to the F0 contour. Participants were asked whether the first or last speech excerpt differed from the second speech excerpt. The second excerpt was always shifted up in pitch to force listeners to make judgements about the prosodic contour rather than the of the stimuli. b, Example spectrograms of stimuli from harmonic and inharmonic trials in experiment 4. Note the even and jittered spacing of frequency components in the two trial types. In these examples, the final excerpt in the trial contains the added frequency modulation. c, Results from experiment 4. Error bars denote standard error of the mean. d, Schematic of trial structure for experiment 5. Participants (fluent Mandarin speakers) heard a single resynthesized Mandarin word and were asked to type what they heard (in Pinyin, which assigns numbers to the five possible tones). Participants could, for example, hear the word Wùlĭ, containing tones 4 and 3, and the correct response would be Wu4li3. e, Results for experiment 5. Error bars denote standard error of the mean.

Mandarin words that were resynthesized with harmonic, inhar- contours of the stimuli, irrespective of whether the frequencies are monic or noise carrier signals. The noise carrier simulated the harmonic or inharmonic. acoustics of breath noise in whispered speech and was intended as a control condition to determine whether lexical tone percep- Experiment 6: Familiar melody recognition. Despite the lack of an tion would depend on the frequency modulation introduced by effect of inharmonicity on tasks involving pitch contour discrimina- the pitch contour. As in experiment 4, the resynthesized words tion, it seemed possible that F0-based pitch would be more impor- were filtered to ensure that listeners could not simply track the tant in complex and naturalistic musical settings. We thus measured lower spectral edge provided by the F0 component, and noise was listeners’ ability to recognize familiar melodies (Fig. 5a) that were added to mask potential distortion products. Participants (fluent rendered with harmonic or inharmonic notes. In addition to the Mandarin speakers) were asked to identify single words by typing Harmonic, Inharmonic and Inharmonic-changing conditions from what they heard (Fig. 4d). previous experiments, we included Harmonic and Inharmonic con- As shown in Fig. 4e, tone identification was comparable ditions in which each interval of each melody (the size of note-to- for harmonic and inharmonic speech (t(31) = 1.99, P = 0.06), note changes in pitch) was altered by one semitone while preserving but decreased substantially (P < 0.001) for whispered speech the contour (directions of note-to-note changes) and rhythm (t(31) = 22.14, P < 0.001). These two results suggest that tone com- (Fig. 5a). These conditions were intended to test the extent to which prehension depends on the tone’s pitch contour, as expected, but any effect of inharmonicity would be mediated via an effect on pitch that its perception, like that of the prosodic contour, seems not interval encoding, by reducing the extent to which intervals would to require F0 estimation. Listeners evidently track the frequency be useful for the task.

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR

a What song is this? (P < 0.001 for both), consistent with the notion that the pitch con- tour contributes to familiar melody recognition and is largely unaf- Twinkle Twinkle fected by inharmonicity.

Twinkle Twinkle Little Star: Twinkle Twinkle Little Star: Experiment 7: Sour note detection. To further examine whether correct intervals incorrect intervals pitch interval perception relies on F0, we assessed the effect of Contour: 0 + 0 + 0 – Contour: 0 + 0 + 0 – inharmonicity on the detection of an out-of-key (‘sour’) note within Interval: 0 7 0 2 0 –2 Interval: 0 6 0 4 0 –4 a 16-note melody40,41. Sour notes fall outside the set of notes used in the tonal context of a melody and can be identified only by their interval relations with other notes of a melody. Melodies were ran- Frequency Frequency domly generated using a model of western tonal melodies42. In half of the trials, one of the notes in the melody was modified by one Time Time or two semitones to be out of key. Participants judged whether the melody contained a sour note (explained to participants as a b Experiment 6: Open-set famous melody recognition 70 ‘mistake’ in the melody; Fig. 6a). Notes were band-pass filtered *** N = 322 and superimposed on masking noise as in the contour and two- 60 *** P < 0.001 tone discrimination tasks (to ensure that the task could not be per- 50 NS formed by extracting pitch intervals from the F0 component alone; see Supplementary Fig. 4c,d for comparable results with unfiltered 40 notes). We again measured performance for Harmonic, Inharmonic 30 and Inharmonic-changing conditions.

Percent correct Consistent with the deficit observed for familiar melody rec- 20 NS ognition and in contrast to the results for pitch contour discrimi- 10 nation (experiment 3), sour note detection was substantially impaired for Inharmonic compared to Harmonic trials (Fig. 6b; 0 t(29) = 4.67, P < 0.001). This result is further consistent with the idea that disrupting F0 specifically impairs the estimation of pitch Rhythm Harmonic Inharmonic intervals in music.

Inharmonic-changing Experiment 8: Interval pattern discrimination. It was not obvious Harm., incorrect intervals a priori why inharmonicity would specifically prevent or impair the Inharm., incorrect intervals perception of pitch intervals. Listeners sometimes describe inhar- Fig. 5 | Task, results and schematic of incorrect interval trials from monic tones as sounding like chords, appearing to contain more experiment 6: familiar melody recognition. a, Stimuli and task for than one F0, which might introduce ambiguity in F0 comparisons experiment 6. Participants on Amazon Mechanical Turk heard 24 melodies, between tones. However, if the contour (direction of note-to-note modified in various ways and were asked to identify each melody by changes) can be derived from inharmonic tones by detecting shifts typing identifying information into a computer interface. Three conditions of the spectrum, one might imagine that it should also be possible (Harmonic, Inharmonic and Inharmonic-changing) preserved the pitch to detect the magnitude of that shift (the interval) between notes. intervals between notes. Two additional conditions (incorrect intervals A dissociation between effects of inharmonicity on pitch contour with harmonic or inharmonic notes) altered each interval between notes and interval representations thus seemed potentially diagnos- but preserved the contour (direction of pitch change between notes). The tic of distinct mechanisms subserving pitch-related functions. To Rhythm condition preserved the rhythm of the melody, but used a flat more explicitly isolate the effects of inharmonicity on pitch inter- pitch contour. b, Results from experiment 6. Error bars denote standard val perception, we conducted an experiment in which participants deviations calculated via bootstrap. detected interval differences between two three-note melodies with harmonic or inharmonic notes (Fig. 6c). In half of the trials, the second note of the second melody was changed by one semitone so as to preserve the contour (sign of pitch changes), but alter both Additionally, to evaluate the extent to which participants were intervals in the melody. Tones were again bandpass filtered and using rhythmic cues to identify the melody, we included a condi- superimposed on masking noise. tion where the rhythm was replicated with a flat pitch contour. As shown in Fig. 6d, this task was difficult (as expected, one semi- Participants heard each of 24 melodies once (in one of the condi- tone is close to previously reported pitch interval discrimination tions, chosen at random) and typed the name of the song. Results thresholds43), but performance was again better for harmonic than were coded by the first author, blind to the condition. To obtain a inharmonic notes (t(17) = 4.59, P < 0.001, t-test). Because this task, large sample of participants, which was necessary given the small unlike those of experiments 6 and 7, did not require comparisons to number of trials per listener, the experiment was crowd-sourced on familiar pitch structures (known melodies or key signatures), it mit- Amazon Mechanical Turk. igates the potential concern that the deficits in experiments 6 and 7 As shown in Fig. 5b, melody recognition was modestly impaired reflect a difficulty comparing intervals obtained from inharmonic for Inharmonic compared to Harmonic melodies (P < 0.001, via notes to those learned from harmonic notes through experience bootstrap). By contrast, performance was indistinguishable for with western music. Instead, the results suggest that intervals are Harmonic and Inharmonic conditions when melodic intervals less accurately encoded (or retained) when notes are inharmonic, were changed to incorrect values (P = 0.50). The deficit in melody suggesting a role for F0-based pitch in encoding or representing the recognition with inharmonic notes thus seems plausibly related to magnitude of pitch changes. impairments in encoding pitch intervals (the magnitude of pitch shifts), which are known to be important for familiar melody recog- Experiment 9: Pitch discrimination with large pitch intervals. nition36. Performance in the Inharmonic conditions was nonetheless To better understand the relationship between deficits in interval far better than in the Inharmonic-changing or Rhythm conditions perception (where pitch steps are often relatively large) and the lack

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

a b Experiment 7: Sour note detection 100 N = 30 ***P < 0.001 90 ***

80 *** Does the melody contain a mistake? 70 ROC area 60

Out of key note 50

40 Harmonic Inharmonic Inharmonic- changing c d Experiment 8: Interval pattern discrimination 100 N = 18 ***P < 0.001 Are the patterns identical or diferent? 90 Contour: + – *** 80 Intervals: Intervals: +3 –2 +2 –1 70 ROC area 60

Frequency 50

40 Time Harmonic Inharmonic

Fig. 6 | Task and results for experiments 7 and 8: sour note detection and interval pattern discrimination. a, Sample trial from experiment 7. Participants judged whether a melody contained a ‘sour’ (out of key) note. b, Results for experiment 7. Performance was measured as the area under ROC curves. Error bars denote standard error of the mean. c, Schematic of a sample trial from experiment 8. Participants judged whether two melodies were the same or different. On ‘different’ trials (pictured) the two melodies had different intervals between notes, but retained the same contour. The second melody was always transposed up in pitch relative to the first. d, Results for experiment 8. Performance was measured as the area under ROC curves. Error bars denote standard error of the mean.

of impairment for two-tone pitch discrimination (experiment 1, F(1,27) = 71.29, P < 0.001). One explanation is that, for larger where steps were small), we conducted a second pitch discrimina- steps, the match between the spectral pattern of successive tones is tion experiment with pitch steps covering a larger range (Fig. 7a). occasionally ambiguous, leading to a decrease in performance for As shown in Fig. 7b, the results replicate those of experiment 1, Inharmonic tones (although participants still achieved above 85% but reveal that performance for Harmonic and Inharmonic tones on average). The lack of a similar decline for Harmonic conditions differs somewhat (by ~10%) once pitch shifts exceed a semi- suggests that F0-based pitch may be used to boost performance tone (producing an interaction between tone type and step size; under these conditions.

a b Experiment 9: Pitch discrimination with large pitch intervals 100 N = 29 90 Was the second note higher or lower than the first note? 80

70

Percent correct 60 Harmonic Frequency Inharmonic Inharmonic-changing Time 50

40 0.10.25 12346 Diference (semitones)

Fig. 7 | Task and results for experiment 9: pitch discrimination with large pitch intervals. a, Schematic of trial structure for experiment 9. During each trial, participants heard two tones and judged whether the second tone was higher or lower than the first. The stimuli and task were identical to those of experiment 1, except larger step sizes were included. b, Results from experiment 9. Error bars denote standard error of the mean.

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR

By contrast, performance on the Inharmonic-changing condi- was even worse for whispered speech (P < 0.001), suggesting that tion progressively improved with pitch difference (F(6,168) = 80.30, aspects of the prosodic contour may also matter, independent of the P < 0.001). This result suggests that participants were also able to integrity of the F0. detect pitch differences to some extent through the average den- sity of harmonics (higher tones have greater average spacing than Experiment 11: Novel voice discrimination. As a further test of the lower tones, irrespective of the jitter added). By six semitones, pitch mechanisms involved in voice perception, we measured the where Inharmonic and Inharmonic-changing conditions pro- effect of inharmonicity on the discrimination of unfamiliar voices. duced equivalent performance (t(28) = 0.45, P = 0.66), it seems Participants were presented with three speech excerpts and had to likely that participants were relying primarily on harmonic den- identify which one was spoken by a different speaker from the other sity rather than spectral shifts, as there was no added benefit of a two (Fig. 8d). Speech excerpts were taken from a large anonymized consistent spectral pattern. Overall, the results indicate that pitch corpus44 and thus were unknown to participants. changes between tones are conveyed by a variety of cues and that As with celebrity voice recognition, we observed a significant listeners make use of all of them to some extent. However, pitch deficit in performance for Inharmonic compared to Harmonic conveyed by the F0 appears to play a relatively weak role and only speech (t(29) = 3.88, P < 0.001, Fig. 8e) and a larger impairment for in particular conditions. whispered speech (t(29) = 16.24, P < 0.001). These results are also The difference between Harmonic and Inharmonic perfor- consistent with a role for F0 in the representation of voice iden- mance for larger pitch steps nonetheless left us concerned that tity and show that voice-related deficits from inharmonicity do what appeared to be deficits in interval size estimation in experi- not only occur when matching an inharmonic stimulus to a stored ments 7 and 8 might somehow reflect a difficulty in recovering representation of a normally harmonic voice (as in experiment 10). the direction of pitch change, because the intervals used in those Deficits occur even when comparing multiple stimuli that are all two experiments were often greater than a semitone. To address inharmonic, suggesting that voice representations depend in part this issue, we ran additional versions of both experiments in on F0-based pitch. We note also that the inharmonicity manipula- which the direction of pitch change between notes was rendered tion that produced an effect here and in experiment 10 is identical unambiguous; notes were not band-pass filtered, so the F0 com- to the one that produced no effect on prosodic contour discrimina- ponent moved up and down, as did the spectral centroid of the tion or Mandarin tone identification (experiments 4 and 5). It thus note (Supplementary Fig. 4a). This stimulus produced up–down serves as a positive control for those null results—the manipula- discrimination of tone pairs that was equally good irrespective of tion is sufficient to produce a large effect for tasks that depend spectral composition (F(2,58) = 0.38, P = 0.689; Supplementary on the F0. Fig. 4b), demonstrating that the manipulation had the desired To further test whether the performance decrements in voice effect of rendering direction unambiguous. Yet even with these recognition and discrimination reflect impairments in estimating alternative stimuli, performance differences for Inharmonic F0, we conducted a control experiment. Participants performed an notes were evident in both the sour note detection and interval alternative version of discrimination task of experiment 11 pattern discrimination tasks (t(18) = 3.87, P < 0.001, t(13) = 4.54, in which the mean and variance of the F0 contours of each speech P < 0.001; Supplementary Fig. 4c–f). The results provide addi- excerpt were equated, such that F0-based pitch was much less infor- tional evidence that the deficits on these tasks with inharmonic mative for the task. If the effect of inharmonicity were due to its stimuli do, in fact, reflect a difficulty encoding pitch intervals effect on some other aspect of voice representations, such as vocal between sounds that lack a coherent F0. tract signatures extracted from the spectral envelope of speech, one would expect the deficit to persist even when F0 was rendered Experiment 10: Voice recognition. In addition to its role in con- uninformative. Instead, this manipulation eliminated the advantage veying the meaning of spoken utterances, pitch is thought to be a for harmonic over inharmonic speech (t(13) = 0.43, P = 0.67), sug- cue to voice identity28. Voices can differ in both mean F0 and in the gesting that the deficit in experiments 10 and 11 are in fact due to extent and manner of F0 variation, and we sought to explore the the effect of inharmonicity on pitch perception (Supplementary importance of F0 in this additional setting. We first established the Fig. 5a,b). This conclusion is also supported by findings that inhar- role of pitch in voice recognition by measuring recognition of voices monicity has minimal effects on speech intelligibility, which also whose pitch was altered (experiment 10a). depends on features of the spectral envelope resulting from vocal Participants were asked to identify celebrities from their speech, tract filtering. For example, Mandarin phoneme intelligibility resynthesized in various ways (Fig. 8a). The speakers included (assessed from the responses for experiment 5) was unaffected by politicians, actors, comedians and singers. Participants typed their inharmonicity (Supplementary Fig. 5c,d). responses, which were scored after the fact by the first author, blind to the condition. Due to the small number of trials per listener, the Effects of musicianship. It is natural to wonder how the effects experiments were crowd-sourced on Amazon Mechanical Turk in described here would vary with musicianship, which is known to order to recruit sufficient sample sizes. The speech excerpts were produce improved performance on pitch-related tasks33,45,46. A com- pitch-shifted up and down, remaining harmonic in all cases. Voice parison of musician and non-musician participants across all of the recognition was best at the speaker’s original F0 and decreased for experiments (with the exception of experiment 5, in which most each subsequent pitch shift away from the original F0 (Fig. 8b). This participants identified as musicians) indeed revealed that musicians result suggests that the average absolute pitch of a speaker’s voice is were better than non-musicians at most tasks; the only experiments an important cue to their identity and is used by human listeners for in which this was not the case were those involving voice identifi- voice recognition. cation or discrimination (Supplementary Figs. 6–8). However, the To probe the pitch mechanisms underlying this effect, we mea- effects of inharmonicity were qualitatively similar for musicians sured recognition for inharmonic celebrity voices (experiment and non-musicians. Tasks involving the direction of pitch changes 10b). Participants heard speech excerpts that were harmonic or (two-tone discrimination, melodic contour discrimination and pro- inharmonic at the original pitch, or resynthesized with simulated sodic contour discrimination; experiments 1–4) all showed similar whispered excitation, and again identified the speaker. Recognition performance for harmonic and inharmonic stimuli in both musi- was substantially worse for Inharmonic speech (Fig. 8c; P < 0.001), cians and non-musicians (Supplementary Fig. 6). Tasks involving suggesting that at least part of the pitch representations used for pitch intervals or voice identity (experiments 6–11) produced better familiar voice recognition depends on estimating F0. Recognition performance for harmonic than inharmonic stimuli in both groups

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

a b Experiment 10: Famous speaker recognition c Experiment 10: Famous speaker recognition 60 Whose voice is this? NS N = 248 60 *** N = 412 P = 0.08 ***P < 0.001 50 ‘Sounds like Snape from *** 50 *** *** Harry Potter’ 40 *** 40 30 *** 30 *** 20 20 Percent correct Percent correct F0 10 ***P < 0.001 10 0 0 Time –12 –6 –3 0 +3 +6 +12 Harmonic Inharmonic Whispered Shift (semitones)

d e Which speaker only spoke Experiment 11: Novel voice discrimination once (first or last)? 100 ***P < 0.001 N = 30 90 Speaker 1 *** Speaker 2 80 ***

F0 70

60 Percent correct

Time 50 40 Harmonic Inharmonic Whispered

Fig. 8 | Task and results for experiments 10a, 10b and 11: famous speaker recognition and novel voice discrimination. a, Description of experiments 10a and 10b. Participants on Mechanical Turk heard resynthesized excerpts of speech from recordings of celebrities and were asked to identify each speaker by typing their guesses into a computer interface. b, Results from experiment 10a, with harmonic speech pitch-shifted between − 12 and + 12 semitones. Here and in c, error bars plot standard deviations calculated via bootstrap. c, Results from experiment 10b. Stimuli in the Whispered condition were resynthesized with simulated breath noise, removing the carrier frequency contours. d, Schematic of trial structure for experiment 11. Participants heard three 1!s resynthesized speech utterances from unknown speakers, the first or last of which was spoken by a different speaker than the other two. Participants judged which speaker (first or last) only spoke once. e, Results from experiment 11. Error bars denote standard error of the mean.

(Supplementary Figs 7 and 8). The lone exception was experiment conditions pitch changes need not be detected by first estimating 8 (interval pattern discrimination), where most non-musicians per- the F0 of each note. formed close to chance in both conditions. The similarity in results Previous results have shown that listeners hear changes in the across groups suggests that the differences we find in the effect of overall spectrum of a sound47 (for example, the centroid, believed to inharmonicity across tasks is a basic feature of hearing and is pres- underlie the brightness dimension of timbre, or the edge), that these ent in listeners independent of extensive musical expertise. shifts can produce contour-like representations48, and that these shifts can interfere with the ability to discern changes in F047,49,50. Discussion Our findings differ from these previous results in suggesting that the To probe the basis of pitch perception, we measured performance on substrate believed to underlie F0 estimation (the fine-grained pattern a series of pitch-related music and speech tasks for both harmonic of harmonics) is often instead used to detect spectral shifts. Other and inharmonic stimuli. Inharmonic stimuli should disrupt mecha- prior results have provided evidence for ‘frequency shift detectors’, nisms for estimating F0, as are conventionally assumed to underlie typically for shifts in individual frequency components51, although pitch. We found different effects of this manipulation depending on it has been noted that shifts can be heard between successive inhar- the task. Tasks that involved detecting the direction of pitch changes, monic tones52. Our results are distinct in showing that these shifts whether for melodic contour, spoken prosody or single pitch steps, appear to dictate performance in conditions that have typically been generally produced equivalent performance for harmonic and assumed to rely on F0 estimation. Although we have not formally inharmonic stimuli. By contrast, tasks that required judgements of modelled the detection of such shifts, the cross-correlation of exci- pitch intervals or voice identity showed substantially impaired per- tation patterns (perhaps filtered to accentuate fluctuations due to formance for inharmonic stimuli. These results suggest that what harmonics) between sounds is a candidate mechanism. By contrast, has conventionally been considered ‘pitch perception’ is mediated by it is not obvious how one could account for the detection of shifts in several different mechanisms, not all of which involve estimating F0. inharmonic spectra with an F0-estimation mechanism, particularly given that the same inharmonicity manipulation produces large def- Tracking spectral patterns. Our results suggest a mechanism that icits in some tasks, but not in tasks that rely on detecting the direc- registers the direction of pitch shifts (the contour) by tracking shifts tion of pitch shifts, even when shifts are near threshold. in spectral patterns, irrespective of whether the pattern is harmonic or inharmonic. This mechanism appears to operate for both musical F0-based pitch. The consistently large effects of inharmonicity in tones and for speech. When the correspondence in spectral pattern some pitch-related tasks implicate an important role for F0-based pitch was eliminated in the Inharmonic-changing conditions of experi- (historically referred to as ‘virtual’ pitch, ‘residue’ pitch or ‘periodic- ments 1–3, performance was severely impaired. These results pro- ity’ pitch). F0-based pitch seems necessary for accurately estimating vide evidence that the match in the spectral pattern between notes pitch intervals (the magnitude of pitch shifts; experiments 6–8) and underlies the detection of the pitch change and that under these for identifying and discriminating voices (experiments 10 and 11).

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR

These results provide a demonstration of the importance of F0-based appears that pitch perception is mediated by a relatively rich set of pitch and a delineation of its role in pitch-related behaviours such as cues that vary in their importance depending on the circumstances. interval perception and voice recognition. Comparisons with previous models of pitch. Previous work on Implications for relationship between F0 and pitch. Taken pitch has also implicated multiple mechanisms, but these mecha- together, our data suggest that the classical view of pitch as the per- nisms typically comprise different ways to estimate F0. Classical ceptual correlate of F0 is incomplete; F0 appears to be just one com- debates between temporal and place models of pitch have evolved ponent of real-world pitch perception. The standard psychoacoustic into the modern view that different cues, plausibly via different assessment of pitch (two-tone up–down discrimination) does not mechanisms, underlie pitch derived from low- and high-numbered seem to require the classical notion of pitch. At least for modest harmonics14,21,31,32,54. The pitch heard from low-numbered ‘resolved’ pitch differences and for the stimulus parameters we employed, it harmonics may depend on the individual harmonic frequen- can be performed by tracking correspondence in the spectral pat- cies, whereas that from high-numbered ‘unresolved’ harmonics tern of sounds even when they are inharmonic. is believed to depend on the combined pattern of periodic - Are the changes that are heard between inharmonic sounds ing that they produce. In both cases, however, the mechanisms really ‘pitch’ changes? Listeners describe what they hear in the are thought to estimate F0 from a particular cue to periodicity. By Inharmonic conditions of our experiments as a pitch change, but in contrast, we identify a mechanism for detecting changes in F0 that typical real-world conditions the underlying mechanism presum- does not involve estimating F0 first and that is thus unaffected by ably operates on sounds that are harmonic. The changes heard in inharmonicity (a manipulation that eliminates periodicity). The sequences of inharmonic sounds thus appear to be a signature of a pitch-direction mechanism implicated by our results is presumably mechanism that normally serves to registers changes in F0, but that dependent on resolved harmonics, although we did not explicitly test does so without computing F0. this. Resolved and unresolved harmonics may thus best be viewed as Alternatively, could listeners have learned to employ a strategy to providing different peripheral cues that can then be used for differ- detect shifts in inharmonic spectra that they would not otherwise ent computations, including but not limited to F0-based pitch. use for a pitch task? We consider this unlikely, both because listen- Previous work has also often noted differences in the represen- ers were not given practice on our tasks prior to the experiments, tation of pitch contour and intervals29,36,55. However, the difference and because omitting feedback in pilot experiments did not alter between contour and interval representations has conventionally the results. Moreover, the ability to hear ‘pitch’ shifts in inharmonic been conceived as a difference in what is done with pitch once it tones is typically immediate for most listeners, as is apparent in the is extracted (retaining the sign versus the magnitude of the pitch stimulus demonstrations that accompany this paper. change). By contrast, our results suggest that the difference between Several previous studies found effects of inharmonicity in two- contour and interval representations may lie in what is extracted tone pitch discrimination tasks. Although at face value these previ- from the sound signal—pitch contours can be derived from spectral ous results might appear inconsistent with those reported here, the shifts without estimating F0, whereas intervals appear to require the previously observed effects were typically modest and were either initial estimation of the F0s of the constituent notes, from which the variable across subjects25 or were most evident when the stimuli change in F0 between notes is measured. being compared had different spectral compositions26,27. In pilot experiments, we also found spectral variation between tones to Future directions. Our results suggest a diversity of mechanisms cause performance decrements for inharmonic tones, as might be underlying pitch perception, but leave open the question of why expected if the ability to compare successive spectra were impaired, multiple mechanisms exist. Real-world pitch processing occurs over forcing listeners to partially rely on F0-based pitch. Our results a heterogeneous set of stimuli and tasks, and the underlying archi- here are further consistent with this idea—effects of inharmonicity tecture may result from the demands of this diversity. Some tasks became apparent for large pitch shifts and a fixed spectral envelope, require knowledge of a sound’s absolute F0 (voice identification when spectral shifts were ostensibly somewhat ambiguous. It thus being the clearest example), and the involvement of F0 estimation is remains possible that F0-based pitch is important for extracting the perhaps unsurprising. Many other tasks only require knowledge of pitch contour in conditions in which the spectrum varies, such as the direction of pitch changes. In such cases, detecting shifts in the when intermittent background noise is present. However, in many underlying spectral pattern is evidently often sufficient, but it is not real-world contexts, where spectra are somewhat consistent across obvious why extracting the F0 and then measuring its change over the sounds to be compared (as for the recorded instrument notes time is not the solution of choice. It may be that measuring shifts used in experiment 2), F0 seems unlikely to be the means by which in the spectrum is more accurate or reliable when shifts are small. pitch changes are heard. The classic psychoacoustic notion of pitch It is conversely not obvious why pitch interval tasks are more dif- is thus supported by our data, but primarily for particular tasks ficult with inharmonic spectra, given that pitch direction tasks are (interval perception and voice recognition/discrimination), and not. Inharmonic tones often resemble multiple concurrent notes, not in many contexts in which it has been assumed to be important which could in principle interfere with the extraction of note rela- (melodic contour, prosody and so on). tionships, but no such interference occurs when determining the In addition to F0 and spectral pattern, there are other cues that direction (provided the spectra shift coherently). The dependence could be used to track changes in pitch in real-world contexts, sev- on a coherent F0 could lie in the need to compress sound signals eral of which were evident in our data. When tones were not passed for the purposes of remembering them; pitch intervals must often through a fixed bandpass filter (Supplementary Fig. 4), listeners be computed between notes separated by intervening notes, and could detect a pitch shift even when there was no F0 or consistent without the F0 there may be no way to ‘summarize’ a pitch for com- spectral pattern. This suggests that some aspect of the spectral enve- parison with a future sound. As discussed above, F0 may also be lope (the lower edge generated by the F0, or the centroid47,48,53) can important for comparing sounds whose spectra vary, obscuring the be used to perform the task. We also found good up–down discrimi- correspondence in the spectral pattern of each sound. These ideas nation performance even when the spectral envelope was fixed and could be explored by optimizing auditory models for different pitch the spectral pattern was varied from note to note, provided the steps tasks and then probing their behaviour. were sufficiently large (Inharmonic-changing condition in experi- One open question is whether a single pitch mechanism under- ment 9). This result suggests that listeners can hear the changes in the lies performance in the two types of task in which we found density of harmonics that normally accompany pitch shifts. It thus strong effects of F0-based pitch. Voices and musical intervals are

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

acoustically distinct and presumably require different functions unfltered voices given their greater ethological validity. Moreover, fltered speech of the F0—for example, the mean and variation of a continuously showed qualitatively similar results, so in practice the fltering had little efect on the results (Supplementary Fig. 9). All experiments for which fltered stimuli were changing F0 for voice, versus the difference in the static F0 of musi- used were likewise replicated with unfltered stimuli and in practice it had little cal notes—such that it is not obvious that they would be optimally efect on the results (Supplementary Figs. 4 and 9). served by the same F0-related computation. A related question is whether the importance of harmonicity in musical consonance (the Tasks with synthetic tones. Stimuli were composed of notes. Each note was a pleasantness of combinations of notes)56 and sound segregation57–59 synthetic complex tone with an exponentially decaying temporal envelope (decay constant of 4 s−1) to which onset and offset ramps were applied (20 ms half- reflects the same mechanism that uses harmonicity to estimate F0 Hanning window). The sampling rate was 48,000 Hz. For experiments 1, 3, 7, 8 for voice or musical intervals. and 9, notes were 400 ms in duration. For experiment 6, note durations were varied It remains to be seen how the distinct mechanisms suggested by to recreate familiar melodies; the mean note duration was 425 ms (s.d. = 306 ms), our results are implemented in the brain, but the stimuli and tasks and the range was 100 ms to 2 s. Harmonic notes included all harmonics up to the used here could be used toward this end. Our findings suggest that Nyquist limit, in sine phase. 15,16 To make notes inharmonic, the frequency of each harmonic, excluding the F0-tuned neurons are unlikely to exclusively form the brain basis fundamental, was perturbed (jittered) by an amount chosen randomly from a of pitch and that it could be fruitful to search for neural sensitivity uniform distribution, U(− 0.5, 0.5). This jitter value was chosen to maximally to pitch-shift direction. Our results also indicate a functional segre- perturb F0 (lesser jitter values did not fully remove peaks in the autocorrelation for gation for the pitch mechanisms subserving contour (important for single notes; see Supplementary Fig. 1). Jitter values were multiplied by the F0 of speech prosody as well as music) and interval (important primarily the tone and added to the frequency of the respective harmonic. For example, if the F0 was 200 Hz and a jitter value of − 0.39 was selected for the second harmonic, its for music), suggesting that there could be some specialization for frequency would be set to 322 Hz. To minimize salient differences in beating, jitter the role of pitch in music. We also found evidence that voice pitch values were constrained (via rejection sampling) such that adjacent harmonics (dependent on harmonicity) may be derived from mechanisms dis- were always separated by at least 30 Hz. The same jitter pattern was applied to tinct from those for prosody (robust to inharmonicity). A related every note of the stimulus for a given trial, except for ‘Inharmonic-changing’ trials, question is whether pitch in speech and music taps into distinct sys- for which a different random jitter pattern was generated for each note. Unlike 60–62 the temporal jittering manipulations commonly applied to click train stimuli in tems . The diversity of pitch phenomena revealed by our results neurophysiology experiments15, the frequency jittering manipulation used here suggests that investigating their basis could reveal rich structure preserves the presence of discrete frequency components in the spectrum, allowing within the auditory system. the possibility that spectral shifts could be detected even in the absence of an F0. For all experiments with synthetic tones, each note was band-pass filtered Methods in the frequency domain, with a Gaussian transfer function (in log frequency), centred at 2,500 Hz with a standard deviation of half an octave. This filter was Participants. All experiments were approved by the Committee on the use of applied to ensure that participants could not perform the tasks using changes in Humans as Experimental Subjects at the Massachusetts Institute of Technology, the spectral envelope. The filter parameters were chosen to ensure that the F0 was and were conducted with the informed consent of the participants. All participants attenuated (to eliminate variation in a spectral edge at the F0) while preserving were native English speakers. the audibility of resolved harmonics (the 10th or lower, approximately). For A total of 30 participants (16 female, mean age 32.97 years, standard deviation supplementary experiments in which this filter was not applied, each note was (s.d.) = 16.61 years) participated in experiments 1, 3, 4, 7, 8, 9 and 11, as well as the unfiltered, but harmonic amplitudes were set to decrease by 16 dB per octave. experiments detailed in Supplementary Figs. 4b and 9d. These participants were To ensure that differences in performance for harmonic and inharmonic run in the laboratory for three 2 h sessions each (except one participant, who was conditions could not be mediated by distortion products, we added masking noise unable to return for her final 2 h session). Of these participants, 15 had over 5 years to all band-pass filtered notes (all experiments described in the main text). We of musical training, with an average of 13.75 years (s.d. = 14.04 years). The sample low-pass filtered pink noise using a sigmoidal transfer function in the frequency size was over double the necessary size based on power analyses of pilot data; domain with an inflection point at the third harmonic of the highest note in the this ensured sufficient statistical power to separately analyse musicians and non- given sequence and a slope yielding 40 dB of gain or attenuation per octave on musicians (Supplementary Figs. 6–8). the low and high sides of the inflection point, respectively. We scaled the noise so Another set of 30 participants participated in experiment 2, as well as the that it was 10 dB lower than the mean power of the three harmonics of the highest control experiment detailed in Supplementary Fig. 2 (11 female, mean age of 37.15 note of the trial that were closest to the 2,500 Hz peak of the Gaussian spectral years, s.d. = 13.36). Of these participants, 15 had over 5 years of musical training, envelope35. This filtered and scaled pink noise was added to each note, creating a with an average of 17.94 years (s.d. = 14.04 years). consistent noise floor for each note sequence. A total of 20 participants (8 female, mean age of 38.14 years, s.d. = 15.75 years) participated in the two follow-up experiments, as detailed in Supplementary Speech tasks. Speech was manipulated using the STRAIGHT analysis and Fig. 4c–f. Of these, 9 participants had over 5 years of musical training, with an synthesis method37–39. STRAIGHT decomposes a recording of speech into voiced average of 12.18 years (s.d. = 13.64 years). and unvoiced vocal excitation and vocal tract filtering. If the voiced excitation Fourteen participants completed the control experiment described in is modelled sinusoidally, one can alter the frequencies of individual harmonics Supplementary Fig. 5a,b (5 female, mean age of 40 years, s.d. = 13.41 years). Two and then recombine them with the unaltered unvoiced excitation and vocal tract identified as musicians (average of 9.5 years, s.d. = 0.71 years). filtering to generate inharmonic speech. This manipulation leaves the spectral Between these four sets of experiments (combined N of 94), we tested 70 shape of the speech largely intact, and supplementary experiments (Supplementary different individuals – 2 individuals completed all four sets of experiments, 3 Fig. 5) suggest that intelligibility of inharmonic speech is comparable to that of completed three sets, and 12 completed two sets. harmonic speech. The jitters for inharmonic speech were chosen in the same Participants in online experiments (experiments 5, 6 and 10) were different for way as the jitters for inharmonic musical notes (described above in 'Tasks with each experiment (their details are given in the respective experiment descriptions). synthetic tones'). The same pattern of jitter was used throughout the entire speech For Mechanical Turk studies, sample sizes were chosen to obtain split-half utterance and the entire trial for experiments 4 and 11. STRAIGHT was also used correlations of at least 0.9. to perform pitch shifts, modify the F0 contour of speech utterances, and create ‘whispered speech’ (the voiced vocal excitation is replaced with noise). Noise- Stimuli. Logic of stimulus fltering. Similar performance for harmonic and excited stimuli were generated by substituting simulated breath noise for the tonal/ inharmonic sounds could in principle result from an ability to ‘hear out’ the F0 noise excitation combination otherwise used in STRAIGHT. The breath noise was frequency component in both cases. To prevent this from occurring, we fltered high-pass filtered white noise. The filter was a second-order high-pass Butterworth stimuli to remove the F0 component and added noise to mask distortion products, filter with a (3 dB) cutoff at 1,200 Hz whose zeros were moved towards the origin which could otherwise reinstate a component at the F0 for harmonic stimuli (for (in the z plane) by 5%. The resulting filter produced noise that was 3 dB down at details of this fltering and masking noise see 'Tasks with synthetic tones' and 1,600 Hz, 10 dB down at 1,000 Hz and 40 dB down at 100 Hz, which to the authors 'High-pass fltering and masking noise'). Te exceptions to this approach were sounded like a good approximation to whispering. Without the zero adjustment the experiments on real instrument notes (experiment 2) and voice identifcation the filter removed too much energy at the very bottom of the spectrum. The or discrimination (experiments 10 and 11). Filtering was not applied to the stimuli were thus generated from the same spectrotemporal envelope used for instrument tones because it seemed important to leave the spectral envelope of the harmonic and inharmonic speech, just with a different excitation signal. Speech notes intact (because the experiment was intended to test pitch discrimination for was sampled at 12,000 Hz. realistic sounds). We omitted fltering for the voice experiments because piloting indicated that both experiments would show worse performance for inharmonic High-pass filtering and masking noise. Unless otherwise noted, all speech (except stimuli, suggesting that participants were not exclusively relying on the F0 ‘whispered’ speech) was high-pass filtered to prevent participants from using component. Given this, we opted to feature the version of the experiments with the lowest harmonic as a proxy for the pitch contour, which a priori seemed like

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR a plausible strategy for the inharmonic conditions. Filtering was accomplished Only notes coded in the database as ‘Mezzo Forte’, and ‘Normal’ or ‘Non Vibrato’, by multiplying by a logistic (sigmoid-shaped) transfer function in the frequency were selected for the experiment. Five instruments were used: piano, violin, domain. The transfer function was given an inflection point at twice the mean F0 trumpet, oboe and clarinet. The first note of each trial was randomly selected (that is, the average frequency of the second harmonic) of the utterance. The slope from a uniform distribution over the notes in a western classical chromatic scale of the sigmoid function was set to achieve 40 dB of gain/attenuation per octave between 196 and 392 Hz (G3 to G4). A recording of this note, from a randomly on either side of the inflection point (that is, with the F0 40 dB below the second selected instrument, was chosen as the source for the first note in the trial. If harmonic and with the fourth harmonic 40 dB above the second harmonic). This the second note in the trial was higher, the note two semitones above (for the 2 meant that the F0 would be attenuated by 80 dB relative to the fourth harmonic. semitone trial), or one semitone above (for 0.1, 0.25 and 1 semitone trials) was In addition, masking noise was added to prevent potential distortion products selected to generate the second note (reversed if the second note of the trial was from reinstating (for harmonic conditions) the F0 contour that had been filtered lower). The two notes were analysed and modified using the STRAIGHT analysis out. We low-pass filtered pink noise using a sigmoid function with an inflection and synthesis method37–39; the notes were pitch-flattened to remove any vibrato, point at the third harmonic (3× the mean F0 of the utterance) and the same slope shifted to ensure that the pitch differences would be exactly 0.1, 0.25, 1 or 2 as the low-pass filter described above, but with opposite sign (such that the noise semitones, and resynthesized with harmonic or inharmonic excitation. Excitation was attenuated by 40 dB at 3× F0 relative to 1.5× F0 and by 80 dB at 6× F0). The frequencies were modified for the Inharmonic or Inharmonic-changing conditions noise was then scaled so that its power in a gammatone filter centred at the F0 was in the same way that the synthetic tones were modified in Experiment 1 (section 10 dB below the mean power of harmonics 3 to 8 in a pitch-flattened version of ‘Tasks with synthetic tones’). The resynthesized notes were truncated at 400 ms and the utterance. Assuming any distortion products are at most 20 dB below the peak windowed with a 20 ms half-Hanning window. Note: onsets were always preserved, harmonics in the speech signal35, this added noise should render them inaudible. and notes were sampled at 12,000 Hz. The filtered and scaled pink noise was added to the filtered speech signal to create the final stimuli. Experiment 3: Contour perception. Procedure. Te experimental design was inspired by the classic contour perception task of Dowling and Fujitani36. Audio presentation: in lab. In all experiments, The Psychtoolbox for Matlab63 was Participants heard two 5-note melodies on each trial and were asked to determine used to play sound waveforms. Sounds were presented to participants at 70 dB over whether the melodies were the same or diferent. Tere were three conditions: Sennheiser HD280 headphones (circumaural) in a soundproof booth (Industrial Harmonic, Inharmonic or Inharmonic-changing, as in experiments 1 and 2. Acoustics). Following each trial, participants clicked a button to select one of four responses: ‘Sure diferent’, ‘Diferent’, ‘Same’, ‘Sure same’. Participants were instructed to Audio presentation: Mechanical Turk. We used the crowdsourcing platform attempt to use all four responses equally throughout the experiment. Participants provided by Amazon Mechanical Turk to run experiments that necessitated completed 40 trials for each condition in a single session. Participants with small numbers of trials per participant. Each participant in these studies used a performance > 95% correct averaged across Harmonic and Inharmonic conditions calibration sound to set a comfortable level and then had to pass a ‘headphone (13 of 29 subjects) were removed from analysis to avoid ceiling efects; no check’ experiment that helped ensure they were wearing headphones or earphones participants were close to foor on this experiment. as instructed (described in ref. 64) before they could complete the full experiment. Stimuli. Five-note melodies were randomly generated with steps of + 1 or − 1 Feedback. For all in-lab experiments, conditions were randomly intermixed and semitone, randomly chosen. The tonic and starting note of each melody was set to participants received feedback (‘correct’/‘incorrect’) after each trial. Feedback was 200, 211.89 or 224.49 Hz. On ‘same’ trials, the second melody was identical to the used to assure compliance with task instructions. Pilot results indicated that results first except for being transposed upwards by half an octave. On ‘different’ trials, the without feedback were qualitatively similar across all tasks. Feedback was not given second melody was altered to change the melodic contour (the sequence of signs of during Mechanical Turk experiments because they were open-set recognition tasks. pitch changes). The alteration procedure randomly reversed the sign of the second Participants did not complete practice runs of the experiments. or third interval in the melody (for example, 1 became − 1, in semitones).

Statistics. For experiments 1, 2, 4, 5, 9 and 11, percent correct was calculated for Experiment 4: Speech contour perception. Procedure. Participants heard three, each harmonicity condition and difficulty (if relevant). Paired t-tests were used to 1 s speech utterances and were asked whether the frst or the last was diferent from compare conditions, and for experiments 1, 2, 4 and 9, the effects of harmonicity the other two. Following each trial, participants clicked a button to select one of and difficulty (step size and modulation depth, respectively) were further examined two responses: ‘First diferent’, ‘Last diferent’. Percent correct for each condition using repeated measures analyses of variance (ANOVAs). For experiments 3, 7 was used as the metric of performance. Participants completed the two tasks in and 8, hits and false alarms were converted into a receiver-operating characteristic counterbalanced order. Participants completed 40 trials for each condition in a (ROC) curve for each condition. The area under the ROC curve was the metric of single session. performance; this area always lies between 0 and 1, and 0.5 corresponds to chance performance. Comparisons between conditions were made using paired t-tests. Stimuli. For each trial a single 1 s speech excerpt was randomly chosen from the For open-set recognition tasks on Mechanical Turk (experiments 5, 6 and TIMIT training database44. For each participant, selections from TIMIT were 10), results were coded by the first author, blind to the condition. For example, in balanced for gender and dialect region. The three speech signals used in a trial experiment 10, a response such as ‘He plays Professor Snape in Harry Potter’, would were resynthesized from this original speech excerpt. The ‘same’ utterance was be coded as ‘Alan Rickman’. Percent correct was calculated for each condition resynthesized without altering the excitation parameters. To make the ‘different’ from the resulting scores. For experiments 6 and 10, confidence intervals were utterance, the speech was resynthesized with a random frequency modulation estimated using bootstrap (10,000 repetitions, with participants sampled randomly added to the original F0 contour. The frequency modulation was generated by with replacement; data were non-Gaussian due to the small number of trials per bandpass-filtering pink noise between 1 and 2 Hz using a rectangular filter in the condition per participant) and P values were calculated using the cumulative frequency domain. These band limits were chosen so that there would be at least distribution function of the means of the bootstrapped samples. one up–down modulation within the 1 s speech segment. The added modulation was normalized to have a root-mean-square (r.m.s.) amplitude of 1, multiplied Experiment 1: Basic discrimination with pairs of synthetic tones. Procedure. by the modulation depth (0.05, 0.15 or 0.25, depending on the condition), then Participants heard two notes and were asked whether the second note was higher multiplied by the mean F0 of the speech segment, and then added to the original or lower than the frst note. Tere were three conditions: Harmonic (both notes F0 contour of the speech segment. The second speech signal in each trial was were harmonic), Inharmonic (both notes had the same random jitter) and shifted up in pitch by two semitones relative to the first and third, but otherwise Inharmonic-changing (the two notes had diferent random jitter patterns). Afer unaltered. Stimuli were synthesized with either harmonic or jittered inharmonic each trial, participants clicked a button to indicate their choice (‘Down’ or ‘Up’). excitation, using an extension of STRAIGHT37,38. The frequency jitter applied Participants completed 40 trials for each step size in each condition in a single to harmonic components was constant within a trial. Each speech excerpt was session. Here and in other in-lab experiments, participants were given short breaks high-pass filtered and masked as described above (section ‘High-pass filtering and throughout the session. masking noise’).

Stimuli. Each trial consisted of two notes, described above in the section ‘Tasks Experiment 5: Mandarin tone perception. Participants. A total of 32 self-reported with synthetic tones’. We used the method of constant stimuli; the second note native Mandarin speakers were tested using Amazon Mechanical Turk (17 female, differed from the first by 0.1, 0.25, 1 or 2 semitones. The first note of each trial was mean age of 36.7 years, s.d. = 10.99 years). Of these, 27 answered ‘Yes’ to the randomly selected from a log uniform distribution spanning 200–400 Hz. question ‘Have you ever known how to play a musical instrument?’

Experiment 2: Basic discrimination with pairs of instrument notes. Procedure. Procedure. Participants were instructed that they would hear 120 recordings Te procedure was identical to that of experiment 1. of single words in Mandarin that had been manipulated in various ways. They could only hear each recording once. Their task was to identify as many words as Stimuli. Each trial presented two instrument notes resynthesized from recordings. possible. Responses were typed into a provided entry box using Hanyu Pinyin (the Notes were selected from the RWC Music Database of Musical Instrument Sounds. international standard for romanization of standard Chinese/modern standard

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

Mandarin), which allows for the independent coding of tones (labelled 1–5) and the proximity profile was changed to allow a maximum leap of five semitones phonemes. For example, if a participant heard the word ‘pō’, they could respond (a perfect fourth). Melodies were rejected if they contained a sequential note correctly with ‘po1’, have an incorrect tone response (ex. ‘po2’) or an incorrect repetition (yielding a contour step of 0 semitones). Only major key profiles were phoneme/spelling, but correct tone, response (ex: ‘wo1’). Participants were given used and these were altered from the Temperley (2008) model so that there was several example responses in Pinyin before they began the experiment. Participants no chance of notes outside the designated key. Melodies, initiated randomly on heard each word once over the course of the experiment and conditions were 200, 211.89 or 224.49 Hz, were generated and rejected until one was obtained randomly intermixed. with a specified scale degree (1, 3 or 5—the tonic, median or dominant, randomly distributed over the course of the experiment, with 14 of each scale degree per Stimuli. In Mandarin Chinese, the same syllable can be pronounced with one harmonicity condition) in the 12th, 13th, 14th or 15th note position. For ‘sour’ of five different ‘tones’ (1 – flat, 2 – rising, 3 – falling then rising, 4 – falling, trials, the desired scale degree (1, 3 or 5) was changed: 1 was moved upwards by a 5 – neutral). The use of different tones can change the meaning of a syllable. For semitone, 3 was moved upwards by two semitones and 5 was moved upwards by this experiment, 60 pairs of Mandarin word recordings spoken by a single female one semitone. Melodies were rejected and a new melody was generated if the sour talker were chosen from the ‘Projet SHTOOKA’ database (http://shtooka.net/). note altered the original contour. Only one note was altered in each ‘sour’ trial. Each pair of recordings consisted of either two single syllables (characters) with A fixed bandpass filter (described above) was applied to each note. the same phonemes but different tones, or two 2-syllable words, with the same phonemes but different tones for only one of the syllables. For example, one pair Experiment 8: Interval pattern discrimination. Procedure. Participants heard was Wùlĭ/Wúlĭ: Wùlĭ, meaning ‘physics’, contains the fourth and third tone and is two 3-note melodies that had identical contours, but in half of the trials, the written in Pinyin as ‘Wu4li3’, whereas Wúlĭ, meaning ‘unreasonable’, contains the interval relationship between the notes changed by one semitone41. Half the second and third tones and is written in Pinyin as ‘Wu2li3’. All combinations of trials had Harmonic notes and the other half of the trials had Inharmonic notes. tone differences were represented with the exception of third tone versus neutral A fxed bandpass flter (described above) was applied to each note. Participants tone. Only a few combinations involving the neutral tone could be found in the were asked whether the melodies were identical or diferent. Tere were four SHTOOKA database, which also reflects the relative scarcity of such pairings in possible responses: ‘Sure diferent’, ‘Maybe diferent’, ‘Maybe same’ and ‘Sure same’. the Mandarin language. To span the range of Mandarin vowels, the different tones Participants were instructed to attempt to use all four responses equally throughout within each word pair occurred on cardinal vowels (vowels at the edges of the the experiment. Participants completed 40 trials per condition in a single session. vowel space), or diphthongs containing cardinal vowels. Additionally, to represent Tis task was difcult, and participants with performance less than 0.55 across a range of consonant/vowel transitions, all voiced/unvoiced consonant pairs both conditions (12 of 30 participants) were excluded from analysis. (ex. p/b, d/t) found in Mandarin and present in the source corpus were represented. Harmonic, Inharmonic and Whispered versions of each word were Stimuli. The two 3-note contours were generated randomly using a uniform generated using STRAIGHT. A full list of words is available in the Supplementary distribution and step sizes of ± 1, 2, 3, 4 and 5 semitones (5 semitones was the Information (Supplementary Table 1). Stimuli were high-pass filtered and masked largest step size used in Experiment 7). The two melodies were separated by a 0.6 as described in the section ‘High-pass filtering and masking noise’. silent gap. ‘Different’ conditions were generated by randomly adding 1 or − 1 to the middle note of the comparison melody, with the restriction that the original Experiment 6: Familiar melody recognition. Participants. A total of 322 contour could not change (for example, creating a unison). The first melody started participants completed the experiment (143 women, mean age of 36.49 years, on 200 Hz and the second melody was always transposed up by half an octave. s.d. = 11.77 years). All reported normal hearing and 204 answered ‘Yes’ to the question ‘Have you ever known how to play a musical instrument?’ Experiment 9: Basic discrimination with larger step sizes. The stimuli and procedure were identical to those of experiment 1 except for the addition of 3, 4 Procedure. Participants were asked to identify 24 familiar melodies (see and 6 semitone step sizes. Supplementary Table 2 for a full list of melodies). There were five main conditions: Harmonic, Inharmonic, Inharmonic-changing and Harmonic and Inharmonic Experiment 10a and 10b: Famous speaker recognition. Participants 10a. A conditions in which the intervals were altered randomly by one semitone, with the total of 248 participants, 123 female (mean age of 35.95 years, s.d. = 9.89 years), constraint that the contours stayed intact. The experiments contained additional completed experiment 10a; 133 answered ‘Yes’ to the question ‘Have you ever conditions not analysed here. Participants were given the written instructions ‘You known how to play a musical instrument?’ will hear 24 melodies. These melodies are well known. They come from movies, nursery rhymes, holidays, popular songs, etc. Some of the melodies have been Participants 10b. Experiment 10b was completed by 412 participants, of which 212 manipulated in various ways. Your task is to identify as many of these melodies as were female (mean age of 36.7 years, s.d. = 10.99 years); 220 answered ‘Yes’ to the you can. You will only be able to hear each melody once! Even if you don’t know question ‘Have you ever known how to play a musical instrument?’ the name of the tune, but you recognize it, just describe how it is familiar to you. e.g., is it a nursery rhyme? What movie is it from? Who sings it?’ Participants were Procedure (both). Participants on Mechanical Turk were instructed that they would allowed to type their responses freely into a provided space. hear short recordings of people speaking: ‘These speakers are well-known. They are actors, politicians, singers, TV personalities, etc. Some of the voices have been Stimuli. Harmonic (with either correct or incorrect intervals), Inharmonic (with manipulated in various ways. Your task is to identify as many of the speakers as you either correct or incorrect intervals) and Inharmonic-changing variants of 24 well- can. You will only be able to hear each audio sample once! If you don’t know the known melodies were generated by concatenating synthetic complex tones. The name of the speaker, but you recognize their voice, just describe how it is familiar tonic was always set to an F0 of 200 Hz. A complete phrase of each melody was to you. e.g., What character does this actor play? What is this person’s profession? used (approximately 4 s of music per melody). For trials where the intervals were etc.’ Participants typed their responses into a box provided on the screen. modified, ± 1 semitone was randomly added to every note in the melody. Interval perturbations were iteratively sampled until they produced a new melody whose Stimuli. For both experiments, overlapping subsets of 40 recognizable celebrity contour was the same as that of the original melody. For Rhythm conditions, the voices were chosen; 24 of the 40 voices were used for Experiment 10a and 39 rhythm of the melody was played using a 200 Hz tone. Participants heard each of the 40 voices for Experiment 10b. The exact number of voices used in each melody once. Three melodies were assigned (randomly) to each condition (the experiment was determined by the number of conditions. The full list of celebrity experiment contained additional conditions not analysed here). This produced 966 voices is provided in Supplementary Table 3. Clean recordings of these celebrities’ trials per condition across all participants. voices were found using publically available videos, radio interviews, podcasts and so on. Four seconds of speech were selected for each celebrity. For experiment Experiment 7: Sour note detection. Procedure. Participants heard one 16- 10a, harmonic voices were resynthesized (using STRAIGHT) to have a shift in note melody per trial, and were asked whether this melody contained a ‘sour’ pitch of − 12, − 6, − 3, 0, 3, 6 or 12 semitones. Participants heard each voice only note (a ‘mistake’)40,41. Tere were three conditions: Harmonic, Inharmonic and once (in one of conditions) and heard four examples of each pitch Inharmonic-changing. Following each trial, participants clicked a button to select shift condition. For experiment 10b, there were three conditions: Harmonic, one of four responses: ‘Sure mistake’, ‘Maybe mistake’, ‘Maybe no mistake’, ‘Sure no Inharmonic and Whispered. The speech reported in the main results was not high- mistake’. Participants were instructed to attempt to use all four responses equally pass filtered (section ‘Logic of stimulus filtering’), although an identical pattern of throughout the experiment. Participants completed 42 trials for each condition in results was found for filtered speech (filtered using the same procedure described a single session. above (Supplementary Fig. 9). This experiment contained additional conditions not analysed here. Participants heard three trials for each condition. All voices Stimuli. Sixteen-note melodies were created using a modified version of a recognized correctly on fewer than 10% of trials were excluded from the analysis generative model outlined in ref. 42. This model uses a range profile (to restrict to avoid floor effects—this removed 9 of 28 (32%) voices from experiment 10a and absolute range of a melody), a proximity profile (to restrict the size of note-to- 17 of 39 (44%) voices from experiment 10b. The excluded voices did not contribute note leaps) and a key profile (to maintain a consistent key within a melody), to to the scores for a participant in a condition. In some cases this eliminated all generate melodies on a note-by-note basis. The Temperley (2008) range profile the voices in a condition for a participant, in which case that participant’s score was modified to restrict the range of the five-note contours to 1.5 octaves and was excluded from the mean score for that condition. No participants were fully

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE HUMAN BEHAVIOUR excluded from analysis as a consequence of this threshold. The inclusion of these 21. Norman-Haignere, S., Kanwisher, N. & McDermott, J. H. Cortical pitch poorly recognized voices in the analysis did not alter the qualitative pattern of regions in humans respond primarily to resolved harmonics and are located results across conditions, although it lowered overall performance. in specifc tonotopic regions of anterior auditory cortex. J. Neurosci. 33, 19451–19469 (2013). Experiment 11: Novel voice discrimination. Procedure. Participants heard three 22. Allen, E. J., Burton, P. C., Olman, C. A. & Oxenham, A. J. Representations of 1 s samples of speech and were asked to identify the sample spoken by a diferent pitch and timbre variation in human auditory cortex. J. Neurosci. 37, speaker than the other two (frst or last). Te two samples spoken by the same 1284–1293 (2017). speaker were distinct (taken from diferent sentences). Following each trial, 23. Tang, C., Hamilton, L. S. & Chang, E. F. Intonational speech prosody participants clicked a button to select one of two responses: ‘First diferent’, ‘Last encoding in the human auditory cortex. Science 801, 797–801 (2017). diferent’. Participants completed 48 trials for each condition in a single session. 24. Faulkner, A. Pitch discrimination of harmonic complex signals: residue pitch or multiple component discriminations? J. Acoust. Soc. Am. 78, Stimuli. For each trial, 1 s speech excerpts were randomly chosen from the TIMIT 1993–2004 (1985). training database. The excerpts were presented sequentially, with a half-second 25. Moore, B. C. J. & Glasberg, B. R. Frequency discrimination of complex tones pause between each excerpt. Two excerpts were produced by the same speaker with overlapping and non-overlapping harmonics. J. Acoust. Soc. Am. 87, and one was from another speaker of the same gender and from the same dialect 2163–2177 (1990). region. There were three conditions—Harmonic, Inharmonic and Whispered— 26. Micheyl, C., Divis, K., Wrobleski, D. M. & Oxenham, A. J. Does the stimuli for which were all synthesized from STRAIGHT as described above fundamental-frequency discrimination measure virtual pitch discrimination? and were not high-pass filtered (section ‘Logic of stimulus filtering’), although J. Acoust. Soc. Am. 128, 1930–1942 (2010). equivalent results were found for filtered speech (Supplementary Fig. 9). 27. Micheyl, C., Ryan, C. M. & Oxenham, A. J. Further evidence that fundamental-frequency diference limens measure pitch discrimination. J. Life Sciences Reporting Summary. Further information on experimental design is Acoust. Soc. Am. 131, 3989–4001 (2012). available in the Life Sciences Reporting Summary. 28. Latinus, M. & Belin, P. Human voice perception. Curr. Biol. 21, R143–145 (2011). 29. McDermott, J. H. & Oxenham, A. J. Music perception, pitch, and the auditory Code availability. Custom code for synthesizing inharmonic versions of speech and system. Curr. Opin. Neurobiol. 18, 452–463 (2008). other natural sounds is available at http://mcdermottlab.mit.edu/downloads.html. 30. Roberts, B. & Holmes, S. D. Grouping and the pitch of a mistuned fundamental component: efects of applying simultaneous multiple Data availability. All data are provided in Supplementary Table 4. mistunings to the other harmonics. Hear. Res. 222, 79–88 (2006). 31. Houtsma, A. J. M. & Smurzynski, J. Pitch identifcation and discrimination Received: 21 June 2017; Accepted: 8 November 2017; for complex tones with many harmonics. J. Acoust. Soc. Am. 87, 304 (1990). Published: xx xx xxxx 32. Shackleton, T. M. & Carlyon, R. P. Te role of resolved and unresolved harmonics in pitch perception and frequency modulation discrimination. J. Acoust. Soc. 95, 3529–3540 (1994). References 33. Micheyl, C., Delhommeau, K., Perrot, X. & Oxenham, A. J. Infuence of 1. Helmholtz, H. L. F. On the Sensations of Tone (Longmans, Green, & Co., musical and psychoacoustical training on pitch discrimination. Hear. Res. London, 1875). 219, 36–47 (2006). 2. Rayleigh, W. S. Teory of Sound (Macmillan, London, 1896). 34. Pressnitzer, D. & Patterson, R. D. in Physiological and Psychophysical Bases of 3. von Békésy, G. Experiments in Hearing (McGraw-Hill, New York, NY, 1960). Auditory Function (eds Breebart. D. J., Houtsma, A. J. M., Kohlrausch, A., 4. Plack, C., Oxenham, A., Fay, R. & Popper, A. Pitch: Neural Coding and Prijs, V. F. & Schoonoven, R.) 97–104 (Shaker Publishing, Maastricht, 2001). Perception Vol. 24 (Springer, New York, NY, 2005). 35. Norman-Haignere, S. & McDermott, J. H. Distortion products in auditory 5. DeCheveigné, A. in Pitch: Neural Coding and Perception (eds Plack, C. J., fMRI research: measurements and solutions. NeuroImage 129, 401–413 (2016). Oxenham, A. J., Fay, R. & Popper, A.) 169–233 (Springer, New York, NY, 2005). 36. Dowling, W. J. & Fujitani, D. S. Contour, interval, and pitch recognition in 6. Licklider, J. C. R. ‘Periodicity’ pitch and ‘place’ pitch. J. Acoust. Soc. Am. 26, memory for melodies. J. Acoust. Soc. Am. 49, 524–531 (1971). 945 (1954). 37. Kawahara, H. STRAIGHT, exploitation of the other aspect of VOCODER: 7. Schouten, J. F., Ritsma, R. J. & Cardozo, B. L. Pitch of the residue. J. Acoust. perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Soc. Am. 34, 1418–1424 (1962). Technol. 27, 349–353 (2006). 8. Meddis, R. & Hewitt, M. J. Virtual pitch and phase sensitivity of a computer 38. Kawahara, H. et al. TANDEM-STRAIGHT: a temporally stable power spectral model of the auditory periphery. I: Pitch identifcation. J. Acoust. Soc. Am. 89, representation for periodic signals and applications to interference-free 2866–2882 (1991). spectrum, F0, and aperiodicity estimation. Sadhana 36, 713–722 (2011). 9. Cariani, P. & Delgutte, B. Neural correlates of the pitch of complex tones. I. 39. McDermott, J. H., Ellis, D. P. & Kawahara, H. Inharmonic speech: a tool Pitch and pitch salience. J. Neurophysiol. 76, 1698–1716 (1996). for the study of speech perception and separation. SAPA@ Interspeech 10. Shamma, S. & Klein, D. Te case of the missing pitch templates: how 114–117 (2012). harmonic templates emerge in the early auditory system. J. Acoust. Soc. Am. 107, 2631–2644 (2000). 40. Sloboda, J. A. Te Musical Mind: Te Cognitive Psychology of Music (Oxford 11. Goldstein, J. L. An optimum processor theory for the central formation of the Univ. Press, Oxford, 1985). pitch of complex tones. J. Acoust. Soc. Am. 54, 1496–1516 (1973). 41. Peretz, I., Champod, A. S. & Hyde, K. Varieties of musical disorders: 12. Terhardt, E. Calculating virtual pitch. Hear. Res. 1, 155–182 (1979). the Montreal battery of evaluation of . Ann. NY Acad. Sci. 999, 13. Kaernbach, C. & Demany, L. Psychophysical evidence against the 58–75 (2003). autocorrelation theory of auditory temporal processing. J. Acoust. Soc. Am. 42. Temperley, D. A probabilistic model of melody perception. Cogn. Sci. 32, 104, 2298–2306 (1998). 418–444 (2008). 14. Bernstein, J. G. W. & Oxenham, A. J. Te relationship between frequency 43. McDermott, J. H., Keebler, M. V., Micheyl, C. & Oxenham, A. J. Musical selectivity and pitch discrimination: sensorineural hearing loss. J. Acoust. Soc. intervals and relative pitch: frequency resolution, not interval resolution, is Am. 120, 3929–3945 (2006). special. J. Acoust. Soc. Am. 128, 1943–1951 (2010). 15. Bendor, D. & Wang, X. Te neuronal representation of pitch in primate 44. Garofolo, J. S. et al. TIMIT Acoustic–Phonetic Continuous Speech Corpus auditory cortex. Nature 436, 1161–1165 (2005). LDC93S1 (Linguistic Data Consortium, PA, 1993). 16. Feng, L. & Wang, X. Harmonic template neurons in primate auditory cortex 45. Marques, C., Moreno, S., Castro, S. L. & Besson, M. Musicians detect pitch underlying complex sound processing. Proc. Natl Acad. Sci. USA 114, violation in a foreign language better than nonmusicians: behavioral and E840–848 (2017). electrophysiological evidence. J. Cogn. Neurosci. 19, 1453–1463 (2007). 17. Fishman, Y. I., Micheyl, C. & Steinschneider, M. Neural representation of 46. Tervaniemi, M., Just, V., Koelsch, S., Widmann, A. & Schröger, E. Pitch harmonic complex tones in primary auditory cortex of the awake monkey. J. discrimination accuracy in musicians vs nonmusicians: an event-related Neurosci. 33, 10312–10323 (2013). potential and behavioral study. Exp. Brain Res. 161, 1–10 (2005). 18. Bizley, J. K., Walker, K. M. M., King, A. J. & Schnupp, J. W. H. Neural 47. Schneider, P. et al. Structural and functional asymmetry of lateral Heschl’s ensemble codes for stimulus periodicity in auditory cortex. J. Neurosci. 30, gyrus refects pitch perception preference. Nat. Neurosci. 8, 1241–1247 (2005). 5078–5091 (2010). 48. McDermott, J. H., Lehr, A. J. & Oxenham, A. J. Is relative pitch specifc to 19. Patterson, R. D., Uppenkamp, S., Johnsrude, I. S. & Grifths, T. D. Te pitch? Psychol. Sci. 19, 1263–1271 (2008). processing of temporal pitch and melody information in auditory cortex. 49. Borchert, E. M. O., Micheyl, C. & Oxenham, A. J. Perceptual grouping afects Neuron 36, 767–776 (2002). pitch judgments across time and frequency. J. Exp. Psychol. Hum. Percept. 20. Penagos, H., Melcher, J. R. & Oxenham, A. J. A neural representation of pitch Perform. 37, 257–269 (2011). salience in nonprimary human auditory cortex revealed with functional 50. Warrier, C. M. & Zatorre, R. J. Infuence of tonal context and timbral magnetic resonance imaging. J. Neurosci. 24, 6810–6815 (2004). variation on perception of pitch. Percept. Psychophys. 64, 198–207 (2002).

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE HUMAN BEHAVIOUR ARTICLES

51. Demany, L., Pressnitzer, D. & Semal, C. Tuning properties of the auditory 64. Woods, K. J. P., Siegel, M. H., Traer, J. & McDermott, J. Headphone screening frequency-shif detectors. J. Acoust. Soc. Am. 126, 1342–1348 (2009). to facilitate web-based auditory experiments. Atten. Percept. Psychophys. 79, 52. Chambers, C. et al. Prior context in audition informs binding and shapes 2064–2072 (2017). simple features. Nat. Commun. 8, 15027 (2017). 53. Bregman, M. R., Patel, A. D. & Gentner, T. Q. Songbirds use spectral shape, Acknowledgements not pitch, for sound pattern recognition. Proc. Natl Acad. Sci. USA 113, The authors thank C. Micheyl, K. Walker, B. Delgutte and the McDermott laboratory for 1666–1671 (2016). comments on an earlier draft of this paper, D. Temperley for sharing code to generate 54. Gockel, H. E., Carlyon, R. & Plack, C. Across-frequency interference efects melodies, C. Wang for assistance collecting data, V. Zhao for assistance selecting the in discrimination: questioning evidence for two pitch Mandarin word pairs for experiment 5, and K. Woods for help implementing Mechanical mechanisms. J. Acoust. Soc. Am. 116, 1092–1104 (2004). Turk paradigms. This work was supported by a McDonnell Foundation Scholar 55. Trainor, L. J., Desjardins, R. N. & Rockel, C. A comparison of contour and Award to J.H.M., a National Institutes of Health (NIH) grant (1R01DC014739-01A1) interval processing in musicians and nonmusicians using event-related to J.H.M., an NIH National Institute on Deafness and Other Communication potentials. Aust. J. Psychol. 51, 147–153 (1999). Disorders training grant (T32DC000038) in support of M.J.M. and a National Science 56. McDermott, J. H., Lehr, A. J. & Oxenham, A. J. Individual diferences reveal Foundation (NSF) Graduate Research Fellowship to M.J.M. The funding agencies were the basis of consonance. Curr. Biol. 20, 1035–1041 (2010). not otherwise involved in the research, and any opinions, findings and conclusions 57. Moore, B. C., Glasberg, B. R. & Peters, R. W. Tresholds for hearing mistuned or recommendations expressed in this material are those of the authors and do not partials as separate tones in harmonic complexes. J. Acoust. Soc. Am. 80, necessarily reflect the views of the McDonnell Foundation, NIH or NSF. 479–483 (1986). 58. Hartmann, W. M., McAdams, S. & Smith, B. K. Hearing a mistuned Author contributions harmonic in an otherwise periodic complex tone. J. Acoust. Soc. Am. 88, M.J.M. designed the experiments, collected and analysed data and wrote the paper. 1712–1724 (1990). J.H.M. designed the experiments and wrote the paper. 59. Roberts, B. & Bailey, P. J. Spectral regularity as a factor distinct from harmonic relations in auditory grouping. J. Exp. Psychol. Hum. Percept. Competing interests Perform. 22, 604–614 (1996). The authors declare no competing interests. 60. Schön, D., Magne, C. & Besson, M. Te music of speech: music training facilitates pitch processing in both music and language. Psychophysiology 41, 341–349 (2004). Additional information 61. Norman-Haignere, S., Kanwisher, N. G. & McDermott, J. H. Distinct cortical Supplementary information is available for this paper at https://doi.org/10.1038/ pathways for music and speech pevealed by hypothesis-free voxel s41562-017-0261-8. decomposition. Neuron 88, 1281–1296 (2015). Reprints and permissions information is available at www.nature.com/reprints. 62. Patel, A. D. Can nonlinguistic musical training change the way the brain processes speech? Te expanded OPERA hypothesis. Hearing Res. 308, Correspondence and requests for materials should be addressed to M.J.M. 98–108 (2014). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in 63. Brainard, D. H. Te psychophysics toolbox. Spat. Vis. 10, 433–436 (1997). published maps and institutional affiliations.

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. SUPPLEMENTARY INFORMATIONARTICLES https://doi.org/10.1038/s41562-017-0261-8

In the format provided by the authors and unedited.

Diversity in pitch perception revealed by task dependence

Malinda J. McPherson! !1,2* and Josh H. McDermott! !1,2

1Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA. 2Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA, USA. *e-mail: [email protected]

NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Single Note Autocorrelations Average Autocorrelations

10% Jitter 10% Jitter 1 1

0.5 0.5

0 0 Correlation (r) Correlation (r) -0.5 -0.5

-1 -1 02468 02468 Lag (ms) Lag (ms)

20% Jitter 20% Jitter 1 1

0.5 0.5

0 0 Correlation (r) Correlation (r) -0.5 -0.5

-1 -1 02468 02468 Lag (ms) Lag (ms) 30% Jitter 30% Jitter 1 1

0.5 0.5

0 0 Correlation (r) Correlation (r) -0.5 -0.5 Frequency -1 Time -1 02468 02468 Lag (ms) Lag (ms)

40% Jitter 40% Jitter 1 1

0.5 0.5

0 0 Correlation (r) Correlation (r) -0.5 -0.5

-1 -1 02468 02468 Lag (ms) Lag (ms)

50% Jitter 50% Jitter 1 1

0.5 0.5

0 0 Correlation (r) Correlation (r) -0.5 -0.5

-1 -1 02468 02468 Lag (ms) Lag (ms)

Supplementary Figure 1: Effect of jitter magnitude on the autocorrelation function (A) Waveform autocorrelations for complex tones with an F0 of 250 Hz whose component frequencies were randomly jittered by different amounts. Jitter values were sampled from uniform distributions: U(-.1, .1), U(-.2, .2), U(-.3, .3), U(-.4, .4) and U(-.5 .5), the latter of which was used in the main experiments. Jittered frequency values were obtained by multiplying the sampled jitter value by the F0 and adding the result to the frequency of the original harmonic. (B) Averaged autocorrelations of 1,000 250 Hz complex tones with frequencies randomly jittered by different amounts29. Averaging over a large number of stimuli with different jitter patterns reveals any residual periodicity in the tones. It is apparent that jittering harmonic frequencies by 10% has little effect on the presence of a defined peak in the autocorrelation, but the size of this peak decreases as the jitter values increase, disappearing by 50% jitter. a

-30 Power Spectrum - Inharmonic, Roved Noise Power Spectrum - Inharmonic, Regular Noise -30 Fourth Harmonic Third Harmonic -40 -40 -50 -50 -60 -60 -70 -70 -80 -80

Power (dB) -90 Power (dB) -90 -100 -100 -110 -110 -120 0 1000 2000 3000 4000 5000 -120 Frequency (Hz) 0 1000 2000 3000 4000 5000 Frequency (Hz)

Basic Discrimination, Inharmonic Tones with Roved Noise b (Control for Experiment 1) N = 30 100 Regular Inharmonic Noise Roved Inharmonic 90 N.S. 80

70

Percent Correct Percent 60

50

40 .1 .25 1 2 Difference (Semitones) Supplementary Figure 2: Example stimuli and results from control experiment with roved noise (control version of Experiment 1 – pitch discrimination of inharmonic tones) (A) Power spectra of two examples tones from a control experiment in which the masking noise level was roved. Procedures and parameters for this experiment were identical to those of Experiment 1 (same note range, pitch differences, number of trials), except only Inharmonic tones were tested, with and without roved masking noise levels. The condition without roving was identical to that used in Experiment 1. The condition with roving noise was identical to the regular noise condition except that the masking noise for one of the tones (randomly selected), was increased in level by 12 dB. This ensured that the lowest audible harmonic was different between the two tones. (B) Pitch discrimination results for inharmonic tones with and without roved noise. Error bars denote standard error of the mean. A repeated measures ANOVA was used to assess statistical significance. There was no main effect of noise roving (F(1,29)=1.41, p=.245), suggesting that listeners were not performing the task by tracking the lowest audible harmonic. Speech Tone Contour Perception a b (Control for Experiment 4) 100 N = 30 Harmonic Which excerpt is different Inharmonic 90 from the second one, the first or last? 80

F0 70

60 N.S. Percent Correct Percent

50

Time 40 51525 FM Amplitude (% of F0)

Supplementary Figure 3: Task and results from frequency-modulated contour discrimination experiment (control version of Experiment 4) (A) Schematic of FM Speech Contour Discrimination task. The procedure was identical to that of Experiment 4 except that the F0 contour of the selected speech segment for a trial was extracted using STRAIGHT and used to synthesize harmonic and inharmonic complex tones with the F0 contour of the speech utterance. Participants heard three one-second tone contours, the first or last of which had a random frequency modulation added to its F0 contour (the added FM was white noise bandpass-filtered between 1-2 Hz, with a modulation depth that varied across conditions). Participants were asked whether the first or last tone differed from the second tone, which was transposed up in pitch for all trials. We applied a low-pass filter to the frequency-modulated tones to approximate the natural falloff in amplitude of higher harmonics in speech. This filter was a logistic function in the frequency domain with an inflection point at 4,000 Hz and a slope of 40 dB per octave at the inflection point. Because there are breaks in the F0 contour of speech (during unvoiced segments), we applied 10 ms half-Hanning windows at the onsets and offsets of each voiced segment. If voiced segments were shorter than 20 ms, they were replaced with silence. Stimuli were high-pass filtered and masked as described in the section on High-pass Filtering and Masking Noise in the Methods. (B) Results from FM contour discrimination experiment. Error bars denote standard error of the mean. A repeated measures ANOVA was used to assess statistical significance. There was no main effect of harmonicity (F(1,29)=.462, p=0.502). (Compare results to Figure 4c, which obtained similar results with speech stimuli).

a b Moving Spectral Envelope, 200 Hz Pitch Discrimination - Without a Fixed Filter (Control for Experiments 1 and 9) -20 100 N = 29 -40 Task Description Was the second note higher or -60 90 lower than the Power (dB) first note? -80 Power Spectrum Tone 200Hz 80 Frequency -100 Time 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 70 Moving Spectral Envelope, 400 Hz Harmonic -20 Inharmonic Percent Correct Percent 60 Inharmonic - Changing -40 50 -60 Power (dB) 40

Power Spectrum Tone 400Hz -80 .1 .25 1 2 Difference (Semitones) -100 0 500 1000 1500 2000 2500 3000 Frequency (Hz)

c d Sour Note Detection - Without a Fixed Filter (Control for Experiment 7) 100 N = 19 *** p<0.001 90 Does the melody contain a mistake? *** 80 ***

70 ROC Area 60

50 Out of key note 40 Harmonic Inharmonic Inharmonic Changing f e Interval Discrimination - Without a Fixed Filter (Control for Experiment 8) 100 Are the patterns identical or different? N = 13 ** p<0.01 90 Harmonic Inharmonic Inharmonic - Changing Contour: +1 -1 Intervals:Intervals: Intervals:Intervals: 80

+3+3 --22 +2+2 --11 70 ** ROC Area 60

50

40 Harmonic Inharmonic Supplementary Figure 4: Stimuli and results for control versions of pitch discrimination, sour note detection and interval discrimination with unfiltered notes (A) Example notes from pitch discrimination experiment without note filtering. Power spectra are shown for 200 Hz and 400 Hz tones. The harmonics of each note decreased 16 dB/octave in amplitude, but were not passed through the fixed bandpass filter used in Experiments 1 and 9. The center of mass of the spectrum, and its lower edge, thus shifted with the pitch of each note, rendering the direction unambiguous. (B) Results for pitch discrimination with unfiltered notes. In contrast to Experiment 1, performance was similar across conditions, indicating that listeners could perceive the shift direction irrespective of the correspondence between the harmonics of successive notes, presumably because the direction is signaled by the spectral envelope. Error bars denote standard error of the mean. (Compare to Figure 2d and 2f). (C) Schematic of Sour Note Detection task. Participants judged whether the note contained a ‘sour’ (out of key) note. (D) Results for Sour Note Detection with unfiltered notes. The pattern of performance was similar to that of Experiment 7, where a fixed spectral envelope was used, suggesting that any ambiguity in the pitch shift direction in the stimuli of Experiment 7 was not critical to the main result. A participant with performance less than .55 across all three conditions was removed from analysis. Error bars denote standard error of the mean. Paired t-tests were used to assess statistical significance. (Compare to Figure 6b). (E) Schematic of Interval Pattern Discrimination task. Participants judged whether two three-note tone sequences were the same or different. The second tone sequence was always shifted up in pitch relative to the first. On ‘different’ trials (pictured) the two stimuli had the same contour, but the second stimulus was altered so that its intervals differed by 1 semitone from the corresponding intervals in the first stimulus. (F) Results for Interval Pattern Discrimination for unfiltered notes. The experiment replicated the main effect of Experiment 8, where a fixed spectral envelope was used, suggesting that any ambiguity in the pitch shift direction in the stimuli of Experiment 8 was not critical to the main result. Participants with performance less than .55 across both conditions were removed from analysis (7 of 20 participants). Error bars denote standard error of the mean. Paired t-tests were used to assess statistical significance. (Compare to Figure 6d). Novel Voice Discrimination, Mean Pitch Matched a b (Control for Experiment 11) 100 *** p<0.001 N=14 Which speaker only spoke 90 once (first or last)?

80 N.S. *** Speaker 1 70 Speaker 2 Percent Correct Percent 60 F0 50

40 Harmonic Inharmonic Whispered Time Mandarin Phoneme Perception (Additional Analysis, Experiment 5) 100 d N=32 c N.S. N.S. 90 What word did you hear? Correct Response: 80 po1 “pō” 70 Incorrect response: Percent Correct wo1 60

50 . Harmonic Inharmonic Whispered

Supplementary Figure 5: Task description and results for voice discrimination control experiment with pitch-matched voices, and mandarin phoneme perception analysis (A) Schematic of experimental paradigm. The experiment was identical to Experiment 11, except that speech excerpts were matched for mean and variance of the F0 contour. Three speech files were selected, and the mean and standard deviation of their pitch contours were equated before resynthesis. There were 64 trials per condition (192 trials per subject). (B) Discrimination performance for pitch-matched voices. Paired t-tests were used to assess statistical significance. (C) Schematic of experimental paradigm – the same experiment as Experiment 5 (Mandarin Tone Perception) in the main text. (D) Results from Mandarin Phoneme Perception. Results from the Mandarin Tone Perception experiment were scored for phoneme accuracy in addition to tone accuracy. For example, if a participant heard the word ‘cuo4’, but typed in ‘cuo1’, their response would be scored as correct for phoneme but incorrect for tone. If they typed in ‘duo4’, they would be marked incorrect for phoneme but correct for tone. Paired t-tests were used to assess statistical significance. *** p<0.001 ** p<0.01 Musicians Non-Musicians N.S. p>0.05 Exp. 1: Pitch Discrimination, Synthetic Tones Exp. 1: Pitch Discrimination, Synthetic Tones a 100 N=15 100 Harmonic N=15 Inharmonic 90 90 Inharmonic - Changing

80 Harmonic 80 Inharmonic 70 Inharmonic - Changing 70 Percent Correct Percent 60 Correct Percent 60

50 50

40 40 .1 .25 1 2 .1 .25 1 2 Difference (Semitones) Difference (Semitones) b Exp. 2: Pitch Discrimination - Instruments Exp 2: Pitch Discrimination - Instruments 100 Harmonic 100 Harmonic Inharmonic Inharmonic Inharmonic-Changing 90 90 Inharmonic-Changing

80 80

70 70 Percent Correct Percent Correct 60 60 N=15 N=15 50 50

40 40 .1 .25 1 2 .1 .25 1 2 c Difference (Semitones) Difference (Semitones) Exp. 3: Melodic Contour Discrimination Exp 3: Melodic Contour Discrimination 100 100 N=15 N.S. N=14 90 90 ***

80 80 N.S. 70 *** 70 ROC Area ROC Area 60 60

50 50

40 40 Harmonic Inharmonic Inharm.-Changing Harmonic Inharmonic Inharm.-Changing Exp. 4: Speech Contour Perception Exp 4: Speech Contour Perception d 100 100 Harmonic N=15 Harmonic N=15 Inharmonic 90 Inharmonic 90 N.S.

80 80 N.S. 70 70

60 60 Percent Correct Percent Correct Percent

50 50

40 40 51525 5 15 25 FM Amplitude (% of F0) FM Amplitude (% of F0) Supplementary Figure 6: Comparison of results for musicians and non-musicians for Experiments 1-4. (A) Musician vs. Non-Musician results for Experiment 1 – Pitch Discrimination with Pairs of Synthetic Tones. Here and in other panels, error bars denote standard error of the mean, and conventions are as in figures in main text. (Compare to Figure 2d). (B) Musician vs. Non-Musician results for Experiment 2 – Pitch Discrimination with Pairs of Instrument Notes. (Compare to Figure 2f). (C) Musician vs. Non-Musician results for Experiment 3 – Melodic Contour Discrimination. (Compare to Figure 3b). (D) Musician vs. Non-Musician results for Experiment 4 – Speech Contour Perception. (Compare to Figure 4c). *** p<0.001 ** p<0.01 Musicians Non-Musicians N.S. p>0.05

a Exp. 6: Famous Melody Recognition Exp. 6: Famous Melody Recognition 70 70 *** N=204 N=118 60 60 ***

50 N.S. 50 N.S.

40 40

30 30 N.S. Percent Correct Percent

Percent Correct Percent N.S. 20 20

10 10

0 0 c c s c s Rhythm Harmoni Inharmoni Rhythm Harmonic Inharmoni Inharmonic - Changing Harm, IncorrectInharm, Intervals IncorrectInharmonic Interval - Changing Harm, IncorrectInharm, Intervals Incorrect Interval

b Exp. 7: Sour Note Detection Exp. 7: Sour Note Detection 100 100 N=15 N=15 *** 90 90 *** 80 80 ** 70 ** 70 ROC Area ROC Area 60 60

50 50

40 40 Harmonic Inharmonic Harmonic Inharmonic Inharm.-Changing Inharm.-Changing

Exp. 8: Interval Pattern Discrimination Exp. 8: Interval Pattern Discrimination c 1 1 N=15 N=15 0.9 0.9

0.8 0.8 *** 0.7 0.7 N.S.

0.6 0.6 ROC Area ROC Area

0.5 0.5

0.4 0.4 Harmonic Inharmonic Harmonic Inharmonic

d Exp. 9: Basic Discrimination with Large Pitch Intervals Exp. 9: Basic Discrimination with Large Pitch Intervals N=14 100 N=15 100

90 90

80 80

70 70 Harmonic Percent Correct Percent Correct Harmonic 60 Inharmonic - Fixed Jitter 60 Inharmonic - Fixed Jitter Inharmonic - Changing Inharmonic - Changing 50 50

40 40 .1 .25 1 2 3 4 6 .1 .25 1 2 3 4 6 Difference (Semitones) Difference (Semitones)

Supplementary Figure 7: Comparison of results for musicians and non-musicians, for Experiments 6-9. (A) Musician vs. Non-Musician results for Experiment 6 – Famous Melody Recognition. Error bars denote standard deviations, calculated via bootstrap. Here and in other panels, conventions are as in figures in main text. (Compare to Figure 5b). (B) Musician vs. Non-Musician results for Experiment 7 – Sour Note Detection. Error bars denote standard error of the mean. (Compare to Figure 6b). (C) Musician vs. Non-Musician results for Experiment 8 – Interval Pattern Discrimination. Error bars denote standard error of the mean. (Compare to Figure 6d). (D) Musician vs. Non-Musician results for Experiment 9 – Pitch Discrimination with Pairs of Notes (Larger Step Sizes). Error bars denote standard error of the mean. (Compare to Figure 7b). *** p<0.001 ** p<0.01 Musicians Non-Musicians * p<0.05

Exp. 10a: Celebrity Voice Identification Exp. 10a: Celebrity Voice Identification a 60 60 N = 133 N = 115 N.S. 50 ** * 50 *** *** *** 40 ** 40 * 30 30 *** *** *** *** 20 20 Percent Correct Percent Percent Correct Percent

10 10

0 0 0 0 -6 -3 +3 +6 -6 -3 +3 +6 -12 +12 -12 +12 Shift (Semitones) Shift (Semitones) b Exp. 10b: Celebrity Voice Recognition Exp. 10b: Celebrity Voice Recognition 60 *** N=220 60 *** N=192 50 *** 50

40 40 ***

30 30 Percent Correct 20 Percent Correct 20

10 10

0 0 Harmonic Inharmonic Whispered Harmonic Inharmonic Whispered

c Exp. 11: Novel Voice Discrimination Exp. 11: Novel Voice Discrimination 100 100 N=15 N=15 90 *** 90 N.S. (p=0.061) *** *** 80 80

70 70

60 60 Percent Correct Percent Percent Correct Percent

50 50

40 40 Harmonic Inharmonic Whispered Harmonic Inharmonic Whispered

Supplementary Figure 8: Comparison of results for musicians and non-musicians, for experiments 10a, 10b, and 11. (A) Musician vs. Non-Musician results for Experiment 10a – Celebrity Voice Identification with pitch shifts. Error bars denote standard deviations calculated via bootstrap. Here and in other panels, conventions are as in figures in main text. (Compare to Figure 8b). (B) Musician vs. Non-Musician results for Experiment 10b – Celebrity Voice Identification with Harmonic, Inharmonic and Whispered speech. Error bars standard deviations, calculated via bootstrap. (Compare to Figure 8c). (C) Musician vs. Non-Musician results for Experiment 11 – Novel Voice Discrimination. Error bars denote standard error of the mean. (Compare to Figure 8e). a b Celebrity Voice Recognition - Unfiltered vs. Filtered Speech (Additional Conditions from Experiment 10b) N=412 *** p<0.001 Whose voice is this? 60 ***

‘Sounds like Snape from 50 *** Harry Potter’ 40

30

Percent Correct 20

10

0 d d d d

Harmonic - UnfiltereInharmonic - UnfiltereHarmonic - FiltereInharmonic - Filtere c d

Voice Discrimination - Filtered Speech (Control for Experiment 11) Which speaker only spoke 100 once (first or last)? N = 30 *** p<0.001 90 *** 80 *** Speaker 1 70 Speaker 2 60 Percent Correct

Frequency 50

40 Harmonic Inharmonic Whispered Time

Supplementary Figure 9: Tasks and results for celebrity voice recognition and novel voice discrimination with the F0 filtered out (A) Famous Speaker Recognition Task. Participants on Mechanical Turk heard excerpts of resynthesized speech from celebrities, and were asked to identify each speaker by typing their guesses into a text field (same paradigm as Experiments 10a and 10b). (B) Results from Experiment 10b, with additional control conditions (in which a filter was applied to the stimuli) not shown in main figure. Graph replots results from two of the conditions shown in Figure 8c alongside two additional conditions omitted from main figure for the sake of brevity. In these additional conditions the speech was filtered to eliminate the F0 component, with masking noise added to mask distortion products at the F0. Filtering the speech reduced overall performance, but the effect of inharmonicity was similar. Error bars denote standard deviations, calculated via bootstrap. (C) Novel Voice Discrimination task (same task as Experiment 11). Participants heard three one-second speech utterances, the first or last of which was spoken by a different speaker than the other two. Participants judged which speaker (first or last) only spoke once. (D) Results from replication of Experiment 11 with a filter applied to speech excerpts to remove the F0 component. Masking noise was also added to mask distortion products at the F0. Error bars denote standard error of the mean. (Compare to Figure 8e). Supplementary Table 1: Mandarin Word List bùfá bùfǎ pō pò cáo cǎo qiē qiě chǎn chàn qīngchú qīngchu chāyì chàyì qīngtīng qīngtíng chuánbō chuánbó shànzì shànzi cuō cuò shēngchǎn shèngchǎn dádào dàdào tāng tǎng dǎsǎo dàsǎo tiāndì tiándì dàyī dàyì tiáojié tiáojiě diān diǎn tiáolǐ tiáolì díshì dìshì tóuzī tóuzi dǐzhì dìzhì tuō tuó duó duò wō wò fúqì fúqi wùlǐ wùlì fùshǔ fùshù wúshù wǔshù gǎibiān gǎibiàn wúyí wúyì gān gàn xiézuò xiězuò gāochāo gāocháo yīwù yìwù gūlì gǔlì yōuzhì yòuzhì guójí guójì yǔqí yǔqì huólì huǒlì zhā zhà jiānchá jiǎnchá zhǎnxiàn zhànxiàn jiāotán jiāotàn zháojí zhàojí jídù jìdù zhèngdāng zhèngdǎng jiējiàn jièjiàn zhīzhū zhīzhù jiēshōu jiēshòu zhōngdiǎn zhòngdiǎn jīlì jílì zhōngduān zhōngduàn jízǎo jízào zuǒ zuò kū kǔ zuòwéi zuòwèi nánkān nánkàn pāizi páizi piān piàn Supplementary Table 2: Famous Melodies

1. Twinkle, Twinkle, Little Star 2. Ode to Joy (4th movement from Beethoven’s 9th Symphony, also called ‘Joyful Joyful We Adore Thee’) 3. Happy Birthday to You 4. Mary Had A Little Lamb 5. Hey Jude 6. Theme from 1st movement, Beethoven’s 5th Symphony 7. When the Saints Go Marching In 8. Somewhere Over the Rainbow 9. God Save the Queen (My Country Tis of Thee) 10. The Star-Spangled Banner 11. Amazing Grace 12. Frere Jacques (Brother John) 13. Hedwig's Theme from Harry Potter 14. Star Wars main theme 15. Indiana Jones Theme 16. Here Comes the Bride 17. James Bond Theme 18. Old MacDonald Had a Farm 19. The Itsy Bitsy Spider (also accepted ‘Little Bunny Foo Foo’ and ‘Alouette’) 20. Yankee Doodle Went to Town 21. Bingo (Bingo was His Name-O) 22. Jingle Bells 23. The Wheels On The Bus 24. The Muffin Man

Supplementary Table 3: Celebrity Voices

1) Alan Rickman 2) Alex Trebek 3) Arnold Schwarzenegger 4) Barack Obama 5) Betty White 6) Beyonce Knowles Carter 7) Bill Clinton 8) Bill Cosby 9) Bill O’Reilly 10) Brian Williams 11) Cameron Diaz 12) Conan O’Brien 13) Dolly Parton 14) Donald Trump 15) Eddie Murphy 16) Ellen Degeneres 17) Emma Watson 18) Fran Drescher 19) George H.W. Bush 20) George W. Bush 21) Gilbert Gottfried 22) Gordon Ramsay 23) Hillary Clinton 24) James Earl Jones 25) 26) Jennifer Aniston 27) John Stewart 28) Julia Louis-Dreyfus 29) Kayley Cuoco 30) Kelly Ripa 31) Kim Kardashian 32) Matt Damon 33) Morgan Freeman 34) Newt Gingrich 35) Oprah Winfrey 36) Robin Williams 37) Jerry 38) Shia LaBeouf 39) Sofia Vergara 40) Stephen Colbert

nature research | life sciences reporting summary

Corresponding author(s): Malinda McPherson Initial submission Revised version Final submission Life Sciences Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity. For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.

Experimental design 1. Sample size Describe how sample size was determined. To establish sample size for main in-lab experiments, we performed power analyses on pilot data. Our final sample size was approximately double the size needed to detect an effect with power of .9 given an alpha of .05 (because we wanted to be able to separately analyze musicians and non-musicians, i.e., split the sample in half). For Mechanical Turk studies, sample sizes were chosen to obtain split-half correlations of at least .9 in the pattern of mean performance across conditions. 2. Data exclusions Describe any data exclusions. For experiment 3, participants with performance >95% correct averaged across Harmonic and Inharmonic conditions (13 of 29 subjects) were removed from analysis to avoid ceiling effects; no participants were close to floor on this experiment. For experiment 8, due to its difficulty, participants with performance less than .55 across both conditions (12 of 30 participants) were excluded from analysis. For Experiments 10a-b, all voices recognized correctly on less than 10% of trials were excluded from the analysis to avoid floor effects – this removed 9 of 28 (32%) voices from Experiment 10a and 17 of 39 (44%) voices from Experiment 10b. Exclusion criteria were not pre-established. 3. Replication Describe whether the experimental findings were Every experiment was piloted to establish sample sizes, and then re-run to obtain reliably reproduced. the data presented in the paper. All experimental findings replicated. 4. Randomization Describe how samples/organisms/participants were N/A allocated into experimental groups. 5. Blinding Describe whether the investigators were blinded to For Experiments 5, 6, and 10, answers were coded for analysis by the first author, group allocation during data collection and/or analysis. blind to the experimental condition. Note: all studies involving animals and/or human research participants must disclose whether blinding and randomization were used. June 2017

1 6. Statistical parameters nature research | life sciences reporting summary For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed). n/a Confirmed

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.) A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one- or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range) Clearly defined error bars See the web collection on statistics for biologists for further resources and guidance.

Software Policy information about availability of computer code 7. Software Describe the software used to analyze the data in this All ANOVAs performed using SPSS (Version 24). MATLAB (R2016b) was used for all study. other analyses.

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

Materials and reagents Policy information about availability of materials 8. Materials availability Indicate whether there are restrictions on availability of N/A unique materials or if these materials are only available for distribution by a for-profit company. 9. Antibodies Describe the antibodies used and how they were validated N/A for use in the system under study (i.e. assay and species). 10. Eukaryotic cell lines a. State the source of each eukaryotic cell line used. N/A

b. Describe the method of cell line authentication used. N/A

c. Report whether the cell lines were tested for N/A mycoplasma contamination.

d. If any of the cell lines used are listed in the database N/A of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.

Animals and human research participants June 2017 Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines 11. Description of research animals Provide details on animals and/or animal-derived N/A materials used in the study.

2 Policy information about studies involving human research participants nature research | life sciences reporting summary 12. Description of human research participants Describe the covariate-relevant population In-lab experiments: characteristics of the human research participants. Thirty participants (16 female, mean age of 32.97 years, SD = 16.61) participated in Experiments 1, 3, 4, 7, 8, 9, and 11. These participants were run in the lab for three two-hour sessions each (except one participant, who was unable to return for her final 2-hour session).

30 participants participated in Experiment 2, as well as the control experiment detailed in Supplementary Figure 2 (11 female, mean age of 37.15 years, SD=13.36).

20 participants (8 female, mean age of 38.14 years, SD=15.75) participated in the two follow-up experiments detailed in Supplementary Figure 4c-f.

14 participants completed the control experiment described in Supplementary Figure 5a-b (5 female, mean age of 40 years, SD=13.41).

Mechanical Turk experiments: 32 participants were tested on Amazon Mechanical Turk (17 female, mean age = 36.7 years, SD=10.99) for Experiment 5. 322 participants, 143 women, mean age of 36.49 years, SD=11.77 years, completed Experiment 6 on Amazon Mechanical Turk. 248 participants, 123 female, with a mean age of 35.95 years, SD=9.89 years, completed experiment 10a on Amazon Mechanical Turk. 412 participants, 212 female, with a mean age of 36.7 years, SD = 10.99 years completed experiment 10b on Amazon Mechanical Turk. June 2017

3