Inharmonic Speech: A Tool for the Study of Speech Perception and Separation∗

Josh H. McDermott Daniel P. W. Ellis Hideki Kawahara

Center for Neural Science Dept. Elec. Eng. Faculty of Systems Engineering New York University, USA Columbia University, USA Wakayama University, Japan [email protected] [email protected] [email protected]

Abstract vidual sources – to separate the energy produced by a target source from that produced by other sources created by a periodic process have a Fourier rep- [2]. resentation with structure – i.e., components at multiples of a fundamental . Such harmonic Human sound segregation relies in part on acoustic frequency relations are a prominent feature of speech and grouping cues – sound properties that are characteristic many other natural sounds. Harmonicity is closely re- of individual natural sound sources such as speech [3, 4], lated to the perception of pitch and is believed to provide and that can be used to infer groupings of sound energy an important acoustic grouping cue underlying sound from a mixture of sources. Harmonic frequency relations segregation. Here we introduce a method to manipulate are believed to be among the most powerful of such cues. the harmonicity of otherwise natural-sounding speech to- Harmonicity is the frequency-domain analogue of the pe- kens, providing the stimuli needed to study the role of riodicity that characterizes many natural sounds, includ- harmonicity in speech perception. Our algorithm utilizes ing voiced speech. Periodicity produces frequency com- elements of the STRAIGHT framework for speech ma- ponents that are multiples of the nipulation and synthesis, in which a recorded speech ut- (f0), a relationship known as harmonicity. Frequency terance is decomposed into voiced and unvoiced vocal components that are harmonically related are generally excitation and vocal tract filtering. Unlike the conven- heard as a single sound with a common pitch, and mistun- tional STRAIGHT method, we model voiced excitation ing a single component of a harmonic series by as little as a combination of time-varying sinusoids, the frequen- as 1% causes it to be heard as a distinct sound [5]. More- cies of which can then be individually modified. By shift- over, two concurrent tones with different f0s are typically ing, stretching, or jittering the harmonic , we heard as two distinct sources [6]. introduce inharmonic excitation without changing other Machine systems that attempt to replicate human seg- aspects of the speech signal. The resulting signal remains regation abilities also make use of harmonicity. Com- highly intelligible, and can be used to assess the role of putational auditory scene analysis (CASA) systems typi- harmonicity in the perception of prosody or in the segre- cally compute a measure of periodicity and f0 within lo- gation of speech from mixtures of talkers. cal time-frequency cells and then group cells in part based Index Terms: speech synthesis, harmonicity, sound seg- on the consistency of the f0 estimates. CASA systems in regation fact rely more strongly on harmonicity than common on- set, the other main bottom up grouping cue believed to 1. Introduction underlie human segregation [7, 8]. Despite the widespread assumption that harmonic- Human speech recognition is remarkable for its robust- ity is critical to sound segregation, its role in the seg- ness to background noise. Our ability to recognize tar- regation of real-world sounds such as speech remains get utterances from mixtures with other sound sources largely untested. Given the potential importance of spec- sets humans apart from state-of-the-art speech recogni- trotemporal sparsity in the segregation of complex nat- tion systems [1], which typically perform well in quiet ural sounds [9, 10], it is conceivable that the most im- but are adversely affected by the presence of additional portant role of harmonicity could simply be to produce sound sources. The robustness of human recognition to discrete frequency components, the sparsity of which re- competing sounds reflects our ability to segregate indi- duces masking and could facilitate common onset and This∗ work was supported in part by the National Science Foun- other grouping cues. Moreover, experimental results with dation (NSF) via grant IIS-1117015, by Grants-in-Aid for Scientific artificial stimuli have raised questions about whether har- Research 22650042 from JSPS, and by the Howard Hughes Medical monicity is in fact critical for segregation. Mistuned fre- Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily quency components of complex tones can be detected reflect the views of the sponsors. even when the frequencies of all components are in- creased by a fixed amount, or when the complex is components are combined using a sigmoid function (de- “stretched” such that adjacent components are no longer fined by the boundary frequency between the voiced and separated by a fixed number of Hz [11]. Although such unvoiced components, and a transition slope) [16]. tones are inharmonic (lacking a fundamental frequency The synthesis procedure of the basic STRAIGHT common to all the components they contain), component framework generates excitation with a pulse-plus-noise mistuning detection thresholds are comparable to those model. A linear phase shifter is used to implement frac- for harmonic tones. This result suggests that various tional pitch control. Time-varying filter characteristics forms of spectral regularity, rather than harmonicity per are implemented with a minimum phase response. se, could be most critical to segregation. The most relevant and convincing test of the im- 2.2. Sinusoidal Modeling of Voiced Excitation portance of harmonic structure in segregation would ar- To permit the manipulation of harmonicity, the pulse- guably be to assess the segregation of real-world sounds based voicing synthesis of the original STRAIGHT pro- compared to that of inharmonic equivalents that are cedure was was replaced by a sum of multiple sinusoids. matched in other respects. To usefully alter the harmonic- Let A(t, f) represent the amplitude at a time- ity of natural sounds, however, one must avoid disrupting frequency location (t, f) of the spectral envelope esti- other aspects of the sound structure. Speech in partic- mated using the STRAIGHT procedure. The determin- ular is difficult to manipulate in this manner, as it con- istic (voiced) component s(t) of the sinusoidal synthesis sists of a highly structured interaction of periodic and procedure is defined by the following equation. noise excitation with vocal tract filtering to which hu- mans are exquisitely sensitive. To explore the role of N(t) X  Z t  harmonicity in speech perception and segregation, we s(t) = A(t, fn(t)) cos 2π fn(τ)dτ +ϕn (1) devised a method to generate inharmonic versions of n=1 0 recorded speech utterances. We used the framework of STRAIGHT, a powerful tool for representing and manip- where fn(t) represents the time-varying frequency of the ulating speech [12, 13]. n-th constituent sinusoid and ϕn represents its initial phase (set to zero for the experiments described here). The total number of harmonic components N(t) at time 2. Methods t is adaptively adjusted to keep the highest component Spectral envelopes extracted by STRAIGHT were used to frequency lower than the Nyquist frequency. set the amplitudes of time-varying sinusoids. This section Equation 1 is approximated using a fixed frame-rate outlines the original STRAIGHT procedure and its exten- overlap-add procedure (6 ms Hanning-windowed frames, sion to enable sinusoidal modeling of excitation, which 3 ms frame rate, and 50% overlap between adjacent we used to generate inharmonic voicing. frames). A linear phase FIR time-varying filter, derived from A(t, f), is applied to each frame using a 64 ms FFT 2.1. STRAIGHT Framework buffer. To synthesize speech, s(t) is added to the un- Spectral envelope estimation in STRAIGHT consists of voiced speech component estimated as in conventional a two-stage procedure to eliminate interference from the STRAIGHT. periodic speech excitation on the the representation of spectral power [13]. In the first stage, temporal interfer- 2.3. Inharmonicity Manipulations ence is eliminated by averaging power spectra calculated Following manipulations from prior at two time points separated by half a pitch period. In the studies [11, 17], we altered the frequencies of speech har- second stage, spectral interference is eliminated by spec- monics in three ways. tral smoothing using an f0-adaptive rectangular smoother followed by post processing to preserve harmonic com- • Shifting: the frequencies of all were in- ponent levels based on consistent sampling theory. These creased by a fixed proportion of the f0, preserving frequency domain procedures are implemented with cep- the regular spacing (in Hz) between components. stral liftering. More details are provided in [14]. Hence, the frequency of harmonic n became: Speech excitation estimation also relies on a tem- porally stable representation of the power spectrum and fn(t) = nf0(t) + af0(t) (2) combines this with a temporally stable representation of instantaneous frequency [15]. Excitation is represented • Stretching: the frequency spacing between adja- using a time-varying fundamental frequency f0(t) (for cent components was increased with increasing the voiced, deterministic component of excitation) and component number: time-varying parameters to describe colored noise (for the unvoiced, random component of excitation). The two fn(t) = nf0(t) + bn(n − 1)f0(t) (3) • Jittering: a distinct random offset (uniformly dis- Original 2 0 tributed between -30% and +30% of the f0) was added to the frequency of each component: −20 1 −40 f (t) = nf (t) + c f (t) (4) n 0 n 0 −60 freq / kHz For comparison we also synthesized a substitute for whis- 0 −80 Harmonic level / dB pered speech, in which a low amplitude noise component 2 (26 dB below the level of the voiced component of the regular synthetic speech, an amount that sounded fairly natural to the authors) was added to the usual unvoiced 1 component in lieu of harmonic voicing. 0 Shifted by 0.3 f0 3. Results 2 Figure 1 displays spectrograms of one original speech utterance and the synthetic variants that resulted from 1 our synthesis algorithm. It is visually apparent that the synthetic harmonic rendition is acoustically similar to 0 the original, as intended. The inharmonic variants, in Stretched by 0.075 n(n-1)f0 contrast, deviate in their fine spectral detail. Close in- 2 spection reveals that the frequencies of the shifted inhar- monic version are translated upwards by a small amount 1 in frequency (such that the component frequencies are no longer integer multiples of the component spacing). The 0 stretched and jittered versions lack the regular spacing Jittered by 0.3 [-1..1] f0 found in harmonic spectra, while the simulated whisper 2 lacks discrete frequency components. In all cases, how- ever, the coarse spectro-temporal envelope of the origi- 1 nal signal is preserved by virtue of STRAIGHT’s speech decomposition, and the unvoiced components in the orig- 0 inal speech, which are processed separately, are recon- Simulated whisper structed without modification. All synthetic renditions 2 remain highly intelligible, as can be confirmed by listen- ing to the demos available online: http://labrosa. 1 ee.columbia.edu/projects/inharmonic/ . Although much of the acoustic structure needed for speech recognition remains intact, the inharmonicity that 0 0 0.5 1 time / sec results from the synthesis is readily audible. Unlike the synthetic renditions that preserve harmonicity, the inhar- monic versions do not sound fully natural, perhaps due Figure 1: Spectrograms of original and modified tokens to weaker fusion of frequency components and/or the ab- of the utterance ”Two cars came over a crest”. The fre- sence of a clear pitch during voiced speech segments [18]. quency axis extends only to 2 kHz to facilitate inspection To quantify the physical basis of this effect, we used Praat of individual frequency components. to measure the instantaneous periodicity of each type of synthetic signal for a large number of speech utterances from the TIMIT database. As shown in Figure 2, the shape as the components of the harmonic speech, the ab- periodicity histograms for both the original recordings sence of periodicity impairs the extraction of an f0 con- and their synthetic harmonic counterparts have a steep tour (in this case by Praat), as shown in Figure 3. The peak near 1, corresponding to moments of highly periodic f0 track for the harmonic synthetic version closely mir- voicing. In contrast, all three types of inharmonic signals rors that of the original (note the overlap between blue lack strong periodicity despite the presence of discrete and black), but the inharmonic variants do not. Our sub- frequency components. The simulated whisper synthetic jective observations suggest that many aspects of prosody speech also lacks periodicity, as expected from noise ex- are nonetheless preserved in inharmonic speech, a topic citation. that will be interesting to explore experimentally. Although the individual frequency components of the The most exciting application of inharmonic speech inharmonic speech utterances trace out the same contour stimuli may be to the study of sound segregation. We in- 0.14 veridical harmonic speech. Inharmonic speech signals Original hold promise for the study of prosody and speech seg- 0.12 Harmonic Synth Jittered Synth regation. Stretched Synth 0.1 Shifted Synth Whispered Synth 5. Acknowledgements 0.08 The authors thank Sam Norman-Haignere for assistance with Praat. 0.06 6. References 0.04

Probabiilty of Occurence [1] R.P. Lippmann, “Speech recognition by machines and humans,” Speech Comm., vol. 1, no. 22, pp. 1–15, 1997. 0.02 [2] A.S. Bregman, Auditory Scene Analysis, Bradford Books, MIT Press, 1990. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 [3] M. Cooke and D.P.W. Ellis, “The auditory organization of speech Periodicity (Normalized Autocorrelation Peak Height) and other sources in listeners and computational models,” Speech Comm., vol. 35, no. 3, pp. 141–177, 2001. [4] J.H. McDermott, “The cocktail party problem,” Current Biology, Figure 2: Histograms of instantaneous periodicity for vol. 19, no. 22, pp. R1024–R1027, 2009. recordings of speech utterances and different synthetic [5] B.C.J. Moore, B.R. Glasberg, and R.W. Peters, “Thresholds for renditions thereof. Data obtained from 76 randomly se- hearing mistuned partials as separate tones in harmonic com- lected sentences from the TIMIT database. plexes,” J. Acoust. Soc. Am., vol. 80, pp. 479–483, 1986. [6] R.P. Carlyon, “Encoding the fundamental frequency of a com- 300 plex tone in the presence of a spectrally overlapping masker,” J. Original Acoust. Soc. Am., vol. 99, no. 1, pp. 517–524, 1996. Harmonic Synth Jittered Synth [7] G.J. Brown and M. Cooke, “Computational auditory scene analy- 250 Shifted Synth sis,” Comp. Speech and Lang., vol. 8, no. 4, pp. 297–336, 1994. [8] D.L. Wang and G.J. Brown, “Separation of speech from inter- 200 fering sounds based on oscillatory correlation,” IEEE Tr. Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.

f0 (Hz) 150 [9] M. Cooke, “A glimpsing model of speech perception in noise,” J. Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, March 2006. 100 [10] D. P. W. Ellis, “Model-based scene analysis,” in Computational Auditory Scene Analysis: Principles, Algorithms, and Applica- tions, D. Wang and G. Brown, Eds., chapter 4, pp. 115–146. Wi- 50 0 0.5 1 1.5 2 2.5 ley/IEEE Press, 2006. Time (s) [11] B. Roberts and J.M. Brunstrom, “Perceptual segregation and pitch shifts of mistuned components in harmonic complexes and in reg- Figure 3: f tracks extracted from an example original ular inharmonic complexes,” J. Acoust. Soc. Am., vol. 104, pp. 0 2326–2338, 1998. speech utterance and its synthetic variants. The stretched [12] H. Kawahara, “STRAIGHT, exploitation of the other aspect inharmonic version is omitted for visual clarity. of vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoustical Sci. and Tech., vol. 27, no. 6, pp. 349–353, 2006. formally compared the ease of hearing a target speaker [13] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno, “TANDEM-STRAIGHT: A temporally stable power mixed with competing talkers, for harmonic, inharmonic, spectral representation for periodic signals and applications to and whispered synthetic speech. Although definitive con- interference-free spectrum, F0, and aperiodicity estimation,” in clusions will require formal measurements of speech in- Proc. IEEE ICASSP, 2008, pp. 3933–3936. telligibility over a large corpus, our subjective impres- [14] H. Kawahara and M. Morise, “Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and syn- sion was that harmonic speech was somewhat easier to thesis framework,” SADHANA, vol. 36, no. 5, pp. 713–722, 2011. perceive in a mixture than was inharmonic speech, with [15] H. Kawahara, T. Irino, and M. Morise, “An interference-free rep- whispered speech noticeably more difficult than inhar- resentation of instantaneous frequency of periodic signals and its application to F0 extraction,” in Proc. IEEE ICASSP, 2011, pp. monic. In some cases it seemed that harmonic speech 5420–5423. derived an advantage from its pitch contour, which helps [16] H. Kawahara, M. Morise, T. Takahashi, H. Banno, R. Nisimura, to sequentially group parts of speech. Such issues merit and T. Irino, “Simplification and extension of non-periodic excita- further study. tion source representations for high-quality speech manipulation systems,” in Proc. Interspeech2010, 2010, pp. 38–41. [17] J.H. McDermott, A.J. Lehr, and A.J. Oxenham, “Individual dif- 4. Conclusions ferences reveal the basis of consonance,” Current Biology, vol. 20, no. 11, pp. 1035–1041, 2010. Inharmonic speech utterances can be synthesized using a [18] C. Micheyl, K. David, D.M. Wrobleski, and A.J. Oxenham, “Does modification of the STRAIGHT framework. They are in- fundamental frequency discrimination measure virtual pitch dis- telligible but lack a clear pitch and sound less fused than crimination?,” J. Acoust. Soc. Am., vol. 128, pp. 1930–1942, 2010.