<<

Inharmonic Speech: A Tool for the Study of Speech and Separation∗

Josh H. McDermott Daniel P. W. Ellis Hideki Kawahara

Center for Neural Science Dept. Elec. Eng. Faculty of Systems Engineering New York University, USA Columbia University, USA Wakayama University, Japan [email protected] [email protected] [email protected]

Abstract sources – to separate the energy produced by a tar- get source from that produced by other sources [2]. created by a periodic process have a Fourier rep- Human sound segregation relies in part on acoustic resentation with harmonic structure – i.e., components grouping cues – sound properties that are characteristic at multiples of a fundamental . Harmonic fre- of individual natural sound sources such as speech [3, 4], quency relations are a prominent feature of speech and and that can be used to infer groupings of sound energy many other natural sounds. Harmonicity is closely re- from a mixture of sources. Harmonic frequency relations lated to the perception of pitch and is believed to provide are believed to be among the most powerful of such cues. an important acoustic grouping cue underlying sound Harmonicity is the frequency-domain analogue of the pe- segregation. Here we introduce a method to manipulate riodicity that characterizes many natural sounds, includ- the harmonicity of otherwise natural-sounding speech to- ing voiced speech. Periodicity produces frequency com- kens, providing stimuli with which to study the role of ponents that are multiples of the harmonicity in . Our algorithm utilizes (f0), a relationship known as harmonicity. Frequency elements of the STRAIGHT framework for speech ma- components that are harmonically related are generally nipulation and synthesis, in which a recorded speech ut- heard as a single sound with a common pitch, and mistun- terance is decomposed into voiced and unvoiced vocal ing a single component of a harmonic series by as little excitation and vocal tract filtering. Unlike the conven- as 1% causes it to be heard as a distinct sound [5]. More- tional STRAIGHT method, we model voiced excitation over, two concurrent tones with different f0s are typically as a combination of time-varying sinusoids. By individ- heard as two distinct sources [6]. ually modifying the frequency of each sinusoid, we in- Machine systems that attempt to replicate human seg- troduce inharmonic excitation without changing other as- regation abilities also make use of harmonicity. Com- pects of the speech signal. The resulting signal remains putational auditory scene analysis (CASA) systems typi- highly intelligible, and can be used to assess the role of cally compute a measure of periodicity and f within lo- harmonicity in the perception of prosody or in the segre- 0 cal time-frequency cells and then group cells in part based gation of speech from mixtures of talkers. on the consistency of the f estimates. CASA systems in Index Terms: speech synthesis, harmonicity, sound seg- 0 fact rely more strongly on harmonicity than common on- regation , the other main bottom up grouping cue believed to underlie human segregation [7, 8]. 1. Introduction Despite the widespread assumption that harmonicity Human speech recognition is remarkable for its robust- is critical to sound segregation, its role in the segrega- ness to background noise. Our ability to recognize tion of real-world sounds such as speech remains largely speech from mixtures with other sound sources sets hu- untested. Given the potential importance of spectrotem- mans apart from state-of-the-art speech recognition sys- poral sparsity in the segregation of natural sounds [9, 10], tems [1], which typically perform well in quiet but are it is conceivable that the most important role of harmonic- adversely affected by the presence of additional sound ity could simply be to produce discrete frequency compo- sources. The robustness of human recognition to com- nents, the sparsity of which reduces masking and could peting sounds reflects our ability to segregate individual facilitate common onset and other grouping cues. More- over, psychoacoustic experiments with artificial stimuli This∗ work was supported in part by the National Science Foun- have raised questions about whether harmonicity is in fact dation (NSF) via grant IIS-1117015, by Grants-in-Aid for Scientific critical for segregation. Mistuned frequency components Research 22650042 from JSPS, and by the Howard Hughes Medical of complex tones can be detected even when the frequen- Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily cies of all components are increased by a fixed amount, or reflect the views of the sponsors. when the complex is “stretched” such that adjacent com- ponents are no longer separated by a fixed number of Hz (for the unvoiced, random component of excitation). The [11]. Although such tones are inharmonic (lacking a fun- original STRAIGHT framework synthesizes voiced ex- damental frequency common to all the components they citation with a sequence of pulses, with each pulse be- contain), component mistuning detection thresholds are ing the minimum impulse response of the esti- comparable to those for harmonic tones. This result sug- mated vocal tract filter at that time point. Fractional pitch gests that various forms of spectral regularity, rather than control is implemented with a linear phase shifter. The harmonicity per se, could be most critical to segregation. voiced and unvoiced components are combined using a The strongest test of harmonicity’s importance in seg- sigmoid function (defined by the boundary frequency be- regation would arguably be to compare the segregation of tween the voiced and unvoiced components, and a transi- real-world sounds to that of inharmonic equivalents that tion slope) [16]. are matched in other respects. However, speech in par- ticular is nontrivial to manipulate in this manner, as it 2.2. Sinusoidal Modeling of Voiced Excitation consists of an interaction of periodic and noise excitation To permit the manipulation of harmonicity, the pulse- with vocal tract filtering to which humans are exquisitely based voicing synthesis of the original STRAIGHT pro- sensitive. We devised a method to generate inharmonic cedure was replaced by a sum of multiple sinusoids. Our versions of recorded speech utterances that selectively al- implementation extends a previous instantiation of sinu- ters the periodic component of excitation. We used the soidal excitation modeling in STRAIGHT [17]. framework of STRAIGHT, a powerful tool for represent- Let A(t, f) represent the amplitude at a time- ing and manipulating speech [12, 13]. frequency location (t, f) of the spectral envelope esti- mated using the STRAIGHT procedure. The determin- 2. Methods istic (voiced) component s(t) of the sinusoidal synthesis procedure can be defined by the following equation: Spectral envelopes (used to model vocal tract filtering) were extracted by STRAIGHT from recorded speech N(t) t X  Z  and were used to set the amplitudes of constituent time- s(t) = A(t, fn(t)) cos 2π fn(τ)dτ +ϕn (1) varying sinusoids (used to model speech excitation). n=1 0

Conventionally, these sinusoidal components would mir- where fn(t) represents the time-varying frequency of the ror the harmonics of the pitch contour. However, model- n-th constituent sinusoid and ϕn represents its initial ing the excitation in this way allows the frequency rela- phase (set to zero for the experiments described here). tions between sinusoids to be manipulated independently The total number of harmonic components N(t) at time of the spectral envelope or the prosodic contour, introduc- t is adaptively adjusted to keep the highest component ing into an otherwise normal speech sig- frequency below the Nyquist frequency. nal. This section outlines the original STRAIGHT proce- Instead of directly implementing Equation 1, as in dure and its extension to enable inharmonic excitation. [17], we approximate it here using a time-varying fil- ter and a fixed frame-rate overlap-add procedure (3 ms 2.1. Original STRAIGHT Framework frame rate and 50% overlap between adjacent Hanning- windowed frames). A linear phase FIR filter, derived Spectral envelope estimation in STRAIGHT consists of from A(t, f), is applied to each frame using a 1024 sam- a two-stage procedure to eliminate interference from pe- ple (64 ms) FFT buffer. This is essentially a cross synthe- riodic speech excitation [13]. In the first stage, tempo- sis framework, whose minimal restrictions ral interference is eliminated by averaging power spectra on the input signal make it straightforward to vary the calculated at two time points separated by half a pitch excitation. To synthesize speech, s(t) is added to the un- period. In the second stage, spectral interference is elimi- voiced speech component estimated as in conventional nated by spectral smoothing using an f -adaptive rectan- 0 STRAIGHT. gular smoother, followed by post-processing to preserve harmonic component levels based on consistent sampling 2.3. Inharmonicity Manipulations theory. These procedures are imple- mented with cepstral liftering. More details are provided Following manipulations from prior in [14]. studies [11, 18], we altered the of speech har- Excitation estimation in STRAIGHT also relies on a monics in three ways. temporally stable representation of the power spectrum • Shifting: the frequencies of all harmonics were in- and combines this with a temporally stable representa- creased by a fixed proportion of the f0, preserving tion of instantaneous frequency [15]. Excitation is repre- the regular spacing (in Hz) between components. sented using a time-varying fundamental frequency f0(t) Hence, the frequency of harmonic n became: (for the voiced, deterministic component of excitation) and time-varying parameters to describe colored noise fn(t) = nf0(t) + af0(t) (2) • Stretching: the frequency spacing between adja- Original 0 cent components was increased with increasing 2 component number: −20 1 −40 fn(t) = nf0(t) + bn(n − 1)f0(t) (3) −60 freq / kHz −80 • Jittering: a distinct random offset (uniformly dis- 0 Harmonic level / dB tributed between -30% and +30% of the f0) was 2 added to the frequency of each component: 1 fn(t) = nf0(t) + cnf0(t) (4)

For comparison we also synthesized a substitute for whis- 0 pered speech, in which a low amplitude noise component Shifted by 0.3 f0 2 (26 dB below the level of the voiced component of the regular synthetic speech, an amount that sounded fairly natural to the authors) was added to the usual unvoiced 1 component in lieu of sinusoidally modeled voicing. 0 Stretched by 0.075 n(n-1)f 3. Results 2 0 Figure 1 displays of one original speech ut- terance and the synthetic variants that resulted from our 1 synthesis algorithm. It is visually apparent that the syn- thetic harmonic rendition is acoustically similar to the 0 original, as intended. The inharmonic variants, in con- Jittered by 0.3 [-1..1] f0 trast, deviate in their spectral detail. Close inspection 2 reveals that the frequencies of the shifted inharmonic version are translated upwards by a small amount in 1 frequency (such that the component frequencies are no longer integer multiples of the component spacing). The 0 stretched and jittered versions lack the regular spacing Simulated whisper found in harmonic spectra, while the simulated whisper 2 lacks discrete frequency components. In all cases, how- ever, the coarse spectro-temporal envelope of the origi- 1 nal signal is preserved by virtue of STRAIGHT’s speech decomposition, and the unvoiced components in the orig- inal speech, which are processed separately, are recon- 0 0 0.5 1 time / sec structed without modification. All synthetic renditions remain highly intelligible, as can be confirmed by listen- ing to the demos available online: http://labrosa. Figure 1: Spectrograms of original and modified tokens ee.columbia.edu/projects/inharmonic/ . of the utterance ”Two cars came over a crest”. The fre- Although much of the acoustic structure needed for quency axis extends only to 2 kHz to facilitate inspection speech recognition remains intact, the inharmonicity that of individual frequency components. results from the synthesis is readily audible. Unlike the synthetic renditions that preserve harmonicity, the inhar- monic versions do not sound fully natural, perhaps due lack strong periodicity despite the presence of discrete to weaker fusion of frequency components and/or the ab- frequency components. The simulated whisper synthetic sence of a clear pitch during voiced speech segments. To speech also lacks periodicity, as expected from noise ex- quantify the physical basis of this effect, we used Praat citation. to measure the instantaneous periodicity of each type of Although the individual frequency components of the synthetic signal for a large number of speech utterances inharmonic speech utterances trace out the same contour from the TIMIT database. As shown in Figure 2, the shape as the components of the harmonic speech, the ab- periodicity histograms for both the original recordings sence of periodicity impairs the extraction of an f0 con- and their synthetic harmonic counterparts have a steep tour (in this case by Praat), as shown in Figure 3. The peak near 1, corresponding to moments of periodic voic- f0 track for the harmonic synthetic version closely mir- ing. In contrast, all three types of inharmonic signals rors that of the original (note the overlap between blue 0.14 4. Conclusions Original 0.12 Harmonic Synth Inharmonic speech utterances can be synthesized using Jittered Synth a modification of the STRAIGHT framework. They are Stretched Synth 0.1 Shifted Synth intelligible but lack a clear pitch and sound less fused Whispered Synth than veridical harmonic speech. Inharmonic speech sig- 0.08 nals may be useful for the study of prosody and speech segregation. 0.06 5. References 0.04 [1] R.P. Lippmann, “Speech recognition by machines and humans,” Probabiilty of Occurence Speech Comm., vol. 1, no. 22, pp. 1–15, 1997. 0.02 [2] A.S. Bregman, Auditory Scene Analysis, Bradford Books, MIT Press, 1990. 0 [3] M. Cooke and D.P.W. Ellis, “The auditory organization of speech 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 and other sources in listeners and computational models,” Speech Periodicity (Normalized Peak Height) Comm., vol. 35, no. 3, pp. 141–177, 2001. [4] J.H. McDermott, “The cocktail party problem,” Current Biology, Figure 2: Histograms of instantaneous periodicity for vol. 19, no. 22, pp. R1024–R1027, 2009. recordings of speech utterances and different synthetic [5] B.C.J. Moore, B.R. Glasberg, and R.W. Peters, “Thresholds for renditions thereof. Data obtained from 76 randomly se- hearing mistuned partials as separate tones in harmonic com- plexes,” J. Acoust. Soc. Am., vol. 80, pp. 479–483, 1986. lected sentences from the TIMIT database. [6] C. Micheyl and A.J. Oxenham, “Pitch, harmonicity, and concur- rent sound segregation: Psychoacoustical and neurophysiological 300 findings,” Hearing Research, vol. 266, no. 1-2, pp. 36–51, 2010. Original Harmonic Synth [7] G.J. Brown and M. Cooke, “Computational auditory scene analy- Jittered Synth sis,” Comp. Speech and Lang., vol. 8, no. 4, pp. 297–336, 1994. 250 Shifted Synth [8] D.L. Wang and G.J. Brown, “Separation of speech from inter- fering sounds based on oscillatory correlation,” IEEE Tr. Neural 200 Networks, vol. 10, no. 3, pp. 684–697, 1999. [9] M. Cooke, “A glimpsing model of speech perception in noise,” J. f0 (Hz) 150 Acoust. Soc. Am., vol. 119, no. 3, pp. 1562–1573, March 2006. [10] D. P. W. Ellis, “Model-based scene analysis,” in Computational 100 Auditory Scene Analysis: Principles, Algorithms, and Applica- tions, D. Wang and G. Brown, Eds., chapter 4, pp. 115–146. Wi- ley/IEEE Press, 2006. 50 0 0.5 1 1.5 2 2.5 [11] B. Roberts and J.M. Brunstrom, “Perceptual segregation and pitch Time (s) shifts of mistuned components in harmonic complexes and in reg- ular inharmonic complexes,” J. Acoust. Soc. Am., vol. 104, pp. 2326–2338, 1998. Figure 3: f0 tracks extracted from an example original [12] H. Kawahara, “STRAIGHT, exploitation of the other aspect speech utterance and its synthetic variants. The stretched of vocoder: Perceptually isomorphic decomposition of speech inharmonic version is omitted for visual clarity. sounds,” Acoustical Sci. and Tech., vol. 27, no. 6, pp. 349–353, 2006. [13] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno, “TANDEM-STRAIGHT: A temporally stable power and black), but the inharmonic variants do not. Our sub- spectral representation for periodic signals and applications to jective observations suggest that many aspects of prosody interference-free spectrum, F0, and aperiodicity estimation,” in Proc. IEEE ICASSP, 2008, pp. 3933–3936. are nonetheless preserved in inharmonic speech, a topic [14] H. Kawahara and M. Morise, “Technical foundations of that will be interesting to explore experimentally. TANDEM-STRAIGHT, a speech analysis, modification and syn- The most exciting application of inharmonic speech thesis framework,” SADHANA, vol. 36, no. 5, pp. 713–722, 2011. [15] H. Kawahara, T. Irino, and M. Morise, “An interference-free rep- stimuli may be to the study of sound segregation. We in- resentation of instantaneous frequency of periodic signals and its formally compared the ease of hearing a target speaker application to F0 extraction,” in Proc. IEEE ICASSP, 2011, pp. mixed with competing talkers, for harmonic, inharmonic, 5420–5423. and whispered synthetic speech. Although definitive con- [16] H. Kawahara, M. Morise, T. Takahashi, H. Banno, R. Nisimura, and T. Irino, “Simplification and extension of non-periodic excita- clusions will require formal measurements over a large tion source representations for high-quality speech manipulation corpus, our subjective impression was that harmonic systems,” in Proc. Interspeech2010, 2010, pp. 38–41. speech was somewhat easier to perceive in a mixture than [17] Hideki Kawahara, Hideki Banno, Toshio Irino, and Parham was inharmonic speech, with whispered speech notice- Zolfaghari, “Algorithm AMALGAM: Morphing based methods, sinuisoidal models and STRAIGHT,” in Proc. IEEE ably more difficult than inharmonic. In some cases it ICASSP, 2004, pp. 13–16. seemed that harmonic speech derived an advantage from [18] J.H. McDermott, A.J. Lehr, and A.J. Oxenham, “Individual dif- its pitch contour, which helps to sequentially group parts ferences reveal the basis of consonance,” Current Biology, vol. of speech. 20, no. 11, pp. 1035–1041, 2010.