Inharmonic Speech: a Tool for the Study of Speech Perception and Separation∗

Inharmonic Speech: A Tool for the Study of Speech Perception and Separation∗ Josh H. McDermott Daniel P. W. Ellis Hideki Kawahara Center for Neural Science Dept. Elec. Eng. Faculty of Systems Engineering New York University, USA Columbia University, USA Wakayama University, Japan [email protected] [email protected] [email protected] Abstract sources – to separate the sound energy produced by a tar- get source from that produced by other sources [2]. Sounds created by a periodic process have a Fourier rep- Human sound segregation relies in part on acoustic resentation with harmonic structure – i.e., components grouping cues – sound properties that are characteristic at multiples of a fundamental frequency. Harmonic fre- of individual natural sound sources such as speech [3, 4], quency relations are a prominent feature of speech and and that can be used to infer groupings of sound energy many other natural sounds. Harmonicity is closely re- from a mixture of sources. Harmonic frequency relations lated to the perception of pitch and is believed to provide are believed to be among the most powerful of such cues. an important acoustic grouping cue underlying sound Harmonicity is the frequency-domain analogue of the pe- segregation. Here we introduce a method to manipulate riodicity that characterizes many natural sounds, includ- the harmonicity of otherwise natural-sounding speech to- ing voiced speech. Periodicity produces frequency com- kens, providing stimuli with which to study the role of ponents that are multiples of the fundamental frequency harmonicity in speech perception. Our algorithm utilizes (f0), a relationship known as harmonicity. Frequency elements of the STRAIGHT framework for speech ma- components that are harmonically related are generally nipulation and synthesis, in which a recorded speech ut- heard as a single sound with a common pitch, and mistun- terance is decomposed into voiced and unvoiced vocal ing a single component of a harmonic series by as little excitation and vocal tract filtering. Unlike the conven- as 1% causes it to be heard as a distinct sound [5]. More- tional STRAIGHT method, we model voiced excitation over, two concurrent tones with different f0s are typically as a combination of time-varying sinusoids. By individ- heard as two distinct sources [6]. ually modifying the frequency of each sinusoid, we in- Machine systems that attempt to replicate human seg- troduce inharmonic excitation without changing other as- regation abilities also make use of harmonicity. Com- pects of the speech signal. The resulting signal remains putational auditory scene analysis (CASA) systems typi- highly intelligible, and can be used to assess the role of cally compute a measure of periodicity and f within lo- harmonicity in the perception of prosody or in the segre- 0 cal time-frequency cells and then group cells in part based gation of speech from mixtures of talkers. on the consistency of the f estimates. CASA systems in Index Terms: speech synthesis, harmonicity, sound seg- 0 fact rely more strongly on harmonicity than common on- regation set, the other main bottom up grouping cue believed to underlie human segregation [7, 8]. 1. Introduction Despite the widespread assumption that harmonicity Human speech recognition is remarkable for its robust- is critical to sound segregation, its role in the segrega- ness to background noise. Our ability to recognize tion of real-world sounds such as speech remains largely speech from mixtures with other sound sources sets hu- untested. Given the potential importance of spectrotem- mans apart from state-of-the-art speech recognition sys- poral sparsity in the segregation of natural sounds [9, 10], tems [1], which typically perform well in quiet but are it is conceivable that the most important role of harmonic- adversely affected by the presence of additional sound ity could simply be to produce discrete frequency compo- sources. The robustness of human recognition to com- nents, the sparsity of which reduces masking and could peting sounds reflects our ability to segregate individual facilitate common onset and other grouping cues. More- over, psychoacoustic experiments with artificial stimuli This∗ work was supported in part by the National Science Foun- have raised questions about whether harmonicity is in fact dation (NSF) via grant IIS-1117015, by Grants-in-Aid for Scientific critical for segregation. Mistuned frequency components Research 22650042 from JSPS, and by the Howard Hughes Medical of complex tones can be detected even when the frequen- Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily cies of all components are increased by a fixed amount, or reflect the views of the sponsors. when the complex is “stretched” such that adjacent components are no longer separated by a fixed number of Hz (for the unvoiced, random component of excitation). The [11]. Although such tones are inharmonic (lacking a fun- original STRAIGHT framework synthesizes voiced ex- damental frequency common to all the components they citation with a sequence of pulses, with each pulse be- contain), component mistuning detection thresholds are ing the minimum phase impulse response of the esti- comparable to those for harmonic tones. This result sug- mated vocal tract filter at that time point. Fractional pitch gests that various forms of spectral regularity, rather than control is implemented with a linear phase shifter. The harmonicity per se, could be most critical to segregation. voiced and unvoiced components are combined using a The strongest test of harmonicity’s importance in seg- sigmoid function (defined by the boundary frequency be- regation would arguably be to compare the segregation of tween the voiced and unvoiced components, and a transi- real-world sounds to that of inharmonic equivalents that tion slope) [16]. are matched in other respects. However, speech in par- ticular is nontrivial to manipulate in this manner, as it 2.2. Sinusoidal Modeling of Voiced Excitation consists of an interaction of periodic and noise excitation To permit the manipulation of harmonicity, the pulse- with vocal tract filtering to which humans are exquisitely based voicing synthesis of the original STRAIGHT pro- sensitive. We devised a method to generate inharmonic cedure was replaced by a sum of multiple sinusoids. Our versions of recorded speech utterances that selectively al- implementation extends a previous instantiation of sinu- ters the periodic component of excitation. We used the soidal excitation modeling in STRAIGHT [17]. framework of STRAIGHT, a powerful tool for represent- Let A(t, f) represent the amplitude at a time- ing and manipulating speech [12, 13]. frequency location (t, f) of the spectral envelope estimated using the STRAIGHT procedure. The determin- 2. Methods istic (voiced) component s(t) of the sinusoidal synthesis procedure can be defined by the following equation: Spectral envelopes (used to model vocal tract filtering) were extracted by STRAIGHT from recorded speech N(t) t X Z and were used to set the amplitudes of constituent time- s(t) = A(t, fn(t)) cos 2π fn(τ)dτ +ϕn (1) varying sinusoids (used to model speech excitation). n=1 0 Conventionally, these sinusoidal components would mir- where fn(t) represents the time-varying frequency of the ror the harmonics of the pitch contour. However, model- n-th constituent sinusoid and ϕn represents its initial ing the excitation in this way allows the frequency rela- phase (set to zero for the experiments described here). tions between sinusoids to be manipulated independently The total number of harmonic components N(t) at time of the spectral envelope or the prosodic contour, introduc- t is adaptively adjusted to keep the highest component ing inharmonicity into an otherwise normal speech sig- frequency below the Nyquist frequency. nal. This section outlines the original STRAIGHT proce- Instead of directly implementing Equation 1, as in dure and its extension to enable inharmonic excitation. [17], we approximate it here using a time-varying filter and a fixed frame-rate overlap-add procedure (3 ms 2.1. Original STRAIGHT Framework frame rate and 50% overlap between adjacent Hanning- windowed frames). A linear phase FIR filter, derived Spectral envelope estimation in STRAIGHT consists of from A(t, f), is applied to each frame using a 1024 sam- a two-stage procedure to eliminate interference from pe- ple (64 ms) FFT buffer. This is essentially a cross synthe- riodic speech excitation [13]. In the first stage, tempo- sis VOCODER framework, whose minimal restrictions ral interference is eliminated by averaging power spectra on the input signal make it straightforward to vary the calculated at two time points separated by half a pitch excitation. To synthesize speech, s(t) is added to the un- period. In the second stage, spectral interference is elimi- voiced speech component estimated as in conventional nated by spectral smoothing using an f -adaptive rectan- 0 STRAIGHT. gular smoother, followed by post-processing to preserve harmonic component levels based on consistent sampling 2.3. Inharmonicity Manipulations theory. These frequency domain procedures are implemented with cepstral liftering. More details are provided Following manipulations from prior psychoacoustics in [14]. studies [11, 18], we altered the frequencies of speech har- Excitation estimation in STRAIGHT also relies on a monics in three ways. temporally stable

Load more