Inharmonic Speech: a Tool for the Study of Speech Perception and Separation∗
Total Page:16
File Type:pdf, Size:1020Kb
Inharmonic Speech: A Tool for the Study of Speech Perception and Separation∗ Josh H. McDermott Daniel P. W. Ellis Hideki Kawahara Center for Neural Science Dept. Elec. Eng. Faculty of Systems Engineering New York University, USA Columbia University, USA Wakayama University, Japan [email protected] [email protected] [email protected] Abstract vidual sources – to separate the sound energy produced by a target source from that produced by other sources Sounds created by a periodic process have a Fourier rep- [2]. resentation with harmonic structure – i.e., components at multiples of a fundamental frequency. Such harmonic Human sound segregation relies in part on acoustic frequency relations are a prominent feature of speech and grouping cues – sound properties that are characteristic many other natural sounds. Harmonicity is closely re- of individual natural sound sources such as speech [3, 4], lated to the perception of pitch and is believed to provide and that can be used to infer groupings of sound energy an important acoustic grouping cue underlying sound from a mixture of sources. Harmonic frequency relations segregation. Here we introduce a method to manipulate are believed to be among the most powerful of such cues. the harmonicity of otherwise natural-sounding speech to- Harmonicity is the frequency-domain analogue of the pe- kens, providing the stimuli needed to study the role of riodicity that characterizes many natural sounds, includ- harmonicity in speech perception. Our algorithm utilizes ing voiced speech. Periodicity produces frequency com- elements of the STRAIGHT framework for speech ma- ponents that are multiples of the fundamental frequency nipulation and synthesis, in which a recorded speech ut- (f0), a relationship known as harmonicity. Frequency terance is decomposed into voiced and unvoiced vocal components that are harmonically related are generally excitation and vocal tract filtering. Unlike the conven- heard as a single sound with a common pitch, and mistun- tional STRAIGHT method, we model voiced excitation ing a single component of a harmonic series by as little as a combination of time-varying sinusoids, the frequen- as 1% causes it to be heard as a distinct sound [5]. More- cies of which can then be individually modified. By shift- over, two concurrent tones with different f0s are typically ing, stretching, or jittering the harmonic frequencies, we heard as two distinct sources [6]. introduce inharmonic excitation without changing other Machine systems that attempt to replicate human seg- aspects of the speech signal. The resulting signal remains regation abilities also make use of harmonicity. Com- highly intelligible, and can be used to assess the role of putational auditory scene analysis (CASA) systems typi- harmonicity in the perception of prosody or in the segre- cally compute a measure of periodicity and f0 within lo- gation of speech from mixtures of talkers. cal time-frequency cells and then group cells in part based Index Terms: speech synthesis, harmonicity, sound seg- on the consistency of the f0 estimates. CASA systems in regation fact rely more strongly on harmonicity than common on- set, the other main bottom up grouping cue believed to 1. Introduction underlie human segregation [7, 8]. Despite the widespread assumption that harmonic- Human speech recognition is remarkable for its robust- ity is critical to sound segregation, its role in the seg- ness to background noise. Our ability to recognize tar- regation of real-world sounds such as speech remains get utterances from mixtures with other sound sources largely untested. Given the potential importance of spec- sets humans apart from state-of-the-art speech recogni- trotemporal sparsity in the segregation of complex nat- tion systems [1], which typically perform well in quiet ural sounds [9, 10], it is conceivable that the most im- but are adversely affected by the presence of additional portant role of harmonicity could simply be to produce sound sources. The robustness of human recognition to discrete frequency components, the sparsity of which re- competing sounds reflects our ability to segregate indi- duces masking and could facilitate common onset and This∗ work was supported in part by the National Science Foun- other grouping cues. Moreover, experimental results with dation (NSF) via grant IIS-1117015, by Grants-in-Aid for Scientific artificial stimuli have raised questions about whether har- Research 22650042 from JSPS, and by the Howard Hughes Medical monicity is in fact critical for segregation. Mistuned fre- Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily quency components of complex tones can be detected reflect the views of the sponsors. even when the frequencies of all components are in- creased by a fixed amount, or when the complex is components are combined using a sigmoid function (de- “stretched” such that adjacent components are no longer fined by the boundary frequency between the voiced and separated by a fixed number of Hz [11]. Although such unvoiced components, and a transition slope) [16]. tones are inharmonic (lacking a fundamental frequency The synthesis procedure of the basic STRAIGHT common to all the components they contain), component framework generates excitation with a pulse-plus-noise mistuning detection thresholds are comparable to those model. A linear phase shifter is used to implement frac- for harmonic tones. This result suggests that various tional pitch control. Time-varying filter characteristics forms of spectral regularity, rather than harmonicity per are implemented with a minimum phase response. se, could be most critical to segregation. The most relevant and convincing test of the im- 2.2. Sinusoidal Modeling of Voiced Excitation portance of harmonic structure in segregation would ar- To permit the manipulation of harmonicity, the pulse- guably be to assess the segregation of real-world sounds based voicing synthesis of the original STRAIGHT pro- compared to that of inharmonic equivalents that are cedure was was replaced by a sum of multiple sinusoids. matched in other respects. To usefully alter the harmonic- Let A(t, f) represent the amplitude at a time- ity of natural sounds, however, one must avoid disrupting frequency location (t, f) of the spectral envelope esti- other aspects of the sound structure. Speech in partic- mated using the STRAIGHT procedure. The determin- ular is difficult to manipulate in this manner, as it con- istic (voiced) component s(t) of the sinusoidal synthesis sists of a highly structured interaction of periodic and procedure is defined by the following equation. noise excitation with vocal tract filtering to which hu- mans are exquisitely sensitive. To explore the role of N(t) X Z t harmonicity in speech perception and segregation, we s(t) = A(t, fn(t)) cos 2π fn(τ)dτ +ϕn (1) devised a method to generate inharmonic versions of n=1 0 recorded speech utterances. We used the framework of STRAIGHT, a powerful tool for representing and manip- where fn(t) represents the time-varying frequency of the ulating speech [12, 13]. n-th constituent sinusoid and ϕn represents its initial phase (set to zero for the experiments described here). The total number of harmonic components N(t) at time 2. Methods t is adaptively adjusted to keep the highest component Spectral envelopes extracted by STRAIGHT were used to frequency lower than the Nyquist frequency. set the amplitudes of time-varying sinusoids. This section Equation 1 is approximated using a fixed frame-rate outlines the original STRAIGHT procedure and its exten- overlap-add procedure (6 ms Hanning-windowed frames, sion to enable sinusoidal modeling of excitation, which 3 ms frame rate, and 50% overlap between adjacent we used to generate inharmonic voicing. frames). A linear phase FIR time-varying filter, derived from A(t, f), is applied to each frame using a 64 ms FFT 2.1. STRAIGHT Framework buffer. To synthesize speech, s(t) is added to the un- Spectral envelope estimation in STRAIGHT consists of voiced speech component estimated as in conventional a two-stage procedure to eliminate interference from the STRAIGHT. periodic speech excitation on the the representation of spectral power [13]. In the first stage, temporal interfer- 2.3. Inharmonicity Manipulations ence is eliminated by averaging power spectra calculated Following manipulations from prior psychoacoustics at two time points separated by half a pitch period. In the studies [11, 17], we altered the frequencies of speech har- second stage, spectral interference is eliminated by spec- monics in three ways. tral smoothing using an f0-adaptive rectangular smoother followed by post processing to preserve harmonic com- • Shifting: the frequencies of all harmonics were in- ponent levels based on consistent sampling theory. These creased by a fixed proportion of the f0, preserving frequency domain procedures are implemented with cep- the regular spacing (in Hz) between components. stral liftering. More details are provided in [14]. Hence, the frequency of harmonic n became: Speech excitation estimation also relies on a tem- porally stable representation of the power spectrum and fn(t) = nf0(t) + af0(t) (2) combines this with a temporally stable representation of instantaneous frequency [15]. Excitation is represented • Stretching: the frequency spacing between adja- using a time-varying fundamental frequency f0(t) (for cent components was increased with increasing