PAPERS
On the Use of TimeÐFrequency Reassignment in Additive Sound Modeling*
KELLY FITZ, AES Member AND LIPPOLD HAKEN, AES Member
Department of Electrical Engineering and Computer Science, Washington University, Pulman, WA 99164
A method of reassignment in sound modeling to produce a sharper, more robust additive representation is introduced. The reassigned bandwidth-enhanced additive model follows ridges in a timeÐfrequency analysis to construct partials having both sinusoidal and noise characteristics. This model yields greater resolution in time and frequency than is possible using conventional additive techniques, and better preserves the temporal envelope of transient signals, even in modified reconstruction, without introducing new component types or cumbersome phase interpolation algorithms.
0INTRODUCTION manipulations [11], [12]. Peeters and Rodet [3] have developed a hybrid analysis/synthesis system that eschews The method of reassignment has been used to sharpen high-level transient models and retains unabridged OLA spectrograms in order to make them more readable [1], [2], (overlapÐadd) frame data at transient positions. This to measure sinusoidality, and to ensure optimal window hybrid representation represents unmodified transients per- alignment in the analysis of musical signals [3]. We use fectly, but also sacrifices homogeneity. Quatieri et al. [13] timeÐfrequency reassignment to improve our bandwidth- propose a method for preserving the temporal envelope of enhanced additive sound model. The bandwidth-enhanced short-duration complex acoustic signals using a homoge- additive representation is in some way similar to tradi- neous sinusoidal model, but it is inapplicable to sounds of tional sinusoidal models [4]Ð[6] in that a waveform is longer duration, or sounds having multiple transient events. modeled as a collection of components, called partials, We use the method of reassignment to improve the time having time-varying amplitude and frequency envelopes. and frequency estimates used to define our partial param- Our partials are not strictly sinusoidal, however. We eter envelopes, thereby enhancing the time Ðfrequency employ a technique of bandwidth enhancement to com- resolution of our representation, and improving its phase bine sinusoidal energy and noise energy into a single par- accuracy. The combination of timeÐfrequency reassign- tial having time-varying amplitude, frequency, and band- ment and bandwidth enhancement yields a homogeneous width parameters [7], [8]. model (that is, a model having a single component type) Additive sound models applicable to polyphonic and that is capable of representing at high fidelity a wide vari- nonharmonic sounds employ long analysis windows, which ety of sounds, including nonharmonic, polyphonic, impul- can compromise the time resolution and phase accuracy sive, and noisy sounds. The reassigned bandwidth- needed to preserve the temporal shape of transients. enhanced sound model is robust under transformation, and Various methods have been proposed for representing the fidelity of the representation is preserved even under transient waveforms in additive sound models. Verma and time dilation and other model-domain modifications. The Meng [9] introduce new component types specifically for homogeneity and robustness of the reassigned bandwidth- modeling transients, but this method sacrifices the homo- enhanced model make it particularly well suited for such geneity of the model. A homogeneous model, that is, a manipulations as cross synthesis and sound morphing. model having a single component type, such as the break- Reassigned bandwidth-enhanced modeling and render- point parameter envelopes in our reassigned bandwidth- ing and many kinds of manipulations, including morph- enhanced additive model [10], is critical for many kinds of ing, have been implemented in the open-source C class library Loris [14], and a stream-based, real-time * Manuscript received 2001 December 20; revised 2002 July implementation of bandwidth-enhanced synthesis is avail- 23 and 2002 September 11. able in the Symbolic Sound Kyma environment [15].
J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November 879 FITZ AND HAKEN PAPERS
1TIMEÐFREQUENCY REASSIGNMENT cantly to the reconstruction, and the maximum contribu- tion (center of gravity) occurs at the point where the phase The discrete short-time Fourier transform is often used is changing most slowly with respect to time and frequency. as the basis for a timeÐfrequency representation of time- In the vicinity of t τ (that is, for an analysis window varying signals, and is defined as a function of time index centered at time t τ), the point of maximum spectral n and frequency index k as energy contribution has timeÐfrequency coordinates that satisfy the stationarity conditions 3 R V S j2π ^lnkh W Xkn ^^^hhh hlnxlexp (1) ! S N W 2 l 3 8φτω^^, hh ωt τB 0 (3) T X 2ω N 1 2 2 j2πlk 8φτω^^, hh ωt τB 0 (4) !hlxn^^hh l exp e o (2) 2τ N 1 N l where ( , ) is the continuous short-time phase spec- 2 φ τ ω trum and ω(t τ) is the phase travel due to periodic where h(n) is a sliding window function equal to 0 for n < oscillation [18]. The stationarity conditions are satisfied (N 1)/2 and n > (N 1)/2 (for N odd), so that Xn(k) at the coordinates is the N-point discrete Fourier transform of a short-time 2φτω^ , h waveform centered at time n. t τ (5) Short-time Fourier transform data are sampled at a rate 2ω equal to the analysis hop size, so data in derivative 2φτω^ , h timeÐfrequency representations are reported on a regular ωt (6) 2τ temporal grid, corresponding to the centers of the short- time analysis windows. The sampling of these so-called representing group delay and instantaneous frequency, frame-based representations can be made as dense as respectively. desired by an appropriate choice of hop size. However, Discretizing Eqs. (5) and (6) to compute the time and temporal smearing due to long analysis windows needed frequency coordinates numerically is difficult and unreli- to achieve high-frequency resolution cannot be relieved by able, because the partial derivatives must be approxi- denser sampling. mated. These formulas can be rewritten in the form of Though the short-time phase spectrum is known to con- ratios of discrete Fourier transforms [1]. Time and fre- tain important temporal information, typically only the quency coordinates can be computed using two additional short-time magnitude spectrum is considered in the short-time Fourier transforms, one employing a time- timeÐfrequency representation. The short-time phase weighted window function and one a frequency-weighted spectrum is sometimes used to improve the frequency esti- window function. mates in the timeÐfrequency representation of quasi- Since time estimates correspond to the temporal center harmonic sounds [16], but it is often omitted entirely, or of the short-time analysis window, the time-weighted win- used only in unmodified reconstruction, as in the basic dow is computed by scaling the analysis window function sinusoidal model, described by McAulay and Quatieri [4]. by a time ramp from (N 1)/2 to (N 1)/2 for a win- The so-called method of reassignment computes sharp- dow of length N. The frequency-weighted window is ened time and frequency estimates for each spectral com- computed by wrapping the Fourier transform of the analy- ponent from partial derivatives of the short-time phase sis window to the frequency range [ π, π], scaling the spectrum. Instead of locating timeÐfrequency compo- transform by a frequency ramp from (N 1)/2 to (N 1)/2, and inverting the scaled transform to obtain a (real) nents at the geometrical center of the analysis window (tn, frequency-scaled window. Using these weighted win- ωk), as in traditional short-time spectral analysis, the com- ponents are reassigned to the center of gravity of their dows, the method of reassignment computes corrections complex spectral energy distribution, computed from the to the time and frequency estimates in fractional sample short-time phase spectrum according to the principle of units between (N 1)/2 to (N 1)/2. The three analy- stationary phase [17, ch. 7.3]. This method was first devel- sis windows employed in reassigned short-time Fourier oped in the context of the spectrogram and called the mod- analysis are shown in Fig. 1. ˆ ified moving window method [18], but it has since been The reassigned time tk,n for the kth spectral component applied to a variety of timeÐfrequency and time-scale from the short-time analysis window centered at time n (in transforms [1]. samples, assuming odd-length analysis windows) is [1] The principle of stationary phase states that the variation R V of the Fourier phase spectrum not attributable to periodic S * W S XkXktn; ^^hhn W tntkn, 0 (7) oscillation is slow with respect to frequency in certain S 2 W spectral regions, and in surrounding regions the variation is S Xkn ^ h W relatively rapid. In Fourier reconstruction, positive and T X negative contributions to the waveform cancel in frequency where Xt;n(k) denotes the short-time transform computed regions of rapid phase variation. Only regions of slow using the time-weighted window function and 0 [á] phase variation (stationary phase) will contribute signifi- denotes the real part of the bracketed ratio.
880 J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November PAPERS TIME-FREQUENCY REASSIGNMENT IN SOUND MODELING
The corrected frequency ωˆ k,n(k) corresponding to the spectrum, which is very difficult to interpret. The method same component is [1] reassignment is a technique for extracting information R V from the phase spectrum. S * W S XkXkfn; ^^hhn W ωt kn, k 1 (8) S 2 W 2 REASSIGNED BANDWIDTH-ENHANCED S Xkn ^ h W ANALYSIS T X The reassigned bandwidth-enhanced additive model where Xf ;n(k) denotes the short-time transform computed using the frequency-weighted window function and 1 [á] [10] employs timeÐfrequency reassignment to improve the time and frequency estimates used to define partial denotes the imaginary part of the bracketed ratio. Both tk,n parameter envelopes, thereby improving the timeÐfre- and ωˆ k,n have units of fractional samples. Time and frequency shifts are preserved in the reas- quency resolution and the phase accuracy of the represen- signment operation, and energy is conserved in the reas- tation. Reassignment transforms our analysis from a signed timeÐfrequency data. Moreover, chirps and frame-based analysis into a “true” timeÐfrequency analy- impulses are perfectly localized in time and frequency in sis. Whereas the discrete short-time Fourier transform any reassigned timeÐfrequency or time-scale representa- defined by Eq. (2) orients data according to the analysis tion [1]. Reassignment sacrifices the bilinearity of frame rate and the length of the transform, the time and timeÐfrequency transformations such as the squared mag- frequency orientation of reassigned spectral data is solely nitude of the short-time Fourier transform, since very data a function of the data themselves. point in the representation is relocated by a process that is The method of analysis we use in our research models highly signal dependent. This is not an issue in our repre- a sampled audio waveform as a collection of bandwidth- sentation, since the bandwidth-enhanced additive model, enhanced partials having sinusoidal and noiselike charac- like the basic sinusoidal model [4], retains data only at teristics. Other methods for capturing noise in additive timeÐfrequency ridges (peaks in the short-time magnitude sound models [5], [19] have represented noise energy in spectra), and thus is not bilinear. fixed frequency bands using more than one component Note that since the short-time Fourier transform is type. By contrast, bandwidth-enhanced partials are invertible, and the original waveform can be exactly defined by a trio of synchronized breakpoint envelopes reconstructed from an adequately sampled short-time specifying the time-varying amplitude, center frequency, Fourier representation, all the information needed to pre- and noise content for each component. Each partial is ren- cisely locate a spectral component within an analysis win- dered by a bandwidth-enhanced oscillator, described by dow is present in the short-time coefficients X (k). n yn AnΒζ n ncos θ n (9) Temporal information is encoded in the short-time phase ^^^^^hhhhh88BB where A(n) and β(n) are the time-varying sinusoidal and noise amplitudes, respectively, and ζ(n) is a energy- 1 normalized low-pass noise sequence, generated by exciting a low-pass filter with white noise and scaling the filter gain h(n) such that the noise sequence has the same total spectral 0 energy as a full-amplitude sinusoid. The oscillator phase -250 0 250 Time (samples) θ(n) is initialized to some starting value, obtained from the (a) reassigned short-time phase spectrum, and updated accord- ing to the time-varying radian frequency ω(n) by 40 θθ^^nnhhh 10 ω ^ nn,> (10) (n)
t 0 h The bandwidth-enhanced oscillator is depicted in Fig. 2. -40 -250 0 250 We define the time-varying bandwidth coefficient κ(n) Time (samples) as the fraction of total instantaneous partial energy that is (b) attributable to noise. This bandwidth (or noisiness) coeffi- cient assumes values between 0 for a pure sinusoid and 1 100 for a partial that is entirely narrow-band noise, and varies (n)
f 0 over time according to the noisiness of the partial. If we h -100 represent the total (sinusoidal and noise) instantaneous 2 -250 0 250 partial energy as à (n), then the output of the bandwidth- Time (samples) enhanced oscillator is described by (c) u Fig. 1. Analysis windows employed in three short-time trans- yn^^hh An9 12κκζθ ^ n h ^^^ n hhh nC cos8 nB . forms used to compute reassigned times and frequencies. (a) Ori- (11) ginal window function h(n) (a 501-point Kaiser window with shaping parameter 12.0 in this case). (b) Time-weighted window The envelopes for the time-varying partial amplitudes and function ht(n) nh(n). (c) Frequency-weighted window function hf(n). frequencies are constructed by identifying and following
J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November 881 FITZ AND HAKEN PAPERS the ridges on the timeÐfrequency surface. The time-varying ponents extracted from early or late short-time analysis partial bandwidth coefficients are computed and assigned windows are relocated nearer to the times of transient by a process of bandwidth association [7]. events, yielding clusters of timeÐfrequency data points, as We use the method of reassignment to improve the depicted in Fig. 4. In this way, time reassignment greatly time and frequency estimates for our partial parameter reduces the temporal smearing introduced through the use envelope breakpoints by computing reassigned times and of long analysis windows. Moreover, since reassignment frequencies that are not constrained to lie on the sharpens our frequency estimates, it is possible to achieve timeÐfrequency grid defined by the short-time Fourier good frequency resolution with shorter (in time) analysis analysis parameters. Our algorithm shares with tradi- windows than would be possible with traditional methods. tional sinusoidal methods the notion of temporally con- The use of shorter analysis windows further improves our nected partial parameter estimates, but by contrast, our time resolution and reduces temporal smearing. estimates are nonuniformly distributed in both time and The effect of timeÐfrequency reassignment on the tran- frequency. sient response can be demonstrated using a square wave Short-time analysis windows normally overlap in both that turns on abruptly, such as the waveform shown in Fig. time and frequency, so timeÐfrequency reassignment 5. This waveform, while aurally uninteresting and unin- often yields time corrections greater than the length of the formative, is useful for visualizing the performance of var- short-time hop size and frequency corrections greater than ious analysis methods. Its abrupt onset makes temporal the width of a frequency bin. Large time corrections are smearing obvious, its simple harmonic partial amplitude common in analysis windows containing strong transients relationship makes it easy to predict the necessary data for that are far from the temporal center of the window. Since a good timeÐfrequency representation, and its simple we retain data only at timeÐfrequency ridges, that is, at waveshape makes phase errors and temporal distortion frequencies of spectral energy concentration, we generally easy to identify. Note, however, that this waveform is observe large frequency corrections only in the presence pathological for Fourier-based additive models, and exag- of strong noise components, where phase stationarity is a gerates all of these problems with such methods. We use it weaker effect. only for the comparison of various methods. Fig. 6 shows two reconstructions of the onset of a square 3 SHARPENING TRANSIENTS wave from timeÐfrequency data obtained using overlap- ping 54-ms analysis windows, with temporal centers sepa- TimeÐfrequency representations based on traditional rated by 10 ms. This analysis window is long compared to magnitude-only short-time Fourier analysis techniques the period of the square wave, but realistic for the case of a (such as the spectrogram and the basic sinusoidal model polyphonic sound (a sound having multiple simultaneous [4]) fail to distinguish transient components from sustain- voices), in which the square wave is one voice. For clarity, ing components. A strong transient waveform, as shown in only the square wave is presented in this example, and Fig. 3(a), is represented by a collection of low-amplitude other simultaneous voices are omitted. The square wave spectral components in early short-time analysis frames, that is, frames corresponding to analysis windows centered 1 earlier than the time of the transient. A low-amplitude peri- odic waveform, as shown in Fig. 3(b), is also represented e by a collection of low-amplitude spectral components. The tud mpli information needed to distinguish these two critically dif- A ferent waveforms is encoded in the short-time phase spec- 0 trum, and is extracted by the method of reassignment. 0 100 200 300 400 500 600 700 TimeÐfrequency reassignment allows us to preserve the Time (samples) temporal envelope shape without sacrificing the homo- (a) geneity of the bandwidth-enhanced additive model. Com-
1 n A n β( ) ( ) e noise lowpass filter tud mpli
ζ(n) A N + 0
0 100 200 300 400 500 600 700 ω(n) Time (samples) (b) (starting phase) Fig. 3. Windowed short-time waveforms (dashed lines), not read- θ(0) y(n) ily distinguished in basic sinsoidal model [4]. Both waveforms are represented by low-amplitude spectral components. (a) Strong Fig. 2. Block diagram of bandwidth-enhanced oscillator. Time- transient yields off-center components, having large time correc- varying sinusoidal and noise amplitudes are controlled by A(n) tions (positive in this case because transient is near right tail of and β(n), respectively; time-varying center (sinusoidal) fre- window). (b) Sustained quasi-periodic waveform yields time cor- quency is ω(n). rections near zero.
882 J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November PAPERS TIME-FREQUENCY REASSIGNMENT IN SOUND MODELING has an abrupt onset. The silence before the onset is not 4CROPPING shown. Only the first (lowest frequency) five harmonic par- tials were used in the reconstruction, and consequently the Off-center components are short-time spectral compo- ringing due to Gibb’s phenomenon is evident. nents having large time reassignments. Since they repre- Fig. 6(a) is a reconstruction from traditional, nonreas- sent transient events that are far from the center of the signed timeÐfrequency data. The reconstructed square analysis window, and are therefore poorly represented in wave amplitude rises very gradually and reaches full the windowed short-time waveform, these off-center com- amplitude approximately 40 ms after the first nonzero ponents introduce unreliable spectral parameter estimates sample. Clearly, the instantaneous turn-on has been that corrupt our representation, making the model data dif- smeared out by the long analysis window. Fig. 6(b) shows ficult to interpret and manipulated. a reconstruction from reassigned timeÐfrequency data. Fortunately large time corrections make off-center The transient response has been greatly improved by relo- components easy to identify and remove from our cating components extracted from early analysis windows model. By removing the unreliable data embodied by (like the one on the left in Fig. 5) to their spectral centers off-center components, we make our model cleaner and of gravity, closer to the observed turn-on time. The syn- more robust. Moreover, thanks to the redundancy inher- thesized onset time has been reduced to approximately 10 ent in short-time analysis with overlapping analysis ms. The corresponding timeÐfrequency analysis data are windows, we do not sacrifice information by removing shown in Fig. 7. The nonreassigned data are evenly dis- the unreliable data points. The information represented tributed in time, so data from early windows (that is, win- poorly in off-center components is more reliably repre- dows centered before the onset time) smear the onset, sented in well-centered components, extracted from whereas the reassigned data from early analysis windows analysis windows centered nearer the time of the tran- are clumped near the correct onset time. sient event. Typically, data having time corrections
ω8 ω8 ω8
ω7 ω7 ω7
ω6 ω6 ω6
ω5 ω5 ω5
ω4 ω4 ω4 ω ω ω Frequency Frequency Frequency 3 3 3
ω2 ω2 ω2
ω1 ω1 ω1
t1 t2 t3 t4 t5 t6 t7 t1 t2 t3 t4 t5 t6 t7 t1 t2 t3 t4 t5 t6 t7 (a) (b) (c) Fig. 4. Comparison of timeÐfrequency data included in common representations. Only timeÐfrequency orientation of data points is shown. (a) Short-time Fourier transform retains data at every time tn and frequency ωk. (b) Basic sinusoidal model [4] retains data at selected time and frequency samples. (c) Reassigned bandwidth-enhanced analysis data are distributed continuously in time and fre- quency, and retained only at timeÐfrequency ridges. Arrows indicate mapping of short-time spectral samples onto timeÐfrequency ridges due to method of reassignment.
1.0 e tud i pl
m 0 A
-1.0 0 250 Time (ms) Fig. 5. Two long analysis windows superimposed at different times on square wave signal with abrupt turn-on. Short-time transform corresponding to earlier window generates unreliable parameter estimates and smears sharp onset of square wave.
J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November 883 FITZ AND HAKEN PAPERS greater than the time between consecutive analysis win- tion data unreliable in components that are very far off dow centers are considered to be unreliable and are center. removed, or cropped. Fig. 8 shows reassigned bandwidth-enhanced model Cropping partials to remove off-center components data from the onset of a bowed cello tone before and after allows us to localize transient events reliably. Fig. 7(c) the removal of off-center components. In this case, com- shows reassigned timeÐfrequency data from the abrupt ponents with time corrections greater than 10 ms (the time square wave onset with off-center components removed. between consecutive analysis windows) were deemed to The abrupt square wave onset synthesized from the be too far off center to deliver reliable parameter esti- cropped reassigned data, seen in Fig. 6(c), is much sharper mates. As in Fig. 7(c), the unreliable data clustered at the than the uncropped reassigned reconstruction, because the time of the onset are removed, leaving a cleaner, more taper of the analysis window makes even the time correc- robust representation. e tud i pl m A
0 Time (ms) 22 e tud i pl m A
22 Time (ms) 44 (a) e tud i pl m A
19 Time (ms) 41 (b) e tud i pl m A
19 Time (ms) 41 (c) Fig. 6. Abrupt square wave onset reconstructed from five sinusoidal partials corresponding to first five harmonics. (a) Reconstruction from nonreassigned analysis data. (b) Reconstruction from reassigned analysis data. (c) Reconstruction from reassigned analysis data with unreliable partial parameter estimates removed, or cropped.
2000 2000 2000 Frequency (Hz) Frequency (Hz) Frequency (Hz) 0 0 0 060Time (ms) 060Time (ms) 060Time (ms) (a) (b) (c) Fig. 7. TimeÐfrequency analysis data points for abrupt square wave onset. (a) Traditional nonreassigned data are evenly distributed in time. (b) Reassigned data are clumped at onset time. (c) Reassigned analysis data after far off-center components have been removed, or cropped. Only time and frequency information is plotted; amplitude information is not displayed.
884 J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November PAPERS TIME-FREQUENCY REASSIGNMENT IN SOUND MODELING
5 PHASE MAINTENANCE the parameter envelope breakpoints. In general, preserving phase using the cubic phase method in the presence of mod- Preserving phase is important for reproducing some ifications (or estimation errors) introduces wild frequency classes of sounds, in particular transients and short-duration excursions [20]. Phase can be preserved at one time, how- complex audio events having significant information in the ever, and that time is typically chosen to be the onset of temporal envelope [13]. The basic sinusoidal models pro- each partial, although any single time could be chosen. The posed by McAulay and Quatieri [4] is phase correct, that is, partial phase at all other times is modified to reflect the new it preserves phase at all times in unmodified reconstruction. timeÐfrequency characteristic of the modified partial. In order to match short-time spectral frequency and phase Off-center components with unreliable parameter esti- estimates at frame boundaries, McAulay and Quatieri em- mates introduce phase errors in modified reconstruction. If ploy cubic interpolation of the instantaneous partial phase. the phase is maintained at the partial onset, even the cubic Cubic phase envelopes have many undesirable proper- interpolation scheme cannot prevent phase errors from prop- ties. They are difficult to manipulate and maintain under agating in modified syntheses. This effect is illustrated in Fig. time- and frequency-scale transformation compared to lin- 9(c), in which the square wave timeÐfrequency data have ear frequency envelopes. However, in unmodified recon- been shifted in frequency by 10% and reconstructed using struction, cubic interpolation prevents the propagation of cubic phase curves modified to reflect the frequency shift. phase errors introduced by unreliable parameter estimates, By removing the off-center components at the onset of a maintaining phase accuracy in transients, where the tem- partial, we not only remove the primary source of phase poral envelope is important, and throughout the recon- errors, we also improve the shape of the temporal envelope structed waveform. The effect of phase errors in the in the modified reconstruction of transients by preserving a unmodified reconstruction of a square wave is illustrated more reliable phase estimate at a time closer to the time of in Fig. 9. If not corrected using a technique such as cubic the transient event. We can therefore maintain phase accu- phase interpolation, partial parameter errors introduced by racy at critical parts of the audio waveform even under off-center components render the waveshape visually transformation, and even using linear frequency envelopes, unrecognizable. Fig. 9(b) shows that cubic phase can be which are much simpler to compute, interpret, edit, and used to correct these errors in unmodified reconstruction. maintain than cubic phase curves. Fig. 9(d) shows a square It should be noted that, in this particular case, the phase wave reconstruction from cropped reassigned timeÐfre- errors appear dramatic, but do not affect the sound of the quency data, and Fig. 9(e) shows a frequency-shifted reconstructed steady-state waveforms appreciably. In reconstruction, both using linear frequency interpolation. many sounds, particularly transient sounds, preservation Removing components with large time corrections pre- of the temporal envelope is critical [13], [9], but since they serves phase in modified and unmodified reconstruction, lack audible onset transients, the square waves in Fig. and thus obviates cubic phase interpolation. 9(a)Ð(c) sound identical. It should also be noted that cubic Moreover, since we do not rely on frequent cubic phase phase interpolation can be used to preserve phase accu- corrections to our frequency estimates to preserve the racy, but does not reduce temporal smearing due to off- shape of the temporal envelope (which would otherwise center components in long analysis windows. be corrupted by errors introduced by unreliable data), we It is not desirable to preserve phase at all times in modi- have found that we can obtain very good-quality recon- fied reconstruction. Because frequency is the time deriva- struction, even under modification, with regularly sampled tive of phase, any change in the time or frequency scale of partial parameter envelopes. That is, we can sample the a partial must correspond to a change in the phase values at frequency, amplitude, and bandwidth envelopes of our
1200 1200 Frequency (Hz) Frequency (Hz)
370 370 0 140 0 140 Time (ms) Time (ms) (a) (b) Fig. 8. TimeÐfrequency coordinates of data from reassigned bandwidth-enhanced analysis. (a) Before cropping. (b) After cropping of off-center components clumped together at partial onsets. Source waveform is a bowed cello tone.
J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November 885 FITZ AND HAKEN PAPERS reassigned bandwidth-enhanced partials at regular inter- respond to the onset of associated partials, then that method vals (of, for example, 10 ms) without sacrificing the will preserve the temporal envelope of multiple transient fidelity of the model. We thereby achieve the data regular- events. In fact, however, partials often span transients. Fig. ity of frame-based additive model data and the fidelity of 10 shows a partial that extends over transient boundaries in reassigned spectral data. Resampling of the partial param- a representation of a bongo roll, a sequence of very short eter envelopes is especially useful in real-time synthesis transient events. The approximate attack times are indi- applications [11], [12]. cated by dashed vertical lines. In such cases it is not pos- sible to preserve the phase at the locations of multiple 6 BREAKING PARTIALS AT TRANSIENT EVENTS transients, since under modification the phase can only be preserved at one time in the life of a partial. Transients corresponding to the onset of all associated Strong transients are identified by the large time correc- partials are preserved in our model by removing off-center tions they introduce. By breaking partials at components components at the ends of partials. If transients always cor- having large time corrections, we cause all associated par-
1.0 Amplitude -1.0 60 Time (ms) 84 (a)
1.0 Amplitude -1.0 60 Time (ms) 84 (b)
1.0 Amplitude -1.0 60 Time (ms) 84 (c)
1.0 Amplitude -1.0 60 Time (ms) 84 (d)
1.0 Amplitude -1.0 60 Time (ms) 84 (e) Fig. 9. Reconstruction of square wave having abrupt onset from five sinusoidal partials corresponding to first five harmonics. 24-ms plot spans slightly less than five periods of 200-Hz waveform. (a) Waveform reconstructed from nonreassigned analysis data using lin- ear interpolation of partial frequencies. (b) Waveform reconstructed from nonreassigned analysis data using cubic phase interpolation, as proposed by McAulay and Quatieri [4]. (c) Waveform reconstructed from nonreassigned analysis data using cubic phase interpola- tion, with partial frequencies shifted by 10%. Notice that more periods of (distorted) waveform are spanned by 24-ms plot than by plots of unmodified reconstructions, due to frequency shift. (d) Waveform reconstructed from timeÐfrequency reassigned analysis data using linear interpolation of partial frequencies, and having off-center components removed, or cropped. (e) Waveform reconstructed from reassigned analysis data using linear interpolation of partial frequencies and cropping of off-center components, with partial frequen- cies shifted by 10%. Notice that more periods of waveform are spanned by 24-ms plot than by plots of unmodified reconstructions, and that no distortion of waveform is evident.
886 J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November PAPERS TIME-FREQUENCY REASSIGNMENT IN SOUND MODELING tials to be born at the time of the transient, and thereby The same two bongo strikes reconstructed from nonreas- enhance our ability to maintain phase accuracy. In Fig. 11 signed data are shown in Fig. 12(a). A comparison with the the partial that spanned several transients in Fig. 10 has been source waveform shown in Fig. 12(a) reveals that the recon- broken at components having time corrections greater than struction from reassigned data is better able to preserve the the time between successive analysis window centers (about temporal envelope than the reconstruction from nonreas- 1.3 ms in this case), allowing us to maintain the partial signed data and suffers less from temporal smearing. phases at each bongo strike. By breaking partials at the loca- tions of transients, we can preserve the temporal envelope of 7 REAL-TIME SYNTHESIS multiple transient events, even under transformation. Fig. 12(b) shows the waveform for two strikes in a bongo Together with Kurt Hebel of Symbolic Sound Corporation roll reconstructed from reassigned bandwidth-enhanced data. we have implemented a real-time reassigned bandwidth-
2114 2114
1976 1976
1839 1839
1701 1701 Frequency (Hz) Frequency (Hz) 1563 1563
1425 1425 123 146 170 193 216 239 123 146 170 193 216 239 Time (ms) Time (ms) Fig. 10. TimeÐfrequency plot of reassigned bandwidth- Fig. 11. TimeÐfrequency plot of reassigned bandwidth-enhanced analy- enhanced analysis data for one strike in a bongo roll. Dashed ver- sis data for one strike in a bongo roll with partials broken at components tical lines show approximate locations of attack transients. having large time corrections, and far off-center components removed. Partial extends across transient boundaries. Only timeÐfre- Dashed vertical lines show approximate locations of attack transients. quency coordinates of partial data are shown; partial amplitudes Partials break at transient boundaries. Only timeÐfrequency coordinates are not indicated. of partial data are shown; partial amplitudes are not indicated.
1.0 Amplitude -1.0 14 25 Time (ms) (a)
1.0 Amplitude -1.0 14 25 Time (ms) (b)
1.0 Amplitude -1.0 14 25 Time (ms) (c) Fig. 12. Waveform plot for two strikes in a bongo roll. (a) Reconstructed from reassigned bandwidth-enhanced data. (b) Reconstructed from nonreassigned bandwidth-enhanced data. (c) Synthesized using cubic phase interpolation to maintain phase accuracy.
J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November 887 FITZ AND HAKEN PAPERS enhanced synthesizer using the Kyma Sound Design Values for the envelopes Ak, Nk, and Fk are updated Workstation [15]. from the parameter stream every 128 samples. The syn- Many real-time synthesis systems allow the sound thesis element performs sample-level linear interpolation designer to manipulate streams of samples. In our real- between updates, so that Ak, Nk, and Fk are piecewise lin- time reassigned bandwidth-enhanced implementation, we ear envelopes with segments 128 samples in length [21]. work with streams of data that are not time-domain sam- The θk values are initialized at partial onsets (when Ak and ples. Rather, our envelope parameter streams encode fre- Nk are zero) from the phase envelope in the partial’s quency, amplitude, and bandwidth envelope parameters parameter stream. for each bandwidth-enhanced partial [11], [12]. Rather than using a separate model to represent noise in Much of the strength of systems that operate on sample our sounds, we use the envelope Nk (in addition to the tra- streams is derived from the uniformity of the data. This ditional Ak and Fk envelopes) and retain a homogeneous homogeneity gives the sound designer great flexibility data stream. Quasi-harmonic sounds, even those with with a few general-purpose processing elements. In our noisy attacks, have one partial per harmonic in our repre- encoding of envelope parameter streams, data homogene- sentation. The noise envelopes allow a sound designer to ity is also of prime importance. The envelope parameters manipulate noiselike components of sound in an intuitive for all the partials in a sound are encoded sequentially. way, using a familiar set of controls. We have imple- Typically, the stream has a block size of 128 samples, mented a wide variety of real-time manipulations on enve- which means the parameters for each partial are updated lope parameter streams, including frequency shifting, for- every 128 samples, or 2.9 ms at a 44.1-kHz sampling rate. mant shifting, time dilation, cross synthesis, and sound Sample streams generally do not have block sizes associ- morphing. ated with them, but this structure is necessary in our enve- Our new MIDI controller, the Continuum Fingerboard, lope parameter stream implementation. The envelope allows continuous control over each note in a perform- parameter stream encodes envelope information for a sin- ance. It resembles a traditional keyboard in that it is gle partial at each sample time, and a block of samples approximately the same size and is played with ten fingers provides updated envelope information for all the partials. [12]. Like keyboards supporting MIDI’s polyphonic after- Envelope parameter streams are usually created by tra- touch, it continually measures each finger’s pressure. The versing a file containing frame-based data from an analy- Continuum Fingerboard also resembles a fretless string sis of a source recording. Such a file can be derived from instrument in that it has no discrete pitches; any pitch may a reassigned bandwidth-enhanced analysis by resampling be played, and smooth glissandi are possible. It tracks, in the envelopes at intervals of 128 samples at 44.1 kHz. The three dimensions (left to right, front to back, and down- parameter streams may also be generated by real-time ward pressure), the position for each finger pressing on the analysis, or by real-time algorithms, but that process is playing surface. These continuous three-dimensional out- beyond the scope of this discussion. A parameter stream puts are a convenient source of control parameters for typically passes through several processing elements. real-time manipulations on envelope parameter streams. These processing elements can combine multiple streams in a variety of ways, and can modify values within a 8 CONCLUSIONS stream. Finally a synthesis element computes an audio sample stream from the envelope parameter stream. The reassigned bandwidth-enhanced additive sound Our real-time synthesis element implements bandwidth- model [10] combines bandwidth-enhanced analysis and enhanced oscillators [8] with the sum synthesis techniques [7], [8] with the timeÐfrequency reassignment technique described in this paper. K 1 We found that the method of reassignment strengthens yn^^^^^hhhhh!8B Akk n N nbnsin θk n (12) our bandwidth-enhanced additive sound model dramati- k 0 cally. Temporal smearing is greatly reduced because the