<<

The Phase Vocoder: A Tutorial Author(s): Mark Dolson Source: Computer Journal, Vol. 10, No. 4 (Winter, 1986), pp. 14-27 Published by: The MIT Press Stable URL: http://www.jstor.org/stable/3680093 Accessed: 19/11/2008 14:26

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=mitpress.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].

The MIT Press is collaborating with JSTOR to digitize, preserve and extend access to Computer Music Journal.

http://www.jstor.org MarkDolson The Phase Vocoder: Computer Audio Research Laboratory Center for Music Experiment, Q-037 A Tutorial University of California, San Diego La Jolla, California 92093 USA

Introduction technique has become popular and well understood. The phase vocoder is one of a number of digital For composers interested in the modification of signal processing algorithms that can be catego- natural sounds, the phase vocoder is a digital signal rized as analysis-synthesis techniques. Mathemati- processing technique of potentially great signifi- cally, these techniques are sophisticated algorithms cance. By itself, the phase vocoder can perform very that take an input signal and produce an output sig- high fidelity time-scale modification or pitch trans- nal that is either identical to the input or a modi- position of a wide variety of sounds. In conjunction fied version of it. The underlying assumption is that with a standardsoftware synthesis program,the the input signal can be well representedby a model phase vocoder can provide the composer with arbi- (i.e., a mathematical formula)whose parameters trary control of individual . But use vary with time. The analysis is devoted to deter- of the phase vocoder to date has been limited mining the values of these model parametersfor the primarily to experts in digital signal processing. signal in question, and the synthesis is simply the Consequently, its musical potential has remained output of the model itself. largely untapped. The benefits of analysis-synthesis formulations Since the is based on In this article, I attempt to explain the operation are considerable. synthesis of the phase vocoder in terms accessible to musi- analysis of a specific signal, the synthesized output this cians. I rely heavily on the familiar concepts of sine can be virtually identical to the original input; in bears waves, filters, and additive synthesis, and I employ a can occur even when the signal question minimum of mathematics. My hope is that this little relation to the assumed model. Furthermore, can tutorial will lay the groundworkfor widespread use the parametervalues derived from the analysis useful modifications of of the phase vocoder, both as a tool for sound analy- be altered to synthesize sis and modification, and as a catalyst for continued the original signal. In the case of alteration, how- musical musical exploration. ever, the perceptual significance and utility of the result depends critically on the degree to which the assumed model matches the signal to be Overview modified. In the phase vocoder, the signal is modeled as a to be deter- The phase vocoder has its origins in a long line of sum of sine waves, and the parameters voice coding techniques aimed at minimizing the mined by analysis are the time-varying amplitude amount of data that must be transmitted for intel- and frequency for each sine wave. These sine waves so this ligible electronic communication. Indeed, are not requiredto be harmonically related, the word "vocoder"is simply a contraction of the model is appropriatefor a wide variety of musical term "voice coder."The phase vocoder is so named signals. For example, wind, brass, string, speech, themselves to distinguish it from the earlier channel vocoder, and some percussive sounds all lend which is described in this paper as well. Commer- well to phase-vocoderrepresentation and modifi- and cial analog "vocoders"for are of cation. Other percussive sounds (e.g., clicks) the channel-vocoder type. The phase vocoder was certain signal-plus- sound combinations are first described in an article by Flanaganand Golden not well representedas a sum of sine waves. These can still be but (1966), but it is only in the past ten years that this sounds perfectly resynthesized, attempts to modify them can lead to unanticipated Computer Music Journal,Vol. 10, No. 4, Winter 1986, sonic results. Since the musical utility of the phase ? 1986 Massachusetts Institute of Technology. vocoder lies in its ability to modify sounds predic-

14 Computer Music Journal Fig. 1. Filterbank inter- Fig. 2. Idealized frequency pretation of the phase response of bandpass vocoder. filters.

Amplitude No overlap

>0 Frequency(Hz) R/2

Lairge overlap

I 10'XI Frequency(Hz) Input j ) - Output 1 R/2

(Fig. 1). For resynthesis, the amplitude and fre- quency outputs serve as the control inputs to a bank of sine-wave oscillators. The synthesis is then literally a sum of sine waves with the time-varying amplitude and frequency of each sine wave being Filters Oscillators obtained directly from the correspondingbandpass filter. If the center frequencies of the individual bandpassfilters happen to align with the harmonics tably, signals that fail to fit the model are a poten- of a musical signal, then the outputs of the phase- tial source of difficulty. vocoder analysis are essentially the time-varying In the sections that follow, I show in detail how amplitudes and frequencies of each . But the phase-vocoder analysis-synthesis is actually this alignment is by no means essential to the performed.In particular,I show that there are two analysis. complementary (but mathematically equivalent) The filterbank itself has only three constraints. viewpoints that may be adopted. I refer to these as First, the frequency response characteristics of the the filterbank interpretation and the Fourier-trans- individual bandpassfilters are identical except that form interpretation, and I discuss each in turn. each filter has its passband centered at a different Lastly, I show how the results of the phase-vocoder frequency. Second, these center frequencies are analysis can be used musically to effect useful mod- equally spaced across the entire spectrum from ifications of recordedsounds. In all cases, the input 0 Hz to R/2 Hz as shown in Fig. 2. Third, the indi- to the phase vocoder is assumed to be a discrete vidual bandpassfrequency response is such that the (i.e., sampled) signal with a sampling rate R of at combined frequency response of all the filters in least twice the highest frequency present in the parallel is essentially flat across the entire spectrum. signal. Thus, for a typical high fidelity application, (This is akin to the situation in a good graphicor R = 50 KHz. The frequency range of interest is parametricequalizer when the gain in each band is then from 0 Hz to R/2 Hz. set to 0 db; the resulting sound has no added colora- tion. In the phase vocoder, this condition ensures that no frequency component is given dispropor- The FilterbankInterpretation tionate weight in the analysis.) As a consequence of these constraints, the only issues in the design The simplest view of the phase-vocoder analysis is of the filterbank are the number of filters and the that it consists of a fixed bank of bandpass filters individual bandpassfrequency response. with the output of each filter expressed as a time- The number of filters should always be suffi- varying amplitude and a time-varying frequency ciently large so that there is never more than one

Dolson 15 Fig. 3. An individual bandpass filter.

sin(27rft)

Magnitude Input Frequency

cos(27rft) f

partial within the passband of any single filter. (In changes in the input signal. This is a fundamental general, more filters mean narrowerpassbands for tradeoffin the design of any filter-a sharp fre- each filter.) For harmonic sounds, this means that quency response comes only at the expense of a there must be at least one filter for each harmonic slow time response. in the frequency range from 0 Hz to R/2 Hz. Thus, In the phase vocoder, the consequence of this for example, if the sampling rate R is 50 KHz, and tradeoffis that to get sharp filter cutoffs with mini- the fundamental frequency of the sound is, say, 500 mal overlap, one must use filters whose time re- Hz, then there are partials every 500 Hz. Conse- sponse is very sluggish. For slowly-varying sounds, quently, there are (25000/500) = 50 partials in all, a sluggish time response in the individual filters and we need 50 filters in our bank. may be perfectly acceptable. For more rapidly vary- For inharmonic and polyphonic sounds, the ing sounds, however, it may be more desirable for number of filters is usually much greater because the individual filters to have a rapid time response. the partials are no longer equally spaced. If the In this case, there can be large frequency overlap be- number of filters is too small, then the phase vo- tween adjacent bandpassfilters. In practice, the best coder does not function as intended because the solution is generally discovered experimentally by partials within a single filter constructively and de- simply trying different filter settings for the sound structively interfere with each other (i.e., they will in question. "beat"with each other), and the information about their individual frequencies is coded as an unin- tended temporal variation in a single composite A Closer Look at the Filterbank signal. The design of the representative bandpassfilter The previous paragraphsprovide an adequate de- is dominated by a single consideration: the sharper scription of the phase vocoder from the standpoint the filter frequency response cuts off at the band of the user, but they omit some important details. edges (i.e., the less overlap between adjacent band- In particular, what needs to be clarified is that the pass filters), the longer the filter will ring in re- actual implementation of the filterbank in the sponse to a single nonzero input sample (i.e., the phase vocoder is rather different from that which longer the filter impulse response). Another way of might be employed in a commercial analog filter- saying that is: the sharperthe filter frequency re- bank. In this section, I show in detail how the out- sponse, the longer it takes the filter to respond to put of a single phase vocoder bandpassfilter is

16 Computer Music Journal Fig. 4. Spectralplot of the result of multiplying two sinusoids: (a) sinusoid 1, (b) sinusoid 2, (c) result.

expressed as a time-varying amplitude and a time- cos(2rr 110 t)cos(27T100 t) varying frequency. = cos(27r (110 - 100)t) + cos(27r (110 + 100)t) A diagram of the operation of a single phase- vocoder bandpass filter is shown in Fig. 3. This pic- (a) Frequency(Hz) I I .I ture may appeara bit complicated, but it can easily 0 be broken down into a series of fairly simple steps. 100 200 In the first step, the incoming signal is routed (b) Frequency(Hz) [ I i > into two parallel paths. In one path, the signal is 0 multiplied by a sine wave with an amplitude of 1.0 100 200 (c) Frequency(Hz) and a to the center of the I I frequency equal frequency I I I ), in the other the is mul- bandpass filter; path, signal 0 100 200 tiplied by a cosine wave of the same amplitude and frequency. (The readershould recall that sine and cosine waves are both sinusoidal waveforms that differ only in their starting point of phase. The cosine wave is always 90 degrees ahead of the sine soids-an amplitude of one sinusoid by wave.) Thus, the two parallel paths are identical the other.) except for the phase of the multiplying waveform. Now, if the result of the multiplication in Fig. 4 The next step in Fig. 3 is that, in each of the two is passed through an appropriatelowpass filter, only paths, the result of the multiplication is fed into a the 10 Hz sinusoid will remain. This sequence of lowpass filter. To understandthe significance of this operations (i.e., multiplying by a sinusoid of fre- operation, first one must better understand the quency f and then lowpass filtering) is useful in a multiplication operation itself. variety of signal processing applications and is Suppose we have a complex signal consisting of known as heterodyning. many frequency components. Multiplying that sig- In the phase vocoder, heterodyning accomplishes nal by a sinusoid of constant frequency has the two things. First, any input frequency components effect of shifting all the frequency components by in the vicinity of frequency f are shifted down to both plus and minus the frequency of the sinusoid. the vicinity of 0 Hz and allowed to pass; input fre- An example of this is shown in Fig. 4 in which a quency components not in the vicinity of frequency single component at 110 Hz (representedas a co- f are similarly shifted but not by enough to get sine wave of constant amplitude cos(27r1lOt)), is through the lowpass filter. Thus, heterodyning im- multiplied by a cosine wave at 100 Hz (represented plements a type of bandpassfiltering in which the as cos(27rl100t)).The effect of this multiplication is passband is frequency-shifted down to very low fre- to "split" the 110 Hz component into a low fre- quencies. A bandpassfilter of this type can easily quency component at 10 Hz (i.e., 110 Hz - 100 Hz) be designed to have any desired center frequency and a high frequency component at 210 Hz (i.e., 110 simply by choosing the frequency of the hetero- Hz + 100 Hz). (A musical example of multiplying dyning sine and cosine waves to equal the desired two signals can be found in the common effect of center frequency.Indeed, this is precisely how each . But in typical applications, nei- of the filters in Fig. 2 is actually implemented. The ther the input signal nor the multiplying signal is result is not a "true"bandpass filtering because of a simple sinusoid; the result of this is that every the frequency-shifting of the passband,but it does sinusoidal component of the input signal is split by effectively separatefrequency components in the every sinusoidal component of the multiplying sig- vicinity of frequency f from all other frequency nal. Another musical example, the familiar phe- components. Furthermore,as will be shown shortly, nomena of beats, is essentially the other side of the the frequency-shifting is straightforwardlytaken same coin: when two sinusoids are added together, into account in subsequent steps within the phase the result can be expressed as a product of two sinu- vocoder.

Dolson 17 Fig. 5. Rectangular and polar coordinates.

The second important consequence of hetero- dyning in the phase vocoder is that it provides a simple mechanism for computing the time-varying amplitude and frequency of the resulting signal. In Fig. 3, heterodyning is performedin each of the two parallel paths. But since one path heterodynes with a sine wave while the other path uses a cosine wave (and since the relative phase is preservedby the process), the resulting heterodyned signals in the two paths are out of phase by 90 degrees. Thus, in the above example, both paths produce a 10-Hz sinusoidal wave at the outputs of their respective lowpass filters, but one of the two sinusoids is 90 degrees ahead of the other. What we really want, however, are not two sinusoids, but rather the am- plitude and frequency of a single sinusoid. This leads us to the next operation in the sequence of Fig. 3: the transformationfrom rectangular (i.e., Cartesian)coordinates to polar coordinates.

Rectangular Coordinates versus Polar r = Vxo2 + y02 Coordinates 0= arctan( Yo Sinsuoidal motion is often introduced to students Xo as a projection of uniform circular motion. I now show how this same point of view can be adopted to understandthe transformationfrom "two sinu- soids" to "amplitude and frequency of a single The situation in the phase vocoder is directly sinusoid" within the phase vocoder. analogous. Since the result of the heterodyning and Suppose that we wish to plot the position of a lowpass filtering operations is a pair of sinusoids point on the rotating wheel in Fig. 5 as a function with a 90-degree phase difference between them, of time. We have a choice of using rectangular coor- the two parallel paths in the phase vocoder can be dinates (e.g., horizontal position and vertical posi- viewed equivalently as the rectangular(i.e., hori- tion) or polar coordinates (e.g., radial position and zontal and vertical) coordinates of a single point on angular position). With rectangularcoordinates we a rotating wheel. The position of this point can be find that both the horizontal position and the ver- represented equally well via polar coordinates. tical position of our point vary sinusoidally, but the Actually, this choice between rectangularand maximum vertical displacement occurs one quarter polar coordinates occurs frequently in signal proc- cycle later than the maximum horizontal displace- essing but in a variety of different guises. In the ment (e.g., the point is at its highest one-quarter terminology of communications systems (where cycle after being at its furthest to the right). Thus, heterodyning is most frequently encountered), the the horizontal and vertical signals are sinusoids horizontal and vertical signals are the two parallel with a 90-degree phase difference between them. heterodyning paths, and the two resulting lowpass- On the other hand, if we represent a circularly filtered signals are known as the in-phase and quad- rotating point in terms of polar coordinates, we rature signals. In the literature of Fouriertrans- simply have a linearly increasing angular position forms, the horizontal and vertical signals are known and a constant radial position. as the real and imaginary components, and the

18 Computer Music Journal Fig.6. Phaseunwrapping.

are radial position and angular position known 0 (degrees) respectively as magnitude and phase. (Note that this usage of the term "phase"-to indicate angular 720 - position as a function of time-is more general than the usage in which phase is simply the initial 360 - angular position, as when two signals are said to be / / ,~~~~~~~~~~~~~~~~~~~~1, Time "out of phase.") The magnitude-and-phaserepresen- 0 tation is also the standardfor describing filter fre- 0 (Degrees) quency response, but in this case it is customary to plot only the magnitude, as in Fig. 2. 720 - To actually perform the conversion from rectan- gular to polar coordinates, we can use the formulas 360 -

in Fig. 5. The radial position is the hypotenuse of 0 ) , Time the right triangle with the horizontal and vertical positions as the two sides. In phase vocoder terms, the time-varying radial position is the desired time- varying amplitude (i.e., magnitude). Thus, the single cycle. Therefore, if we want our frequency amplitude at each point in time is simply the square calculation to work properly,we should really write root of the sum of the squares of the two hetero- the sequence as 180, 225, 270, 315, 360, 405, 450. dyned signals. Similarly, the time-varying angular Then the difference between successive values is in- position is the time-varying phase. Thus, the phase deed 45 in all cases. This process of adding in 360 at each point in time is the angle whose tangent is degrees whenever a full cycle has been completed is (i.e., the arctan of) the ratio of the vertical position known as phase unwrapping. A comparison of the to the horizontal position. angular-position signal obtained directly from the But relating the time-varying angular position to rectangular-to-polarconversion and the unwrapped the desired time-varying frequency requires an addi- angular position-signal is illustrated in Fig. 6. tional operation (see Fig. 3). Frequency is the num- The operations of phase unwrappingand calculat- ber of cycles which occur during some unit time ing rate-of-rotation(i.e., frequency)from successive interval. In terms of the rotating wheel, this can be unwrappedangular-position values are the two stated as the "numberof revolutions per unit time." next-to-last operations in the sequence of steps in Thus, to determine the time-varying frequency we Fig. 3. But note that the frequency actually refers need to determine the rate of rotation of the wheel. only to the differencefrequency between the hetero- To do this, we can simply measure the change in dyning sinusoid (i.e., the filter center frequency) and angular position between two successive samples, the input signal. (Rememberthat the rectangular- and divide by the time interval between them. to-polar conversion and the rate-of-rotationcalcula- Hence, the frequency at each point in time is the tion are all performedonly on the lowpass-filtered difference between successive angular position results of the heterodyning operation-e.g., on the values divided by the sample period. 10-Hz component that resulted from "splitting" the It turns out, though, that there is one additional 110-Hz component into components at 10 Hz and problem here: the arctan function produces answers 210 Hz.) Therefore the very final step in the phase- only in the range of 0 to 360 degrees. Thus, if we vocoder-analysis sequence is to add the filter center examine successive values of angular position, we frequency back in to the instantaneous frequency may find a sequence such as 180, 225, 270, 315, 0, signal calculated above. 45, 90. This may appearto suggest that the instan- In summary, the internal operation of a single taneous frequency (i.e., rate of angular rotation) is phase vocoder bandpassfilter, as diagrammedin Fig. not constant (because the difference between suc- 3, consists of (1) heterodyning the input with both a cessive values is not always 45). What has actually sine wave and a cosine wave in parallel, (2) lowpass happened is that we have gone through more than a filtering each result, (3) converting the two parallel

Dolson 19 Fig. 7. Filterbank inter- pretation versus Fourier- transform interpretation.

lowpass-filtered signals from rectangularto polar Frequency coordinates, (4) unwrapping the angular-position A values, (5) subtracting successive unwrappedan- gular-position values and dividing by the time inter- val to obtain a frequency signal, and (6) adding the o filter center frequency back in to the frequency sig- nal. This sequence of operations produces the de- sired time-varying parametervalues. (I 0 o 0 0 ) Filter view

The Fourier-TransformInterpretation o

A complementary view of the phase-vocoder analy- sis is that it consists of a succession of overlapping o Fouriertransforms taken over finite-duration win- dows in time. It is interesting to compare this per- Fourierview spective to that of the filterbank interpretation. In ) Time the latter, the emphasis is on the temporal succes- sion of magnitude and phase (or frequency) values in a single filter band. In contrast, the Fourier-trans- form interpretation focuses attention on the magni- periodic waveform. If the waveform is perfectly tude and phase values for all of the different filter periodic (i.e., it repeats itself indefinitely with no bands or frequency bins at a single point in time. A changes in period or waveshape), it is always pos- graphic representation of this distinction is shown sible to synthesize it as the sum of appropriately in Fig. 7. weighted sinusoids with frequencies that are in- It is important to understand that each of these teger multiples of the fundamental frequency.In viewpoints is really concerned only with the imple- musical terms, the individual sinusoids are the har- mentation of the bank of bandpassfilters. In either monics; in mathematical terms, they are the com- case, it is still necessary to convert the filter out- ponents of the Fourierseries. But since the initial puts from rectangularto polar coordinates as de- phase of the periodic signal is arbitrary,it generally scribed previously. So the real question here is how requires a sum of both sine and cosine components a sequence of overlappingFourier transforms can to match this initial phase correctly. Musical dis- behave like a bank of filters. cussions of harmonics tend to ignore this subtlety Certain features of the Fouriertransform have because the initial phase is itself of little perceptual fairly simple interpretations in terms of a filter- significance. However, when the signal is not per- bank. For example, the number of filter bands is fectly periodic, then the phase can no longer be simply the number of bins in the Fouriertransform. ignored. Similarly, the equal spacing in frequency of the The discrete short-time Fourier transform is es- individual filters is a fundamental feature of the sentially a means of computing the Fourierseries Fouriertransform. But how does the Fourierview for signals that are not perfectly periodic. The basic incorporate the shape of the bandpassfilters (e.g., idea is that the signal in question can be multiplied the steepness of the cutoff at the band edges) or the by a window function so that only a certain section in-phase and quadraturesignals? of the signal is nonzero, and the Fourierseries can The Fouriertransform can be seen as a gener- be computed assuming that the windowed section alization of the Fourierseries. The Fourierseries repeats indefinitely. The exact shape of the window specifies the amplitudes of the various harmonics turns out to be important because the windowing that must be added together to create a complex operation smears the signal's spectrum so that each

20 Computer Music Journal bin of the Fouriertransform (i.e., each component in it is essentially identical to the phase vocoder. the Fourierseries) includes some energy from other These two differingviews of the phase-vocoder bins nearby.The general rule is that the amount of analysis suggest two equally divergent interpreta- smearing increases as the window duration gets tions of the resynthesis. In the filterbank interpreta- shorter. Thus, the window function turns out to tion, the resynthesis can be viewed as an additive- have exactly the same role as the filter impulse re- synthesis procedurewith time-varying amplitude sponse in the filterbank interpretation of the phase and frequency controls for each of a bank of oscil- vocoder (e.g., making the impulse response shorter lators. In the Fourierview, the synthesis is accom- increases the filter overlap or smearing). plished by addingtogether successive inverse Similarly, the fact that each bin has both a sine Fouriertransforms, overlappingthem in time to cor- component and a cosine component is equivalent respond with the overlappingof the analysis Fourier to each filter in the filterbank interpretation having transforms. (Note, however, that this requires con- both an in-phase and quadraturesignal. The polar verting back from polar coordinates to rectangular representation of these two signals as a rotating vec- coordinates prior to calculating the inverse Fourier tor (i.e., a point on a rotating wheel) allows the pre- transform.) cise determination of the time-varying frequency Although the filterbank and Fourierviews are within a single bin; this is done by comparing the mathematically equivalent, a particularadvantage orientation of the vector in successive Fouriertrans- of the Fourierinterpretation is that it leads to the forms. For example, when the frequency of a par- implementation of the filterbank via the fast Fourier ticular signal is exactly equal to that of a particu- transform (FFT)technique. The FFTis an algorithm lar Fouriertransform bin, then successive Fourier for computing the Fouriertransform with fewer transforms will always find the rotating vector for multiplications than would otherwise be required that bin in the same orientation. Successive phase provided that the number of bins is a power of 2. values are thus equal, and their difference is zero. The FFT produces an output value for each of N When the input signal frequency does not exactly bins with (on the order of) N log2N multiplications, match the bin frequency,the differencefrequency while the direct implementation of the filterbank can be calculated from the successive phase values requires N2 multiplications. Thus, the Fourierin- exactly as described in the preceding section. terpretation can lead to a substantial increase in In summary, the two complementary views of the computational efficiency when the number of fil- phase-vocoder analysis can be described as follows. ters is large (e.g., if N = 1024, the savings is ap- In the filterbank interpretation, the filtering is ac- proximately a factor of 100). complished by heterodyning and lowpass filtering It is also the Fourierinterpretation that has been to produce in-phase and quadraturesignals. In the the source of much of the recent progressin phase- Fourierview, the bank of filters is a set of frequency vocoder-like techniques. Mathematically, these bins, and the in-phase and quadraturesignals are techniques are described as short-time Fourier- the real and imaginary components of the Fourier transform techniques (Crochiere 1980; Portnoff transform. In either case, it is still necessary to con- 1980; 1981a; 1981b; Griffin and Lim 1984). Such vert from rectangularto polar coordinates, and then algorithms can also be referredto as multirate digi- to calculate the difference between successive un- tal signal processing techniques for reasons that will wrappedphase values to determine the time-varying be made clear later (Crochiereand Rabiner 1983). frequency within each filter (or bin). We may note that it is precisely this calculation of time-varying frequency that distinguishes the phase vocoder from Sample-Rate Considerations the earlier channel vocoder. The channel vocoder computes only the time-varying amplitude within The input and output signals to and from the phase each filter (or bin) and makes no attempt to deter- vocoder are always assumed to be digital signals mine the time-varying frequency.In other respects, with a sampling rate of at least twice the highest

Dolson 21 frequency in the associated analog signal (e.g., a computed only 10000/50 = 200 times per second. speech signal with a highest frequency of 5 KHz It turns out, though, that the process of adding to- might be digitized-at least in principle-at 10 gether the successive overlappinginverse FFTs for KHz and fed into the phase vocoder). However, the resynthesis automatically accomplishes the nec- sample rates within the individual filter bands of essary interpolation. the phase vocoder do not need to be nearly so high. Last, it should be noted that we have so far con- This is most easily understood via the filterbank sidered the bandwidth of the output of the lowpass interpretation. filter without any mention of the conversion from Within any given filter band, the result of the rectangularto polar coordinates. This conversion heterodyning and lowpass-filtering operation is a involves highly nonlinear operations which (at least signal whose highest frequency is equal to the cut- in principle) can significantly increase the band- off frequency of the lowpass filter. For instance in width of the signals to which they are applied. For- the above example, the lowpass filter may only pass tunately, this effect is usually small enough in frequencies up to 50 Hz. Thus, although the input practice that it can generally be ignored. to the filter was a speech signal sampled at 10 KHz, the output of the filter can be sampled (at least in the ideal case) at as little as 100 Hz without any Applications error.This is true for each of the bandpass filters, because each filter operates by heterodyning a cer- The basic goal of the phase vocoder is to separate tain frequency region down to the 0-50 Hz region. (as much as possible) temporal information from In practice, the lowpass filter can never have an spectral information. The operative strategy is to infinitely steep cutoff. Therefore, it is generally divide the signal into a number of spectral bands, advisable to sample the output of the filter at four and to characterize the time-varying signal in each times the cutoff frequency (e.g., 200 Hz) rather than band. This strategy succeeds to the extent that this two. Still, this represents an enormous savings in bandpasssignal is itself slowly varying. It fails when computation (e.g., the filter output is calculated there is more than a single partial in a given band, or 200 times per second instead of 10,000 times per when the time-varyingamplitude or frequencyof the second). bandpasssignal changes too rapidly."Too rapidly" If we now seek to resynthesize the original input means that the amplitude and frequency are not from the phase-vocoder-analysissignals, we face a relatively constant over the duration of a single FFT. minor problem. The analysis signals, which in the Equivalently, a signal is varying too rapidly if the filterbank interpretation are thought of as providing amplitude or frequency changes considerably over the instantaneous amplitude and frequency values durations that are small comparedto the inverse of for a bank of sine-wave oscillators, are no longer at the lowpass filter bandwidth. (Recall that the dura- the same sample rate as the desired output signal. tion over which the transformis taken is inversely Thus, an interpolation operation is requiredto con- proportionalto the lowpass filter bandwidth.) vert the analysis signals back up to the original To the extent that the phase vocoder does succeed sample rate. Even with this interpolation, this is in separatingtemporal and spectral information, it much more computationally efficient than omit- provides the basis for an impressive arrayof musi- ting the sample-rate reduction in the first place. cal applications. Historically, the first of these to be In the Fourier-transforminterpretation, an inter- explored was that of analyzing instrumental tones polation is also requiredfor resynthesis, but the to determine the time-varying amplitudes and fre- details are less apparent.In the previous example, quencies of individual partials. This application was where the internal sample rate is only 2% (200 Hz/ pioneered by Moorer and Grey at StanfordUniver- 10000 Hz) of the external sample rate, we simply sity in the mid 1970s in a landmarkseries of inves- skip 10000/200 = 50 samples between successive tigations of the perception of timbre (Grey 1977; FFTs. As a result, the analysis FFT values are Grey and Moorer 1977; Grey and Gordon 1978;

22 Computer Music Journal Moorer 1978). (The heterodyne filter technique de- tween successive FFTs).This means that the suc- veloped by Moorer and used in those investigations cessive phase values within the bin are incremented is essentially a special case of the phase vocoder.) by 45 degrees. Spacing the inverse FFTs further More recently, interest in the phase vocoder has apart means that the 45-degree increase now occurs focused more on its ability to modify and transform over a longer time interval. But this means that the recordedsound materials in musically useful ways. frequency of a signal has been inadvertently altered. The possibilities in this realm are myriad. However, The solution is to rescale the phase by precisely two basic operations stand out as particularlysignifi- the same factor by which the sound is being time- cant. These are time scaling and pitch transposition. expanded.Thus for time expansion by a factor of two, the 45-degree increase should be rescaled to a 90-degree increase, because it occurs over twice the Time Scaling time interval of the original 45-degree increase. This ensures that the signal in any given filter band It is always possible to slow down a recordedsound has the same frequency variation in the resynthesis simply by playing it back at a lower sample rate; this as in the original (though it occurs more slowly). is analogous to playing a tape recording at a lower The reason that the problem of rescaling the playback speed. But this kind of simplistic time ex- phase does not appearin the filterbank interpre- pansion simultaneously lowers the pitch by the tation is that the interpolation there is assumed to same factor as the time expansion. Slowing down be performedon the frequency control signal as the temporal evolution of a sound without altering opposed to the phase. This is perfectly correct con- its pitch requires an explicit separation of temporal ceptually, but the actual implementation generally and spectral information. As noted above, this is conforms more closely to the Fourierinterpreta- precisely what the phase vocoder attempts to do. tion. Also, by emphasizing that the time expansion To understand the use of the phase vocoder for amounts to spacing out successive "snapshots"of time scaling, it is helpful once again to consider the the evolving spectrum, the Fourierview makes it two basic interpretations described above. In the fil- easier to understandhow the phase vocoder can per- ter bank interpretation, the operation is simplicity form equally well with nonharmonic material. itself. The time-varying amplitude and frequency To be sure, the phase vocoder is not the only signals for each oscillator are control signals that technique that can be employed for this kind of carry only temporal information. Expandingthe time scaling. A number of time-domain procedures duration of these control signals (via interpolation) (i.e., not involving filtering or Fouriertransforms) does not change the frequency of the individual can be employed at substantially less computa- oscillators at all, but it does slow down the tem- tional expense. But from the standpoint of fidelity poral evolution of the composite sound. The result (i.e., the relative absence of objectionable artifacts), is a time-expanded sound with the original pitch. the phase vocoder is by far the most desirable. The Fouriertransform view of time scaling is very similar, but with one additional catch. The basic idea is that in order to time-expand a sound, Pitch Transposition the inverse FFTs can simply be spaced further apart than the analysis FFTs.As a result, spectral changes Since the phase vocoder can be used to change the occur more slowly in the synthesized sound than in temporal evolution of a sound without changing its the original. But this overlooks a critical detail in- pitch, it should also be possible to do the reverse volving the magnitude and phase signals. (i.e., change the pitch without changing the dura- Consider a single bin within the FFT for which tion). Indeed, this operation is easily accomplished. the signal within that bin is increasing in phase at a The procedureis simply to time-scale by the de- rate of 1/8 cycle (i.e., 45 degrees)per time interval sired pitch-change factor, and then to play the re- (where the time interval in question is the time be- sulting sound back at the "wrong"sample rate. For

Dolson 23 Fig. 8. Spectral envelope correction. (a) original spectrum. (b) transposed spectrum. (c) corrected spectrum.

example, to raise the pitch by an octave, the sound Original spectrum is first time-expanded by a factor of two, and the A time-expansion is then played at twice the original sample rate. This shrinks the sound back to its duration while all original simultaneously doubling Ill Ii, -lb Frequency frequencies. In practice, however, there are also (a) some additional concerns. First, instead of changing the clock rate on the playback digital-to-analog converters, it is more Transposedspectrum convenient to perform sample-rate conversion on A the time-scaled sound via software. Thus, in the above example, we would simply designate a higher sample rate for the time-expanded sound, and then ' convert it down a factor of two so lIl > Frequency sample-rate by (b) that it could be played at the normal sample rate. (It is possible to embed this sample-rate conversion within the phase vocoder itself, but this proves to be of only marginal utility and will not be dis- Transposedspectrum with cussed further here.) original spectral envelope Second, upon closer examination it can be seen that only time-scale factors that are ratios of in- tegers are actually allowed. This is clearest in the Frequency Fourierview because the expansion factor is simply (c) the ratio of the number of samples between suc- cessive analysis FFTs to the number of samples be- these sounds intelligibility is not an issue. Conse- tween successive synthesis FFTs. However, it is quently, the change in sound quality is not nearly equally true of the filterbank interpretation because so objectionable.) To correct for this, an additional it turns out that the control signals can only be in- operation can be inserted into the phase-vocoder terpolated by factors that are ratios of two integers. algorithm as shown in Fig. 8. For each FFT,this Of course, this has little significance for time scal- additional operation determines the spectral enve- ing because, while it may be impossible to find two lope (i.e., the shape traced out by the peaks of the suitable integers with precisely the desired ratio, harmonics as a function of frequency), and then the erroris perceptually negligible. However, when distorts this envelope in such a way that the subse- time scaling is performedas a prelude to pitch trans- quent sample-rate conversion brings it back pre- position, the perceptual consequences of such errors cisely to its original shape. are greatly magnified (by virtue of the ear's sen- sitivity to small pitch differences),and considerable care may be requiredin the selection of two appro- Conclusion priate integers. An additional complication arises when modify- The descriptions previously discussed in this ar- ing the pitch of speech signals because the trans- ticle address only the most elementary possibili- position process changes not only the pitch, but ties of the phase-vocodertechnique. In addition to also the frequency of the vocal tract simple time scaling and pitch transposition, it is (i.e., the ). For shifts of an octave or more, also possible to perform time-varying time scaling this considerably reduces the intelligibility of the and pitch transposition, time-varying filtering (e.g., speech. (This same phenomenon occurs in the pitch cross synthesis), and nonlinear filtering (e.g., noise transposition of nonspeech sounds as well, but for reduction), all with very high fidelity. The phase vo-

24 Computer Music Journal coder analysis capabilities alone can be extremely Appendix useful in applications ranging from psychoacoustics to composition (e.g., individual harmonics of a While there is much to be said for a nonmathemati- sound can be separatedin space). cal description of the phase vocoder, there is also a limit to what can be understood in this fashion. In Acknowledgments this Appendix I provide a brief mathematical re- statement of the foregoing description. I am grateful to both JerryBalzano and Richard Mathematically, the short-time Fourier transform Boulangerfor numerous helpful suggestions that X(n,k) of the signal x(n) is a function of both time have considerably enhanced the clarity of this n and frequency k. It can be written as presentation.

X(n,k) = E - N km m = -- x(m)h(n m)e (1) References where h (n) is a window or filter impulse response Crochiere, R. E. 1980. "A WeightedOverlap-Add Method as described below. of Fourier IEEE Analysis-Synthesis." Transactions on The short-time Fouriertransform is the analysis Acoustics, Speech, and Signal Processing ASSP-28(1): of the A 55-69. portion analysis-synthesis procedure. gen- eral is then Crochiere, R. E., and L. R. Rabiner. 1983. Multirate Digi- resynthesis equation given by tal Cliffs: Prentice-Hall. Signal Processing. Englewood oo 2rNN- 1 . Flanagan,J. L., and R. M. Golden. 1966. "PhaseVocoder." - x(n)= m f(n-m) N (2) Bell System Technical Journal45:1493-1509. m = -oo N koX(m,k)e=0 Grey,J. M. 1977. "Multidimensional PerceptualScaling of Musical Timbres." Journalof the Acoustical Society of where f(n) is another window or filter impulse re- America 61:1270-1277. sponse. In the absence of modifications (and with Grey,J. M., and J.A. Moorer. 1977. "PerceptualEvalua- certain constraints on h(n) and f(n)), this equation tions of Synthesized Musical Instrument Tones." Jour- reconstructs the input perfectly. nal of the Acoustical Society of America 62:454-462. The analysis equation can be viewed from either Grey,J. M., and J.W. Gordon. 1978. "PerceptualEffects of of two complementary perspectives. On the one Modifications on Musical Timbres." Journal Spectral hand, Eq. [1] can be written as of the Acoustical Society of America 63:1493-1500. X(n, 2ri. Griffin, D. W., and J.S. Lim. 1984. "SignalEstimation - = E N m). (3) from Modified Short-TimeFourier Transform." IEEE X(n,k) m = -0 (x(m)e' -)h(n Transactions on Acoustics, Speech, and Signal Proc- essing ASSP-28(2):236-242. This describes a heterodyne filterbank. The kth fil- Moorer, J.A. 1978. "The Use of the Phase Vocoderin ter channel is obtained by multiplying the input Computer Music Applications." Journalof the Audio x(m) by a complex sinusoid at frequency k/N times Engineering Society 24(9): 717-727. the sample rate. This shifts input frequency compo- Portnoff,M. R. 1980. "Time-FrequencyRepresentation nents in the vicinity of k/N times the sample rate of Digital Signals and Systems Based on Short-Time down near 0 Hz also near twice k/N times Fourier Xnalysis."IEEE Transactions on Acoustics, (and up the sample The is then con- Speech, and Signal Processing ASSP-28(1):55-69. rate). resulting signal volved with the lowpass filter h (m). This removes Portnoff,M. R. 198 la. "Short-TimeFourier Analysis of the and leaves Sampled Speech." IEEETransactions on Acoustics, high-frequency components only those in the Speech, and Signal Processing ASSP-29(3):364-373. input frequency components originally Portnoff,M. R. 198lb. "Time-ScaleModification of Speech vicinity of k/N times the sample rate (albeit shifted Based on Short-Time FourierAnalysis." IEEETransac- down to low frequency). Thus the output X(n,k), tions on Acoustics, Speech, and Signal Processing for any particular value of k, is a frequency-shifted, ASSP-29(3):374-390. bandpass-filtered version of the input.

Dolson 25 On the other hand, Eq. (1) can be regroupedas where M > N. But the exponential terms involving i > N (e.g., i = N + 3) are identical to the corre- - sponding terms involving i < N (e.g., i = 3) because X(n,k)= E (x(m)h(n m))e~- N (4) m = -00 (letting i = N + 1)

This is the for the discrete Fouriertrans- 2 expression + kN - k - kc ---k(NN k ) -i N N form of an input signal x(m) which is multiplied by e = e-Ne N = e-2rk e (8) a finite-duration, time-shifted window h(n - m). Since e-ji2rk = 1 for integer values of k, the summa- Thus the output for any particularvalue of X(n,k), tion reduces to n, is the Fouriertransform of the windowed input at time n. Instead of representing the output of a N- I rNsM + _. 27rT bank of parallel heterodyne filters, X(n,k) now rep- r0 + e 'N = E= y(rN 1) (9) resents a succession of partially-overlappingFourier I 0 r 0 transforms. Thus, even when the window is longer than the de- this latter view is particularly Computationally, sired FFT size, the FFT formulation can still be because the transform can be imple- significant used provided that the windowed signal is appropri- mented as a fast Fouriertransform To see (FFT). ately stacked and added (i.e., successive temporal this, we can note that if the nonzero extent of the segments are overlayedand added together). window is N samples or fewer = 0 for (i.e., h(n) in| Using a window duration greaterthan N samples, > then Eq. is simply N/2), (4) though, does entail some additional complication. N n+ 2 1 2 First, it turns out that the condition for perfect re- = = X(n,k) = > x(m)h(n - m)e N (5) synthesis is that h(n) 0 for n IN (where 1 is any m = n - - nonzero integer). This says that the window must be zero at multiples of N. If the window is With a change of variable (i = m - n + N/2), this integer fewer than N to then this con- can be rewritten as samples begin with, dition is trivially satisfied. If the window is greater (6) than N samples in duration, then an easy way to X(n,k) = satisfy the condition is to make the actual window NI2 - ' N I the product of the desired window and a sin(x)/x - N e N x(i + n N/2)h(-i + N/2)e function with period N. Second, the duration of the analysis window h (n) form and the summation is seen to have the proper is an important factor in selecting an appropriate from i = 0 for an FFT implementation (i.e., a sum synthesis window f(n). If the analysis window is = to i N - 1 of terms x(i)exp(- j2rki/N)). The ex- shorter than N samples in duration, then it can be ponential phase term preceding the summation is shown that an optimal synthesis window is simply important to include, but it is equivalent to shifting an amplitude-scaled copy of the analysis window. If the time origin for the terms within the summa- the analysis window duration is greaterthan N tion. Hence, a cyclic rotation of the windowed sig- samples, then it is easier to think of the synthesis nal prior to Fouriertransforming eliminates the window as the impulse response of a filter that is need for an explicit multiplication by the performinginterpolation. Then a good choice is to exponential. make the synthesis window the product of some As it turns out, this procedureis equally appli- simple window and a sin(x)/x function with period is cable in the case where the window duration I (where I is the number of samples between the as- greater than N. The summation in Eq. (6) then beginnings of successive Fouriertransforms). sumes the general form Additional insight into the role of the analysis M .271r window can be gained by considering a narrow- E y([i}e- (7) band input signal to the phase vocoder given by

26 Computer Music Journal the fact that when two sinusoids of different fre- = --knT + by x(n) A(n)cos( O(n)). (10) quencies lie within the same filter bandpass,the composite signal has 180-degreejumps in phase If both the amplitude A(n) and the instantaneous whenever the composite amplitude goes through phase are slowly varying, then it can be shown 0(n) zero. This is a consequence of the fact that that the phase-vocoderanalysis for the filter cen- tered at frequency 27rk/N results in cos(A) + cos(B) = cos((A - B)/2)cos((A + B)/2). (13) 00 The slowly varying cos((A - B)/2) can be thought of - |X(n,k)| = I| A(m)h(n m)e?IOm\ (11) as the amplitude envelope modulating the more rap- -00 m= idly varying cos((A + B)/2). But the phase-vocoder amplitude estimate is always positive. Hence, when arg[X(n,k) = arg[ E A(m)h(n - m)eilm)] (12) cos((A - B)/2) changes sign, the amplitude esti- m = -oo mate remains unchanged, and the phase estimate 180 where IX(n,k)l and arg[X(n,k)] are, respectively, the jumps degrees. In magnitude and phase of X(n,k). the past two years, considerable progresshas been made on short-time Fourier Ideally Eq. (11)would simply state |X(n,k )| = A (n). analysis-synthesis This would mean that the phase-vocoderanalysis formulations that do not require an explicit phase had perfectly extracted the amplitude of the input calculation (Griffinand Lim 1984). These magni- signal. In reality, though, we see that the phase- tude-only techniques are potentially superior to the vocoder estimate of the amplitude is a smeared ver- phase vocoder for applications such as time scaling sion of the true amplitude. If the input signal has a because these new techniques are based on a mathe- constant frequency of exactly 27rk/N, then 0(n) = 0 matical formulation that guarantees an optimal re- sult some measure of error for all n, and the amplitude estimate IX(n,k)1 is pre- (i.e., between actual and cisely a lowpass-filtered version of the true ampli- desired resyntheses is minimized). In tude A(n). If the input signal has a constant contrast, the phase-vocoderapproach to time- frequency not quite equal to the filter center fre- scaling is basically a heuristic one. The phase- quency, then the amplitude estimate is attenuated vocoder modifications to the short-time Fourier by the gain of the lowpass filter at the difference transform do not actually result in a valid short- frequency. time Fouriertransform. The modified transform can The situation for the instantaneous phase esti- still be inverted to accomplish resynthesis, but the mate is similar but less conducive to a simple in- short-time Fouriertransform of the resynthesized terpretation. It turns out, though, that the phase signal is not the same as the transformfrom which estimate is also smeared by a lowpass-filtering op- it was synthesized. Nevertheless, the phase vocoder eration so that, for example, a sudden change in the remains a very powerful tool for sound manipula- input signal frequency results in a more gradual tion, and an understandingof the phase vocoder re- change in the instantaneous frequency estimate of mains as the fundamental step from which future the phase vocoder. innovations can proceed. The situation with phase is further complicated

Dolson 27