J Sign Process Syst (2011) 65:339–350 DOI 10.1007/s11265-010-0510-9

Single-Channel Source Separation of Audio Signals Using Bark Scale Wavelet Packet Decomposition

Yevgeni Litvin · Israel Cohen

Received: 29 December 2009 / Revised: 2 May 2010 / Accepted: 20 July 2010 / Published online: 6 August 2010 © Springer Science+Business Media, LLC 2010

Abstract We address the problem of blind source sepa- 1 Introduction ration from a single channel audio source using a statis- tical model of the sources. We modify the Bark Scale Blind source separation (BSS) is the task of recovering aligned Wavelet Packet Decomposition, to acquire a set of signals from a set of observed signal mixtures. approximate-shiftability property. We allow oversam- The problem of BSS is common for different signal pling in some decomposition nodes to equalize sam- processing tasks. It is also at the heart of numerous pling rate in all terminal nodes. Statistical models applications in audio signal processing. BSS algorithms are trained from samples of each source separately. that operate on audio signals are sometimes called The separation is performed using these models. The Blind Audio Source Separation (BASS) algorithms [1]. proposed psycho-acoustically motivated non-uniform Cherry [2] coined the ability of the human filterbank structure reduces signal space dimension and system to concentrate on a single speaker in presence simplifies training procedure of the statistical model. In of interfering signals as a “cocktail party effect”. Al- our experiments we show that the proposed algorithm though, human audio segregation abilities are fascinat- performs better when compared to a competing algo- ing, not necessarily a full audio separation is performed rithm. We study the effect that different wavelet fam- in the inner ear or somewhere in the auditory cortex. ilies have on the performance of the proposed signal It is possible that the human hearing system is only ca- analysis in the single-channel source separation task. pable of recognizing semantic objects in one of several audio streams the listener is exposed to. Keywords Audio source separation · BS-WPD · Different settings for the BSS task arise in different CSR-BS-WPD · GMM · CWT · applications. In different settings the prior and the Monaural source separation posterior information available to a source separation algorithm may differ, such as number of sources and number of observed channels; mixing model (instan- taneous, echoic, convolutive, linear, non-linear); prior information on statistical properties of signals; and presence of noise. This work was supported by the Israel Science Foundation One of the crucial factors in the definition of the under Grant 1085/05 and by the European Commission BSS problem is the ratio of the number of observed under project Memories FP6-IST-035300. channels to the number of audio sources in the mixture. Y. Litvin (B) · I. Cohen If the number of observed channels is equal to the Department of Electrical Engineering, Technion—Israel number of sources then it is called an even-determined Institute of Technology, Technion City, Haifa 32000, Israel or a determined case. In an over-determined case the e-mail: [email protected] number of channels is greater than the number of I. Cohen sources and in an under-determined case the number e-mail: [email protected] of channels is smaller than the number of sources. The 340 J Sign Process Syst (2011) 65:339–350 under-determined case is the most difficult to handle In [7], Benaroya et al. introduced a source sepa- and requires stronger assumptions on the mixture com- ration algorithm based on Gaussian Mixture Model ponent properties. (GMM) and Hidden Markov Model (HMM) statisti- Another important factor that differentiates be- cal modeling of source signal classes. First GMM or tween BSS problem setups is the mixing model. The HMM models are trained for each signal class using instantaneous mixing model implies that several in- spectral shapes acquired from the Short-Time Fourier stantaneous mixtures are observed, each having source Analysis (STFT) analysis. During the separation stage, components mixed in a different proportion. Echoic these models are used to estimate mixture components mixing model allows different delays for each compo- using Maximum A-posterior (MAP) or Posterior Mean nent in each channel. The convolutive mixing model (PM) estimates. The authors also showed that using allows different linear filtering of sources at each chan- more complicated HMM models does not improve the nel. Naturally, the instantaneous mixing model is a separation performance significantly when compared degenerate case of the echoic mixing model and the to the GMM model. Some extensions to that work were echoic model is a degenerate case of the convolutive presented in [6]. For example, Gaussian Scaled Mixture mixing model. The convolutive mixing model is the Model (GSMM) which takes into account variations in most appropriate in describing most of the real world amplitude of with similar spectral shapes. scenarios, but is also the hardest to handle. Another signal modeling technique that was found Most source separation algorithms assume that mix- useful in single channel source separation is Auto Re- ture components are statistically independent. Al- gressive (AR) modeling. Srinivasan et al. [8] proposed though, this is a reasonable assumption in many cases, it a codebook of Linear Predictive Coefficients (LPC) is not necessarily true for all applications. For example, trained on a speech and an interfering signal. The one of the source separation applications is the sepa- maximum likelihood estimator is used to find the most ration of an individual musical instrument from a poly- probable pair of codebook members. Wiener filter is phonic musical excerpt. In this case, the assumption of used later to suppress the interfering signal. In [9] statistical independence is inaccurate for most musical LPC coefficients are treated as random variables. In styles where several musical instruments perform parts these works both algorithms are described and tested in a certain musical key and according to a common for speech enhancement in non stationary noise setup. tempo. Nevertheless, they are also applicable to a source sep- The blind source separation problem was first for- aration scenario by modeling one of the sources as the mulated in a statistical framework by Herault et al. in speech and the other as the noise. 1984. Comon [3] introduced the Independent Compo- Traditionally, short time Fourier transform (STFT) nent Analysis (ICA) in 1994 and numerous theoretical is used in many audio and speech processing appli- and practical works followed. A basic ICA algorithm cations. Bark-Scaled Wavelet Packet Decomposition assumes even-determined BSS case and instantaneous (BS-WPD) [10] is a time- signal transforma- mixing model. Under these assumptions, a demixing tion with non uniform frequency resolution. This trans- matrix has to be found. In order to find such matrix formation is psychoacoustically motivated and reflects the ICA algorithm minimizes statistical dependency the critical bands structure of the human auditory between unmixed channels. Various methods may be system. Mapping based Complex Wavelet Transform used in order to reduce statistical dependency, such (CWT) [11] is based on bijective mapping of a real as maximization of non-Gaussianity between channels signal into a complex signal domain followed by stan- or minimization of mutual information [4]. The search dard wavelet analysis performed on the complex signal. is usually done using gradient descent or fixed point Among others, CWT partially mitigates lack of shift algorithms. Unfortunately, most of the algorithms in invariance of wavelet analysis. the ICA family require several mixtures to be observed The algorithm presented in this paper, addresses in order to perform the separation. a single-channel separation of instantaneous mixture In some cases a database of audio samples is avail- of two audio sources. It follows Benaroya et al. [7] able and statistical signal models can be trained in a STFT based algorithm, but operates with a non uniform supervised manner before the separation process. In WPD filter-bank. We modify the BS-WPD analysis to these cases, various techniques from statistical learning equalize sampling rates of different scale-bands, which can be used. Algorithms that rely on these kind of sta- enables construction of instantaneous spectral shapes tistical models are sometimes called Semi-Blind Source that are used in training and separation stages of the Separation Algorithms (SBSS) [5, 6]. separation algorithm. We also use CWT in order to J Sign Process Syst (2011) 65:339–350 341 achieve some level of shift invariance. The non-uniform Fernandes et al. showed that a function space frequency resolution of the BS-WPD filterbank, re- L2 (R → R) is isomorphic to Hardy-space H2 (R → C) duces the dimension of feature vectors by allocating under a certain inner product. They also showed that fewer vector elements to the higher . This the mapping of a function in L2 (R → R) into Hardy- behavior mimics critical bands structure of human au- space cannot be implemented using a digital filter. As a ditory system. In a series of experiments we validate our remedy, they defined Softy-space and proved that it has approach using various types of wavelet families and properties similar to the Hardy-space. The mapping of show that the proposed approach performs better when a function in L2 (R → R) into the Softy-space is done compared to a competing algorithm in some scenarios. using a projection digital filter h+ that has a passband Partial results of this work were presented in [12]. over [0,π) and a stopband over [−π, 0). Softy-space The remainder of this paper is structured as follows. signals are denoted by a superscript “+” in [13] because In Section 2 we shortly describe the disadvantages of of the attenuated negative frequencies. classical wavelet transform applied to audio processing We adapt this notation. Let x (n) be a time sequence. tasks and present the CWT transform. In Section 3 we Its Softy-space image is given by introduce Bark Scaled WPD and a modification de- + ( ) = + ( ) ∗ ( ) . signed to equalize the sampling frequencies in all sub- x n h n x n (1) bands. In Section 4 we formulate our mixing model and Forward CWT transform is defined by mapping time derive MAP estimators for its components. Section 5 domain signal x (n) into its Softy-space image x+ (n) specifies training and separation stages of the algo- followed by a standard DWT transform. The inverse rithm. Section 6 outlines the performance measures, CWT transform consists of an Inverse Discrete Wavelet and Section 7 shows some experimental results. Transform (IDWT) followed by the inverse mapping from the complex valued Softy-space back to the real 2 Mapping Based Complex Wavelet Transform valued time signal. The Softy-space to the real space mapping is defined by the following equation In this section we describe the disadvantages of the   x (n) = Re g+ (n) ∗ x+ (n) . (2) standard Discrete Wavelet Transform (DWT) and present the mapping based Complex Wavelet Trans- where g+ (n) is also a digital filter. The relation between form (CWT), introduced by Fernandes et al. in [11, 13], h+ (n) and g+ (n) is described in [13]. that mitigates these disadvantages to some degree. Simoncelli et al. [14] defined a “shiftability” as a A major disadvantage of the DWT that reduces its transform property that guarantees that transform sub- usefulness in audio signal processing applications is the band energy is invariant under signal shifts. The shifta- ( ) lack of shift invariance. Let x n beatimedomain bility property is weaker than shift invariance but easier ( ) = { ( )} signal and Xl,n m DWT x n its DWT transform. to achieve. Simoncelli et al. argued that a transform is ( ) = ( − ) Let x n x n be a shifted version of the time shiftable if and only if there is no aliasing any subband. ( ) signal. The DWT coefficients of x n change sig- It follows that the shiftability it is not achievable for ( ) nificantly compared to Xl,n m . The reason for this any non-redundant wavelet transform, except for the behavior lies in the downsampling performed on the practically unrealizable Shannon wavelet. dilated signals by the DWT. Different researchers pro- A Hardy-space signal has half the bandwidth of posed various methods to address the problem. A the corresponding real-valued function due to the lack survey of techniques used to mitigate lack of shift in- of negative frequency components. Nyquist condition variance may be found in [13]. holds for the Hardy-space signal in each analysis sub- 2 (R → C) Let L denote a function space of square band and the shiftability property follows. The practical integrable complex-valued functions on a real line and CWT transform uses an approximation of the Hardy- 2 (R → R) L its subspace comprised of real-valued func- space, hence the CWT is approximately shiftable tions. Hardy-space H2 (R → C) is defined by  provided sufficiently long filters are used in the imple- H2 (R → C)  f ∈ L2 (R → C) : F f (ω) mentation as explained in [11]. We note that because  of the complex transform coefficients, mapping based = 0 for a.e. ω<0 , CWT is a redundant transform with a redundancy fac- where F f (ω) denotes the Fourier transform of f(t). tor of two. The mapping of a signal to the Hardy-space is equiv- Our algorithm benefits from approximate shift in- alent to finding its analytic signal. variance property. We train GMM model using the 342 J Sign Process Syst (2011) 65:339–350

(0,0)

(1,0) (1,1)

(2,0) (2,1) (2,2) (2,3)

(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7)

(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7)

(5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (5,7)

(6,0) (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) (6,7) (6,12) (6,13)

Figure 1 CSR-BS-WPD decomposition tree. Nodes having l > 6 are not decimated. This way, sampling frequencies of signals in all terminal nodes will be the same. Only few of the labels are shown due to the space limitations.

wavelet transform coefficients. Lack of shift invari- all analysis subbands. We define a new version of the ance makes the signal space unnecessarily large. This BS-WPD transform that has equal sampling frequency can make the statistical modeling of the signal more in each sub-band. Unfortunately, terminal nodes of the complicated in terms of the selected statistical model, BS-WPD are positioned at different depths and each computational burden or the amount of data required depth is associated with a different sampling frequency. for training. In order to align signals from all sub-bands and equalize the sampling frequency we do not allow decimation in nodes deeper than level 6 (i.e. l > 6). We call the 3 Bark-Scaled Wavelet Packet Decomposition resulting transform Constant Sampling Rate BS-WPD (CSR-BS-WPD). We note that by canceling decimation In this section we present the BS-WPD as defined in in lower levels of the WPD tree we introduce a certain [10] and introduce a modification that has some fa- amount of redundancy into the CSR-BS-WPD repre- vorable properties for the frame-by-frame classification sentation. Figure 1 shows CSR-BS-WPD decomposi- and filtering used in our algorithm.  tion tree. Let E ⊂ (l, n) : 0 ≤ l < L, 0 ≤ n < 2l be a set of The CSR-BS-WPD analysis has only 168 sub-bands, terminal nodes of a WPD tree. The center frequency compared to 513 sub-bands of STFT analysis with of a terminal node (l, n) ∈ E is roughly given by approximately the same frequency resolution in low   frequencies. Like the human auditory system, we sac- −l −1 fl,n = 2 GC (n) + 0.5 fs, rifice frequency resolution at higher frequency range. Reducing the number of sub-bands results in smaller where GC−1 (n) is the inverse Gray code of n and dimension of data used in the training and separation f is a sampling frequency. Critical band WPD (CB- s stages. A smaller data dimension has a potential to WPD) [10] filterbank structure is obtained by selecting a terminal nodes set E in a way that positions center fre- quencies fl,n approximately one Bark apart. Another constraint that must be taken into consideration is that − − : = l , l ( + ) 160 a dyadic interval set Il,n Il,n 2 n 2 n 1 must 0 form a disjoint cover of [0, 1). Only in this case, the set 140 −10 of wavelet packet family functions will be able to span 120

−20 the signal space. 100 BS-WPD is defined as a CB-WPD with two addi- 80 −30 tional levels of terminal node expansion. Due to a Band index 60 higher frequency resolution, the BS-WPD performed −40 better in the task of speech enhancement. Our experi- 40 −50 ments showed that adding three additional levels of de- 20 −60 composition results in better performance of our source 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 separation algorithm. Time [sec] The proposed source separation algorithm needs an Figure 2 CSR-BS-WPD time-frequency representation of a access to the instantaneous spectral information from speech signal. Horizontal axis. J Sign Process Syst (2011) 65:339–350 343

500 0 4 Mixture Components Estimation 450

400 −10 Let s1 (n) and s2 (n) be mixture components. We assume 350 −20 300 a mixing model without noise presence

250 −30 x (n) = s1 (n) + s2 (n) . (3) Band index 200 − 150 40 Mapping the mixture signal into the Softy-space we get 100 + + + −50 ( ) = ( ) + ( ) , 50 x n s1 n s2 n (4)

−60 0.5 1 1.5 2 2.5 3 3.5 4 4.5 and in the CSR-BS-WPD domain Time [sec] X¯ (m) = S¯ (m) + S¯ (m) (5) Figure 3 STFT representation of a speech signal. 1 2

¯ ¯ ¯ M S1 (m) , S2 (m) , X (m) ∈ C . increase accuracy of the GMM estimation because of We use posterior mean (PM) to estimate mixture less redundancy in the high frequency features, in addi- components in CSR-BS-WPD domain. Let x, s1, s2 ( ) ( ) ( ) tion to lower computational burden. be vectors of N observations of x n , s1 n and s2 n We denote by Xl,n (m) the CSR-BS-WPD transform respectively. It is shown by Benaroya et al. in [6]that + ( ) of x (n) in Eq. 2,where(l, n) are the indices of ter- for the mixing model (3) and Gaussian processes s1 n , ( ) minal nodes and m is time index. Since all terminal s2 n , the PM estimators for s1 and s2 are given by nodes have the same sampling rate we can rearrange −1 sˆc = c (1 + 2) x, c ∈ {1, 2} , (6) the elements of Xl,n (m) into a complex, single column ¯ M ¯ N×N vector X (m) ∈ C . The dimension of X (m) is equal to where c ∈ R is the covariance matrix of sc. the number of sub-bands M = 168. The CSR-BS-WPD For a stationary and approximately circular pro- is a reversible transform. Given CSR-BS-WPD signal cesses, Fourier transform F diagonalizes the covariance X¯ (m) we can acquire a time domain signal x (n) first matrices resulting in the following estimators: + by inverting the WPD and then filtering x (n) with a σ 2 ( ) + ˆ c f { } digital filter g as given by Eq. 2. Fsc ( f) = Fx ( f) , c ∈ 1, 2 , (7) σ 2 ( f) + σ 2 ( f) Figures 2 and 3 show a time frequency plot of a 1 2 σ 2  speech signal. The intensity of color shows the coeffi- where c is a vector of eigen-values of c and f is cient magnitude in dB. Band indices are shown on a frequency index. We notice that the solution coincides vertical axis. In Fig. 2, band indices on the vertical axis with is a well known Wiener filter. are the indices of the terminal nodes in the growing Diagonal covariance matrix of the STFT expansion mid-band frequency order. It can be seen that less coefficients is often assumed in the speech processing time-frequency coefficients are dedicated to the higher applications [15]. Benaroya used this assumption in the frequency bands in the CSR-BS-WPD analysis than in application to musical signals. The covariance matrix the conventional STFT analysis. diagonality means lack of correlations between signals

1 1

50 0.9 50 0.9

100 0.8 100 0.8

150 0.7 150 0.7

200 0.6 200 0.6

250 0.5 250 0.5

300 0.4 300 0.4

Frequency band index 350 0.3 Frequency band index 350 0.3

400 400 0.2 0.2 450 450 0.1 0.1 500 500 0 0 100 200 300 400 500 100 200 300 400 500 Frequency band index Frequency band index (a) (b)

Figure 4 Correlation coefficients matrix of STFT expansion for: a speech signal; b musical signal. 344 J Sign Process Syst (2011) 65:339–350

1 1

20 0.9 20 0.9

0.8 0.8 40 40 0.7 0.7 60 60 0.6 0.6 80 80 0.5 0.5

100 0.4 100 0.4 Frequency band index Frequency band index 120 0.3 120 0.3

0.2 0.2 140 140 0.1 0.1 160 160 0 0 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 Frequency band index Frequency band index (a) (b)

Figure 5 Correlation coefficients matrix of CSR-BS-WPD expansion using dmey wavelet for: a speech signal; b musical signal.

in different frequency bands. We can interpret CSR- used in Figs. 4–6, the differences in the correlation BS-WPD transform as a filterbank. We can expect that coefficients are explained by the properties of analysis as long as WPD filters have good band-pass filter char- filters. The analysis filter of the db2 wavelet is only 4 acteristics, the correlation between different frequency samples long compared to a more than 100 coefficients bands will also be low and diagonal covariance matrix of the dmey analysis filter. db2 wavelet has inferior assumption can also be extended to the CSR-BS-WPD frequency localization and worse attenuation of side signals. Next, we verify this assumption on empirical lobes compared to the dmey analysis filter, hence the data. correlation coefficients are higher for the db2 based Figures 4, 5 and 6 show absolute value of correla- CSR-BS-WPD expansion coefficients. tion coefficient matrices for STFT and CSR-BS-WPD Despite its weakness, especially for some kinds of transforms. The CSR-BS-WPD transform is based on wavelet families, we assume the lack of correlation the discrete Meyer wavelet (dmey) and Daubechies between expansion coefficients since the estimation of wavelet of order 2 (db2). It is clear that the diagonal co- the large number off diagonal elements of a covariance variance assumption is inaccurate for all transforms, in- matrix is impractical. cluding STFT. Highest amount of correlation is present Simple assumption of Gaussian distribution prior in the db2 based CSR-BS-WPD transform, especially does not hold for most natural signals such as speech for the musical signal. High inter-band correlation may or music. The remedy is to assume Gaussian Mix- be explained either by correlated events occurring at ture prior densities (GMM prior) [6]. GMM model different frequencies or by a leakage of energy between describes signal distribution as an outcome of a two frequency bands due to non-ideal bandpass character- stage process: first an active component k is selected istics of analysis filters. Since same audio signals were out of K Gaussian distributions in the mixture; then an

1 1

20 0.9 20 0.9

0.8 0.8 40 40 0.7 0.7 60 60 0.6 0.6 80 80 0.5 0.5

100 0.4 100 0.4 Frequency band index Frequency band index 120 0.3 120 0.3

0.2 0.2 140 140 0.1 0.1 160 160 0 0 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 Frequency band index Frequency band index (a) (b)

Figure 6 Correlation coefficients matrix of CSR-BS-WPD expansion using db2 wavelet for: a speech signal; b musical signal. J Sign Process Syst (2011) 65:339–350 345

¯ˆ observation sample is obtained using the selected order to estimate signal sources Sc using Eq. 8 for every model parameters μ(k),(k) where μ(k) and (k) are time instance, we first estimate the posterior proba- the expectation value and the covariance of the k- bility γ j,k:   th component. The probability of selecting the k-th γ ∝ ¯ | = , = ( = ) ( = ) w j,k p X q1 j q2 k p q1 j p q2 k component is given by k (k-th element of prob-   ability vector w). The GMM model is defined by = ¯ ; ( j) + (k) w( j)w(k),     g X 1 2 1 2 (10) μ(k) K , (k) K ,w = = .   k 1 k 1 ¯ ;  We estimate mixture components using GMM pri- where g X is a zero-mean multivariate Gaussian ors. We introduce hidden variables q (m)∈{1,...,K} , probability density function. Substituting Eq. 10 into c  c ∈ {1, 2} associated with index of active GMM compo- Eq. 8 and using c estimated in the training process we ¯ˆ ¯ˆ nent at time m.Letγ j,k (m)= p (q1 (m)= j, q2 (m)=k|x) obtain estimators for S1 and S2. be the posterior probabilities of active components. Then γ j,k (m) is estimated from mixture observations. Conditioning Eq. 3 on active GMM component and 6 Evaluation Criteria applying to the CSR-BS-WPD signal we get the PM estimator In this section, we define evaluation criteria used in ex- periments to evaluate the performance of the proposed K    −1 algorithm. ¯ˆ ( ) = γ ( ) ( j) ( j) + (k) ¯ ( ) . S1 m j,k m 1 1 2 X m (8) We use common distortion measures described in j,k=1 [17] and BSS_EVAL toolbox [18]. Mixture components s1, s2 are assumed to be uncorrelated. Let sˆc be an estimate of s . The estimator will have the following ¯ˆ ( ) c The estimator for S2 m is derived in the same manner. decomposition:

sˆc = yc + ec,interf + ec,artif

yc  sˆc, sc sc 5 Training and Separation ec,interf  sˆc, sc sc Let L denote the number of training signal time   e ,  sˆ − y + sˆ , s  s  , samples in the CSR-BS-WPD domain. During the c artif c c c c c  training stage we use signal samples of both classes where c is the target class and c is the interfering class. ¯ ( ) L , ¯ ( ) L Now the following performance measures are defined: S1 m m=1 S2 m m=1 to train two different GMM models. Both Softy-space mapping and the WPD are 2 yc linear transformations, so the expectation values of s, SDR  10 log 10 + 2 s+ and S¯ are zero. Hence we can define a simplified ec,interf ec,artif GMM model that assumes zero mean of every state in y 2 SIR  10 log c the GMM 10 2 ec,interf     (k) K K M×M + 2  = w ,  ,w ∈ R , ∈ R , (9) yc ec,interf c c c k=1 c c SAR  10 log . 10 2 ec,artif where K is the GMM model order. Following the rea- Signal to Distortion Ratio (SDR) measures the total (k) soning in the previous section, we assume c to be a amount of distortion introduced to the original signal, diagonal covariance matrix. both due to the interfering signal and artifacts intro- The training of the GMM models is performed us- duced by the algorithm. Signal to Interference Ratio ing the Expectation Maximization (EM) algorithm [16] (SIR) measures the amount of distortion introduced bootstrapped by the K-Means algorithm. Expectation into the original signal by the interfering signal. Signal value of training data is assumed to be zero and is not to Artifact Ratio (SAR) measures the amount of arti- updated during the expectation step of the EM algo- facts introduced into the original signal by the separa- rithm, i.e. is set to zero. tion algorithm that do not originate in the interfering ¯ˆ We note that the estimation of Sc (m) is performed signal. for every time index m. In the rest of this section we Usually some parameters can be chosen to tune the omit the time index m for clearness of notation. In trade-off between interfering signal leakage (SIR) and 346 J Sign Process Syst (2011) 65:339–350

5 the distortion to the desired signal (SAR). For example it is possible to reduce SIR to −∞ simply by zeroing 4.5 source estimation. However, the SAR measure will 4 become very high in this case. SDR is a combined 3.5 [dB] measure for both SIR and SAR, hence it is convenient 1

SDR 3 to compare the algorithm performance based on SDR. dmey db2 We use a sub-index to indicate which extracted signal 2.5 db5 coif3 coif5 the measure is referring to. For example, SDR1 refers 2 to the signal-to-distortion ratio of the first extracted STFT 1.5 component. 10 20 30 40 50 60 70 80 90 Training time [sec]

Figure 7 The signal-to-distortion (SDR) ratio of the extracted speech signal based on four test excerpts 90 s each. 7 Experimental Results

In this section, we present some experimental results. signal repository. We used non-overlapping parts of the We study the algorithm performance on speech utter- same recording for model training and for the mixture ances mixed with different musical excerpts. We study in order to overcome this problem. the effect of the training excerpt length, wavelet family First, we examined the effect of the training sig- choice, the GMM order and speaker gender on the nal length on the separation performance. We trained performance of the algorithm. GMM models using 10 to 90 s of the speech and music We compare the performance of the proposed algo- excerpts. All signals were sampled at 16 KHz. As a rithm to a similar algorithm which is based on the STFT preprocessing step, we normalized the energy of all signal analysis [6, 7]. In order for the comparison to the signals (the normalization is important since the statis- STFT-based algorithm to be fair, we select STFT analy- tical model used is not invariant to signal amplitude. sis parameters to match time and frequency resolution A Gaussian Scaled Mixture Model [6] can be used as of the CSR-BS-WPD at the lowest frequency band. The a remedy). We evaluated the separation performance length of the STFT analysis window is chosen to be using another 90 s of speech and music mixture. 1,024 and the overlap between subsequent frames is 960 We used the following wavelet families in our ex- samples. We use Hamming synthesis window and its periments: discrete approximation of Meyer wavelet bi-orthogonal vector as the analysis window. We note (dmey), Daubechies wavelets of order 2 and 5 (db2, that the number of expansion coefficients for the STFT db5), Coiflets of order 3 and 5 (coif3, coif5). Figures 7 transform is roughly three times greater than for the and 8 present signal-to-distortion ratio (SDR ,SDR) CSR-BS-WPD transform. 1 2 of the extracted speech and musical signal respectively We selected four different musical excerpts: two for different algorithms. These figure present the SDR solo piano pieces and two wind quartet pieces. Titles average of the recovered speech and music signals. of the musical excerpts and the abbreviations we use High value of SDR indicates a small degree of speech in the text are shown in Table 1.Aspeechsignal distortion in the extracted component. Other measures consists of a single sentence pronounced by different male speakers. All speech utterances were taken from the TIMIT database. While TIMIT database provided 5 speech utterances recorded in the same controlled en- vironment, we did not have access to a similar musical 4.5

4

3.5 [dB] Table 1 List of musical excerpts used in the experimental results. 1

SDR 3 Abbreviation Name Instrument dmey db2 W1 L.v. Beethoven, Quartet No. 10 Wind quartet 2.5 db5 coif3 coif5 in Eb, Op. 74 - Poco adagio 2 W2 L.v. Beethoven, Quartet No. 10 Wind quartet STFT 1.5 in Eb, Op. 74 - Adagio ma 10 20 30 40 50 60 70 80 90 non troppo Training time [sec] P1 J.S. Bach, Air Piano solo Figure 8 The signal-to-distortion (SDR) ratio of the extracted P2 J.S. Bach, French Suite No. 4 Piano solo musical signal based on four test excerpts 90 s each. J Sign Process Syst (2011) 65:339–350 347 exhibit similar performance trend and are not shown others. In the rest of the experiments we use 90 s long here. The SDR values shown in these figure are the training sequences. average of four test sequences, 90 s each. The Discrete Figure 9 presents the separation performance as Meyer (dmey) based CSR-BS-WPD algorithm demon- a function of GMM model order (K = 2,...,30). strates superior performance compared to the signal Higher order GMM models are useful to describe analysis based on other wavelet families for all training more complex statistical behaviors at the price of signal lengths. The performance of the STFT based higher computational complexity and larger data sets algorithm is comparable to the dmey based algorithm. required for training. Error bars on the SDR plots show When the length of the training signal is more than 70 confidence intervals of 68% for STFT and dmey based s, the SDR is only slightly improved for the dmey and CSR-BS-WPD algorithms. The dmey based CSR-BS- STFT based algorithms and remains the same for the WPD and STFT exhibit superior performance to other

6 6

5 5

4 4 [dB] [dB] 1 2 3 SDR 3 SDR dmey dmey db2 db2 2 db5 db5 2 coif3 coif3 coif5 1 coif5 1 STFT STFT

0

5 10 15 20 25 30 5 10 15 20 25 30 GMM order GMM order

(a) SDR 1 (b) SDR 2

12 10

11 9

10 8 9 [dB] [dB] 1 2 7 SIR SIR 8 dmey dmey db2 db2 6 db5 7 db5 coif3 coif3 5 coif5 6 coif5 STFT STFT

5

5 10 15 20 25 30 5 10 15 20 25 30 GMM order GMM order

(c) SIR 1 (d) SIR 2

6.4

7 6.2

6 6.5 5.8

6 5.6

[dB] [dB] 5.4 1 2 5.5

SAR SAR 5.2 dmey dmey 5 5 db2 db2 db5 db5 4.8 coif3 coif3 4.5 coif5 4.6 coif5 STFT STFT 4.4 4

5 10 15 20 25 30 5 10 15 20 25 30 GMM order GMM order

(e) SAR 1 (f) SAR 2

Figure 9 SDR, SIR and SAR averages for speech (sub-index 1) and music (sub-index 2) recovered signals. All measures are averaged over 6 min of test mixture signals. 348 J Sign Process Syst (2011) 65:339–350

wavelet families, under all the performance criteria. 0 A trade-off between SIR and SAR can be seen oc- 7000 −10 casionally, but the SDR measure supports our claim 6000 consistently. For a very low GMM orders such as 2 or 4, 5000 −20 the STFT based algorithm performs significantly worse 4000 −30

than all CSR-BS-WPD based algorithms. Frequency [Hz] 3000 −40 Figures 10 and 11 show spectrograms of the original 2000 and extracted speech and piano components. When the −50 1000 trained GMM model does not have a component that 0 −60 describes an observed mixture component well enough, 0 0.005 0.01 0.015 0.02 0.025 0.03 Time [sec] a sequence of frames may be entirely filtered out as can (a) be seen in Figs. 10band11b. This kind of an artifact High frequency components of 0 can be found in both STFT and CSR-BS-WPD based 7000 fricative phonemes − 6000 10

5000 Misclassified −20 frames 0 4000 −30 7000

Frequency [Hz] 3000 − 6000 10 −40 2000

5000 −20 −50 1000

4000 −30 0 −60 0 0.005 0.01 0.015 0.02 0.025 0.03

Frequency [Hz] 3000 Time [sec] −40 (b) 2000

−50 1000 0 7000

0 −60 −10 0 0.005 0.01 0.015 0.02 0.025 0.03 6000 Time [sec] (a) 5000 −20

4000 −30 0 7000 Frequency [Hz] 3000 − −40 6000 10 2000

5000 Misclassified −20 −50 1000 frames 4000 −30 0 −60 0 0.005 0.01 0.015 0.02 0.025 0.03

Frequency [Hz] 3000 Time [sec] −40 2000 (c)

−50 1000 Figure 11 Spectrograms (in dB) of (a) piano signal (P1) used 0 −60 0 0.005 0.01 0.015 0.02 0.025 0.03 in the mixture; piano component extracted from the mixture Time [sec] (b) using STFT based algorithm; (c) using CSR-BS-WPD based (b) algorithm based on the dmey wavelet family.

0 7000

−10 6000 algorithms, but it is more common to the STFT based 5000 −20 algorithm. This can be explained by a better fit of the 4000 −30 GMM model due to lower signal dimension. Another

Frequency [Hz] 3000 artifact observed in Fig. 11b is high-frequency resid- −40 2000 ual of fricative phonemes that leak into the extracted

−50 1000 music. This artifact appears mostly in musical signals extracted by the STFT based algorithm. 0 −60 0 0.005 0.01 0.015 0.02 0.025 0.03 Time [sec] We verified the algorithm performance on speech (c) excerpts pronounced by male and female speakers. We trained the GMM model on 90 s of speech pronounced Figure 10 Spectrograms (in dB) of (a) speech signal used in the only by female speakers. We tested the performance on mixture; speech component extracted from the mixture (b)using STFT based algorithm; (c) using CSR-BS-WPD based algorithm additional 90 s of a speech and music mixture. We used basedonthedmey wavelet family. GMM order of K = 30.Table2 presents experiment J Sign Process Syst (2011) 65:339–350 349

Table 2 Signal-to-distortion ratio of the extracted speech com- discrete Meyer wavelet based CSR-BS-WPD analysis ponent SDR1 and music component SDR2. yielded improved separation performance compared Wavelet family SDR SDR 1 2 to the STFT based analysis with very low orders of Female Male Female Male GMM model. Other wavelet families failed to produce coif3 5.27 4.12 5.20 3.97 performance comparable to the STFT based algorithm. coif5 5.05 3.96 4.93 3.78 db2 5.22 4.08 5.14 3.96 Cohen [10]useddmey wavelet family for speech db5 5.12 3.93 5.05 3.80 enhancement. He justified the advantage of the dmey dmey 6.19 5.25 6.01 5.08 wavelet family by the regularity of the wavelet filter and STFT 5.58 4.91 5.50 4.79 its good frequency localization properties. In Section 4 Either male only or female only speech is used in the training and we showed that superior frequency localization proper- the separation. Order of the GMM model is K = 30. ties reduce correlation between expansion coefficients, hence improve fit to our separation model. Due to the similarity of our separation algorithm to the STFT results averaged over all four musical excerpts (P1, based algorithm, the performance and pitfalls of both P2, W1, W2). The proposed algorithm and the STFT algorithms are comparable. based algorithm perform better when separating female Future work with the CSR-BS-WPD analysis may speech from music than separating male speech from address various audio processing tasks traditionally music. The SDR measure is approximately 1 dB higher performed in the STFT domain, which require instanta- for female speech than for male speech for both algo- neous spectral shapes and where the psychoacoustically rithms and for all wavelet families. The SIR and SAR motivated band structure of the time-frequency analy- performance measures exhibited similar behavior. sis might be desirable. Informal listening tests reveal that the subjective quality of the extracted musical signals are comparable for the STFT and the CSR-BS-WPD based algorithms. The extracted speech signal on the other hand, has References slightly more pleasant when recovered by the proposed algorithm. A relatively large amount of mis- 1. Vincent, E., Févotte, C., Benaroya, L., & Gribonval, R. classified frames, similar to those shown in Fig. 11b (2003). A tentative typology of audio source separation tasks. results in an unnatural speech sound and reduces the In Proc. 4th international symposium on independent compo- subjective signal quality. nent analysis and blind signal separation (ICA2003) (pp. 715– Audiofileswithmixturesusedintheexperiments 720). Nara, Japan. 2. Cherry, C. E. (1953). Some experiments on the recognition of and the extracted signals can be downloaded from http: speech, with one and with two ears. Journal of the Acoustical //sipl.technion.ac.il/∼elitvin/CSR-BS-WPD. Society of America, 25(5), 975–979. 3. Comon P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. 4. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent 8 Conclusions and Future Work component analysis. Wiley-Interscience. 5. Ozerov, A., Philippe, P., Bimbot, F., & Gribonval, R. (2007). We have described how a Bark-scaled WPD can be Adaptation of bayesian models for single-channel source sep- adapted to the source separation task. Introduction aration and its application to voice/music separation in popu- lar songs. IEEE Transactions on Audio, Speech & Language of a different subsampling scheme and approximate Processing, 15(5), 1564–1578. shiftability to the psychoacoustically motivated BS- 6. Benaroya, L., Bimbot, F., & Gribonval, R. (2006). Audio WPD decomposition tree, enabled us to adapt a source source separation with a single sensor. IEEE Transactions on separation algorithm that was originally designed to Audio, Speech & Language Processing, 14(1), 191–199. 7. Benaroya, L., & Bimbot, F. (2003). Wiener based source sep- work in the uniformly sampled STFT domain. aration with HMM/GMM using a single sensor. In Proc. 4th We found several advantages of using the proposed international symposium on independent component analysis analysis together with a GMM based source separa- and blind signal separation (ICA2003) (pp. 957–961). Nara, tion algorithm. The most significant advantage is the Japan. 8. Srinivasan, S., Samuelsson, J., & Kleijn, W. B. (2006). Code- reduction of the dimension of the signal space at the book driven short-term predictor parameter estimation for expense of high frequency resolution. As a result, the speech enhancement. IEEE Transactions on Audio, Speech computational burden and the amount of trained para- & Language Processing, 14(1), 163–176. meters of the GMM model are reduced. 9. Srinivasan, S., Samuelsson, J., & Kleijn, W. B. (2007). Codebook-based bayesian speech enhancement for nonsta- We found that a choice of wavelet family used tionary environments. IEEE Transactions on Audio, Speech with the proposed separation algorithm is crucial. The & Language Processing, 15(2), 441–452. 350 J Sign Process Syst (2011) 65:339–350

10. Cohen, I. (2001). Enhancement of speech using bark-scaled From 1996 to 2003, he was a Software Engineer with Israeli wavelet packet decomposition. In Proc. 7th European conf. Defense Forces. From 2008 to 2009, he was a Teaching Assistant speech, communication and technology, EUROSPEECH- and a Project Supervisor with the Signal and Image Processing 2001 (pp. 1933–1936). Aalborg, Denmark. Lab (SIPL), Electrical Engineering Department, the Technion. 11. Fernandes, F. C .A., van Spaendonck, R. L. C., & Burrus, C. Since 2009, he has been an Algorithm Engineer with Zoran. S. (2003). A new framework for complex wavelet transforms. His research interests are statistical signal processing, speech IEEE Transactions Signal Processing, 51(7), 1825–1837. enhancement, statistical machine learning and video processing. 12. Litvin, Y., & Cohen, I. (2009). Single-channel source separa- tion of audio signals using bark scale wavelet packet decom- position. In 2009 IEEE international workshop on machine learning for signal processing (MLSP09). 13. Fernandes, F. C. A. (2002). Directional, shift-insensitive, com- plex wavelet transforms with controllable redundancy. Ph.D. thesis, Rice Univ., Houston, TX, USA. 14. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multiscale transforms. IEEE Transac- tions on Information Theory, 38(2), 587–607. 15. Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on , Speech, and Sig- nal Processing, 32(6), 1109–1121. 16. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maxi- mum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodolog- ical), 39(1), 1–38. 17. Gribonval, R., Benaroya, L., Vincent, E., & Févotte, C. (2003). Proposals for performance measurement in source Israel Cohen (M’01-SM’03) received the B.Sc. (Summa Cum separation. In Proc. 4th international symposium on ICA and Laude), M.Sc. and Ph.D. degrees in Electrical Engineering from BSS (ICA2003) (pp. 763–768). Nara, Japan. the Technion – Israel Institute of Technology, Haifa, Israel, in 18. Févotte, C., Gribonval, R., & Vincent, E. (2005). BSS_EVAL 1990, 1993 and 1998, respectively. toolbox user guide revision 2.0. Tech. Rep. 1706, IRISA, From 1990 to 1998, he was a Research Scientist with RAFAEL Rennes, France. Research Laboratories, Haifa, Israel Ministry of Defense. From 1998 to 2001, he was a Postdoctoral Research Associate with the Computer Science Department, Yale University, New Haven, CT. In 2001 he joined the Electrical Engineering Department of the Technion, where he is currently an Associate Professor. His research interests are statistical signal processing, analysis and modeling of acoustic signals, speech enhancement, noise estimation, microphone arrays, source localization, blind source separation, system identification and adaptive filtering. He served as Guest Editor of a special issue of the EURASIP Journal on Advances in Signal Processing on Advances in Multimicrophone Speech Processing and a special issue of the EURASIP Speech Communication Journal on Speech Enhancement. He is a coedi- tor of the Multichannel Speech Processing section of the Springer Handbook of Speech Processing (Springer, 2007), a coauthor of Noise Reduction in Speech Processing (Springer, 2009), and a cochair of the 2010 International Workshop on Acoustic and Noise Control. Dr. Cohen received in 2005 and 2006 the Technion Excellent Yevgeni Litvin received the B.Sc. (Summa Cum Laude) and Lecturer awards, and in 2009 the Muriel and David Jacknow M.Sc. degrees in Electrical Engineering from the Technion – award for Excellence in Teaching. He served as Associate Editor Israel Institute of Technology, Haifa, Israel, in 2007 and 2010, of the IEEE Transactions on Audio, Speech, and Language respectively. Processing and IEEE Signal Processing Letters.