Single-Channel Source Separation of Audio Signals Using Bark Scale Wavelet Packet Decomposition

J Sign Process Syst (2011) 65:339–350 DOI 10.1007/s11265-010-0510-9 Single-Channel Source Separation of Audio Signals Using Bark Scale Wavelet Packet Decomposition Yevgeni Litvin · Israel Cohen Received: 29 December 2009 / Revised: 2 May 2010 / Accepted: 20 July 2010 / Published online: 6 August 2010 © Springer Science+Business Media, LLC 2010 Abstract We address the problem of blind source sepa- 1 Introduction ration from a single channel audio source using a statistical model of the sources. We modify the Bark Scale Blind source separation (BSS) is the task of recovering aligned Wavelet Packet Decomposition, to acquire a set of signals from a set of observed signal mixtures. approximate-shiftability property. We allow oversam- The problem of BSS is common for different signal pling in some decomposition nodes to equalize sam- processing tasks. It is also at the heart of numerous pling rate in all terminal nodes. Statistical models applications in audio signal processing. BSS algorithms are trained from samples of each source separately. that operate on audio signals are sometimes called The separation is performed using these models. The Blind Audio Source Separation (BASS) algorithms [1]. proposed psycho-acoustically motivated non-uniform Cherry [2] coined the ability of the human hearing filterbank structure reduces signal space dimension and system to concentrate on a single speaker in presence simplifies training procedure of the statistical model. In of interfering signals as a “cocktail party effect”. Al- our experiments we show that the proposed algorithm though, human audio segregation abilities are fascinat- performs better when compared to a competing algo- ing, not necessarily a full audio separation is performed rithm. We study the effect that different wavelet fam- in the inner ear or somewhere in the auditory cortex. ilies have on the performance of the proposed signal It is possible that the human hearing system is only ca- analysis in the single-channel source separation task. pable of recognizing semantic objects in one of several audio streams the listener is exposed to. Keywords Audio source separation · BS-WPD · Different settings for the BSS task arise in different CSR-BS-WPD · GMM · CWT · applications. In different settings the prior and the Monaural source separation posterior information available to a source separation algorithm may differ, such as number of sources and number of observed channels; mixing model (instantaneous, echoic, convolutive, linear, non-linear); prior information on statistical properties of signals; and presence of noise. This work was supported by the Israel Science Foundation One of the crucial factors in the definition of the under Grant 1085/05 and by the European Commission BSS problem is the ratio of the number of observed under project Memories FP6-IST-035300. channels to the number of audio sources in the mixture. Y. Litvin (B) · I. Cohen If the number of observed channels is equal to the Department of Electrical Engineering, Technion—Israel number of sources then it is called an even-determined Institute of Technology, Technion City, Haifa 32000, Israel or a determined case. In an over-determined case the e-mail: [email protected] number of channels is greater than the number of I. Cohen sources and in an under-determined case the number e-mail: [email protected] of channels is smaller than the number of sources. The 340 J Sign Process Syst (2011) 65:339–350 under-determined case is the most difficult to handle In [7], Benaroya et al. introduced a source sepa- and requires stronger assumptions on the mixture com- ration algorithm based on Gaussian Mixture Model ponent properties. (GMM) and Hidden Markov Model (HMM) statisti- Another important factor that differentiates be- cal modeling of source signal classes. First GMM or tween BSS problem setups is the mixing model. The HMM models are trained for each signal class using instantaneous mixing model implies that several in- spectral shapes acquired from the Short-Time Fourier stantaneous mixtures are observed, each having source Analysis (STFT) analysis. During the separation stage, components mixed in a different proportion. Echoic these models are used to estimate mixture components mixing model allows different delays for each compo- using Maximum A-posterior (MAP) or Posterior Mean nent in each channel. The convolutive mixing model (PM) estimates. The authors also showed that using allows different linear filtering of sources at each chan- more complicated HMM models does not improve the nel. Naturally, the instantaneous mixing model is a separation performance significantly when compared degenerate case of the echoic mixing model and the to the GMM model. Some extensions to that work were echoic model is a degenerate case of the convolutive presented in [6]. For example, Gaussian Scaled Mixture mixing model. The convolutive mixing model is the Model (GSMM) which takes into account variations in most appropriate in describing most of the real world amplitude of sounds with similar spectral shapes. scenarios, but is also the hardest to handle. Another signal modeling technique that was found Most source separation algorithms assume that mix- useful in single channel source separation is Auto Re- ture components are statistically independent. Al- gressive (AR) modeling. Srinivasan et al. [8] proposed though, this is a reasonable assumption in many cases, it a codebook of Linear Predictive Coefficients (LPC) is not necessarily true for all applications. For example, trained on a speech and an interfering signal. The one of the source separation applications is the sepa- maximum likelihood estimator is used to find the most ration of an individual musical instrument from a poly- probable pair of codebook members. Wiener filter is phonic musical excerpt. In this case, the assumption of used later to suppress the interfering signal. In [9] statistical independence is inaccurate for most musical LPC coefficients are treated as random variables. In styles where several musical instruments perform parts these works both algorithms are described and tested in a certain musical key and according to a common for speech enhancement in non stationary noise setup. tempo. Nevertheless, they are also applicable to a source sep- The blind source separation problem was first for- aration scenario by modeling one of the sources as the mulated in a statistical framework by Herault et al. in speech and the other as the noise. 1984. Comon [3] introduced the Independent Compo- Traditionally, short time Fourier transform (STFT) nent Analysis (ICA) in 1994 and numerous theoretical is used in many audio and speech processing appli- and practical works followed. A basic ICA algorithm cations. Bark-Scaled Wavelet Packet Decomposition assumes even-determined BSS case and instantaneous (BS-WPD) [10] is a time-frequency signal transforma- mixing model. Under these assumptions, a demixing tion with non uniform frequency resolution. This trans- matrix has to be found. In order to find such matrix formation is psychoacoustically motivated and reflects the ICA algorithm minimizes statistical dependency the critical bands structure of the human auditory between unmixed channels. Various methods may be system. Mapping based Complex Wavelet Transform used in order to reduce statistical dependency, such (CWT) [11] is based on bijective mapping of a real as maximization of non-Gaussianity between channels signal into a complex signal domain followed by stan- or minimization of mutual information [4]. The search dard wavelet analysis performed on the complex signal. is usually done using gradient descent or fixed point Among others, CWT partially mitigates lack of shift algorithms. Unfortunately, most of the algorithms in invariance of wavelet analysis. the ICA family require several mixtures to be observed The algorithm presented in this paper, addresses in order to perform the separation. a single-channel separation of instantaneous mixture In some cases a database of audio samples is avail- of two audio sources. It follows Benaroya et al. [7] able and statistical signal models can be trained in a STFT based algorithm, but operates with a non uniform supervised manner before the separation process. In WPD filter-bank. We modify the BS-WPD analysis to these cases, various techniques from statistical learning equalize sampling rates of different scale-bands, which can be used. Algorithms that rely on these kind of sta- enables construction of instantaneous spectral shapes tistical models are sometimes called Semi-Blind Source that are used in training and separation stages of the Separation Algorithms (SBSS) [5, 6]. separation algorithm. We also use CWT in order to J Sign Process Syst (2011) 65:339–350 341 achieve some level of shift invariance. The non-uniform Fernandes et al. showed that a function space frequency resolution of the BS-WPD filterbank, re- L2 (R → R) is isomorphic to Hardy-space H2 (R → C) duces the dimension of feature vectors by allocating under a certain inner product. They also showed that fewer vector elements to the higher frequencies. This the mapping of a function in L2 (R → R) into Hardy- behavior mimics critical bands structure of human au- space cannot be implemented using a digital filter. As a ditory system. In a series of experiments we validate our remedy, they defined Softy-space and proved that it has approach using various types of wavelet families and properties similar to the Hardy-space. The mapping of show that the proposed approach performs better when a function in L2 (R → R) into the Softy-space is done compared to a competing algorithm in some scenarios. using a projection digital filter h+ that has a passband Partial results of this work were presented in [12]. over [0,π) and a stopband over [−π, 0). Softy-space The remainder of this paper is structured as follows. signals are denoted by a superscript “+” in [13] because In Section 2 we shortly describe the disadvantages of of the attenuated negative frequencies.

Load more