EURASIP Journal on Advances in

Microphone Array Speech Processing

Guest Editors: Sven Nordholm, Thushara Abhayapala, Simon Doclo, Sharon Gannot, Patrick Naylor, and Ivan Tashev Microphone Array Speech Processing EURASIP Journal on Advances in Signal Processing

Microphone Array Speech Processing

Guest Editors: Sven Nordholm, Thushara Abhayapala, Simon Doclo, Sharon Gannot, Patrick Naylor, and Ivan Tashev Copyright © 2010 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2010 of “EURASIP Journal on Advances in Signal Processing.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Editor-in-Chief Phillip Regalia, Institut National des Tel´ ecommunications,´ France

Associate Editors

Adel M. Alimi, Tunisia Sudharman K. Jayaweera, USA Douglas O’Shaughnessy, Canada Kenneth Barner, USA Soren Holdt Jensen, Denmark Bjorn¨ Ottersten, Sweden Yasar Becerikli, Turkey Mark Kahrs, USA Jacques Palicot, France Kostas Berberidis, Greece Moon Gi Kang, South Korea Ana Perez-Neira, Spain Enrico Capobianco, Italy Walter Kellermann, Germany Wilfried R. Philips, Belgium A. Enis Cetin, Turkey Lisimachos P. Kondi, Greece Aggelos Pikrakis, Greece Jonathon Chambers, UK Alex Chichung Kot, Singapore Ioannis Psaromiligkos, Canada Mei-Juan Chen, Taiwan Ercan E. Kuruoglu, Italy Athanasios Rontogiannis, Greece Liang-Gee Chen, Taiwan Tan Lee, China Gregor Rozinaj, Slovakia Satya Dharanipragada, USA Geert Leus, The Netherlands Markus Rupp, Austria Kutluyil Dogancay, Australia T.-H. Li, USA William Sandham, UK Florent Dupont, France Husheng Li, USA B. Sankur, Turkey Frank Ehlers, Italy Mark Liao, Taiwan Erchin Serpedin, USA Sharon Gannot, Israel Y.-P. Lin, Taiwan Ling Shao, UK Samanwoy Ghosh-Dastidar, USA Shoji Makino, Japan Dirk Slock, France Norbert Goertz, Austria Stephen Marshall, UK Yap-Peng Tan, Singapore M. Greco, Italy C. Mecklenbrauker,¨ Austria Joao˜ Manuel R. S. Tavares, Portugal IreneY.H.Gu,Sweden Gloria Menegaz, Italy George S. Tombras, Greece Fredrik Gustafsson, Sweden Ricardo Merched, Brazil Dimitrios Tzovaras, Greece Ulrich Heute, Germany Marc Moonen, Belgium Bernhard Wess, Austria Sangjin Hong, USA Christophoros Nikou, Greece Jar-Ferr Yang, Taiwan Jiri Jan, Czech Republic Sven Nordholm, Australia Azzedine Zerguine, Saudi Arabia Magnus Jansson, Sweden Patrick Oonincx, The Netherlands Abdelhak M. Zoubir, Germany Contents

Microphone Array Speech Processing, Sven Nordholm, Thushara Abhayapala, Simon Doclo, Sharon Gannot (EURASIPMember), Patrick Naylor, and Ivan Tashev Volume 2010, Article ID 694216, 3 pages

Selective Frequency Invariant Uniform Circular Broadband Beamformer,XinZhang,WeeSer, Zhang Zhang, and Anoop Kumar Krishna Volume 2010, Article ID 678306, 11 pages

First-Order Adaptive Azimuthal Null-Steering for the Suppression of Two Directional Interferers, ReneM.M.Derkx´ Volume 2010, Article ID 230864, 16 pages

Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on Higher-Order Statistics, Yu Takahashi, Hiroshi Saruwatari, Kiyohiro Shikano, and Kazunobu Kondo Volume 2010, Article ID 431347, 25 pages

Microphone Diversity Combining for In-Car Applications,Jurgen¨ Freudenberger, Sebastian Stenzel, and Benjamin Venditti Volume 2010, Article ID 509541, 13 pages

DOA Estimation with Local-Peak-Weighted CSP, Osamu Ichikawa, Takashi Fukuda, and Masafumi Nishimura Volume 2010, Article ID 358729, 9 pages

Shooter Localization in Wireless Microphone Networks, David Lindgren, Olof Wilsson, Fredrik Gustafsson, and Hans Habberstad Volume 2010, Article ID 690732, 11 pages Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 694216, 3 pages doi:10.1155/2010/694216

Editorial Microphone Array Speech Processing

Sven Nordholm (EURASIP Member),1 Thushara Abhayapala (EURASIP Member),2 Simon Doclo (EURASIP Member),3 Sharon Gannot (EURASIP Member),4 Patrick Naylor (EURASIP Member),5 and Ivan Tashev6

1 Department of Electrical and Computer Engineering, Curtin University of Technology, Perth, WA 6845, Australia 2 College of Engineering & Computer Science, The Australian National University, Canberra, ACT 0200, Australia 3 Institute of Physics, Signal Processing Group, University of Oldenburg, 26111 Oldenburg, Germany 4 School of Engineering, Bar-Ilan University, 52900 Tel Aviv, Israel 5 Department of Electrical and Electronic Engineering, Imperial College, London SW7 2AZ, UK 6 Microsoft Research, USA

Correspondence should be addressed to Sven Nordholm, [email protected]

Received 21 July 2010; Accepted 21 July 2010

Copyright © 2010 Sven Nordholm et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Significant knowledge about microphone arrays has been highly reverberant speech given that we only can observe the gained from years of intense research and product develop- received microphone signals. ment. There have been numerous applications suggested, for This special issue contains contributions to traditional example, from large arrays (in the order of >100 elements) areas of research such as frequency invariant beamforming for use in auditoriums to small arrays with only 2 or 3 [1], hand-free operation of microphone arrays in cars [2], elements for hearing aids and mobile telephones. Apart from and source localisation [3]. The contributions show new that, microphone array technology has been widely applied ways to study these traditional problems and give new in speech recognition, surveillance, and warfare. Traditional insights into those problems. Small size arrays have always techniques that have been used for microphone arrays a lot of applications and interest for mobile terminals, include fixed spatial filters, such as, frequency invariant hearing aids, and close up microphones [4]. The novel beamformers, optimal and adaptive beamformers. These way to represent small size arrays leads to a capability to array techniques assume either model knowledge or cali- suppress multiple interferers. Abnormalities in noise and bration signal knowledge as well as localization information speech stemming from processing are largely unavoidable, for their design. Thus they usually combine some form and using nonlinear processing results often in significant of localisation and tracking with the beamforming. Today character change particularly in noise character. It is thus contemporary techniques using blind signal separation (BSS) important to provide new insights into those phenomena and time frequency masking technique have attracted sig- particularly the so called musical noise [5]. Finally, new nificant attention. Those techniques are less reliant on array and unusual use of microphone arrays is always interesting model and localization, but more on the statistical properties to see. Distributed microphone arrays in a sensor network of speech signals such as sparseness, non-Gaussianity, and [6] provide a novel approach to find snipers. This type of non-stationarity. The main advantage that multiple micro- processing has good opportunities to grow in interest for new phones add from a theoretical perspective is the spatial and improved applications. diversity, which is an effective tool to combat interference, The contributions found in this special issue can be reverberation, and noise. The underpinning physical feature categorized to three main aspects of microphone array used is a difference in coherence in the target field (speech processing: (i) microphone array design based on eigenmode signal) versus the noise field. Viewing the processing in this decomposition [1, 4]; (ii) multichannel processing methods way one can understand also the difficulty in enhancing [2, 5]; and (iii) source localisation [3, 6]. 2 EURASIP Journal on Advances in Signal Processing

The paper by Zhang et al., “Selective frequency invariant array signal processing and spectral subtraction. To obtain uniform circular broadband beamformer”[1], describes a better noise reduction, methods of integrating microphone design method for Frequency-Invariant (FI) beamforming. array signal processing and nonlinear signal processing have This problem is a well-known array signal processing tech- been researched. However, nonlinear signal processing often nique used in many applications such as, speech acquisition, generates musical noise. Since such musical noise causes acoustic imaging and communications purposes. However, discomfort to users, it is desirable that musical noise is many existing FI beamformers are designed to have a mitigated. Moreover, it has been recently reported that frequency invariant gain over all angles. This might not be higher-order statistics are strongly related to the amount necessary and if a gain constraint is confined to a specific of musical noise generated. This implies that it is possible angle, then the FI performance over that selected region (in to optimize the integration method from the viewpoint of frequency and angle) can be expected to improve. Inspired not only noise reduction performance but also the amount by this idea, the proposed algorithm attempts to optimize of musical noise generated. Thus, the simplest methods the frequency invariant beampattern solely for the mainlobe of integration, that is, the delay-and-sum beamformer and and relax the FI requirement on the sidelobes. This sacrifice spectral subtraction, are analysed and the features of musical on performance in the undesired region is traded off for noise generated by each method are clarified. As a result, it is better performance in the desired region as well as reduced clarified that a specific structure of integration is preferable number of microphones employed. The objective function from the viewpoint of the amount of generated musical is designed to minimize the overall spatial response of the noise. The validity of the analysis is shown via a computer beamformer with a constraint on the gain being smaller simulation and a subjective evaluation. than a predefined threshold value across a specific frequency The paper by Freudenberger et al., “Microphone diversity range and at a specific angle. This problem is formulated as a combining for in-car applications”[2], proposes a frequency convex optimization problem and the solution is obtained domain diversity approach for two or more microphone by using the Second-Order Cone Programming (SOCP) signals, for example, for in-car applications. The micro- technique. An analysis of the computational complexity phones should be positioned separately to ensure diverse of the proposed algorithm is presented as well as its signal conditions and incoherent recording of noise. This performance. The performance is evaluated via computer enables a better compromise for the microphone position simulation for different number of sensors and different with respect to different speaker sizes and noise sources. This threshold values. Simulation results show that the proposed work proposes a two-stage approach: In the first stage, the algorithm is able to achieve a smaller mean square error of microphone signals are weighted with respect to their signal- the spatial response gain for the specific FI region compared to-noise ratio and then summed similar to maximum-ratio- to existing algorithms. combining. The combined signal is then used as a reference The paper by Derkx, “First-order azimuthal null-steering for a frequency domain least-mean-squares (LMS) filter for for the suppression of two directional interferers”[4] shows each input signal. The output SNR is significantly improved that an azimuth steerable first-order super directional micro- compared to coherence-based noise reduction systems, even phone response can be constructed by a linear combination if one microphone is heavily corrupted by noise. of three eigenbeams: a monopole and two orthogonal The paper by Ichikawa et al., “DOA estimation with dipoles. Although the response of a (rotation symmetric) local-peak-weighted CSP”[3], proposes a novel weighting first-order response can only exhibit a single null, the algorithm for Cross-power Spectrum Phase (CSP) analysis paper studies a slice through this beampattern lying in the to improve the accuracy of (DOA) azimuthal plane. In this way, a maximum of two nulls estimation for beamforming in a noisy environment. As in the azimuthal plane can be defined. These nulls are a sound source, a human speaker is used, and as a noise symmetric with respect to the main-lobe axis. By placing source broadband automobile noise is used. The harmonic these two nulls on maximally two-directional sources to structures in the human speech spectrum can be used for be rejected and compensating for the drop in level for the weighting the CSP analysis, because harmonic bins must desired direction, these directional sources can be effectively contain more speech power than the others and thus give rejected without attenuating the desired source. An adaptive us more reliable information. However, most conventional null-steering scheme for adjusting the beampattern, which methods leveraging harmonic structures require pitch esti- enables automatic source suppression, is presented. Closed- mation with voiced-unvoiced classification, which is not form expressions for this optimal null-steering are derived, sufficiently accurate in noisy environments. The suggested enabling the computation of the azimuthal angles of the approach employs the observed power spectrum, which is interferers. It is shown that the proposed technique has a directly converted into weights for the CSP analysis by good directivity index when the angular difference between retaining only the local peaks considered to be coming the desired source and each directional interferer is at least from a harmonic structure. The presented results show that 90 degrees. the proposed approach significantly reduces the errors in In the paper by Takahashi et al. “Musical noise analysis localization, and it also shows further improvement when in methods of integrating microphone array and spectral used with other weighting algorithms. subtraction based on higher-order statistics”[5], an objective The paper by Lindgren et al., “Shooter localization in analysis on musical noise is conducted. The musical noise wireless microphone networks”[6], is an interesting com- is generated by two methods of integrating microphone bination of microphone array technology with distributed EURASIP Journal on Advances in Signal Processing 3 communications. By detecting the muzzle blast as well as the ballistic shock wave, the microphone array algorithm is able to locate the shooter in the case when the sensors are synchronized. However, in the distributed sensor case, synchronization is either not achievable or very expensive to achieve and therefore the accuracy of localization comes into question. Field trials are described to support the algorithmic development. Sven Nordholm Thushara Abhayapala Simon Doclo Sharon Gannot Patrick Naylor Ivan Tashev

References

[1] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Selective frequency invariant uniform circular broadband beamformer,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 678306, 11 pages, 2010. [2] J. Freudenberger, S. Stenzel, and B. Venditti, “Microphone diversity combining for In-car applications,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 509541, 13 pages, 2010. [3] O. Ichikawa, T. Fukuda, and M. Nishimura, “DOA estimation with local-peak-weighted CSP,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 358729, 9 pages, 2010. [4]R.M.M.Derkx,“First-orderadaptiveazimuthalnull-steering for the suppression of two directional interferers,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 230864, 16 pages, 2010. [5] Yu. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo, “Musical-noise analysis in methods of integrating microphone array and spectral subtraction based on higher-order statistics,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 431347, 25 pages, 2010. [6] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad, “Shooter localization in wireless sensor networks,” in Proceed- ings of the 12th International Conference on Information Fusion (FUSION ’09), pp. 404–411, July 2009. Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 678306, 11 pages doi:10.1155/2010/678306

Research Article Selective Frequency Invariant Uniform Circular Broadband Beamformer

Xin Zhang,1 Wee Ser, 1 Zhang Zhang,1 and Anoop Kumar Krishna2

1 Center for Signal Processing, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798 2 EADS Innovation Works, EADS Singapore Pte Ltd., No. 41, Science Park Road, 01-30, Singapore 117610

Correspondence should be addressed to Xin Zhang, zhang [email protected]

Received 16 April 2009; Revised 24 August 2009; Accepted 3 December 2009

Academic Editor: Thushara Abhayapala

Copyright © 2010 Xin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Frequency-Invariant (FI) beamforming is a well known array signal processing technique used in many applications. In this paper, an algorithm that attempts to optimize the frequency invariant beampattern solely for the mainlobe, and relax the FI requirement on the sidelobe is proposed. This sacrifice on performance in the undesired region is traded off for better performance in the desired region as well as reduced number of microphones employed. The objective function is designed to minimize the overall spatial response of the beamformer with a constraint on the gain being smaller than a pre-defined threshold value across a specific frequency range and at a specific angle. This problem is formulated as a convex optimization problem and the solution is obtained by using the Second Order Cone Programming (SOCP) technique. An analysis of the computational complexity of the proposed algorithm is presented as well as its performance. The performance is evaluated via computer simulation for different number of sensors and different threshold values. Simulation results show that, the proposed algorithm is able to achieve a smaller mean square error of the spatial response gain for the specific FI region compared to existing algorithms.

1. Introduction to use the Frequency-Invariant (FI) beampattern synthesis technique. As the name implies, such beamformers are Broadband beamforming techniques using an array of designed to have constant spatial gain response over the microphones have been applied widely in hearing aids, tele- desired frequency bands. conferencing, and voice-activated human-computer inter- Over recent years, FI beamforming techniques are face applications. Several broadband beamformer designs developed in a fast pace. It is difficult to make a distinct have been reported in the literature [1–3]. One design classification. However, in order to grasp the literature on FI approach is to decompose the broadband signal into several beamforming in a glimpse, we classify them loosely into the narrowband signals and apply narrowband beamforming following three types. techniques for each narrowband signal [4]. This approach One type of FI beamformers includes those that focus requires several narrowband processing to be conducted on the design based on array geometry. These include, for simultaneously and is computationally expensive. Another example, the 3D sensor array design reported in [6], the design approach is to use adaptive broadband beamformers. rectangular sensor array design reported in [7], and the Such techniques use a bank of linear transversal filters to design of using subarrays in [8]. In [9], the FI beampattern is generate the desired beampattern. The filter coefficients can achieved by exploiting the relationship among the frequency be derived adaptively from the received signals. One classic responses of the various filters implemented at the output of design example is the Frost Beamformer [5]. However, in each sensor. order to have a similar beampattern over the entire frequency The second type of FI beamformers is designed on range, a large number of sensors and filter taps will be the base of a least-square approach. For this type of FI needed. This again leads to high computational complexity. beamformers, the weights of the beamformer are optimized The third approach of designing broadband beamformers is such that the error between the actual beampattern and 2 EURASIP Journal on Advances in Signal Processing the desired beampattern is minimized over a range of In this configuration, the intersensor spacing is fixed at frequencies. Some of such beamformers are designed in the λ/2, where λ is the wavelength of the signals of interest time-frequency domain [10–12], while others are designed and its minimum value is denoted by λmin.Theradius in the eigen-space domain [13]. corresponding to λmin is given by [14] The third type of FI beamformers is designed based on “Signal Transformation.” For this type of beamformers, the λmin signal received at the sensor array is transformed into a r = . (1) 4sin(π/K) domain such that the frequency response and the spatial response of the signal can be decoupled and hence adjusted independently. This is the principle adopted in [14], where Assuming that the circular array is on a horizontal plane, a uniform concentric circular array (UCCA) is designed the steering vector is to achieve the FI beampattern. Excellent results have been produced by this algorithm. One limitation of the UCCA − − T = j2πfrcos(φ φ0)/c j2πfrcos(φ φK−1)/c (2) beamformer is that a relatively large number of sensors have a f , φ e , ..., e , to be used to form the concentric circular array. Inspired by the UCCA beamformer design, a new where T denotes transpose. For convenience, let ω be the algorithm has been proposed by the authors of this paper normalized angular frequency, that is, ω = 2πf/fs,let and presented in [15]. The proposed algorithm attempts be the ratio of the sampling frequency and the maximum to optimize the FI beampattern solely for the main lobe frequency, that is,  = fs/fmax, and let r be the normalized where the signal of interest is from and relaxes the FI radius, that is, r = r/λmin, the steering vector can be rewritten requirement on the side lobe. As a result, the sacrifice as on performance in the undesired region is traded off for better performance in the desired region and fewer number  −  − T = jωr cos(φ φ0) jωr cos(φ φK−1) (3) of microphones are employed. To achieve this goal, an a ω, φ e , ..., e . objective function with a quadratic constraint is designed. This constraint function allows the FI characteristic to be Figure 2 shows the system structure of the proposed accurately controlled over the specified bandwidth at the uniform circular array beamformer. The sampled signals expense of other parts of the spectrum which are not of after the sensor are represented by the vector X[n] = concern to the designer. This objective function is formulated T [x0(n), x1(n), ..., xK−1(n)] where n is the sampling instance. into a convex optimization problem and solved by SOCP These sampled signals are transformed into a set of coef- readily. Our algorithm has a frequency band of interest from ficients via the Inverse Discrete Fourier Transform (IDFT), 0.3π to 0.95π. If the sampling frequency is 16000 Hz, the where each of the coefficients is called a phase mode [17]. frequency band of interest ranges from 2400 Hz to 7600 Hz. The mth phase mode at time instance n can be expressed as This algorithm can be applied in speech processing as the labial and fricative sounds of speech mostly lie in the 8th K−1 to 9th octave. If the sampling frequency is 8000 Hz, the p [n] = x [n]e j2πkm/K. (4) frequency band of interest is from 1200 Hz to 3800 Hz. m k k=0 This frequency range is useful for respiratory sounds [16]. The aim of this paper is to provide the full details of These phase modes are passed through an FIR (Finite ffi the design proposed in [15]. In addition, a computational Impulse Response) filter where the filter coe cients are complexity analysis of the proposed algorithm and the denoted as bm[n]. The purpose of this filter is to remove sensitivity performance evaluations at different numbers of the frequency dependency of the received signal X[n]. The sensors and different constraint parameter values are also beamformer output y[n] is then determined as the weighted included. sum of the filtered signals: The remaining paper is organized in the following way: in Section 2, problem formulation is discussed; in Section 3, L the proposed beamforming design is described; in Section 4, y[n] = pm[n] ∗ bm[n] · hm, (5) the design of the beamforming weight using SOCP is m=−L shown; numerical results are given in Section 5,andfinally, conclusions are drawn in Section 6. where hm is the phase spatial weighting coefficients or the beamforming weights, and ∗ is the discrete-time convolu- tion operator. 2. Problem Formulation Let M be the total number of phase modes and it is assumed to be an odd number. It can be seen from Figure 2 A uniformly distributed circular sensor array with K number that the K received signals are transformed into M phase of microphones is arranged as shown in Figure 1.Each modes, where L = (M − 1)/2. omnidirectional sensor is located at (r cos φk, r sin φk), where The corresponding spectrum of the phase modes can r is the radius of the circle, φk = 2kπ/K and k = 0, ..., K − 1. be obtained by taking the Discrete Time Fourier Transform EURASIP Journal on Advances in Signal Processing 3

(DTFT) of the phase modes defined in (4): kth element

K−1 j2πkm/K Pm(ω) = Xk(ω)e k=0 (6) φk K−1 − = S(ω) · e jωr cos(φ φk)e j2πkm/K, k=0 where S(ω) is the spectrum of the source signal. Radius r Taking DTFT on both side of (5) and using (6), we have

L Figure 1: Uniform Circular Array Configuration. Y(ω) = hmPm(ω)Bm(ω) m=−L ⎛ ⎞ L K−1 ⎝ jωrcos(φ−φk ) j2πkm/K⎠ objective function is proposed: = S(ω) · hm e e Bm(ω). =− = m L k 0 (7) min G ω, φ 2dω dφ, ω φ (11) Consequently, the response of the beamformer can be s.t. G ω, φ − 1 ≤ δ, ω ∈ [ω , ω ], expressed as 0 l u

⎛ ⎞ where G(ω, φ) is the spatial response of the beamformer L K−1 given in (10), and ωl and ωu are the lower and upper limit of ⎝ jωrcos(φ−φk) j2πkm/K⎠ G ω, φ = hm e e Bm(ω). (8) the specified frequency region respectively. φ0 is the specified m=−L k=0 direction and δ is a predefined threshold value that controls the magnitude of the ripples of the main beam. In principle, the objective function defined above aims In order to obtain an FI response, terms which are to minimize the square of the spatial gain response across functions of ω are grouped together using the Jacobi-Anger all frequencies and all angles, while constraining the gain expansion given as follows [18]: to the value of one at the specified angle. This is to relax the gain constraint to one angle instead of all angles, +∞ so that the FI beampattern in the specified region can jβcos γ n jnγ e = j Jn β e , (9) be improved. With this constraint setting, the resulting n=−∞ beamformer can enhance broadband desired signals arriving from one direction while attenuate broadband noise received where J (β) is the Bessel function of the first kind of order n. from other directions. The concept for formulating the n objective function is similar to Capon beamformer [19]. One Substituting (9) into (8), and applying property of the ff Bessel function, the spatial response of the beamformer can di erence is that the Capon beamformer aims to minimize now be approximated by the data dependent array output power at a single frequency, while the proposed algorithm aims to minimize the data independent array output power across a wide range of L frequencies. Another difference is that the constraint used in jmφ m G ω, φ = hm · e · K · j · Jm(ωr) · Bm(ω). (10) Capon beamformer is a hard constraint, whereas the array m=−L gain used in the proposed algorithm is a soft constraint, which can result in a higher degree of flexibility. This process has been described in [13] and its detailed The proposed algorithm is expected to have lower com- derivation can be found in [14]. putational complexity compared to the UCCA beamformer. ThelaterisdesignedtoachieveFIbeampatternforall angles whereas the proposed algorithm focuses only on a 3. Proposed Novel Beamformer specified angle. For the same reason, the proposed algorithm is expected to have a larger degree of freedom too. This With the above formulation, we propose the following beam explains the result in having a better FI beampattern for a pattern synthesis method. The basic idea is to enhance the given number of sensors. These performance improvements broadband signals for a specific frequency region and at a have been supported by computer simulations and will be certain direction. In order to achieve this goal, the following discussed in the later part of this paper. 4 EURASIP Journal on Advances in Signal Processing

p [n] hL x0[n] L bL[n]

hL−1 − x1[n] pL 1[n] bL−1[n]

y[n] . IDFT . . SUM ......

− xK−1[n] p−L[n] h L b−L[n]

Figure 2: The system structure of a uniform circular array beamformer.

The optimization problems defined by (10)and(11) Using the identity e− jnω = cos(nω) − j sin(nω), (13) require the optimum values of both the compensation filter becomes and the spatial weightings to be determined simultaneously. L As such, Cholesky factorization is used to transform the jmφ m G ω, φ = hm · e · K · j · Jm(ωr) objective function further into the Second-Order Cone Pro- m=−L gramming (SOCP) problem. The details of implementation ⎡ ⎤ Nm will be discussed in the following section. It should be noted ⎣ ⎦ · bm[n] cos(nω) − j sin(nω) that when the threshold value δ equals zero, the optimization n=0 process becomes a linearly constrained problem. L jmφ m = K hm · e · j · Jm(ωr) m=−L 4. Convex Optimization-Based Implementation ⎡ ⎤ Nm Second-Order Cone Programming (SOCP) is a popular tool ⎣ ⎦ · bm[n] cos(nω) − jbm[n] sin(nω) for solving convex optimization problem, and it has been n=0 used for array pattern synthesis problem [20–22] since the L early papers by Lobo et al. [23]. One advantage of SOCP jmφ m = K hm · e · j · Jm(ωr) is that the global optimal solution is guaranteed if it exists, m=−L whereas constrained least square optimization procedure ⎡ ⎛ ⎞⎤ Nm Nm looks for local minimum. Another important advantage is ⎣ ⎝ ⎠⎦ · bm[n] cos(nω) − j bm[n] sin(nω)) that it is very convenient to include additional linear or n=0 n=0 convex quadratic constraints, such as the norm constraint of L the variable vector, in the problem formulation. The standard = K h · e jmφ · jm · J (ωr) · c b − js b , form of SOCP can be written as follows: m m m m m m m=−L (14) min bT x, (12) T ≥  = where b = [ [0], [1], , [ ]]T ; c = [cos(0), s.t. di x + qi Aix + ci 2, i 1, ..., N, m bm bm ... bm Nm m cos(ω), ...,cos(Nm · ω)]; sm = [sin(0), sin(ω), ..., sin(Nm · where x ∈ Rm is the variable vector; the parameters are b ∈ ω)]. m (n −1)×m n −1 m R , Ai ∈ R i , ci ∈ R i , di ∈ R ,andqi ∈ R.The hm is the spatial weighting in the system structure, and norm appearing in the constraints is the standard Euclidean bm is the FIR filter coefficient vector for each phase. T 1/2 = · m · norm, that is, u2 = (u u) . Let um hm j bm,wehave L 4.1. Convex Optimization of the Beampattern Synthesis jmφ G ω, φ = K e · Jm(ωr) · cmum Problem. The following transformations are carried out to m=−L convert (11) into the standard form defined by (12).  L Nm − jnω (15) First, B (ω) = = b [n]e is substituted into (10), m n 0 m − j · K e jmφ · J (ωr) · s u where is the filter order for each phase. The spatial m m m Nm m=−L response of the beamformer can now be expressed as ⎡ ⎤ = c ω, φ u − js ω, φ u, L Nm jmφ m ⎣ − jnω⎦ j(−L)φ j(L)φ G ω, φ = hm ·e · K · j · Jm(ωr) · bm[n]e . where c(ω, φ) = [Ke J−L(ωr)c−L, ..., Ke JL(ωr)cL]; m=−L n=0 = T T T T = j(−L)φ u [u−L, u−L+1, ..., uL ] ; s(ω, φ) [Ke J−L(ωr)s−L, (13) j(L)φ ..., Ke JL(ωr)sL]. EURASIP Journal on Advances in Signal Processing 5

Representing the complex spatial response G(ω, φ)by 0 a 2-dimensional vector g(ω, φ) which display the real and imaginary parts into rows of a vector separately, (15)is −10 rewritten in the following form: ⎛ ⎞ −20 c ω, φ ⎝ ⎠ H g ω, φ = u = A(ω, φ) u. (16) − −s ω, φ 30

− H Gain (dB) 40 Hence, G(ω, φ)2 = gH g = (A(ω, φ)H u) (A(ω, φ)H u) = H H u A(ω, φ)A(ω, φ) u. −50 The objective function and the constraint inequality defined in (11) can now be written as −60 min uH Ru, u −70 (17) −200 −150 −100 −50 0 50 100 150 200

s.t. G ω, φ0 − 1 ≤ δ,forω ∈ [ωl, ωu], Angle (deg)   = H Figure 3: The normalized spatial response of the proposed where R ω φA(ω, φ)A(ω, φ) dω dφ. Inordertotransform(17) into the SOCP form defined beamformer for ω = [0.3π,0.95π]. by (12), the cost function must be a linear equation. Since matrix R is hermitian and positive definite, it can be decomposed into an upper triangular matrix and its where 0 is the zero matrix with its dimension determined = H transpose using Cholesky factorization, that is, R D D, from the context. where D is the Cholesky factorization of R. Substituting this Equation (21) can now be solved using convex optimiza- into (17), we have tiontoolboxsuchasSeDuMi[24]withgreatefficiency.   uH Ru = uH DH D u = (Du)H (Du). (18) 4.2. Computational Complexity. When the Interior-Point This further simplifies (17) into the following form: Method (IPM) is used to solve the SOCP problem defined in (21), the number of iterations needed is bounded by min d2, √ u O( N)whereN is the number of constraints. The amount 2 ⎧ of computation per iteration is O(n ni)[23]. ⎨ 2 2 (19) i d =D · u , The bulk of the computational requirement of the broad- s.t. ⎩ G ω, φ − 1 ≤ δ for ω ∈ [ω , ω ]. band array pattern synthesis comes from the optimization 0 l u process. The computational complexity of the optimization Denoting t as the maximum norm of vector Du process of the proposed algorithm and that of the UCCA subject to various choices of u,(19)reducesto algorithm have been calculated and are listed in Table 1. ItcanbeseenfromTable 1 that the proposed algorithm min t, u requires a similar amount of computation per iterations ⎧ but a much smaller number of iterations compared to ⎨D · u≤t, (20) the UCCA algorithm. The overall computational load of s.t. ⎩ the proposed method is therefore much smaller that that − ≤ ∈ G ω, φ0 1 δ for ω [ωl, ωu]. is needed by the UCCA algorithm. It should be noted ffi It should be noted that (20) contains I different con- that, as the coe cients are optimized in the phase modes, straints where I uniformly divides the frequency range the comparative computational load presented above is spanned by ω. calculated based on the number of phase modes and not the Lastly,inordertosolve(20) by SOCP toolbox, we stack number of sensors. Nevertheless, the larger the number of t and the coefficients of u together and define y = [t; u]. Let sensors, the larger the number of phase modes too. a = [1, 0]T , so that t = aT y. As a result, the objective function and the constraint defined in (11) can be expressed as 5. Numerical Results min aT y, y In this numerical study, the performance of the proposed beamformer is compared with that of UCCA beamformer ⎧ ⎪ ≤ T [14] and Yan’s beamformer [25], for the specified frequency ⎪ 0Dy a y, ⎨ ⎛ ⎞ region. The evaluation metric used to quantify the frequency s.t. ⎪ 1 invariance (FI) characteristics is the mean squared error ⎪ H − ⎝ ⎠ ≤ ∈ ⎩⎪ 0A(ω, φ0) y δ for ω ωl ωu , of the array gain variation at the specified direction. The 0 sensitivity performance of the proposed algorithm will also (21) be evaluated for different number of sensors and different 6 EURASIP Journal on Advances in Signal Processing

Table 1: Computational complexity of different broadband beampattern synthesis method.

Method Number√ of iteration Amount of computation per iteration { × } { 2 } UCCA O √I M O (1 + P(1 + Nm)) [2M(I +1)] Equation (11) O{ 1+I} O{[M(Nm+1)2][2I + M(Nm+1)+1]}

Table 2: Comparison of array gain at each frequency along the 0 desired direction for the three methods. −5 Normalized Proposed Yan’s UCCA − Frequency Beamformer beamformer Beamformer 10 (radians/sample) (dB) (dB) (dB) −15 0.3 −0.0007 0 0.6761 −20 0.4 −0.0248 −0.8230 0.1760 − − −25 0.5 0.0044 1.3292 0.022 Gain (dB) − − − 0.6 0.0097 1.6253 0.2186 −30 0.7 −0.0046 −1.8789 −0.6301 −35 0.8 0.0085 −2.9498 −0.1291 0.9 −0.0033 −6.2886 0.1477 −40 −45 −200 −150 −100 −50 0 50 100 150 200 threshold values set for magnitude control of the ripples of Angle (deg) the main beam. A uniform circular array consisting of 20 sensors is Figure 4: The normalized spatial response of the UCCA beam- = considered. All the sensors are assumed perfectly calibrated. former for ω [0.3π,0.95π]. The number of phase modes M is set to be 17 and thus there are 17 spatial weighting coefficients. The order of the 0 compensation filter is set to be 16 for all the phase modes. −5 The frequency region of interest is specified to be from 0.3π to 0.95π. The threshold value, δ, which controls the −10 magnitude of the ripples of the main beam is set to 0.1. −15 The specified direction is set to be 0◦ where the reference −20 microphone is located. There are several optimization criteria presented in −25

[25]. The one that is chosen to compare is peak sidelobe Gain (dB) −30 constrained minimax mainlobe spatial response variation − (MSRV) design. Its objective is to minimize the maximum 35 MSRV with peak sidelobe constraint. The mathematic −40 expression is shown as follows: −45 min σ, −50 h −200 −150 −100 −50 0 50 100 150 200 ⎧ Angle (deg) ⎪ T = ⎪u f0, φ0 h 1, ⎪ ⎪ Figure 5: The normalized spatial response of Yan’s beamformer for ⎪     = ⎨⎪ T (22) ω [0.3π,0.95π]. u fk, θq − u f0, θq h ≤ σ, s t . . ⎪ ⎪ ⎪ T ≤ ⎪ u fk, θs h ε, ⎪ spaced discrete frequencies is superimposed. It can be seen ⎩⎪ fk ∈ fl, fu , θq ∈ ΘML, θs ∈ ΘSL, that, the proposed beamformer has approximately a constant gain within the frequency region of interest in the specified ◦ ◦ where f0 is the reference frequency and choose to have direction (0 ). As the direction deviates from 0 , the FI the value of fl,andh is the beamformed weightings to be property becomes poorer. The peak sidelobe level has a value optimized. ε is the peak sidelobe constraint and set to be of −8dB. 0.036. ΘML and ΘSL represent the mainlobe and sidelobe The beampattern of the UCCA beamformer is shown in region, respectively. Figure 4. As the proposed algorithm is based on a circular The beampattern obtained for the proposed beamformer array, only one layer of the UCCA concentric array is used for the frequency region of interest is shown in Figure 3.The for the numerical study. All other parameter settings remain spatial response of the proposed beamformer at 10 uniformly the same as that used for the proposed algorithm. As shown EURASIP Journal on Advances in Signal Processing 7

20 12 18 11 16

14 10 12 9 10

8 8

Mean square error 6 4 White noise gain (dB) 7 2 6 0 0.20.30.40.50.60.70.80.91 5 Normalised frequency (radians/sample) 0.20.30.40.50.60.70.80.91 Normalised frequency (radians/sample) Yan’s beamformer Proposed beamformer Figure 8: White noise gain versus frequency for the broadband UCCA beamformer beam pattern shown in Figure 3. Figure 6: Comparison on FI characteristic between the proposed beamformer, UCCA beamformer and Yan’s beamformer at 0 degree for ω = [0.3π,0.95π]. different methods are shown in Figure 6. It is seen that the proposed beamformer outperforms both the UCCA beam- 15 former and Yan’s beamformer on achieving FI characteristic at the desired direction. Table 2 tabulates the array gain at 14.5 each frequency along the desired direction for these three 14 methods. Furthermore, the performance of the frequency invariant 13.5 beam pattern obtained by the proposed method is assessed 13 by evaluating the directivity and the white-noise gain over 12.5 the entire frequency band considered, as shown in Figures 7 and 8, respectively. Directivity describes the ability of the 12 array to suppress a diffuse noise field, while white noise Directivity (dB) 11.5 gain shows the ability of the array to suppress spatially uncorrelated noise, which can be caused by self-noise of the 11 sensors. Because our array is a circular array, the directivity 10.5 D(ω) is calculated using the following equation: 10  2 0.20.30.40.50.60.70.80.91 L m=−L Bm(ω) Normalised frequency (radians/sample) D(ω) =   , L L − m=−L n=−L Bm(ω)Bn(ω) sinc[(m n)2πωr/c] Figure 7: Directivity versus frequency for the broadband beam (23) pattern shown in Figure 3. where Bm(ω) is the frequency response of the FIR filter at mth phase mode, and r is the radius of the circle. in the figure, the beampattern of the UCCA beamformer As shown in the figure, the directivity has a constant is not as constant as that of the proposed beamformer in profile, with an average value equal to 13.1755 dB. The the specified direction (0◦). The peak sidelobe level which white noise gain ranges from 5.5 dB to 11.3 dB. These has a value of −6 dB is higher as compared to the proposed positive values represent an attenuation of self-noise of the beamformer too. microphones. As expected, the lower the frequency, the The beampattern of Yan’s beamformer is shown in smaller the white noise gain, and the higher the sensitivity Figure 5. The frequency invariant characteristics is poorer to array imperfections. Hence, the proposed beamformer at the desired direction. However it has the lowest sidelobe is more sensitive to array imperfection at low frequency level among all. From this comparison, we find that having and is the most robust to array imperfection at normalized processed the signal in phase mode, the frequency range frequency 0.75π. for the beamformer to achieve Frequency Invariant (FI) characteristics is wider. 5.1. Sensitivity Study—Number of Sensors. Most FI beam- The mean squared errors of the spatial response gain in formers reported in the literature employ a large number the specified direction and across different frequencies for of sensors. In this study, the number of sensors used 8 EURASIP Journal on Advances in Signal Processing

0 0

−5 −10 −10 −20 −15 − −20 30

−25 − Gain (dB) Gain (dB) 40 −30 −50 −35 −60 −40

−45 −70 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 Angle (deg) Angle (deg)

Figure 9: The normalized spatial response of the proposed FI Figure 11: The normalized spatial response of the Yan’s beam- beamformer for 10 microphones. former for 10 microphones.

0 0 −5 −10 −10 −20 −15

−20 −30

−25

Gain (dB) − Gain (dB) 40 −30 −50 −35 − −40 60

−45 −70 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 Angle (deg) Angle (deg)

Figure 10: The normalized spatial response of the UCCA beam- Figure 12: The normalized spatial response of the proposed FI former for 10 microphones. beamformer for 8 microphones.

are reduced from 20 to 10 and 8 and the performances the allowed ripples in the magnitude of the main beam of the proposed FI beamformer, UCCA beamformer, and spatial gain response. In this section, different values of δ Yan’s beamformer are compared. The results are shown in are used to study the sensitivity of the performance of the Figures 9, 10, 11, 12, 13,and14. As seen from the simula- proposed algorithm to this parameter value. Three values, tions, when 10 microphones are employed, the proposed namely, δ = [0.001, 0.01, 0.1] are selected and the results algorithm achieves the best FI performance in the mainlobe obtained are shown in Figures 15, 16,and17,respectively. − region, with a sidelobe level of 8 dB. For UCCA method The specified frequency region of interest remains the same. and Yan’s method, frequency invariant characteristics are Figure 18 shows the mean squared error of the array gain at not promising at the desired direction, and higher sidelobes the specified direction (0◦) for the three different δ values are obtained. When the number of microphone is further studied. reduced to 8, our proposed method is still able to produce As shown in the figures, as the value of δ decreases, the FI reasonable FI beampattern whereas the FI property of the performance at the specified direction improves. The results beampattern of the UCCA algorithm becomes much poorer also show that the improvement in the FI performance in in the specified direction. the specified direction is achieved with an increase in the peak sidelobe level and a poorer FI beampattern in the other 5.2. Sensitivity Study—Different Threshold Value δ. In this directions in the main beam. For example, when the value proposed algorithm, δ is a parameter created to define of δ is 0.001, the peak sidelobe of the spatial response is EURASIP Journal on Advances in Signal Processing 9

0 0

−2 −10 −4

−6 −20

−8 −30 −10 Gain (dB) Gain (dB) −12 −40

−14 −50 −16

−18 −60 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 Angle (deg) Angle (deg)

Figure 13: The normalized spatial response of the UCCA beam- Figure 16: The normalized spatial response of the proposed former for 8 microphones. beamformer for δ = 0.01.

0 0 −5 −10 −10 −15 −20 −20 −25 −30 Gain (dB) Gain (dB) −30 −40 −35 −40 −50 −45 −50 −60 −200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200 Angle (deg) Angle (deg)

Figure 14: The normalized spatial response of the Yan’s beam- Figure 17: The normalized spatial response of the proposed former for 8 microphones. beamformer for δ = 0.1.

0 as high as −5 dB and the beampatterns do not overlap well −5 in the main beam. As δ increases to 0.1, the peak sidelobe −10 of the spatial response is approximately −10 dB (lower) and the beampatterns in the main beam are observed to have a −15 relatively good FI characteristics. −20 −25 6. Conclusion Gain (dB) −30 A selective frequency invariant uniform circular broadband −35 beamformer is presented in this paper. Other than pro- −40 viding the details of a recent conference paper presented by the authors of this paper, a complexity analysis and −45 two sensitivity studies on the proposed algorithm are also −50 presented in this paper. The proposed algorithm is designed −200 −150 −100 −50 0 50 100 150 200 to minimize an objective function of the spatial response gain Angle (deg) with a constraint on the gain being smaller than a predefined Figure 15: The normalized spatial response of the proposed threshold value across a specified frequency range and in a beamformer for δ = 0.001. specified direction. The problem is formulated as a convex 10 EURASIP Journal on Advances in Signal Processing

0.18 [5] O. L. Frost III, “An algorithm for linearly constrained adaptive array processing,” Proceedings of the IEEE,vol.60,no.8,pp. 0.16 926–935, 1972. 0.14 [6] W. Liu, D. McLernon, and M. Ghogho, “Frequency invariant beamforming without tapped delay-lines,” in Proceedings 0.12 of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’07), vol. 2, pp. 997–1000, 0.1 Honolulu, Hawaii, USA, April 2007. 0.08 [7] M. Ghavami, “Wideband theory using rectan- gular array structures,” IEEE Transactions on Signal Processing, Mean square error 0.06 vol. 50, no. 9, pp. 2143–2151, 2002. [8] T. Chou, “Frequency-independent beamformer with low 0.04 response error,” in Proceedings of the 20th IEEE Transactions 0.02 on Acoustics, Speech, and Signal Processing (ICASSP ’95), vol. 5, pp. 2995–2998, Detroit, Mich, USA, May 1995. 0 0.20.30.40.50.60.70.80.91 [9] D. B. Ward, R. A. Kennedy, and R. C. Williamson, “FIR filter design for frequency invariant beamformers,” IEEE Signal Normalised frequency (radians/sample) Processing Letters, vol. 3, no. 3, pp. 69–71, 1996. δ = 0.1 [10] A. Trucco and S. Repetto, “Frequency invariant beamforming δ = 0.01 in very short arrays,” in Proceedings of the MTS/IEEE Techno- δ = 0.001 Ocean (Oceans ’04), vol. 2, pp. 635–640, November 2004. [11]A.Trucco,M.Crocco,andS.Repetto,“Astochasticapproach Figure 18: Comparison on FI characteristic of the proposed to the synthesis of a robust frequency-invariant filter-and- = = beamformer for δ 0.001, 0.01 and 0.1 at 0 degree for ω sum beamformer,” IEEE Transactions on Instrumentation and [0.3π,0.95π]. Measurement, vol. 55, no. 4, pp. 1407–1415, 2006. [12] S. Doclo and M. Moonen, “Design of far-field and near-field broadband beamformers using eigenfilters,” Signal Processing, optimization problem and the solution is obtained by using vol. 83, no. 12, pp. 2641–2673, 2003. the Second-Order Cone Programming (SOCP) technique. [13] L. C. Parra, “Least squares frequency-invariant beamforming,” The complexity analysis shows that the proposed algorithm in Proceedings of the IEEE Workshop on Applications of Signal has a lower computational requirement compared to that of Processing to Audio and Acoustics, pp. 102–105, New Paltz, NY, the UCCA algorithm for the problem defined. Numerical USA, October 2005. results show that the proposed algorithm is able to achieve [14] S. C. Chan and H. H. Chen, “Uniform concentric circu- a more FI beampattern and a smaller mean square error of lar arrays with frequency-invariant characteristics—theory, the spatial response gain in the specified direction across the design, adaptive beamforming and DOA estimation,” IEEE specified FI region compared to the UCCA algorithm. Transactions on Signal Processing, vol. 55, no. 1, pp. 165–177, 2007. [15] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Uniform Acknowledgments circular broadband beamformer with selective frequency and spatial invariant region,” in Proceedings of the 1st International The authors would like to acknowledge the helpful discus- Conference on Signal Processing and Communication System sions given by H. H. Chen of the University of Hong Kong (ICSPCS ’07), Gold Coast, Australia, December 2007. on the UCCA algorithm. The authors would also like to [16] W. Ser, T. T. Zhang, J. Yu, and J. Zhang, “Detection of thank STMicroelectronics (Singapore) for the sponsorship of wheezes using a wearable distributed array of microphones,” this project. Last but not the least, the authors would like in Proceedings of the 6th International Workshop on Wearable to thank the reviewers too for their constructive comments and Implantable Body Sensor Networks (BSN ’09), pp. 296–300, Berkeley, Calif, USA, June 2009. and suggestions which greatly improve the quality of this [17] D. E. N. Davies, “Circular arrays,” in Handbook of Antenna manuscript. Design, Peregrinus, London, UK, 1983. [18] M. Abramowitz and I. A. Stegum, Handbook of Mathematical References Functions, Dover, New York, NY, USA, 1965. [19] J. Capon, “High-resolution frequency-wavenumber spectrum [1] H. Krim and M. Viberg, “Two decades of array signal analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418, processing research: the parametric approach,” IEEE Signal 1969. Processing Magazine, vol. 13, no. 4, pp. 67–94, 1996. [20] F. Wang, V. Balakrishnan, P. Y. Zhou, J. J. Chen, R. Yang, and [2] D. H. Johnson and D. E. Dudgeon, Array Signal Processing: C. Frank, “Optimal array pattern synthesis using semidefinite Concepts and Techniques, Prentice-Hall, Upper Saddle River, programming,” IEEE Transactions on Signal Processing, vol. 51, NJ, USA, 1993. no. 5, pp. 1172–1183, 2003. [3] R. A. Monzingo and T. W. Miller, Introduction to Adaptive [21] J. Liu, A. B. Gershman, Z.-Q. Luo, and K. M. Wong, “Adaptive Arrays, John Wiley & Sons, SciTech, New York, NY, USA, 2004. beamforming with sidelobe control: a second-order cone [4] B. D. Van Veen and K. M. Buckley, “Beamforming: a programming approach,” IEEE Signal Processing Letters, vol. versatile approach to spatial filtering,” in Proceedings of the 10, no. 11, pp. 331–334, 2003. IEEE Transactions on Acoustics, Speech, and Signal Processing [22] S. Autrey, “Design of arrays to achieve specified spatial (ICASSP ’88), April 1988. characteristics over broadbands,” in Signal Processing,J.W.R. EURASIP Journal on Advances in Signal Processing 11

Griffiths, Ed., pp. 507–524, Academic Press, New York, NY, USA, 1973. [23] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, “Applications of second-order cone programming,” Linear Algebra and Its Applications, vol. 284, no. 1–3, pp. 193–228, 1998. [24] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11, no. 1–4, pp. 625–653, 1999. [25] S. Yan, Y. Ma, and C. Hou, “Optimal array pattern synthesis for broadband arrays,” Journal of the Acoustical Society of America, vol. 122, no. 5, pp. 2686–2696, 2007. Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 230864, 16 pages doi:10.1155/2010/230864

Research Article First-Order Adaptive Azimuthal Null-Steering for the Suppression of Two Directional Interferers

ReneM.M.Derkx´

Digitial Signal Processing Group, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands

Correspondence should be addressed to ReneM.M.Derkx,´ [email protected]

Received 21 July 2009; Revised 10 November 2009; Accepted 15 December 2009

Academic Editor: Simon Doclo

Copyright © 2010 Rene´ M. M. Derkx. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

An azimuth steerable first-order superdirectional microphone response can be constructed by a linear combination of three eigenbeams: a monopole and two orthogonal dipoles. Although the response of a (rotation symmetric) first-order response can only exhibit a single null, we will look at a slice through this beampattern lying in the azimuthal plane. In this way, we can define maximally two nulls in the azimuthal plane which are symmetric with respect to the main-lobe axis. By placing these two nulls on maximally two directional sources to be rejected and compensating for the drop in level for the desired direction, we can effectively reject these directional sources without attenuating the desired source. We present an adaptive null-steering scheme for adjusting the beampattern so as to obtain this suppression of the two directional interferers automatically. Closed-form expressions for this optimal null-steering are derived, enabling the computation of the azimuthal angles of the interferers. It is shown that the proposed technique has a good directivity index when the angular difference between the desired source and each directional interferer is at least 90 degrees.

1. Introduction superdirectional response is constructed by means of a linear combination of a pressure and velocity (first-order spatial In applications such as hands-free communication and voice derivative of the pressure field) response.) control systems, the microphone signal does not only contain A first method to obtain this first-order superdirectivity the desired sound-source (e.g., a speech signal) but can also is by using microphone-arrays with omnidirectional contain undesired directional interferers and background microphone elements and to apply beamforming-techniques noise (e.g., diffuse-noise). To reduce the amount of noise with asymmetrical filter-coefficients [3]. Basically, this and minimize the influence of interferers, we can use a asymmetrical filtering corresponds to subtraction of signals, microphone array and apply beamforming techniques to like in delay-and-subtract techniques [4, 5] or by taking steer the main-lobe of a beam towards the desired source- spatial derivatives of the sound pressure field [6, 7]. As signal, for example, a speech signal. In this paper, we focus subtraction leads to smaller signals for low frequencies, a on arrays where the wavelength of the sound is much large first-order integrator needs to be applied to equalize the than the size of the array. These arrays are therefore called frequency-response, resulting in an increased sensitivity “Small Microphone Arrays.” When using omnidirectional (20 dB/decade) for sensor-noise and increased sensitivity (monopole) microphones in a small microphone array for mismatches in microphones characteristics [8, 9] for the configuration, additive beamformers like delay-and-sum are lower-frequency-range. not able to obtain a sufficient directivity as the beamwidth A second method to obtain first-order superdirectivity is deteriorates for larger wavelengths [1, 2]. A common method by using microphone-arrays with first-order unidirectional to obtain improved directivity is to apply superdirective microphone elements. As the separate uni-directional micro- beamforming techniques. In this paper, we will focus on phone elements already have a first-order superdirective first-order superdirective beamforming. (The term “first- response, consisting out of a sum of a pressure and a order” is used to indicate that the directivity-pattern of the velocity response, the beamformer can simply be constructed 2 EURASIP Journal on Advances in Signal Processing

y

M1

z

φ y (0, 0) x M1

M0 θ M 0 x M2

M2 Figure 1: Circular array geometry with three cardioid microphones.

by a linear combination of the uni-directional microphone experiments in Section 6. Finally, in Section 7, conclusions signals. In such an approach, there is no need to apply a first- are given. order integrator (as was the case for omni-directional micro- phone elements), and we avoid a 20 dB/decade increased 2. Construction of Eigenbeams sensitivity for sensor-noise [7]. Nevertheless, uni-directional microphones may have a low-frequency roll-off,which We know from [7, 9] that by using a circular array of at least can be compensated for by means of proper equalization three (omni- or uni-directional microphone) sensors in a techniques. Throughout this paper, we will assume that the planar geometry and applying signal processing techniques, uni-directional microphones have a flat frequency response. it is possible to construct a first-order superdirectional We focus on the construction of first-order superdirec- response. This superdirectional response can be steered tional beampatterns where the nulls of the beampattern are with its main-lobe to any desired azimuthal angle and steered to the directional interferers, while having a unity can be adjusted to have any first-order directivity pattern. response in the direction of the desired sound-source. In As mentioned in the introduction, we will use three uni- Section 2, we construct a monopole and two orthogonal directional cardioid microphones (with a heart-shaped dipole responses (known as “eigen-beams” [10, 11]) out directional pattern) in a circular configuration, where the of a circular array of three first-order cardioid microphone main-lobes of the three cardioid responses are pointed elements M0, M1,andM2 (with a heart-shaped directional outwards, as shown in Figure 1. pattern), as shown in Figure 1.Hereθ and φ are the standard The responses of the three cardioid microphones M0, M1, 0 1 spherical coordinate angles: elevation and azimuth. and M2 are given by, respectively, Ec (r, θ, φ), Ec (r, θ, φ), and 2 = Based on these eigenbeams, we are able to construct Ec (r, θ, φ), having their main-lobes at, respectively, φ 0, arbitrary first-order responses that can be steered with 2π/3, and 4π/3 radians. Assuming that we have no sensor- the main-lobe in any azimuthal direction (see Section 2). noise, the nth cardioid microphone response, with n = Although the (rotation symmetric) first-order response can 0, 1, 2, for a harmonic plane-wave with frequency f is ideally only exhibit a single null, we will look at a slice through given by [11] the beampattern lying in the azimuthal plane. In this En r, θ, φ = A e jψn . (1) way, we can define maximally two nulls in the azimuthal c n plane which are symmetric with respect to the main-lobe The magnitude-response An and phase-response ψn of the axis. By placing these two nulls on the two directional nth cardioid microphone are given by, respectively: sources to be rejected and compensating for the drop in 1 1 2nπ ff A = + cos φ − sin θ,(2) level for the desired direction, we can e ectively reject the n 2 2 3 directional sources without attenuating the desired source. In Section 3 expressions are derived for this beampattern 2πf ψ = sin θ x cos φ + y sin φ . (3) synthesis. n c n n To develop an adaptive null-steering algorithm, we first Here c is the speed of sound and xn and yn are the x and y show in Section 4 how the superdirective beampattern can coordinates of the nth microphone (as shown in Figure 1), be synthesized via the Generalized Sidelobe Canceller (GSC) given by [12]. This GSC enables us to optimize a cost-function in 2nπ an unconstrained manner with a gradient descent search- x = r cos φ − , n 3 method that is described in Section 5. Furthermore, the GSC (4) enables tracking of the angles of the separate directional 2nπ y = r sin φ − , interferers, which is validated by means of simulations and n 3 EURASIP Journal on Advances in Signal Processing 3

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1 1 1 1 0.5 1 0.5 1 0.5 1 0 0 0 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1 −1 −1 0 π/2 (a) Em(θ, φ) (b) Ed(θ, φ) (c) Ed (θ, φ) Figure 2: Eigenbeams (monopole and two orthogonal dipoles). with r being the radius of the circle on which the micro- spatial aliasing effects will occur) , that is, r  c/ f , the phones are located. phase-component ψn,givenby(5) can be neglected and the We can simplify (3)as responses of the eigenbeams for these frequencies are equal to 2πf 2nπ ψn = r sin θ cos . (5) Em =,1 c 3 0 = From the three cardioid microphone responses, we Ed θ, φ cos φ sin θ, (10) can construct the circular harmonics [7], also known as π Eπ/2 θ, φ = cos φ − sin θ. “eigenbeams” [10, 11]), by using the 3-point Discrete Fourier d 2 Transform (DFT) with the three microphones as inputs. This The directivity patterns of these eigenbeams are shown in DFT produces three phase-modes Pi(r, θ, φ)[7]withi = 1, 2, 3: Figure 2. The zeroth-order eigenbeam Em represents the monopole 2 response, while the first-order eigenbeams E0(θ, φ)and 1 n d P0 r, θ, φ = E r, θ, φ , π/2 c Ed (θ, φ) represent the orthogonal dipole responses. 3 n=0 The dipole can be steered to any angle ϕs by means of a ∗ P r, θ, φ = P r, θ, φ (6) weighted combination of the orthogonal dipole pair: 1 2 ϕs 0 π/2 E θ, φ = cos ϕsE θ, φ +sinϕsE θ, φ , (11) 1 2 d d d = En r, θ, φ e− j 2πn/3, c with 0 ≤ ϕs ≤ 2π being the steering angle. 3 = n 0 Finally, the steered and scaled superdirectional micro- √ with j = −1and∗ being the complex-conjugate operator. phone response can be constructed via   Via the phase-modes, we can construct the monopole as ϕs E θ, φ = S αE + (1 − α)E θ, φ m d (12) Em r, θ, φ = 2P0 r, θ, φ , (7) = S α + (1 − α) cos φ − ϕs sin θ , and the orthogonal dipoles as with α ≤ 1 being the parameter for controlling the directional pattern of the first-order response and S being an 0 = Ed r, θ, φ 2 P1 r, θ, φ + P2 r, θ, φ , arbitrary scaling factor. Both parameters α and S may also (8) have negative values. π/2 = − Ed r, θ, φ 2j P1 r, θ, φ P2 r, θ, φ . Alternatively, we can write the construction of the response in matrix-vector notation: In matrix notation ⎡ ⎤ ⎡ ⎤⎡ ⎤ = T 0 E θ, φ SFα Rϕs X, (13) ⎢ Em ⎥ ⎢11 1⎥⎢Ec ⎥ ⎢ ⎥ 2 ⎢ ⎥⎢ ⎥ with the pattern-synthesis vector: ⎢ E0 ⎥ = ⎢2 −1 −1 ⎥⎢E1⎥. (9) ⎣ d ⎦ 3 ⎣ √ √ ⎦⎣ c ⎦ ⎡ ⎤ Eπ/2 0 3 − 3 E2 ⎢ α ⎥ d c ⎢ ⎥ − Fα = ⎢(1 α)⎥, (14) For frequencies with wavelengths larger than the size of ⎣ ⎦ the array (for wavelengths smaller than the size of the array, 0 4 EURASIP Journal on Advances in Signal Processing

  the rotation-matrix Rϕs : Solving the two unknowns α and ϕs gives ⎡ ⎤ ϕ = 2 arctan X, (21) ⎢10 0⎥ s ⎢ ⎥ Rϕ = ⎢0cosϕs sin ϕs ⎥, (15) s ⎣ ⎦  = sin Δϕn X − α − − , 0 sin ϕs cos ϕs cos ϕn1 cos ϕn2 + X sin ϕn1 sin ϕn2 +sin Δϕn (22) and the input-vector: ⎡ ⎤ ⎡ ⎤ with Em 1  ⎢ ⎥ ⎢ ⎥ ⎢ 0 ⎥ ⎢ ⎥ − ± − = ⎢ ⎥ = ⎢ ⎥ sin ϕn1 sin ϕn2 2 2cos Δϕn X ⎣ Ed θ, φ ⎦ ⎣cos φ sin θ⎦. (16) = (23) X − , π/2 cos ϕn1 cos ϕn2 Ed θ, φ sin φ sin θ Δϕ = ϕ − ϕ . (24) In the remainder of this paper, we will assume that we n n1 n2 have unity response of the superdirectional microphone for It is noted that (23) can have two solutions, leading to a desired source coming from an arbitrary azimuthal angle ff    = = di erent solutions for ϕs, α,andS. However, the resulting φ ϕs and for an elevation angle θ π/2 and we want to beampatterns are identical. suppress two interferers by steering two nulls towards two As can be seen we get a vanishing denominator in (22) azimuthal angles φ = ϕ and φ = ϕ , also for an elevation n1 n2 for ϕ = ϕ and/or ϕ = ϕ . Similarly, this is the case when angle = 2. Hence, we assume = 2 in the remainder n1 s n2 s θ π/ θ π/ Δϕ = ϕ − ϕ goes to zero. For this latter case, we can of this paper. n n1 n2 compute the limit of ϕs and α:  

3. Optimal Null-Steering for Two Directional sin ϕni lim ϕs = 2 arctan = ϕn + π, (25) → 0 − i Interferers via Direct Pattern Synthesis Δϕn cos ϕni 1 3.1. Pattern Synthesis. The first-order response of (12), with with i = 1, 2 and the main-lobe of the response steered to ϕ , has two nulls for s 1 α ≤ 1/2, given by (see [13]) lim α = , → (26) Δϕn 0 2 − = ± α ϕn1 , ϕn2 ϕs arccos . (17) where Δϕ = ϕ − ϕ . 1 − α n n1 n2 For the case Δϕn = 0, we actually steer a single null

If we want to steer two nulls to arbitrary angles ϕn1 and towards the two directional interferers ϕn1 and ϕn2 . Equations

ϕn2 , not lying symmetrical with respect to ϕs,itcanbeseen (25)and(26) describe the limit-case solution for which there that we cannot steer the main-lobe of the first-order response are an infinite number of solutions that satisfy the system of to ϕs. Therefore, we steer the main-lobe to ϕs and use a scale- equations, given by (21). factor Sunder the constraint that a unity response is obtained at angle ϕs. In matrix notation, 3.2. Analysis of Directivity Index. Although the optimization in this paper is focused on the suppression of two directional  T =  (18) E θ, φ SFα Rϕs X, interferers, it is also important to analyze the noise-reduction performance for isotropic noise circumstances. We will only with the rotation-matrix and the pattern-synthesis matrix analyze the spherical isotropic noise case, for which we being as in (15)and(14), respectively, with α, ϕ instead of s compute the spherical directivity factor Q given by [4, 5] α, ϕ . S s From (12), we see that a unity desired response at angle 2 4πE π/2, ϕs  QS =   . (27) ϕs is obtained when we choose the scale-factor S as 2π π 2 φ=0 θ=0E θ, φ sin θdθ dφ  1 S = , (19) If we combine (27)with(18), we get α + (1 − α) cos ϕs − ϕs − − with α being the parameter for controlling the directional 6 1 cos ϕ1 1 cosϕ2 QS ϕ1, ϕ2 = , (28) pattern of the first-order response (similar to the parameter 5+3cos ϕ1 − ϕ2 α), ϕ the angle for the desired sound, and ϕ the angle for s s with the steering (which, in general, is different from ϕs). Next, we want to place the nulls at ϕ and ϕ .Hence,we n1 n2 ϕ = ϕ − ϕ , (29) solve the following system of two equations: 1 n1 s ϕ = ϕ − ϕ . (30)   −  −  = 2 n2 s S α + (1 α) cos ϕn1 ϕs 0, (20) In Figure 3, the contour-plot of the directivity factor QS S α + (1 − α) cos ϕ − ϕ = 0. n2 s is shown with ϕ1 and ϕ2 on the x-andy-axes, respectively. EURASIP Journal on Advances in Signal Processing 5

2.5 2 1 1 3 Em 1.5 0.5 3 1.5 0 2.5 E Ep E π 2 d 2 3.5 R F + π/2 ϕs α −− Ed 3 3.5

1 3 2 3.5 2.5 1.5 Er1 w1

3 B Er2 w2 3 2 2.5

(rad) π 2 3 2.5 2 ϕ Figure 4: Generalized Sidelobe Canceller scheme. 3 1.5 2.5 3.5

2 3

1 3

3.5 GSC scheme, first a prefiltering with a fixed value of ϕs and

3.5 1 2.5 α is performed, to construct a primary signal with a unity π 1.5 response to angle ϕ and two noise references. As the two 2 2 s 0.5 1.5

1 2 2.5 3 1 noise references do not include the source coming from angle 1 3 ππ π ϕs, two noise-canceller weights w1 and w2 can be optimized 2 2 in an unconstrained manner. The GSC scheme is shown in ϕ (rad) 1 Figure 4. Figure 3: Contour-plot of the directivity factor QS(ϕ1, ϕ2). We start by constructing the primary-response as = T Ep θ, φ Fα Rϕs X, (32) As can be seen in (28), the directivity factor goes to with FT , R ,andX being as defined in the introduction and zero if one of the angles ϕn1 or ϕn2 gets close to ϕs. Clearly, α ϕs a directivity factor which is smaller than unity is not very using a scale-factor S = 1. useful in practice. Hence, the pattern synthesis technique is Furthermore, we can create two noise-references via ⎡ ⎤ only useful when the angles ϕn1 and ϕn2 are located in one half-plane and the desired source is located around the center Er1 θ, φ ⎣ ⎦ = BT R X (33) of the opposite half-plane. ϕs Er2 θ, φ It can be found in the appendix that for 1 with a blocking-matrix B [14]givenby ϕ = arccos − , 1 3 ⎡ ⎤ 1 (31) ⎢ 0⎥ 1 ⎢ 2 ⎥ ϕ = 2π − arccos − , ⎢ ⎥ 2 ⎢ ⎥ 3 B = ⎢ 1 ⎥. (34) ⎢− 0⎥ a maximum directivity factor QS = 4 is obtained. This cor- ⎣ 2 ⎦ responds with 6 dB directivity index, defined as 10 log10QS, 01 where the directivity pattern resembles a hypercardioid. = Furthermore for (ϕ1, ϕ2) (π, π) rad. a directivity factor It is noted that the noise-references Er and Er are, = 1 2 QS 3 is obtained, corresponding with 4.8 dB directivity respectively, a cardioid and a dipole response, with a null index, where the directivity pattern yields a cardioid. As can steered towards the angle of the desired source at azimuth be seen from Figure 3,wecandefineausableregion,where φ = ϕs and elevation θ = π/2. ≤ ≤ the directivity-factor is QS > 3/4forπ/2 ϕ1, ϕ2 3π/2. The primary- and the noise-responses can be used in the generalized sidelobe canceller structure, to obtain an output 4. Optimal Null-Steering for Two Directional as Interferers via GSC = − − E θ, φ Ep θ, φ w1Er1 θ, φ w2Er2 θ, φ . (35) 4.1. Generalized Sidelobe Canceller (GSC) Structure. To develop an adaptive algorithm for steering two nulls towards It is important to note that for any value of ϕs, α, w1,and the two directional interferers based on the pattern-synthesis w2, a unity-response at the output of the GSC is maintained = = technique in Section 3,itwouldberequiredtousea for angle φ ϕs and θ π/2. constrained optimization technique where we want to main- In the next sections we give some details in computing w1 and w for the suppression of two directional interferers, as tain a unity response towards the angle ϕs.Foradaptive 2 algorithms, it is generally easier to adapt in an unconstrained discussed in the previous section. manner. Therefore, we first present an alternative scheme for the null-steering, similar to the direct pattern-synthesis 4.2. Optimal GSC Null-Steering for Two Directional Inter- technique as discussed in Section 3, but based on the well- ferers. Using the GSC structure of Figure 4 having a unity known Generalized Sidelobe Canceller (GSC) [12]. In the response at angle φ = ϕs, we can compute the weights w1 6 EURASIP Journal on Advances in Signal Processing

2 and w to steer two nulls towards azimuthal angles ϕ and 0.5 2 n1 0.5

ϕn2 , by solving π π π 1 E , ϕ − w E , ϕ − w E , ϕ = 0 (36) 1 p i 1 r1 i 2 r2 i 1.5 2 2 2 1 0.5 for i = 1, 2. 2 1.5 2 2.5 This results in the following relations: 1

3.5 − 3 2.5 1.5 2sin ϕ1 ϕ2 2 3 1 = 2 + , (37) w 0 w1 α − − − 3 sin ϕ1 sin ϕ1 ϕ2 sin ϕ2 3.5 − 1 cos ϕ1 cos ϕ2 2 2.5 = 1.5 w2 − − − , (38) sin ϕ1 sin ϕ1 ϕ2 sin ϕ2 0.5 −1 1.5 where ϕ1 and ϕ2 are defined as given by (29)and(30), 1 respectively. 1 To eliminate the dependency of α in (37), we will use 0.5  = − −2 0.5 w1 w1 2α. (39) −2 −10 1 2  = w1 The denominators in (37)and(38) vanish when ϕn1 ϕs = = − and/or ϕn2 ϕs. Also when Δϕn ϕn1 ϕn2 goes to zero, the Figure 5: Contour-plot of the directivity factor QS(w1, w2). denominator vanishes. In this case, we can compute the limit of w1 and w2: lim w =−2, Note that with this computation, it is not necessarily true → 1 (40) Δϕn 0 that ϑ1 = ϕ1 and ϑ2 = ϕ2, that is, we can have a permutation = ambiguity. Furthermore, we compute the resolved angles of lim w2 sin ϕi (41) Δϕn → 0 the directional interferers as = = − with i 1, 2. ϑni ϑi ϕs, (45) For the case Δϕ = 0, we actually steer a single null n = = where (ϑn1 , ϑn2 ) (ϕn1 , ϕn2 )or(ϑn1 , ϑn2 ) (ϕn2 , ϕn1 ). towards the two directional interferers ϕn1 and ϕn2 . Equations (40)and(41) describe the limit-case solution for which there are an infinite number of solutions (w1, w2) that satisfy (36). 4.3. Analysis of Directivity Index. Just as for the direct From the values of w1 and w2, we can derive the pattern synthesis in the previous section, we can analyze the two angles of the directional interferers ϑ1 and ϑ2,where directivity factor for spherical isotropic noise. We can insert (ϑ1, ϑ2) = (ϕ1, ϕ2)or(ϑ1, ϑ2) = (ϕ2, ϕ1). The two angles the values of w1 and w2 into (27)and(35)andget are obtained via a computation involving the arctan-function 3 with additional sign checking to resolve all four quadrants in  = QS(w1, w2)  2 2 . (46) the azimuthal plane and can be computed as w1 + w1 + w2 +1 ⎧ ⎪ N In Figure 5, we show the contour-plot of the directivity ⎪ ≥  ⎪arctan for : D 0, factor with w1 and w2 on the x-andy-axes, respectively. ⎪ D ⎨⎪ From Figure 5 and (46), it can be seen that the contours N  = , = arctan + for : 0, ≥ 0, (42) are concentric circles with the center at coordinate (w1, w2) ϑ1 ϑ2 ⎪ π D< N ⎪ D (−1/2, 0) where the maximum directivity factor of 4 is ⎪ ⎪ N obtained. ⎩arctan − π for : D<0, N<0, D with 5. Adaptive Algorithm −2(w w ∓ X ) 5.1. Cost-Function for Directional Interferers. Next, we N = 1 2 1 , X2 develop an adaptation scheme to adapt two weights in the (43) GSC structure as discussed in the previous Section 4.Weaim w3 +4w2 +4w ± 4w X D = 1 1 1 2 1 , at obtaining the solution, where a unity response is obtained (  +2) X2 w1 at angle ϕs and two nulls are placed at angles ϕn1 and ϕn2 . We start with with  y[k] = p[k] − (w [k] +2α)r [k] − w [k]r [k], (47) =  2  2 1 1 2 2 X1 (w1 +2) 1+w1 + w2 , (44) with k being the discrete-time index, y[k] the output signal, X = 4+4w + w2 +4w2. 2 1 1 2 w1[k]andw2[k] the adaptive weights, r1[k]andr2[k] the EURASIP Journal on Advances in Signal Processing 7 noise reference signals, and p[k] the primary signal. The with inclusion of the term 2α in (47) is a consequence of the fact ⎡ ⎤ σn1 − that w [k]isanestimateofw (see (39) in which 2α is not 1 cos ϕ1 σn1 sin ϕ1 1 1 ⎢ 2 ⎥ included). A = ⎢ ⎥, p ⎣ σ ⎦ In the ideal case that we want to obtain a unity response n2 1 − cos ϕ σ sin ϕ 2 2 n2 2 for a source-signal s[k] originating from angle ϕs and have ⎡ ⎤ an undesired source-signal n1[k] originating from angle ϕn1 w1 = ⎣ ⎦ (53) together with an undesired source-signal n2[k] originating w , w2 from angle ϕn2 ,wehave ⎡ ⎤ p[k] = s[k] + α + (1 − α) cos ϕ n [k], σn1 cos ϕ1 i i v = ⎣ ⎦. i=1,2 p σ cos ϕ n2 2 1 1 r1[k] = − cos ϕi ni[k], T 2 2 (48) The singularity of Ap Ap can be analyzed by computing i=1,2 the determinant of A and setting this determinant to zero: p [ ] = sin [ ] r2 k ϕini k . σn1 σn2 = sin ϕ 1 − cos ϕ − sin ϕ 1 − cos ϕ = 0. (54) i 1,2 2 2 1 1 2

The cost-function J(w1, w2)isdefinedasafunctionofw1 Equation (54) is satisfied when σn1 and/or σn2 are equal to and w2 and is given by zero, ϕ1 and/or ϕ2 are equal to zero, or when J(w , w ) = E y2[k] , (49) 1 2 sin ϕ sin ϕ ϕ ϕ 1 = 2 ≡ cot 1 = cot 2 . (55) with E{·} being the expectation operator. 1 − cos ϕ1 1 − cos ϕ2 2 2 Using that E{n1[k]n2[k]}=0andE{ni[k]s[k]}=0for = i = 1, 2, we can write Equation (55) is satisfied only when ϕ1 ϕ2. This agrees with the result that was obtained in Section 3.1,whereΔϕ = 2 J(w1, w2) = E p[k] − (w1[k] +2α)r1[k] − w2[k]r2[k] 0. = In all other cases (so when ϕ1 / ϕ2, σn1 > 0andσn2 > 0), = 2 2 T σs [k] + σni [k] the matrix Ap is nonsingular and the matrix Ap Ap is positive i=1,2 definite. Hence, the cost-function is a convex function with 1 a global minimum that can be found by solving the least- × w [k]2 + w [k]2 squares problem: 4 1 2 −1 1 2 2 = T T +cos2ϕ w [k] + w [k] − w [k] +1 wopt Ap Ap Ap vp i 4 1 1 2 = −1 Ap vp − 1 2 − (56) +cosϕi w1[k] w1[k] +sinϕiw1[k]w2[k] ⎡ ⎤ 2 − 1 2sin ϕ1 ϕ2 = ⎣ ⎦, − +cosϕi sin ϕi(−2w2[k] − w1[k]w2[k]) A cos ϕ1 cos ϕ2 = 2 − with σs [k] + w1[k] (2+w1[k]) cos ϕi i=1,2 A = sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2, (57) 2 2 σni [k] +2w2[k] sin ϕi , similar to the solutions as given in (37)and(38). 4 (50) As an example, we show the contour-plot of the cost- function 10 log10 J(w1, w2)inFigure 6, for the case where with = = = 2 = = ϕs π/2, ϕn1 0, ϕn2 π rad., σni 1fori 1, 2, and 2 = σ2[k] = E s2[k] , σs 0. s = (51) As can be seen, the global minimum is obtained for w1 2 = E 2 = σn [k] ni [k] . 0andw2 0, resulting in a dipole beampattern. When we i 2 = 2 change σn1 / σn2 , the shape of the cost-function will be more We can see that the cost-function is a quadratic-function and more stretched, but the global optimum will be obtained [15] that can be written in matrix-notation (for convenience, for the same values of w1 and w2. In the extreme case when we leave out the index k): σ2 = 0andσ2 > 0, we obtain the cost-function as shown n2 n1 2 in Figure 7. (It is interesting to note that this cost-function is = 2 − J(w1, w2) σs + Apw vp = = = exactly the same as for the case where ϕs π/2, ϕn1 ϕn2 0 (52) radians with 2 = 1for = 1, 2 and 2 = 0.) Although 2 T T T T T σni i σs = σ + w A Apw − 2w A v + v v , s p p p p p still w1 = 0andw2 = 0isanoptimalsolution,itcanbe 8 EURASIP Journal on Advances in Signal Processing

2 2 10

5 0 1.5 1.5 10 −5 5 5 −10 15 1 1 5 0 − 5 5 −15 −5 −10 0 0 −10 −5 0.5 0.5 15 −5 5 0 − 0 −10 −15 −15 −5 −10 2 2 w 0 −10 w 0 − −5 10 −5 −15 − 0 0 5 0 15 0 5 5 − −0.5 −0.5 − −10 −10 0 −5 −15 5 5 0 5 − − −15 1 1 −10 5 5 5 − 10 − 1.5 −1.5 0 5

−2 −2 10 −2 −1.5 −1 −0.500.511.52 −2 −1.5 −1 −0.500.511.52 w1 w1 Figure 6: Contour-plot of the cost-function 10 log10 J(w1, w2)for Figure 7: Contour-plot of the cost-function 10 log10 J(w1, w2)for ======the case where ϕs π/2, ϕn1 0, and ϕn2 π radians. the case where ϕs π/2andϕn1 ϕn2 0 radians.

seen that there is no strict global minimum. For example, Assuming that there are no directional interferers, =− = also w1 2andw2 1 is an optimal solution (yielding a we obtain the following primary signal p[k] and noise- cardioid beampattern). references r1[k]andr2[k] in the generalized sidelobe can- For the situation where there is only a single interferer celler scheme: or the situation where there are two interferers coming from (nearly) the same angle, the resulting beampattern will have p[k] = s[k] + αd1[k] + (1 − α)d2[k] γ, a null to this angle, while the other (second) null will be 1 1 placed randomly (i.e., the second null is not uniquely defined r [k] = d [k] − d [k] γ, (59) and the adaptation of this second null is poor). However in 1 2 1 2 2 situations where we have additive diffuse-noise present, we r [k] = d [k] γ. obtain an extra degree of freedom, for example, optimization 2 3 of the directivity index. This is however outside the scope of = this paper. As di[k]withi 1, 2, 3 and s[k]aremutuallyuncorre- lated, we can write the cost-function as   5.2. Cost-Function for Isotropic Noise. It is also useful to 2 2 = 2 2 1 1 2 analyze the cost-function in the presence of isotropic (i.e., J(w1, w2) σs [k] + σd w1 + γ 1+ w1 + γw2 . diffuse) noise. We know from [16] that spherical and 2 2 cylindrical isotropic noise can be modelled by adding (60) uncorrelated additive white-noise signals d1, d2,andd3 to the 0 π/2 2 2 three eigenbeams Em, Ed,andEd with variances σd , σd γ,and Just as for the cost-function with two directional interfer- 2 σd γ, respectively, or alternatively with a covariance matrix Kd ers, we can write the cost-function for isotropic noise also as given by a quadratic function in matrix notation: ⎡ ⎤   100 ⎢ ⎥ = 2 − 2 γ ⎢ ⎥ Jd(w1, w2) σs + Ad w vd + , (61) = 2⎢ ⎥ 1+γ Kd σd ⎣0 γ 0⎦. (58) 00γ with ⎡ ⎤ (for diffuse noise situations, the individual elements are σ d 1+γ 0 correlated. However, due the construction of eigenbeams, A = ⎣ 2 ⎦, ff d √ the di use noise will be decorrelated. Hence, it is allowed 0 σd γ to add uncorrelated additive white-noise signals to these ⎡ − ⎤ (62) ff σdγ eigenbeams to simulate di use-noise situations,) We choose ⎢ ⎥ = = = ⎣ 1+γ ⎦ γ 1/3 for spherically isotropic noise and γ 1/2for vd . cylindrically isotropic noise. 0 EURASIP Journal on Advances in Signal Processing 9

It can be easily seen that Ad is positive definite and hence and where μ is the update step-size. As in practice, the we have a convex cost-function with a global minimum. ensemble average E{y2[k]} is not available, we have to use an Via (56) we can easily compute this minimum of the cost- ∇ instantaneous estimate of the gradient wi J(w1, w2), which is function, which is obtained by solving the least-squares computed as problem: 2 dy [k] −1 ∇ J(w , w ) = = T T wi 1 2 wopt Ad Ad Ad vd dwi = −1 =−2 p[k] − (w1 +2α)r1[k] − w2r2[k] ri[k] Ap vp ⎡ ⎤ (63) =− 2γ 2y[k]ri[k]. ⎢− ⎥ (69) = ⎣ 1+γ ⎦. 0 Hence, we can write the update equation as

5.3. Cost-Function for Directional Interferers and Isotropic wi[k +1] = wi[k] +2μy[k]ri[k]. (70) Noise. In case we have directional interferers as well as isotropic noise and assume that all these noise-components Just as proposed in [5], we can apply a power- are mutually uncorrelated, we can construct the cost- normalization such that the convergence speed is indepen- function based on addition of the two cost-functions: dent of the power: ( , ) = ( , ) + ( , ) Jp,d w1 w2 Jp w1 w2 Jd w1 w2 2μy[k]ri[k] wi[k +1] = wi[k] + , (71) 2  2 Pri [k] + = 2 − − 2 σd γ σs + Apw vp + Adw vd + 1+γ with  being a small value to prevent zero division and where

2 2 the power-estimate Pri [k] of the i th reference signal ri[k]can = 2 − σd γ σ + Ap,dw v , + , be computed by a recursive averaging: s p d 1+γ (64) = − 2 Pri [k +1] βPri [k] + 1 β ri [k], (72) with: ⎡ ⎤ with β being a smoothing parameter (lower, but close to 1). Ap The gradient search only needs to be performed in case = ⎣ ⎦ Ap,d , one or both of the directional interferers are present. In A d case the desired speech is present during the adaptation, ⎡ ⎤ (65) v the gradient search will not behave robustly in practice. = ⎣ p⎦ This nonrobust behaviour is caused by leakage of speech vp,d . vd in the noise references r1 and r2 duetoeithervariations of the desired speaker location, microphone mismatches Since Jp(w1, w2)andJd(w1, w2)werefoundtobeconvex, or reverberation (multipath) effects. To avoid adaptation the sum Jp,d(w1, w2) is also convex. The optimal weights wopt during desired speech, we will apply a step-size control factor can be obtained by computing in the adaptation-rule, given by −1 T T w = A A , A v , (66) P [k] + P [k] opt p,d p d p,d p,d Ψ[k] = r1 r2 , (73) P [k] + P [k] + P [k] +  which can be solved numerically via standard SVD tech- r1 r2 p niques [15]. where Pr1 [k]+Pr2 [k] is an estimate of the noise power and Pp[k] is an estimate of the primary signal p[k] that contains 5.4. Gradient Search Algorithm. As we know that the cost- mainly desired speech. The power estimate Pp[k] is, just function is a convex function with a global minimum, we can find this optimal solution by means of a steepest descent as for the reference-signal powers Pr1 and Pr2 , obtained via recursive averaging: update equation for wi with i = 1, 2 by stepping in the direction opposite to the surface J(w , w )withrespecttow , 1 2 i = − 2 similar to [5] Pp[k +1] βPp[k] + 1 β p [k]. (74) = − ∇ We can see that the value of Ψ[k] will be small when the wi[k +1] wi[k] μ wi J(w1, w2), (67) desired speech is dominating, while Ψ[k]willbemuchlarger with a gradient given by (but lower than 1) when either the directional interferers or spherically isotropic noise is dominating. As it is beneficial E 2 ∂J(w1[k], w2[k]) ∂ y [k] to have a low amount of noise components in the power ∇ ( , ) = = , (68) wi J w1 w2 ∂wi[k] ∂wi[k] estimate Pp[k], we found that α = 0.25 is a good choice. 10 EURASIP Journal on Advances in Signal Processing

= = = 2 = 2 = 2 Initialize w1[0] 0, w2[0] 0, Pr1 [0] r1 [0], Pr2 [0] r2 [0] and Pp[0] p [0] for k = 0, ∞: do P [k]+P [k] Ψ[k] = r1 r2  Pr1 [k]+Pr2 [k]+Pp[k]+

y[k] = p[k] − (w1[k]+2α)r1[k] − w2[k]r2[k] for i = 1, 2: do 2μy[k]r [k] w [k +1]= w [k]+ i Ψ[k] i i  Pri [k]+  i 2 2 2 X1 = (−1) (w1[k] +2) (1 + w1[k]+w2[k] )

2 2 X2 = 4+4w1[k]+w1[k] +4w2[k]

−2(w [k]w [k]+X ) N = 1 2 1 X2

w [k]3 +4w [k]2 +4w [k] − 4w [k]X D = 1 1 1 2 1 X2(w1[k]+2) N ϑ = arctan − ϕ ni D s if D<0 then = − ϑni ϑni π sgn(N) end if = − 2 Pri [k +1] βPri [k]+(1 β)ri [k] 2 Pp[k +1]= βPp[k]+(1− β)p [k] end for end for

Algorithm 1: Optimal null-steering for two directional interferers.

 The algorithm now looks as shown in Algorithm 1. Table 1: Computed values of ϕs, α,andS for placing two nulls at As can be seen in the algorithm, the two weights w1[k] ϕn1 and ϕn2 and having a unity response at ϕs. and w2[k] are adapted based on a gradient-search method.  ϕn1 ϕn2 ϕs ϕs  Based on these two weights, a computation with arctan- α S QS function is performed to obtain the angles of the directional (deg) (deg) (deg) (deg) = 45 180 90 292.5 0.277 1.141 0.61 interferers ϑni with i 1, 2. 0 180 90 90 0 1.0 3.0 6. Validation 0 225 90 112.5 0.277 1.058 3.56 00 90 00.520.75 6.1. Directivity Pattern for Directional Interferers. First, we show the beampatterns for a number of situations where two nulls are placed. In Table 1, we show the computed values for Table 2: Computed values of w and w for placing two nulls at ϕ the direct pattern synthesis for 4 different situations, where 1 2 n1 and ϕ and having a unity response at ϕ . nulls are placed at different angles. Furthermore, we assume n2 s that there is no isotropic noise present. ϕn1 ϕn2 ϕs w1 w2 QS As was explained in Section 3.1, we can obtain two (deg) (deg) (deg) ff    di erent sets of solutions for ϕs, α,andS.InTable 1, we show √ 1 √  45 180 90 2 − 2 0.61 the set of solutions where α is positive. 2 Similarly, in Table 2, we show the computed values for w1 and w2 in the GSC structure as explained in Section 4 for the 0 180 90 003.0 same situations as for the direct pattern synthesis. −2 −1 The polar-plots resulting from the computed values in 0 225 90 √ √ 3.56 2 + 2 2 + 2 Tables 1 and 2 are shown in Figure 8. It is noted that the two examples of Section 5.1 where we analyzed the cost-function 00 90 −2 −1 0.75 are depicted in Figures 8(b) and 8(d). EURASIP Journal on Advances in Signal Processing 11

90 3 90 1 120 60 120 60 0.8 2 0.6 150 30 150 30 1 0.4 0.2

180 0 180 0

210 330 210 330

240 300 240 300 270 270 (a) (b) 90 1.5 90 2 120 60 120 60 1.5 1 150 30 150 1 30 0.5 0.5

180 0 180 0

210 330 210 330

240 300 240 300 270 270 (c) (d)

Figure 8: Azimuthal polar-plots for the placement of two nulls with nulls placed at (a) 45 and 180 degrees, (b) 0 and 180 degrees, (c) 0 and 225 degrees and (d) 0, and 0 degrees (two identical nulls).

1 From the plots in Figure 8, it can be seen that if one of 0.5 the two null-angles is close to the desired source angle (e.g., in Figure 8(a)), the directivity index becomes worse. Because 0 of this poor directivity index, the null-steering method as −0.5 is proposed in this paper will only be useful when either azimuthal angle of the two directional interferers is not very 2 −1 n ϑ close to the azimuthal angle of the desired source. When we −1.5 limit the main-beam to be steered maximally 90 degrees away and 1 n − from the desired direction, that is, |ϕ − ϕ | <π/2, we avoid

s s ϑ 2 a poor directivity index. For example, in Figure 8(d) such a −2.5 situation is shown where the main-beam is steered 90 degrees −3 away from the desired direction. In case the two directional interferers will change quickly from 0 to 180 degrees, the −3.5 adaptive algorithm will automatically adapt and removes −4 these two directional interferers at 180 degrees. As only two 012345678910 3 weights are used in the adaptive algorithm, the convergence k ×10 to the optimal weights will be very fast. ϑn1 ϑn2 = ϕni with i 1, 2 6.2. Gradient Search Algorithm. Next, we validate the track- ing behaviour of the gradient update algorithm, as proposed Figure 9: Simulation of the null-steering algorithm with two in Section 5.4. We perform a simulation, where we have a 2 = 2 = desired source at 90 degrees and where we linearly increase directional interferers only where σn σn 1. 1 2 the angle of a first undesired directional interferer (ranging 12 EURASIP Journal on Advances in Signal Processing

1 1

0.5 0.5

0 0 −0.5 −0.5

2 −1 2 −1 n n ϑ ϑ −1.5 −1.5 and and 1 1 n − n − ϑ 2 ϑ 2

−2.5 −2.5 −3 −3

−3.5 −3.5 −4 −4 012345678910 012345678910 3 3 k ×10 k ×10 ϑn1 ϑn1 ϑn2 ϑn2 = = ϕni with i 1, 2 ϕni with i 1, 2 (a) (b)

2 = 2 = Figure 10: Simulation of the null-steering algorithm with two directional interferers where σn1 σn2 1 and with a desired source where 2 = = = σs 1/16 with ϕs 90 degrees (a) and ϕs 60 degrees (b).

1 1

0.5 0.5

0 0 −0.5 −0.5

2 −1 2 −1 n n ϑ ϑ −1.5 −1.5 and and 1 1 n − n − ϑ 2 ϑ 2

−2.5 −2.5 −3 −3

−3.5 −3.5 −4 −4 012345678910 012345678910 3 3 k ×10 k ×10 ϑn1 ϑn1 ϑn2 ϑn2 = = ϕni with i 1, 2 ϕni with i 1, 2 (a) (b)

2 = 2 = Figure 11: Simulation of the null-steering algorithm with two directional interferers where σn1 σn2 1 and with (spherically isotropic) = 2 = 2 = spherical isotropic noise (γ 1/3), where σd 1/16 (a) and σd 1/4(b).

2 = from 135 to 45 degrees) and we linearly decrease the angle σni 1. The results are shown in Figure 9.Itcanbeseen of a second undesired directional interferer (ranging from 30 that ϑ and ϑ do not cross (in contrast to the angles of the − n1 n2 degrees to 90 degrees) in a time-span of 10000 samples. For directional interferers and ).Thefirstnullplacedat = = = ϕn1 ϕn2 ϑn1 the simulation, we used α 0.25, μ 0.02, and β 0.95. adapts very well, while the second null, placed at ϑn2 ,ispoorly First, we simulate the situation, where only two direc- adapted. The reason for this was explained in Section 5.1. tional interferers are present. The two directional interferers Similarly, we simulate the situation with the same two are uncorrelated white random-noise signals with variance directional interferers but now together with a desired EURASIP Journal on Advances in Signal Processing 13

0.1

0.05

0 −0.05

−0.1 0 2 4 6 8 10121416 t (s)

(a) Cardioid to 0 degrees, that is, M0 0.1

0.05

0 −0.05

−0.1 0 2 4 6 8 10121416 Figure 12: Microphone array with 3 outward facing cardioid t (s) microphones. (b) Proposed adaptive null-steering algorithm

Figure 14: Results of the real-life experiment (waveform). N2

6 π rad 5

4 M2 M1

N3 3π/2rad π/2rad N1 1, 2 =

i 3 M 1m 0 with i

φ n 2 ϑ

0rad 1 S 0 0246810121416 t (s)

= Figure 13: Practical setup of the microphone array. ϑni with i 1, 2 = ϕni with i 1, 2 Figure 15: Results of the real-life experiment (angle estimates). source-signal s[k]. The desired source is modelled as a white- 2 = noise signal, with a variance σs 1/16. The result is shown in Figure 10(a). We see that due to the adaptation-noise (caused by s[k]), there is more variance in the estimates of the angles end of the simulation where k = 10000, this can be clearly ϑ and ϑ . In contrast to the situation with two directional n1 n2 seen for ϑn1 . = interferers only, we see that there is a region where ϑn1 ϑn2 . Finally, we simulate the situation of the same directional To show how the adaptation behaviour looks in presence interferers, but now in a spherical isotropic noise situation. of variation in the desired source location, we do a similar As was explained in Section 5.2, isotropic noise can be simulation as above, but now with ϕs set to 60 degrees, while modelled by adding uncorrelated additive white-noise to the 0 π/2 2 2 the desired source is coming from 90 degrees. This means three eigenbeams Em, Ed,andEd with variances σd , σd γ, 2 = that there will be leakage of the desired source signal into the and σd γ,respectively.Hereγ 1/3 for spherically isotropic noise reference signals r1[k]andr2[k]. The results are shown noise and γ = 1/2 for cylindrically isotropic noise. In our in Figure 10(b). Here, it can be seen that the adaptation simulation, we use γ = 1/3. The results are shown in Figures ff 2 = 2 = shows a small o set if one of the directional source angles 11(a) and 11(b) with variances σd 1/16 and σd 1/4, comes close to the desired source angle. For example, at the respectively. When the variance of the diffuse noise gets 14 EURASIP Journal on Advances in Signal Processing

90 1.5 90 1.5 90 1.5 120 60 120 60 120 60 1 1 1 150 30 150 30 150 30 0.5 0.5 0.5

180 0 180 0 180 0

210 330 210 330 210 330

240 300 240 300 240 300 270 270 270 (a) t = 2.5 seconds (b) t = 6 seconds (c) t = 9.5 seconds 90 1.5 90 1.5 120 60 120 60 1 1 150 30 150 30 0.5 0.5

180 0 180 0

210 330 210 330

240 300 240 300 270 270 (d) t = 13 seconds (e) t = 16.5 seconds

Figure 16: Polar-plot results of the real-life experiment. larger compared to the directional interferers, the adaptation was generated via four loudspeakers, placed close to the walls will be influenced by the diffuse noise that is present. The and each facing diffusers hanging on the walls. The level of larger the diffuse noise, the more the final beampattern the diffuse noise is 12 dB lower compared to the directional will resemble the hypercardioid. If diffuse noise would be (interfering) sources. The experiment is done in a time-span dominant over the directional interferers, the estimates ϕn1 of 17.5 seconds, where we switch the directional sources as − and ϕn2 will be equal to 90 109 degrees, and 90+109 degrees, shown in Table 3. respectively, (or −0.33 and −2.81 radians, resp.). We use mutually uncorrelated white random-noise sequences for the directional sources N1, N2, and N3 played 6.3. Real-Life Experiments. To validate the null-steering by loudspeakers and use speech for the desired sound-source algorithm in real-life, we used a microphone array with S. 3 outward facing cardioid electret microphones, as shown For the algorithm, we use discrete-time signals with a in Figure 12. As directional cardioid microphones have sample-rate of 8 KHz. Furthermore, we used α = 0.25, μ = openings on both sides, the microphones are placed in 0.001, and β = 0.95. rubber holders, enabling sound to enter both sides of the Figure 14(a) shows the waveform obtained from micro- directional microphones. phone #0 (M0), which is a cardioid pointed with its main- Thetypeofmicrophoneelementsusedforthisarray lobe to 0 radians. This waveform is compared with the result- is the Primo EM164 cardioid microphones [17]. These ing waveform of the null-steering algorithm, and is shown elements are placed uniformly on a circle with a radius in Figure 14(b). As the proposed null-steering algorithm is of 1 cm. This radius is sufficient for the construction of able to steer nulls toward the directional interferers, the direct eigenbeams up to a frequency of 4 KHz. part of the interferers is removed effectively (this can be seen For the experiment, we placed the array on a table in by the lower noise-level in Figure 14(b) in the time-frame a moderately reverberant room (conferencing-room) with from 0–10.5 seconds). In the segment from 10.5–14 seconds = a T60 of approximately 200 milliseconds. As shown in the (where there is only a single directional interferer at φ π setup in Figure 13, all directional sources are placed at a radians), it can be seen that the null-steering algorithm is distance of 1 meter from the array (at discrete azimuthal able to reject this interferer just as good as the single cardioid angles: φ = 0, π/2, π,and3π/2 radians), while diffuse noise microphone. EURASIP Journal on Advances in Signal Processing 15

Table 3: Switching of sound-sources during the real-life experiment.

Source angle φ (rad) 0–3.5 (s) 3.5–7 (s) 7–10.5 (s) 10.5–14 (s) 14 s–17.5 (s) N1 π/2active—active— — N2 π active active — active — N33π/2 — active active — — S 0 active active active active active

In Figure 15, the resulting angle-estimates from the null- Proof. First, we compute the numerator of the partial steering algorithm are shown. Here, it can be seen that the derivative ∂QS/∂ϕ1 and set this derivative to zero: angle-estimation for the first three segments of 3.5 seconds 6 1 − cos ϕ1 sin ϕ1 5+3cos ϕ1 − ϕ2 is done accurately. For the fourth segment, there is only (A.2) a single point interferer. In this segment, only a single +6 1 − cos ϕ1 1 − cos ϕ2 3sin ϕ1 − ϕ2 = 0. angle-estimation is stable, while the other angle-estimation − is highly influenced by the diffuse noise. Finally, in the The common factor 6(1 cos ϕ1) can be removed, resulting ff in fifth segment, only di use noise is present and the final beampattern will optimize the directivity-index, leading to a sin ϕ1 5+3cos ϕ1 − ϕ2 +3 1 − cos ϕ1 sin ϕ1 − ϕ2 = 0. more hypercardioid beampattern steered with its main-lobe (A.3) to 0 degrees (as explained in Section 6.2). Finally, in Figure 16, the resulting polar-patterns from Similarly, setting the partial derivative ∂QS/∂ϕ2 equal to zero, we get the null-steering algorithm are shown for some discrete time-stamps. Again, it becomes clear that the null-steering sin ϕ2 5+3cos ϕ2 − ϕ1 +3 1 − cos ϕ2 sin ϕ2 − ϕ1 = 0. algorithm is able to steer the nulls toward the angles where (A.4) the interferers are coming from. Combining (A.3)and(A.4)gives − − sin ϕ1 = 3sin ϕ 1 ϕ2 7. Conclusions 1 − cos ϕ1 5+3cos ϕ1 − ϕ2 (A.5) We analyzed the construction of a first-order superdirec- 3sin ϕ − ϕ − sin ϕ = 2 1 = 2 , tional response in order to obtain a unity response for a 5+3cos ϕ2 − ϕ1 1 − cos ϕ2 desired azimuthal angle and to obtain a placement of two or alternatively nulls to undesired azimuthal angles to suppress two direc- tional interferers. We derived a gradient search algorithm to 2sin ϕ1/2 cos ϕ1/2 = ϕ1 =− ϕ2 adapt two weights in a generalized sidelobe canceller scheme. 2 cot cot , (A.6) 2sin ϕ1/2 2 2 Furthermore, we analyzed the cost-function of this gradient ∈ search algorithm, which was found to be convex. Hence with ϕ1, ϕ2 [0, π]. = = a global minimum is obtained in all cases. From the two From (A.6), we can see that ϕ1/2+ϕ2/2 π (or ϕ1 +ϕ2 2π)andcanderive weights in the algorithm and using a four-quadrant inverse- tangent operation, it is possible to obtain estimates of the cos ϕ2 = cos 2π − ϕ1 = cos ϕ1,(A.7) azimuthal angles where the two directional interferers are coming from. Simulations and real-life experiments show a sin ϕ2 = sin 2π − ϕ1 =−sin ϕ1. (A.8) good performance in moderate reverberant situations. Using (A.7)and(A.8)in(A.1)gives − 2 − 2 − 2 = 6 1 cos ϕ1 = 6 1 cos ϕ1 = 6(1 x) QS 2 2 , Appendix 5+3 2cosϕ1 − 1 2+6cos ϕ1 2+6x (A.9) Proofs with x = cos ϕ1 ∈ [−1, 1]. Maximum Directivity Factor QS. We prove that for We can compute the optimal value for x by differentia- tion of (A.9) and setting the result to zero: 2 2 − − − 12(1 − x) 2+6x − 6(1 − x) 12x = 0 = 6 1 cos ϕ1 1 cosϕ2 QS ϕ1, ϕ2 − , (A.1) (A.10) 5+3cos ϕ1 ϕ2 ≡−2 − 6x2 − 6x +6x2 = 0.

Solving (A.10)givesx = cos ϕ1 =−1/3 and conse- with ϕ1, ϕ2 ∈ [0, 2π],amaximumQS = 4 is obtained for quently, ϕ1 = arccos (−1/3) and ϕ2 = 2π −arccos (−1/3). Via ϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3). (A.9), we can see that for these values, we have QS = 4. 16 EURASIP Journal on Advances in Signal Processing

Acknowledgment (OCEANS ’03), vol. 4, pp. 2127–2132, San Diego, Calif, USA, September 2003. The author likes to thank Dr. A. J. E. M. Janssen for his [17] R. M. M. Derkx, “Spatial harmonic analysis of unidirectional valuable suggestions. microphones for use in superdirective beamformers,” in Proceedings of the 36th International Conference: Automotive Audio, Dearborn, Mich, USA, June 2009. References

[1]G.W.Elko,F.Pardo,D.Lopez,D.Bishop,andP.Gammel, “Surface-micromachined mems microphone,” in Proceedings of the 115th AES Convention, p. 1–8, October 2003. [2] P. L. Chu, “Superdirective microphone array for a set-top video conferencing system,” in Proceedings of the IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97), vol. 1, pp. 235–238, Munich, Germany, April 1997. [3] R. L. Pritchard, “Maximum directivity index of a linear point array,” Journal of the Acoustical Society of America, vol. 26, no. 6, pp. 1034–1039, 1954. [4] H. Cox, “Super-directivity revisited,” in Proceedings of the 21st IEEE Instrumentation and Measurement Technology Conference (IMTC ’04), vol. 2, pp. 877–880, May 2004. [5] G. W. Elko and A. T. Nguyen Pong, “A simple first-order differential microphone,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA ’95), pp. 169–172, New Paltz, NY, USA, October 1995. [6]G.W.ElkoandA.T.NguyenPong,“Asteerableandvariable first-order differential microphone array,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97), vol. 1, pp. 223–226, Munich, Germany, April 1997. [7] M. A. Poletti, “Unified theory of horizontal holographic sound systems,” Journal of the Audio Engineering Society, vol. 48, no. 12, pp. 1155–1182, 2000. [8] H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987. [9]R.M.M.DerkxandK.Janse,“Theoreticalanalysisofafirst- order azimuth-steerable superdirective microphone array,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 1, pp. 150–162, 2009. [10] Y. Huang and J. Benesty, Audio Signal Processing for Next Gen- eration Multimedia Communication Systems,KluwerAcademic Publishers, Dordrecht, The Netherlands, 1st edition, 2004. [11] H. Teutsch, Modal Array Signal Processing: Principles and Applications of Acoustic Wavefield Decomposition, Springer, Berlin, Germany, 1st edition, 2007. [12] L. J. Griffiths and C. W. Jim, “An alternative approach to lin- early constrained adaptive beamforming,” IEEE Transactions on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982. [13] R. M. M. Derkx, “Optimal azimuthal steering of a first- order superdirectional microphone response,” in Proceedings of the 11th International Workshop on Acoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash, USA, September 2008. [14] J.-H. Lee and Y.-H. Lee, “Two-dimensional adaptive array beamforming with multiple beam constraints using a general- ized sidelobe canceller,” IEEE Transactions on Signal Processing, vol. 53, no. 9, pp. 3517–3529, 2005. [15] W. Kaplan, Maxima and Minima with Applications: Practical Optimization and Duality, John Wiley & Sons, New York, NY, USA, 1999. [16] B. H. Maranda, “The statistical accuracy of an arctangent bearing estimator,” in Proceedings of the Oceans Conference Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 431347, 25 pages doi:10.1155/2010/431347

Research Article Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on Higher-Order Statistics

Yu Takahashi, 1 Hiroshi Saruwatari (EURASIP Member),1 Kiyohiro Shikano (EURASIP Member),1 and Kazunobu Kondo2

1 Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0192, Japan 2 SP Group, Center for Advanced Sound Technologies, Yamaha Corporation, Shizuoka 438-0192, Japan

Correspondence should be addressed to Yu Takahashi, [email protected]

Received 5 August 2009; Revised 3 November 2009; Accepted 16 March 2010

Academic Editor: Simon Doclo

Copyright © 2010 Yu Takahashi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We conduct an objective analysis on musical noise generated by two methods of integrating microphone array signal processing and spectral subtraction. To obtain better noise reduction, methods of integrating microphone array signal processing and nonlinear signal processing have been researched. However, nonlinear signal processing often generates musical noise. Since such musical noise causes discomfort to users, it is desirable that musical noise is mitigated. Moreover, it has been recently reported that higher- order statistics are strongly related to the amount of musical noise generated. This implies that it is possible to optimize the integration method from the viewpoint of not only noise reduction performance but also the amount of musical noise generated. Thus, we analyze the simplest methods of integration, that is, the delay-and-sum beamformer and spectral subtraction, and fully clarify the features of musical noise generated by each method. As a result, it is clarified that a specific structure of integration is preferable from the viewpoint of the amount of generated musical noise. The validity of the analysis is shown via a computer simulation and a subjective evaluation.

1. Introduction methods, the strength parameter to mitigate musical noise in nonlinear signal processing is determined heuristically. There have recently been various studies on microphone Although there have been some studies on reducing musical array signal processing [1]; in particular, the delay-and- noise [16] and on nonlinear signal processing with less sum (DS) [2–4] array and the adaptive beamformer [5– musical noise [17], evaluations have mainly depended on 7] are the most conventionally used microphone arrays for subjective tests by humans, and no objective evaluations have speech enhancement. Moreover, many methods of integrat- been performed to the best of our knowledge. ing microphone array signal processing and nonlinear signal In our recent study, it was reported that the amount of processing such as spectral subtraction (SS) [8]havebeen generated musical noise is strongly related to the difference studied with the aim of achieving better noise reduction between higher-order statistics (HOS) before and after [9–15]. It has been well demonstrated that such integration nonlinear signal processing [18]. This fact makes it possible methods can achieve higher noise reduction performance to analyze the amount of musical noise arising through than that obtained using conventional adaptive microphone nonlinear signal processing. Therefore, on the basis of HOS, arrays [13] such as the Griffith-Jim array [6].However,aseri- we can establish a mathematical metric for the amount of ous problem exists in such methods: artificial distortion (so- musical noise generated in an objective manner. One of called musical noise [16]) due to nonlinear signal processing. the authors has analyzed single-channel nonlinear signal Since the artificial distortion causes discomfort to users, it processing based on the objective metric and clarified the is desirable that musical noise is controlled through signal features of the amount of musical noise generated [18, 19]. processing. However, in almost all nonlinear noise reduction In addition, this objective metric suggests the possibility that 2 EURASIP Journal on Advances in Signal Processing

Multichannel observed signal

. Beamforming to + . Spectral . enhance target speech Output subtraction (delay-and-sum)

. Beamforming to . . estimate noise signal

Figure 1: Block diagram of architecture for spectral subtraction after beamforming (BF+SS).

Multichannel observed signal + Spectral Beamforming to . subtraction . . . . enhance target speech Output − + Spectral . (delay-and-sum) subtraction − Beamforming to . . . . . estimate noise signal . in each channel

Figure 2: Block diagram of architecture for channelwise spectral subtraction before beamforming (chSS+BF). methods of integrating microphone array signal processing (i) The amount of musical noise generated strongly and nonlinear signal processing can be optimized from the depends on not only the oversubtraction parameter viewpoint of not only noise reduction performance but of SS but also the statistical characteristics of the input also the sound quality according to human hearing. As signal. a first step toward achieving this goal, in this study we (ii) Except for the specific condition that the input signal analyze the simplest case of the integration of microphone is Gaussian, the noise reduction performances of the array signal processing and nonlinear signal processing by two methods are not equivalent even if we set the considering the integration of DS and SS. As a result of the same SS parameters. analysis, we clarify the musical-noise generation features of two types of methods on integration of microphone array (iii) Under equivalent noise reduction performance con- signal processing and SS. ditions, chSS+BF generates less musical noise than Figure 1 shows a typical architecture used for the inte- BF+SS for almost all practical cases. gration of microphone array signal processing and SS, where The most important contribution of this paper is that SS is performed after beamforming. Thus, we call this type these findings are mathematically proved. In particular, the of architecture BF+SS. Such a structure has been adopted amount of musical noise generated and the noise reduction in many integration methods [11, 15]. On the other hand, performance resulting from the integration of microphone the integration architecture illustrated in Figure 2 is an array signal processing and SS are analytically formulated on alternative architecture used when SS is performed before the basis of HOS. Although there have been many studies on beamforming. Such a structure is less commonly used, optimization methods based on HOS [21], this is the first but some integration methods use this structure [12, 14]. time they have been used for musical-noise assessment. The In this architecture, channelwise SS is performed before validity of the analysis based on HOS is demonstrated via a beamforming, and we call this type of architecture chSS+BF. computer simulation and a subjective evaluation by humans. We have already tried to analyze such methods of The rest of the paper is organized as follows. In Section 2, integrating DS and SS from the viewpoint of musical-noise the two methods of integrating microphone array signal generation on the basis of HOS [20]. However, in the processing and SS are described in detail. In Section 3, the analysis, we did not consider the effect of flooring in SS metric based on HOS used for the amount of musical noise and the noise reduction performance. On the other hand, generated is described. Next, the musical-noise analysis of in this study we perform an exact analysis considering the SS, microphone array signal processing, and their integration effect of flooring in SS and the noise reduction performance. methods are discussed in Section 4.InSection 5, the noise We analyze these two architectures on the basis of HOS and reduction performances of the two integration methods are obtain the following results. discussed, and both methods are compared under equivalent EURASIP Journal on Advances in Signal Processing 3

+ To enhance the target speech, DS is applied to the Noise observed signal. This can be represented by Target speech

= T θU yDS f , τ gDS f , θU x f , τ , T = (DS) (DS) gDS f , θU g1 f , θU , ..., gJ f , θU , 0 d (2) Mic. 1 Mic. 2 ···Mic. j ··· Mic. J = = = = (d d1) (d d2) (d dj ) (d dJ ) i2π f/M f d sin θ (DS) = −1 · − s j U gj f , θU J exp , Figure 3: Configuration of microphone array and signals. c

where gDS( f , θU) is the coefficient vector of the DS array and θU is the specific fixed look direction known in advance. noise reduction performance conditions in Section 6.More- Also, fs is the sampling frequency, M is the DFT size, and c over, the result of a computer simulation and experimental is the sound velocity. Finally, we obtain the target-speech- results are given in Section 7. Following a discussion of enhanced spectral amplitude based on SS. This procedure the results of the experiments, we give our conclusions in can be expressed as Section 8. ySS f , τ 2. Methods of Integrating Microphone Array ⎧ Signal Processing and SS ⎪ 2 2 ⎪ yDS f , τ − β · Eτ n f , τ ⎪ ⎪ In this section, the formulations of the two methods of ⎪ ⎪ 2 integrating microphone array signal processing and SS are ⎨ where yDS f , τ (3) = described. First, BF+SS, which is a typical method of ⎪ ⎪ 2 integration, is formulated. Next, an alternative method of ⎪ −β · E n f , τ ≥ 0 , ⎪ τ integration, chSS+BF, is introduced. ⎪ ⎪ ⎩ η · yDS f , τ (otherwise), 2.1. Sound-Mixing Model. In this study, a uniform linear microphone array is assumed, where the coordinates of the wherethisprocedureisatypeofextendedSS[23]. Here, elements are denoted by d (j = 1, ..., J) (see Figure 3)and j y ( f , τ) is the target-speech-enhanced signal, β is the J is the number of microphones. We consider one target SS oversubtraction parameter, η is the flooring parameter, and speech signal and an additive interference signal. Multiple n( f , τ) is the estimated noise signal, which can generally mixed signals are observed at each microphone element, and be obtained by a beamforming techniques such as fixed the short-time analysis of the observed signals is conducted or adaptive beamforming. E [·] denotes the expectation by a frame-by-frame discrete Fourier transform (DFT). The τ operator with respect to the time-frame index. For example, observed signals are given by n( f , τ) can be expressed as [13] = x f , τ h f s f , τ + n f , τ ,(1) = T n f , τ λ f gNBF f x f , τ ,(4) where x( , ) = [ ( , ), , ( , )]T is the observed f τ x1 f τ ... xJ f τ where gNBF( f ) is the filter coefficient vector of the null = T signal vector, h( f ) [h1( f ), ..., hJ ( f )] is the transfer beamformer [22] that steers the null directivity to the speech function vector, s( f , τ) is the target speech signal, and direction θU,andλ( f ) is the gain adjustment term, which T n( f , τ) = [n1( f , τ), ..., nJ ( f , τ)] is the noise signal vector. is determined in a speech break period. Since the null beamformer can remove the speech signal by steering the null directivity to the speech direction, we can estimate 2.2. SS after Beamforming. In BF+SS, the single-channel the noise signal. Moreover, a method exists in which target-speech-enhanced signal is first obtained by beam- independent component analysis (ICA) is utilized as a noise forming, for example, by DS. Next, single-channel noise estimator instead of the null beamformer [15]. estimation is performed by a beamforming technique, for example, null beamformer [22] or adaptive beamforming [1]. Finally, we extract the resultant target-speech-enhanced 2.3. Channelwise SS before Beamforming. In chSS+BF, we signal via SS. The full details of signal processing are given first perform SS independently in each input channel and below. then we derive a multichannel target-speech-enhanced signal 4 EURASIP Journal on Advances in Signal Processing by channelwise SS. This can be expressed as 3.2. Relation between Musical-Noise Generation and Kurtosis. In our previous works [18–20], we defined musical noise as (chSS) yj f , τ the audible isolated spectral components generated through ⎧ signal processing. Figure 4(b) shows an example of a spectro- ⎪ ⎪ 2 2 ⎪ − · gram of musical noise in which many isolated components ⎪ xj f , τ β Eτ nj f , τ ⎪ can be observed. We speculate that the amount of musical ⎪ ⎪ 2 noise is strongly related to the number of such isolated ⎨⎪ where xj f , τ (5) = components and their level of isolation. ⎪ Hence, we introduce kurtosis to quantify the isolated ⎪ 2 ⎪ − · ≥ ⎪ β Eτ nj f , τ 0 , spectral components, and we focus on the changes in kur- ⎪ ⎪ tosis. Since isolated spectral components are dominant, they ⎪ ⎩ are heard as tonal sounds, which results in our perception η · xj f , τ (otherwise), of musical noise. Therefore, it is expected that obtaining (chSS) the number of tonal components will enable us to quantify where yj ( f , τ) is the target-speech-enhanced signal the amount of musical noise. However, such a measurement obtained by SS at a specific channel j and n j ( f , τ) is the estimated noise signal in the jth channel. For instance, is extremely complicated; so instead we introduce a simple the multichannel noise can be estimated by single-input statistical estimate, that is, kurtosis. multiple-output ICA (SIMO-ICA) [24] or a combination of This strategy allows us to obtain the characteristics of ICA and the projection back method [25]. These techniques tonal components. The adopted kurtosis can be used to can provide the multichannel estimated noise signal, unlike evaluate the width of the probability density function (p.d.f.) traditional ICA. SIMO-ICA can separate mixed signals not and the weight of its tails; that is, kurtosis can be used to into monaural source signals but into SIMO-model signals evaluate the percentage of tonal components among the total at the microphone. Here SIMO denotes the specific trans- components. A larger value indicates a signal with a heavy mission system in which the input signal is a single source tail in its p.d.f., meaning that it has a large number of tonal signal and the outputs are its transmitted signals observed components. Also, kurtosis has the advantageous property at multiple microphones. Thus, the output signals of SIMO- that it can be easily calculated in a concise algebraic form. ICA maintain the rich spatial qualities of the sound sources [24] Also the projection back method provides SIMO- 3.3. Kurtosis. Kurtosis is one of the most commonly used model-separated signals using the inverse of an optimized HOS for the assessment of non-Gaussianity. Kurtosis is ICA filter [25]. defined as Finally, we extract the target-speech-enhanced signal by μ4 kurt = , = (chSS) (chSS) T x 2 (7) applying DS to ychSS( f , τ) [y1 ( f , τ), ..., yJ ( f , τ)] . μ2 This procedure can be expressed by where x is a random variable, kurtx is the kurtosis of x,and T y f , τ = g f , θU ychSS f , τ , (6) μn is the nth-order moment of x.Hereμn is defined as DS +∞ where y( f , τ) is the final output of chSS+BF. n μn = x P(x)dx, (8) Such a chSS+BF structure performs DS after (multichan- −∞ nel) SS. Since DS is basically signal processing in which the where P(x) denotes the p.d.f. of x. Note that this μn is summation of the multichannel signal is taken, it can be not a central moment but a raw moment. Thus, (7)isnot considered that interchannel smoothing is applied to the kurtosis according to the mathematically strict definition, multichannel spectral-subtracted signal. On the other hand, but a modified version; however, we refer to (7) as kurtosis the resultant output signal of BF+SS remains as it is after SS. in this paper. That is to say, it is expected that the output signal of chSS+BF is more natural (contains less musical noise) than that of 3.4. Kurtosis Ratio. Although we can measure the number of BF+SS. In the following sections, we reveal that chSS+BF can tonal components by kurtosis, it is worth mentioning that output a signal with less musical noise than BF+SS in almost kurtosis itself is not sufficient to measure musical noise. This all cases on the basis of HOS. is because that the kurtosis of some unprocessed signals such as speech signals is also high, but we do not perceive speech 3. Kurtosis-Based Musical-Noise as musical noise. Since we aim to count only the musical- Generation Metric noise components, we should not consider genuine tonal components. To achieve this aim, we focus on the fact that 3.1. Introduction. It has been reported by the authors that the musical noise is generated only in artificial signal processing. amount of musical noise generated is strongly related to the Hence, we should consider the change in kurtosis during difference between the kurtosis of a signal before and after signal processing. Consequently, we introduce the following signal processing. Thus, in this paper, we analyze the amount kurtosis ratio [18] to measure the kurtosis change: of musical noise generated through BF+SS and chSS+BF on kurtproc the basis of the change in the measured kurtosis. Hereinafter, kurtosis ratio = , (9) we give details of the kurtosis-based musical-noise metric. kurtinput EURASIP Journal on Advances in Signal Processing 5 Frequency (Hz) Frequency (Hz)

Time (s) Time (s) (a) (b)

Figure 4: (a) Observed spectrogram and (b) processed spectrogram.

where kurtproc is the kurtosis of the processed signal and 4.2. Signal Model Used for Analysis. Musical-noise compo- kurtinput is the kurtosis of the input signal. A larger kurtosis nents generated from the noise-only period are dominant ratio (1) indicates a marked increase in kurtosis as a result in spectrograms (see Figure 4); hence, we mainly focus our of processing, implying that a larger amount of musical noise attention on musical-noise components originating from is generated. On the other hand, a smaller kurtosis ratio input noise signals. (1) implies that less musical noise is generated. It has been Moreover, to evaluate the resultant kurtosis of SS, we confirmed that this kurtosis ratio closely matches the amount introduce a gamma distribution to model the noise in the of musical noise in a subjective evaluation based on human power domain [26–28]. The p.d.f. of the gamma distribution hearing [18]. for random variable x is defined as 1 α−1 x PGM(x) = · x exp − , (10) 4. Kurtosis-Based Musical-Noise Analysis for Γ(α)θα θ Microphone Array Signal Processing and SS where x ≥ 0, α>0, and θ>0. Here, α denotes the shape parameter, θ is the scale parameter, and Γ(·) is the gamma 4.1. Analysis Flow. In the following sections, we carry out an function. The gamma distribution with α = 1 corresponds analysis on musical-noise generation in BF+SS and chSS+BF to the chi-square distribution with two degrees of freedom. based on kurtosis. The analysis is composed of the following Moreover, it is well known that the mean of x for a gamma three parts. distribution is E[x] = αθ,whereE[·] is the expectation operator. Furthermore, the kurtosis of a gamma distribution, (i) First, an analysis on musical-noise generation in kurtGM, can be expressed as [18] BF+SS and chSS+BF based on kurtosis that does (α +2)(α +3) not take noise reduction performance into account kurt = . (11) GM ( +1) is performed in this section. α α Moreover, let us consider the power-domain noise signal, (ii) The noise reduction performance is analyzed in xp, in the frequency domain, which is defined as Section 5, and we reveal that the noise reduction =| · |2 performances of BF+SS and chSS+BF are not equiv- xp xre + i xim ∗ alent. Moreover, a flooring parameter designed to = (xre + i · xim)(xre + i · xim) (12) align the noise reduction performances of BF+SS and = x2 + x2 , chSS+BF is also derived to ensure the fair comparison re im of BF+SS and chSS+BF. where xre is the real part of the complex-valued signal and xim is its imaginary part, which are independent and identically (iii) The kurtosis-based comparison between BF+SS and distributed (i.i.d.) with each other, and the superscript ∗ chSS+BF under the same noise reduction perfor- expresses complex conjugation. Thus, the power-domain mance conditions is carried out in Section 6. signal is the sum of two squares of random variables with the same distribution. In the analysis in this section, we first clarify how kurtosis Hereinafter, let xre and xim be the signals after DFT is affected by SS. Next, the same analysis is applied to analysis of signal in a specific microphone j, xj ,andwe DS. Finally, we analyze how kurtosis is increased by BF+SS suppose that the statistical properties of xj equal to xre and and chSS+BF. Note that our analysis contains no limiting xim. Moreover, we assume the following; xj is i.i.d. in each assumptions on the statistical characteristics of noise; thus, channel, the p.d.f. of xj is symmetrical, and its mean is zero. all noises including Gaussian and super-Gaussian noise can These assumptions mean that the odd-order cumulants and be considered. moments are zero except for the first order. 6 EURASIP Journal on Advances in Signal Processing

Before subtraction After subtraction P.d.f. after SS As a result of subtraction, (1) p.d.f. is laterally shifted to the P.d.f. after SS zero-power direction, and Original p.d.f. (2) negative components with nonzero probability arise.

0 βαθ 0 βαθ

Flooring 0 βαη2θ (3) The region corresponding to (4) Positive components the negative components is remain as they are. compressed by a small positive (5) Remaining positive components flooring parameter η. andflooredcomponentsaremerged.

Figure 5: Deformation of original p.d.f. of power-domain signal via SS.

Although kurtx = 3ifx is a Gaussian signal, note where z is the random variable of the p.d.f. after SS. The that the kurtosis of a Gaussian signal in the power spectral derivation of PSS(z) is described in Appendix A. domain is 6. This is because a Gaussian signal in the time From (13), the kurtosis after SS can be expressed as domain obeys the chi-square distribution with two degrees of F freedom in the power spectral domain; for such a chi-square = α, β, η 2 = kurtSS Γ(α) , (14) distribution, μ4/μ2 6. G2 α, β, η

4.3. Resultant Kurtosis after SS. In this section, we analyze the where kurtosis after SS. In traditional SS, the long-term-averaged G power spectrum of a noise signal is utilized as the estimated α, β, η = Γ(α)Γ βα, α +2 − 2βαΓ βα, α +1 noise power spectrum. Then, the estimated noise power + 2 2 , + 4 , +2 , spectrum multiplied by the oversubtraction parameter β β α Γ βα α η γ βα α is subtracted from the observed power spectrum. When a F α, β, η = Γ βα, α +4 − 4βαΓ βα, α +3 gamma distribution is used to model the noise signal, its mean is αθ. Thus, the amount of subtraction is βαθ.The +6β2α2Γ βα, α +2 − 4β3α3Γ βα, α +1 subtraction of the estimated noise power spectrum in each frequency band can be considered as a shift of the p.d.f. to + β4α4Γ βα, α + η8γ βα, α +4 . the zero-power direction (see Figure 5). As a result, negative- (15) power components with nonzero probability arise. To avoid this, such negative components are replaced by observations Here, Γ(b, a) is the upper incomplete gamma function that are multiplied by a small positive value η (the so-called defined as

flooring technique). This means that the region correspond- ∞ ing to the probability of the negative components, which Γ(b, a) = ta−1 exp{−t}dt, (16) forms a section cut from the original gamma distribution, is b compressed by the effect of the flooring. Finally, the floored components are superimposed on the laterally shifted p.d.f. and γ(b, a) is the lower incomplete gamma function defined as (see Figure 5). Thus, the resultant p.d.f. after SS, PSS(z), can be written as b ⎧ γ(b, a) = ta−1 exp{−t}dt. (17) ⎪ ⎪ 1 α−1 z + βαθ 0 ⎪ z + βαθ exp − ⎪ θαΓ(α) θ ⎪ ⎪ The detailed derivation of (14)isgiveninAppendix B. ⎪ ≥ 2 ⎪ z βαη θ , Although Uemura et al. have given an approximated form ⎪ ⎨⎪ (lower bound) of the kurtosis after SS in [18], (14)involves = 1 α−1 z + βαθ PSS(z) ⎪ z + βαθ exp − no approximation throughout its derivation. Furthermore, ⎪ α ⎪ θ Γ(α) θ (14) takes into account the effect of the flooring technique ⎪ ⎪ unlike [18]. ⎪ 1 − z ⎪ + zα 1 exp − ⎪ 2 α 2 Figure 6(a) depicts the theoretical kurtosis ratio after ⎪ η θ Γ(α) η θ ⎩⎪ SS, kurtSS/kurtGM, for various values of oversubtraction 0

60 100

50 80

40 60 30 40 Kurtosis ratio20 Kurtosis ratio

10 20

0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 10 100 Oversubtraction parameter Input kurtosis

η = 0 η = 0.2 β = 1 β = 4 η = 0.1 η = 0.4 β = 2 β = 8 (a) (b)

Figure 6: (a) Theoretical kurtosis ratio after SS for various values of oversubtraction parameter β and flooring parameter η. In this figure, kurtosis of input signal is fixed to 6.0. (b) Theoretical kurtosis ratio after SS for various values of input kurtosis. In this figure, flooring parameter η is fixed to 0.0. to a Gaussian signal. From this figure, it is confirmed that For cumulants, when X and Y are independent random thekurtosis ratio is basically proportional to the oversub- variables it is well known that the following relation holds: traction parameter β.However,kurtosisdoesnotmono- cum ( + ) = ncum ( ) + ncum ( ), tonically increase when the flooring parameter is nonzero. n aX bY a n X b n Y (18)

For instance, the kurtosis ratio is smaller than the peak where cumn(·) denotes the nth-order cumulant. The cumu- = = value when β 4andη 0.4. This phenomenon can be lants of the random variable X,cumn(X), are defined by explained as follows. For a large oversubtraction parameter, a cumulant-generating function, which is the logarithm of almost all the spectral components become negative due to the moment-generating function. The cumulant-generating the larger lateral shift of the p.d.f. by SS. Since flooring is function C(ζ)isdefinedas applied to avoid such negative components, almost all the ∞ components are reconstructed by flooring. Therefore, the ζn C(ζ) = log E exp ζX = cumn(X) , (19) statistical characteristics of the signal never change except for n=1 n! its amplitude if η =/ 0. Generally, kurtosis does not depend on the change in amplitude; consequently, it can be considered where ζ is an auxiliary variable and E[exp{ζX}] is the that kurtosis does not markedly increase when a larger moment-generating function. Thus, the nth-order cumulant oversubtraction parameter and a larger flooring parameter cumn(X)isrepresentedby are set. (n) cumn(X) = C (0), (20) The relation between the theoretical kurtosis ratio and the kurtosis of the original input signal is shown in where C(n)(ζ) is the nth-order derivative of C(ζ). Figure 6(b). In the figure, η is fixed to 0.0. It is revealed Now we consider the DS beamformer, which is steered that the kurtosis ratio after SS rapidly decreases as the to θU = 0 and whose array weights are 1/J. Using (18), the input kurtosis increases, even with the same oversubtraction resultant nth-order cumulant after DS, Kn = cumn(yDS), parameter β. Therefore, the kurtosis ratio after SS, which is can be expressed by related to the amount of musical noise, strongly depends on 1 the statistical characteristics of the input signal. That is to say, K = n n−1 Kn, (21) SS generates a larger amount of musical noise for a Gaussian J input signal than for a super-Gaussian input signal. This fact where Kn = cumn(xj ) is the nth-order cumulant of xj . hasbeenreportedin[18]. Therefore, using (21) and the well-known mathematical rela- tion between cumulants and moments, the power-spectral- 4.4. Resultant Kurtosis after DS. In this section, we analyze domain kurtosis after DS, kurtDS can be expressed by the kurtosis after DS, and we reveal that DS can reduce the K K2 K K K2K K4 = 8 +38 4 +32 2 6 + 288 2 4 + 192 2 kurtosis of input signals. Since we assume that the statistical kurtDS K2 K2K K4 . ff 2 4 +16 2 4 +32 2 properties of xre or xim are the same as that of xj , the e ect (22) of DS on the change in kurtosis can be derived from the cumulants and moments of xj . The detailed derivation of (22) is described in Appendix C. 8 EURASIP Journal on Advances in Signal Processing

100 100

80 80

60 60

40 40 Output kurtosis Output kurtosis

20 20

6 6 20 40 60 80 100 20 40 60 80 100 Input kurtosis Input kurtosis (a) 1-microphone case (b) 2-microphone case

100 100

80 80

60 60

40 40 Output kurtosis Output kurtosis

20 20

6 6 20 40 60 80 100 20 40 60 80 100 Input kurtosis Input kurtosis

Simulation Simulation Theoretical Theoretical Approximated Approximated (c) 4-microphone case (d) 8-microphone case

Figure 7: Relation between input kurtosis and output kurtosis after DS. Solid lines indicate simulation results, broken lines express theoreticalplotsobtainedby(22), and dotted lines show approximate results obtained by (23).

Regarding the power-spectral components obtaining explicit function form: from a gamma distribution, we illustrate the relation between input kurtosis and output kurtosis after DS in − kurt  J 0.7 · (kurt − 6) +6, (23) Figure 7. In the figure, solid lines indicate simulation DS in results and broken lines show theoretical relations given by (22). The simulation results are derived as follows. First, where kurtin is the input kurtosis. The approximated plots multichannel signals with various values of kurtosis are also match the simulation results in Figure 7. generated artificially from a gamma distribution. Next, DS When input signals involve interchannel correlation, the is applied to the generated signals. Finally, kurtosis after DS relation between input kurtosis and output kurtosis after is estimated from the signal resulting from DS. From this DS approaches that for only one microphone. If all input figure, it is confirmed that the theoretical plots closely fit signals are identical signals, that is, the signals are completely the simulation results. The relation between input/output correlated, the output after DS also becomes the same as the kurtosis behaves as follows: (i) The output kurtosis is very input signal. In such a case, the effect of DS on the change close to a linear function of the input kurtosis, and (ii) in kurtosis corresponds to that for only one microphone. the output kurtosis is almost inversely proportional to the However, the interchannel correlation is not equal to one number of microphones. These behaviors result in the within all frequency subbands for a diffuse noise field that following simplified (but useful) approximation with an is a typically considered noise field. It is well known that the EURASIP Journal on Advances in Signal Processing 9

18 18

15 15

12 12 Kurtosis Kurtosis 9 9

6 6 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of microphones Number of microphones

Experimental Experimental Theoretical Theoretical (a) 1000 Hz (b) 8000 Hz

Figure 8: Simulation result for noise with interchannel correlation (solid line) and theoretical effect of DS assuming no interchannel correlation (broken line) in each frequency subband.

27 we experimentally investigate the effect of interchannel correlation in the following. 24 Figures 8 and 9 show preliminary simulation results of 21 DS. In this simulation, SS is first applied to a multichannel Gaussian signal with interchannel correlation in the diffuse 18 noise field. Next, DS is applied to the signal after SS. In the 15 preliminary simulation, the interelement distance between Kurtosis microphones is 2.15 cm. From the results shown in Figures 12 8(a) and 9, we can confirm that the effect of DS on kurtosis is weak in lower-frequency subbands, although it should be 9 noted that the effect does not completely disappear. Also, 6 the theoretical kurtosis curve is in good agreement with the 0 2000 4000 6000 8000 actual results in higher-frequency subbands (see Figures 8(b) Frequency and 9). This is because the interchannel correlation is weak in higher-frequency subbands. Consequently, for a diffuse noise Observed Experimental field, DS can reduce the kurtosis of the input signal even if Theoretical interchannel correlation exists. If input noise signals contain no interchannel correlation, Figure 9: Simulation result for noise with interchannel correlation ff ff the distance between microphones does not a ect the results. (solid line), theoretical e ect of DS assuming no interchannel That is to say, the kurtosis change via DS can be well fit to correlation (broken line), and kurtosis of the observed signal (23). Otherwise, in lower-frequency subbands, it is expected without any signal processing (dotted line) in eight-microphone ff case. that the mitigation e ect of kurtosis by DS degrades with decreasing distance between microphones. This is because the interchannel correlation in lower-frequency subbands increases with decreasing distance between microphones. ff intensity of the interchannel correlation is strong in lower- In higher-frequency subbands, the e ect of the distance frequency subbands and weak in higher-frequency subbands between microphones is thought to be small. for the diffuse noise field [1]. Therefore, in lower-frequency subbands, it can be expected that DS does not significantly 4.5. Resultant Kurtosis: BF+SS versus chSS+BF. In the pre- reduce the kurtosis of the signal. vious subsections, we discussed the resultant kurtosis after As it is well known that the interchannel correlation for SS and DS. In this subsection, we analyze the resultant adiffuse noise field between two measurement locations kurtosis for two types of composite systems, that is, BF+SS can be expressed by the sinc function [1], we can state and chSS+BF, and compare their effect on musical-noise how array signal processing is affected by the interchannel generation. As described in Section 3, it is expected that a correlation. However, we cannot know exactly how cumu- smaller increase in kurtosis leads to a smaller amount of lants are changed by the interchannel correlation because musical noise generated. (18) only holds when signals are mutually independent. In BF+SS, DS is first applied to a multichannel input Therefore, we cannot formulate how kurtosis is changed via signal. At this point, the resultant kurtosis in the power DS for signals with interchannel correlation. For this reason, spectral domain, kurtDS,canberepresentedby(23). Using 10 EURASIP Journal on Advances in Signal Processing

(11), we can derive a shape parameter for the gamma First, we derive the average power of the input signal. We distribution corresponding to kurtDS, α,as assume that the input signal in the power domain can be modeled by a gamma distribution. Then, the average power 2 − kurtDS +14kurtDS +1 kurtDS +5 of the input signal is given as α = . (24) 2kurtDS − 2 ∞ E[nin] = E[x] = xPGM(x)dx The derivation of (24) is shown in Appendix D. Conse- 0 quently, using (14)and(24), the resultant kurtosis after ∞ 1 α−1 x BF+SS, kurtBF+SS,canbewrittenas = x · x exp − dx (29) 0 θαΓ(α) θ F α, β, η ∞ kurt = Γ(α) . (25) 1 x BF+SS G2 α, β, η = xα exp − dx. θαΓ(α) 0 θ In chSS+BF, SS is first applied to each input channel. Here, let t = x/θ, then θdt = dx.Thus, Thus, the output kurtosis after channelwise SS, kurtchSS,is given by ∞ 1 α E[nin] = (θt) exp{−t}θdt F θαΓ(α) 0 α, β, η kurtchSS = Γ(α) . (26) G2 α, β, η α+1 ∞ = θ α {− } α t exp t dt (30) Finally, DS is performed and the resultant kurtosis after θ Γ(α) 0 chSS+BF, kurtchSS+BF,canbewrittenas θΓ(α +1) = = θα. Γ(α) F α, β, η kurt = −0.7 ( ) − 6 + 6, (27) chSS+BF J Γ α G2 α, β, η This corresponds to the mean of a random variable with a gamma distribution. where we use (23). Next, the average power of the signal after SS is calcu- We should compare kurt and kurt here. BF+SS chSS+BF lated. Here, let z obey the p.d.f. of the signal after SS, P (z), However, one problem still remains: comparison under SS defined by (13); then the average power of the signal after SS equivalent noise reduction performance; the noise reduction can be expressed as performances of BF+SS and chSS+BF are not equivalent as described in the next section. Moreover, the design of a [ ] = [ ] flooring parameter so that the noise reduction performances E nout E z of both methods become equivalent will be discussed in ∞ = the next section. Therefore, kurtBF+SS and kurtchSS+BF will zPSS(z)dz 0 be compared in Section 6 under equivalent noise reduction ∞ performance conditions. z − z + βαθ = z + βαθ α 1 exp − dz 0 θαΓ(α) θ 5. Noise Reduction Performance Analysis βαη2θ z α−1 − z + α z exp 2 dz. In the previous section, we did not discuss the noise reduc- 0 η2θ Γ(α) η θ tion performances of BF+SS and chSS+BF. In this section, a (31) mathematical analysis of the noise reduction performances of BF+SS and chSS+BF is given. As a result of this analysis, it We now consider the first term of the right-hand side in (31). is revealed that the noise reduction performances of BF+SS We let t = z + βαθ, then dt = dz. As a result, and chSS+BF are not equivalent even if the same parameters ∞ are set in the SS part. We then derive a flooring-parameter z α−1 z + βαθ design strategy for aligning the noise reduction performances z + βαθ exp − dz 0 θαΓ(α) θ of BF+SS and chSS+BF. ∞ = − · 1 · α−1 − t t βαθ α t exp dt 5.1. Noise Reduction Performance of SS. We utilize the βαθ θ Γ(α) θ following index to measure the noise reduction performance ∞ (NRP): = 1 · α − t α t exp dt (32) βαθ θ Γ(α) θ = E[nout] ∞ NRP 10 log10 , (28) βαθ α−1 t E[nin] − · − α t exp dt βαθ θ Γ(α) θ where nin is the power-domain (noise) signal of the input and θ · Γ βα, α +1 Γ βα, α nout is the power-domain (noise) signal of the output after = − βαθ · . processing. Γ(α) Γ(α) EURASIP Journal on Advances in Signal Processing 11

35 35

30 30

25 25

20 20

15 15

10 10

5 5 Noise reduction performance (dB) Noise reduction performance (dB) 0 0 012345678 6 10 100 Oversubtraction parameter Input kurtosis

η = 0 η = 0.2 β = 1 β = 4 η = 0.1 η = 0.4 β = 2 β = 8 (a) (b)

Figure 10: (a) Theoretical noise reduction performance of SS with various oversubtraction parameters β and flooring parameters η. In this figure, kurtosis of input signal is fixed to 6.0. (b) Theoretical noise reduction performance of SS with various values of input kurtosis. In this figure, flooring parameter η is fixed to 0.0.

Also, we deal with the second term of the right-hand side in In this figure, η is fixed to 0.0. It is revealed that NRPSS (31). We let t = z/(η2θ) then η2θdt = dz, resulting in decreases as the input kurtosis increases. This is because the mean of a high-kurtosis signal tends to be small. Since the βαη2θ shape parameter α of a high-kurtosis signal becomes small, z α−1 − z the mean αθ corresponding to the amount of subtraction α z exp 2 dz 0 η2θ Γ(α) η θ also becomes small. As a result, NRPSS is decreased as the input kurtosis increases. That is to say, the NRP strongly 1 βα SS = 2 α · exp{− } 2 d (33) depends on the statistical characteristics of the input signal 2 α η θt t η θ t η θ Γ(α) 0 as well as the values of the oversubtraction and flooring η2θ parameters. = γ βα, α +1 . Γ(α) 5.2. Noise Reduction Performance of DS. It is well known that the noise reduction performance of DS (NRP )is Using (30), (32), and (33), the noise reduction performance DS proportional to the number of microphones. In particular, of SS, NRPSS, can be expressed by for spatially uncorrelated multichannel signals, NRPDS is given as [1] = 10 E[z] NRPSS 10 log = E[x] NRPDS 10 log10J. (35) Γ βα, α +1 =−10 log 5.3. Resultant Noise Reduction Performance: BF+SS versus 10 ( +1) Γ α chSS+BF. In the previous subsections, the noise reduction Γ βα, α γ βα, α +1 performances of SS and DS were discussed. In this subsec- −β · + η2 . tion, we derive the resultant noise reduction performances Γ(α) Γ(α +1) of the composite systems of SS and DS, that is, BF+SS and (34) chSS+BF. ThenoisereductionperformanceofBF+SSisanalyzed Figure 10(a) shows the theoretical value of NRPSS for as follows. In BF+SS, DS is first applied to a multichannel variousvaluesofoversubtractionparameterβ and flooring input signal. If this input signal is spatially uncorrelated, its parameter η, where the kurtosis of the input signal is fixed noise reduction performance can be represented by 10 log10J. to 6.0, corresponding to a Gaussian signal. From this figure, After DS, SS is applied to the signal. Note that DS affects it is confirmed that NRPSS is proportional to β.However, the kurtosis of the input signal. As described in Section 4.4, NRPSS hits a peak when η is nonzero even for a large value of the resultant kurtosis after DS can be approximated as −0.7 β. The relation between the theoretical value of NRRSS and J · (kurtin − 6) + 6. Thus, SS is applied to the kurtosis- the kurtosis of the input signal is illustrated in Figure 10(b). modified signal. Consequently, using (24), (34), and (35), 12 EURASIP Journal on Advances in Signal Processing

24 24

20 20

16 16

12 12 Noise reduction performance (dB) Noise reduction performance (dB)

8 8 0246802468 Oversubtraction parameter Oversubtraction parameter (a) Input kurtosis = 6 (b) Input kurtosis = 20

24

20

16

12 Noise reduction performance (dB)

8 02468 Oversubtraction parameter

BF+SS chSS+BF (c) Input kurtosis = 80

Figure 11: Comparison of noise reduction performances of chSS+BF with BF+SS. In this figure, flooring parameter is fixed to 0.2 and number of microphones is 8.

the noise reduction performance of BF+SS, NRPBF+SS,is (34)and(35), the noise reduction performance of chSS+BF, given as NRPchSS+BF,canberepresentedby

NRPBF+SS NRPchSS+BF = − 10 log10J 10 log10 1 =−10 log 10 · ( ) Γ βα, α +1 Γ βα, α γ βα, α +1 J Γ α (37) × − β · + η2 Γ(α +1) Γ(α) Γ(α +1) Γ βα, α +1 γ βα, α +1 (36) × − β · Γ βα, α + η2 . 1 α α =−10 log 10 J · Γ(α) Γ βα, α +1 γ βα, α +1 Figure 11 depicts the values of NRPBF+SS and NRPchSS+BF. × − β · Γ βα, α + η2 , From this result, we can see that the noise reduction α α performances of both methods are equivalent when the input signal is Gaussian. However, if the input signal is super- where α is defined by (24). Gaussian, NRPBF+SS exceeds NRPchSS+BF. This is due to the In chSS+BF, SS is first applied to a multichannel input fact that DS is first applied to the input signal in BF+SS; signal; then DS is applied to the resulting signal. Thus, using thus, DS reduces the kurtosis of the signal. Since NRPSS for EURASIP Journal on Advances in Signal Processing 13

1.5 1.5

1 1

R 0.5 R 0.5

0 0

−0.5 −0.5 10 100 10 100 Input kurtosis Input kurtosis (a) Flooring parameter η = 0.0 (b) Flooring parameter η = 0.1

1.5 1.5

1 1

R 0.5 R 0.5

0 0

−0.5 −0.5 10 100 10 100 Input kurtosis Input kurtosis

1mic. 4mics. 1mic. 4mics. 2mics. 8mics. 2mics. 8mics. (c) Flooring parameter η = 0.2 (d) Flooring parameter η = 0.4

Figure 12: Theoretical kurtosis ratio between BF+SS and chSS+BF for various values of input kurtosis. In this figure, oversubtraction parameter is β = 2.0 and flooring parameter in chSS+BF is (a) η = 0.0, (b) η = 0.1, (c) η = 0.2, and (d) η = 0.4. a low-kurtosis signal is greater than that for a high-kurtosis where signal (see Figure 10(b)), the noise reduction performance of Γ βα, α+1 γ βα, α+1 BF+SS is superior to that of chSS+BF. H α, β, η = − β · Γ βα, α +η2 , α α This discussion implies that NRPBF+SS and NRPchSS+BF (39) Γ βα, α +1 are not equivalent under some conditions. Thus the kurtosis- I α, β = − β · Γ βα, α . (40) based analysis described in Section 4 is biased and requires α some adjustment. In the following subsection, we will discuss The detailed derivation of (38)isgiveninAppendix E.By how to align the noise reduction performances of BF+SS and replacing η in (3) with this new flooring parameter η,wecan chSS+BF. align NRPBF+SS and NRPchSS+BF to ensure a fair comparison.

5.4. Flooring-Parameter Design in BF+SS for Equivalent Noise 6. Output Kurtosis Comparison under Reduction Performance. In this section, we describe the Equivalent NRP Condition flooring-parameter design in BF+SS so that NRPBF+SS and NRPchSS+BF become equivalent. In this section, using the new flooring parameter for BF+SS, Using (36)and(37), the flooring parameter η that makes η, we compare the output kurtosis of BF+SS and chSS+BF. NRPBF+SS equal to NRPchSS+BF,is Setting η to (25), the output kurtosis of BF+SS is modified to α Γ(α) η = · H α, β, η − I α, β , (38) F α, β, η γ βα, α +1 Γ(α) kurt = Γ(α) . (41) BF+SS G2 α, β, η 14 EURASIP Journal on Advances in Signal Processing

4 4

3 3

2 2

1 1 R R 0 0

−1 −1

−2 −2

−3 −3 0 5 10 15 20 0 5 10 15 20 Oversubtraction parameter Oversubtraction parameter

η = 0 η = 0.2 η = 0 η = 0.2 η = 0.1 η = 0.4 η = 0.1 η = 0.4 (a) Input kurtosis = 6.0 (b) Input kurtosis = 20.0

Figure 13: Theoretical kurtosis ratio between BF+SS and chSS+BF for various oversubtraction parameters. In this figure, number of microphones is fixed to 8, and input kurtosis is (a) 6.0 (Gaussian) and (b) 20.0 (super-Gaussian).

Loudspeakers (for interferences) In this figure, β is fixed to 2.0 and the flooring parameter in chSS+BF is set to η = 0.0, 0.1, 0.2, and 0.4. The flooring parameter for BF+SS is automatically determined by (38). From this figure, we can confirm that chSS+BF Loudspeaker(fortargetsource) reduces the kurtosis more than BF+SS for almost all input signals with various values of input kurtosis. Theoretical values of R for various oversubtraction parameters are depicted in Figure 13. Figure 13(a) shows that the output kurtosis after chSS+BF is always less than that after BF+SS for a Gaussian signal, even if η is nonzero. On the other 1m hand, Figure 13(b) implies that the output kurtosis after

3.9 m BF+SS becomes less than that after chSS+BF for some parameter settings. However, this only occurs for a large Microphone array oversubtraction parameter, for example, β ≥ 7, which is not (with interelement spacing of 2.15 cm) often applied in practical use. Therefore, it can be considered that chSS+BF reduces the kurtosis and musical noise more than BF+SS in almost all cases.

(Reverberation time: 200 ms) 7. Experiments and Results 7.1. Computer Simulations. First, we compare BF+SS and 3.9 m chSS+BF in terms of kurtosis ratio and noise reduction Figure 14: Reverberant room used in our simulations. performance. We use 16-kHz-sampled signals as test data, in which the target speech is the original speech convoluted with impulse responses recorded in a room with 200 Here, we adopt the following index to compare the resultant millisecond reverberation (see Figure 14), and to which an kurtosis after BF+SS and chSS+BF: artificially generated spatially uncorrelated white Gaussian or super-Gaussian signal is added. We use six speakers kurtBF+SS R = ln , (42) (six sentences) as sources of the original clean speech. The kurt chSS+BF number of microphone elements in the simulation is varied where R expresses the resultant kurtosis ratio between BF+SS from 2 to 16, and their interelement distance is 2.15 cm each. and chSS+BF. Note that a positive R indicates that chSS+BF The oversubtraction parameter β is set to 2.0 and the flooring reduces the kurtosis more than BF+SS, implying that less parameter for BF+SS, η, is set to 0.0, 0.2, 0.4, or 0.8. Note musical noise is generated in chSS+BF. The behavior of that the flooring parameter in chSS+BF is set to 0.0. In the R is depicted in Figures 12 and 13. Figure 12 illustrates simulation, we assume that the long-term-averaged power theoretical values of R for various values of input kurtosis. spectrum of noise is estimated perfectly in advance. EURASIP Journal on Advances in Signal Processing 15

10 20 9 8 7 15 6 5

Kurtosis ratio 4 10 3 2 Noise reduction performance (dB) 1 5 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of microphones Number of microphones

chSS+BF BF+SS (η = 0.4) chSS+BF BF+SS (η = 0.4) BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0.2) BF+SS (η = 0.2) (a) (b)

Figure 15: Results for Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring parameters.

Here, we utilize the kurtosis ratio defined in Section 3.4 reduction performance closely fit the experimental results. to measure the difference in kurtosis, which is related to These findings also support the validity of the analysis in the amount of musical noise generated. The kurtosis ratio Sections 4, 5,and6. is given by Figures 18–20 illustrate the simulation results for a super- Gaussian input signal. It is confirmed from Figure 18(a) that kurt nproc f , τ the kurtosis ratio of chSS+BF also decreases monotonically Kurtosis ratio = , (43) with increasing number of microphones. Unlike the case kurt norg f , τ of the Gaussian input signal, the kurtosis ratio of BF+SS where nproc( f , τ) is the power spectra of the residual noise with η = 0.8 also decreases with increasing number of signal after processing, and norg( f , τ) is the power spectra microphones. However, for a lower value of the flooring of the original noise signal before processing. This kurtosis parameter, the kurtosis ratio of BF+SS is not degraded. ratio indicates the extent to which kurtosis is increased Moreover, the kurtosis ratio of chSS+BF is lower than that with processing. Thus, a smaller kurtosis ratio is desirable. of BF+SS for almost all cases. For the super-Gaussian input Moreover, the noise reduction performance is measured signal, in contrast to the case of the Gaussian input signal, using (28). the noise reduction performance of BF+SS with η = 0.0 Figures 15–17 show the simulation results for a Gaussian is greater than that of chSS+BF (see Figure 18(b)). That input signal. From Figure 15(a), we can see that the kurtosis is to say, the noise reduction performance of BF+SS is ratio of chSS+BF decreases almost monotonically with superior to that of chSS+BF for the same flooring parameter. increasing number of microphones. On the other hand, the This result is consistent with the analysis in Section 5.The kurtosis ratio of BF+SS does not exhibit such a tendency noise reduction performance of BF+SS with η = 0.4is regardless of the flooring parameter. Also, the kurtosis ratio comparable to that of chSS+BF. However, the kurtosis ratio of chSS+BF is lower than that of BF+SS for all cases except of chSS+BF is still lower than that of BF+SS with η = 0.4. for η = 0.8. Moreover, we can confirm from Figure 15(b) This result also coincides with the analysis in Section 6. that the values of noise reduction performance for BF+SS On the other hand, the kurtosis ratio of BF+SS with η = with flooring parameter η = 0.0 and chSS+BF are almost the 0.8 is almost the same as that of chSS+BF. However, the same. When the flooring parameter for BF+SS is nonzero, noise reduction performance of BF+SS with η = 0.8is the kurtosis ratio of BF+SS becomes smaller but the noise lower than that of chSS+BF. Thus, it can be confirmed that reduction performance degrades. On the other hand, for chSS+BF reduces the kurtosis ratio more than BF+SS for Gaussian signals, chSS+BF can reduce the kurtosis ratio, a super-Gaussian signal under the same noise reduction that is, reduce the amount of musical noise generated, performance. Furthermore, the theoretical kurtosis ratio and without degrading the noise reduction performance. Indeed noise reduction performance closely fit the experimental BF+SS with η = 0.8 reduces the kurtosis ratio more than results in Figures 19 and 20. chSS+BF, but the noise reduction performance of BF+SS We also compare speech distortion originating from is extremely degraded. Furthermore, we can confirm from chSS+BF and BF+SS on the basis of cepstral distortion Figures 16 and 17 that the theoretical kurtosis ratio and noise (CD) [29] for the four-microphone case. The comparison 16 EURASIP Journal on Advances in Signal Processing

10 10

8 8

6 6

Kurtosis ratio 4 Kurtosis ratio 4

2 2

2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (a) chSS+BF (b) BF+SS (η = 0.0)

10 10

8 8

6 6

Kurtosis ratio 4 Kurtosis ratio 4

2 2

2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)

10

8

6

Kurtosis ratio 4

2

2 4 6 8 10 12 14 16 Number of microphones

Experimental Theoretical (e) BF+SS (η = 0.8)

Figure 16: Comparison between experimental and theoretical kurtosis ratios for Gaussian input signal. EURASIP Journal on Advances in Signal Processing 17

20 20

15 15

10 10 Noise reduction performance Noise reduction performance

5 5 2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (a) chSS+BF (b) BF+SS (η = 0.0)

20 20

15 15

10 10 Noise reduction performance Noise reduction performance

5 5 2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)

20

15

10 Noise reduction performance

5 2 4 6 8 10 12 14 16 Number of microphones

Experimental Theoretical (e) BF+SS (η = 0.8)

Figure 17: Comparison between experimental and theoretical noise reduction performances for Gaussian input signal. 18 EURASIP Journal on Advances in Signal Processing

6 20

5

15 4

3 Kurtosis ratio 10 2 Noise reduction performance 1 5 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of microphones Number of microphones

chSS+BF BF+SS (η = 0.4) chSS+BF BF+SS (η = 0.4) BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0.2) BF+SS (η = 0.2) (a) (b)

Figure 18: Results for super-Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring parameters.

Table 1: Speech distortion comparison of chSS+BF and BF+SS on (iii) Under the same level of noise reduction performance, the basis of CD for four-microphone case. the amount of musical noise generated via chSS+BF is less than that generated via BF+SS. Input noise type chSS+BF BF+SS Gaussian 6.15 dB 6.45 dB (iv) Thus, the chSS+BF structure is preferable from the viewpoint of musical-noise generation. Super-Gaussian 6.17 dB 5.12 dB (v) However, the noise reduction performance of BF+SS is superior to that of chSS+BF for a super-Gaussian is made under the condition that the noise reduction signal when the same parameters are set in the SS part performances of both methods are almost the same. For for both methods. the Gaussian input signal, the same parameters β = 2.0 and η = 0.0 are utilized for BF+SS and chSS+BF. On (vi) These results imply a trade-off between the amount the other hand, β = 2.0andη = 0.4 are utilized of musical noise generated and the noise reduction for BF+SS and β = 2.0andη = 0.0 are utilized for performance. Thus, we should use an appropriate chSS+BF for the super-Gaussian input signal. Table 1 shows structure depending on the application. the result of the comparison, from which we can see that the amount of speech distortion originating from BF+SS and These results should be applicable under different SNR con- chSS+BF is almost the same for the Gaussian input signal. ditions because our analysis is independent of the noise level. For the super-Gaussian input signal, the speech distortion In the case of more reverberation, the observed signal tends originating from BF+SS is less than that from chSS+BF. This to become Gaussian because many reverberant components is owing to the difference in the flooring parameter for each are mixed. Therefore, the behavior of both methods under method. more reverberant conditions should be similar to that in the In conclusion, all of these results are strong evidence for case of a Gaussian signal. the validity of the analysis in Sections 4, 5,and6. These results suggest the following. 7.2. Subjective Evaluation. Next, we conduct a subjective evaluation to confirm that chSS+BF can mitigate musical (i) Although BF+SS can reduce the amount of musical noise. In the evaluation, we presented two signals processed noise by employing a larger flooring parameter, by BF+SS and by chSS+BF to seven male examinees in it leads to a deterioration of the noise reduction random order, who were asked to select which signal they performance. considered to contain less musical noise (the so-called AB (ii) In contrast, chSS+BF can reduce the kurtosis ratio, method). Moreover, we instructed examinees to evaluate which corresponds to the amount of musical noise only the musical noise and not to consider the amplitude of generated, without degradation of the noise reduc- the remaining noise. Here, the flooring parameter in BF+SS tion performance. was automatically determined so that the output SNR of EURASIP Journal on Advances in Signal Processing 19

6 6

5 5

4 4

3 3 Kurtosis ratio Kurtosis ratio

2 2

1 1 2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (a) chSS+BF (b) BF+SS (η = 0.0)

6 6

5 5

4 4

3 3 Kurtosis ratio Kurtosis ratio

2 2

1 1 2 4 6 8 10121416 2 4 6 8 10121416 Number of microphones Number of microphones (c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)

6

5

4

Kurtosis ratio 3

2

1 2 4 6 8 10 12 14 16 Number of microphones

Experimental Theoretical (e) BF+SS (η = 0.8)

Figure 19: Comparison between experimental and theoretical kurtosis ratios for super-Gaussian input signal.

BF+SS and chSS+BF was equivalent. We used the preference used. Note that noises (b) and (c) were recorded in the actual score as the index of the evaluation, which is the frequency of room shown in Figure 14 and therefore include interchannel the selected signal. correlation because they were recordings of actual noise In the experiment, three types of noise, (a) artificial signals. spatially uncorrelated white Gaussian noise, (b) recorded Each test sample is a 16-kHz-sampled signal, and railway-station noise emitted from 36 loudspeakers, and (c) the target speech is the original speech convoluted with recorded human speech emitted from 36 loudspeakers, were impulse responses recorded in a room with 200 millisecond 20 EURASIP Journal on Advances in Signal Processing

20 20

15 15

10 10 Noise reduction performance Noise reduction performance

5 5 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of microphones Number of microphones (a) chSS+BF (b) BF+SS (η = 0.0)

20 20

15 15

10 10 Noise reduction performance Noise reduction performance

5 5 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of microphones Number of microphones (c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)

20

15

10 Noise reduction performance

5 2 4 6 8 10 12 14 16 Number of microphones

Experimental Theoretical (e) BF+SS (η = 0.8)

Figure 20: Comparison between experimental and theoretical noise reduction performances for super-Gaussian input signal.

reverberation (see Figure 14) and to which the above- Figure 21 shows the subjective evaluation results, which mentioned recorded noise signal is added. Ten pairs of signals confirm that the output of chSS+BF is preferred to that per type of noise, that is, a total of 30 pairs of processed of BF+SS, even for actual acoustic noises including non- signals, were presented to each examinee. Gaussianity and interchannel correlation properties. EURASIP Journal on Advances in Signal Processing 21

100 random variable x is replaced with x + βαθ and the gamma 80 distribution becomes 60 = 1 · α−1 − x + βαθ PGM(x) α x + βαθ exp 40 Γ(α)θ θ (A.1) 20 ≥− Preference score (%) x βαθ . 0 Speech from 36 White Gaussian Station noise from ≥ 36 loudspeakers loudspeakers Since the domain of the original gamma distribution is x 0, the domain of the resultant p.d.f. is x ≥−βαθ.Thus, chSS+BF negative-power components with nonzero probability arise, which can be represented by BF+SS 95% confidence interval 1 α−1 x + βαθ Pnegative(x) = · x + βαθ exp − Figure 21: Subjective evaluation results. Γ(α)θα θ −βαθ ≤ x ≤ 0 , (A.2) 8. Conclusion where Pnegative(x)ispartofPGM(x). To remove the negative- In this paper, we analyze two methods of integrating power components, the signals corresponding to Pnegative(x) microphone array signal processing and SS, that is, BF+SS are replaced by observations multiplied by a small positive and chSS+BF, on the basis of HOS. As a result of the analysis, value η. The observations corresponding to (A.2), P (x), it is revealed that the amount of musical noise generated obs are given by via SS strongly depends on the statistical characteristics of the input signal. Moreover, it is also clarified that the noise 1 − x ff P (x) = · (x)α 1 exp − 0 ≤ x ≤ βαθ . reduction performances of BF+SS and chSS+BF are di erent obs Γ(α)θα θ except in the case of a Gaussian input signal. As a result of (A.3) our analysis under equivalent noise reduction performance conditions, it is shown that chSS+BF reduces musical noise Since a small positive flooring parameter η is applied to more than BF+SS in almost all practical cases. The results (A.3), the scale parameter θ becomes η2θ and the range is of a computer simulation also support the validity of our changed from 0 ≤ x ≤ βαθ to 0 ≤ x ≤ βαη2θ. Then, (A.3)is analysis. Moreover, by carrying out a subjective evaluation, modified to it is confirmed that the output of chSS+BF is considered to contain less musical noise than that of BF+SS. These analytic 1 − x P (x) = · (x)α 1 exp − and experimental results imply the considerable potential of floor 2 α 2 Γ(α) η θ η θ (A.4) optimization based on HOS to reduce musical noise. As a future work, it remains necessary to carry out 0 ≤ x ≤ βαη2θ , signal analysis based on more general distributions. For instance, analysis using a generalized gamma distribution where Pfloor(x) is the probability of the floored components. [26, 27] can lead to more general results. Moreover, an exact This Pfloor(x) is superimposed on the p.d.f. given by (A.1) formulation of how kurtosis is changed through DS under within the range 0 ≤ x ≤ βαη2θ. By considering the positive a coherent condition is still an open problem. Furthermore, range of (A.1)andPfloor(x), the resultant p.d.f. of SS can be the robustness of BF+SS and chSS+BF against low-SNR or formulated as more reverberant conditions is not discussed in this paper. In the future, the discussion should involve not only noise PSS(z) reduction performance and musical-noise generation but ⎧ ⎪ also such robustness. ⎪ 1 α−1 z + βαθ ⎪ z + βαθ exp − ⎪ α ( ) ⎪ θ Γ α θ ⎪ ⎪ 2 Appendices ⎪ z ≥ βαη θ , ⎪ ⎨⎪ (A.5) = 1 α−1 z + βαθ A. Derivation of (13) ⎪ z + βαθ exp − ⎪ α ⎪ θ Γ(α) θ ⎪ When we assume that the input signal of the power domain ⎪ ⎪ 1 − z can be modeled by a gamma distribution, the amount ⎪ + zα 1 exp − ⎪ 2 α 2 ⎪ η θ Γ(α) η θ of subtraction is βαθ. The subtraction of the estimated ⎩⎪ noise power spectrum in each frequency subband can be 0

B. Derivation of (14) Consequently, using (B.4)and(B.5), the kurtosis after SS is given as To derive the kurtosis after SS, the 2nd- and 4th-order moments of z are required. For PSS(z), the 2nd-order F moment is given by = α, β, η kurtSS Γ(α) 2 , (B.6) ∞ G α, β, η 2 μ2 = z · PSS(z)dz 0 where ∞ 2 1 α−1 z + βαθ = z z + βαθ exp − dz (B.1) 0 θαΓ(α) θ G α, β, η = Γ(α)Γ βα, α +2 − 2βαΓ βα, α +1 βαη2θ 2 2 4 2 1 α−1 − z + β α Γ βα, α + η γ βα, α +2 , + z α z exp 2 dz. 0 η2θ Γ(α) η θ F α, β, η = Γ βα, α +4 − 4βαΓ βα, α +3 We now expand the first term of the right-hand side of (B.1). Here, let t = (z + βαθ)/θ; then θdt = dz and z = θ(t − βα). +6β2α2Γ βα, α +2 − 4β3α3Γ βα, α +1 Consequently, + β4α4Γ βα, α + η8γ βα, α +4 . ∞ (B.7) 1 − z + βαθ z2 z + βαθ α 1 exp − dz 0 θαΓ(α) θ ∞ C. Derivation of (22) = 2 − 2 1 α−1 {− } θ t βα α (θt) exp t θdt βα θ Γ(α) Asdescribedin(12), the power-domain signal is the sum of two squares of random variables with the same distribution. 2 ∞ θ 2 2 2 −1 (p) = t − 2βαt + β α tα exp{−t}dt Using (18), the power-domain cumulants Kn can be written Γ(α) βα as 2 θ 2 2 ⎧ = , +2 − 2 , +1 + , (p) Γ βα α βαΓ βα α β α Γ βα α . ⎪ = (2) Γ(α) ⎪K1 2K1 , ⎨⎪ (p) (2) (B.2) K = 2K , power-domain cumulants 2 2 (C.1) ⎪ (p) = (2) Next we consider the second term of the right-hand side of ⎪K3 2K3 , 2 2 ⎩⎪ (B.1). Here, let t = z/(η θ); then η θdt = dz.Thus, (p) = (2) K4 2K4 , βαη2θ 2 1 α−1 − z (2) z α z exp 2 dz where Kn is the nth square-domain moment. Here, the 0 η2θ Γ(α) η θ p.d.f. of such a square-domain signal is not symmetrical and βα its mean is not zero. Thus, we utilize the following relations = 2 2 1 2 α−1 {− } 2 η θt α η θt exp t η θdt between the moments and cumulants around the origin: 0 η2θ Γ(α) 4 2 βα ⎧ η θ γ βα, α +2 ⎪ = = tα+1 exp{−t}dt = η4θ2 . ⎨⎪μ1 κ1, ( ) ( ) Γ α 0 Γ α moments μ = κ + κ2, (C.2) (B.3) ⎪ 2 2 1 ⎩ = 2 2 4 μ4 κ4 +4κ3κ1 +3κ2 +6κ2κ1 + κ1, (SS) As a result, the 2nd-order moment after SS, μ2 ,isa composite of (B.2)and(B.3) and is given as where μn is the nth-order raw moment and κn is the nth- 2 (2) (SS) = θ − order cumulant. Moreover, the square-domain moments μn μ2 Γ βα, α +2 2βαΓ βα, α +1 Γ(α) (B.4) can be expressed by +β2α2Γ βα, α + η4γ βα, α +2 . ⎧ ⎪μ(2) = μ , In the same manner, the 4th-order moment after SS, ⎨ 1 2 (SS) (2) = squared-domain moments ⎪μ2 μ4, (C.3) μ4 ,canberepresentedby ⎩⎪ μ(2) = μ . θ4 4 8 μ(SS) = Γ βα, α +4 − 4βαΓ βα, α +3 4 Γ(α) Using (C.1)–(C.3), the power-domain moments can be 2 2 3 3 +6β α Γ βα, α +2 − 4β α Γ βα, α +1 expressed in terms of the 4th- and 8th-order moments in the 4 4 8 time domain. Therefore, to obtain the kurtosis after DS in +β α Γ βα, α + η γ βα, α +4 . the power domain, the moments and cumulants after DS up (B.5) to the 8th order are needed. EURASIP Journal on Advances in Signal Processing 23

The 3rd-, 5th-, and 7th-order cumulants are zero because D. Derivation of (24) we assume that the p.d.f. of xj is symmetrical and that its mean is zero. If these conditions are satisfied, the following According to (11), the shape parameter α corresponding to relations between moments and cumulants hold: the kurtosis after DS, kurtDS, is given by the solution of the quadratic equation: ⎧ ⎪ ⎪μ1 = 0, ⎪ (α +2)(α +3) ⎪ = ⎪ = kurtDS . (D.1) ⎨⎪μ2 κ2, α(α +1) = 2 moments ⎪μ4 κ4 +3κ2, ⎪ This can be expanded as ⎪ 3 ⎪μ = κ +15κ κ +15κ , ⎪ 6 6 4 2 2 ⎩⎪ 2 = 2 2 4 α (kurtDS − 1) + α(kurtDS − 5) − 6 = 0. (D.2) μ8 κ8 +35κ4 +28κ2κ6 + 210κ2κ4 + 105κ2. (C.4) Using the quadratic formula, Using (21)and(C.4), the time-domain moments after − ± 2 kurtDS +5 kurtDS +14kurtDS +1 DS are expressed as α = , (D.3) 2kurtDS − 2 ⎧ ⎪μ(DS) = K , ⎪ 2 2 whose denominator is larger than zero because kurtDS > 1. ⎪ ⎪ (DS) = K K2 Here, since α> 0, we must select the appropriate numerator ⎪μ4 4 +3 2 , ⎨ of (D.3). First, suppose that (DS) = K K K K3 moments after DS ⎪μ6 6 +15 2 4 +15 2 , ⎪ ⎪ (DS) ⎪μ = K +35K2 +28K K − 2 ⎪ 8 8 4 2 6 kurtDS +5+ kurtDS +14kurtDS +1> 0. (D.4) ⎩⎪ K2K K4 +210 2 4 + 105 2 , (C.5) This inequality clearly holds when 1 < kurtDS < 5because − 2 kurtDS +5> 0and kurtDS +14kurtDS +1> 0. Thus, (DS) where μn is the nth-order raw moment after DS in the time − − 2 domain. kurtDS +5> kurtDS +14kurtDS +1. (D.5) Using (C.2), (C.3), and (C.5), the square-domain cumu- lants can be written as When kurtDS ≥ 5, the following relation also holds:

⎧ 2 2 ⎪K(2) =K (−kurtDS +5) < kurt +14kurtDS +1, ⎪ 1 2, DS ⎪ (D.6) ⎪ ⎪K(2) =K K2 ⇐⇒ 24 kurt > 24. ⎨⎪ 2 4 +2 2 , DS K(2) =K K K K3 square-domain cumulants ⎪ 3 6 +12 4 2 +8 2 , ⎪ Since (D.6) is true when kurtDS ≥ 5, (D.4)holds.In ⎪ (2) ⎪K =K +32K2 +24K K summary, (D.4)alwaysholdsfor1< kurt < 5and5≤ ⎪ 4 8 4 2 6 DS ⎩⎪ K2K K4 kurtDS.Thus, +144 2 4 +48 2 , (C.6) − 2 kurtDS +5+ kurtDS +14kurtDS +1> 0forkurtDS > 1. (D.7) (2) where Kn is the nth-order cumulant in the square domain. Moreover, using (C.1), (C.2), and (C.6), the 2nd- and Overall, 4th-order power-domain moments can be written as −kurt +5+ kurt2 +14kurt +1 DS DS DS (D.8) (p) = K K2 > 0. μ2 2 4 +4 2 , 2kurtDS − 2 (p) = K K2 K K K K2 K4 μ4 2 8 +38 4 +32 6 2 + 288 4 2 + 192 2 . On the other hand, let (C.7) − − 2 kurtDS +5 kurtDS +14kurtDS +1> 0. (D.9) As a result, the power-domain kurtosis after DS, kurtDS,is given as This inequality is not satisfied when kurtDS > 5because − 2 kurtDS +5< 0and kurtDS +14kurtDS +1> 0. Now (D.9) K K2 K K K2K K4 can be modified as = 8 +38 4 +32 2 6 + 288 2 4 + 192 2 kurtDS K2 K2K K4 . 2 4 +16 2 4 +32 2 − 2 (C.8) kurtDS +5> kurtDS +14kurtDS +1, (D.10) 24 EURASIP Journal on Advances in Signal Processing

then the following relation also holds for 1 < kurtDS ≤ 5: This can be rewritten as − 2 2 2 Γ(α) γ βα, α +1 ( kurtDS +5) > kurtDS +14kurtDS +1, η (D.11) Γ(α) α ⇐⇒ 24 kurtDS < 24. Γ βα, α +1 γ βα, α +1 = − β · Γ βα, α + η2 (E.5) α α This is not true for 1 < kurtDS ≤ 5. Thus, (D.9)isnot appropriate for kurt > 1. Therefore, α corresponding to DS Γ(α) Γ βα, α +1 kurtDS is given by − − β · Γ βα, α , Γ(α) α − 2 kurtDS +5+ kurtDS +14kurtDS +1 and consequently α = . (D.12) 2kurtDS − 2 α Γ(α) η2 = H α, β, η − I α, β ,(E.6) E. Derivation of (38) γ βα, α +1 Γ(α) H I For 0 <α≤ 1, which corresponds to a Gaussian or super- where (α, β, η)isdefinedby(39)and (α, β)isgivenby Gaussian input signal, it is revealed that the noise reduction (40). Using (E.3)and(E.4), the right-hand side of (E.5)is performance of BF+SS is superior to that of chSS+BF from clearly greater than or equal to zero. Moreover, since Γ(α) > the numerical simulation in Section 5.3. Thus, the following 0, Γ(α) > 0, α> 0, and γ(βα, α +1)> 0, the right-hand side relation holds: of (E.6) is also greater than or equal to zero. Therefore, − 1 α Γ(α) 10 log10 · η = · H α, β, η − I α, β . (E.7) J Γ(α) γ βα, α +1 Γ(α) Γ βα, α +1 γ βα, α +1 × − β · Γ βα, α + η2 α α Acknowledgment 1 ≥−10 log This work was partly supported by MIC Strategic Informa- 10 J · Γ(α) tion and Communications R&D Promotion Programme in Japan. Γ βα, α +1 γ βα, α +1 × − β · Γ βα, α + η2 . α α (E.1) References [1] M. Brandstein and D. Ward, Eds., Microphone Arrays: Signal This inequality corresponds to Processing Techniques and Applications, Springer, Berlin, Ger- many, 2001. 1 Γ βα, α +1 γ βα, α +1 [2] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko, − β · Γ βα, α + η2 Γ(α) α α “Computer-steered microphone arrays for sound transduc- tion in large rooms,” Journal of the Acoustical Society of 1 Γ βα, α +1 γ βα, α +1 America, vol. 78, no. 5, pp. 1508–1518, 1985. ≤ − β · Γ βα, α + η2 . [3] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani, Γ(α) α α “Microphone array based speech recognition with different (E.2) talker-array positions,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP Then, the new flooring parameter η in BF+SS, which makes ’97), pp. 227–230, Munich, Germany, September 1997. the noise reduction performance of BF+SS equal to that of [4] H. F. Silverman and W. R. Patterson, “Visualizing the perfor- chSS+BF, satisfies η ≥ η (≥ 0) because mance of large-aperture microphone arrays,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’99), pp. 962–972, 1999. γ βα, α +1 ≥ [5] O. Frost, “An algorithm for linearly constrained adaptive array 0. (E.3) α processing,” Proceedings of the IEEE, vol. 60, pp. 926–935, 1972. Moreover, the following relation for η also holds: [6] L. J. Griffiths and C. W. Jim, “An alternative approach to lin- early constrained adaptive beamforming,” IEEE Transactions 1 Γ βα, α +1 γ βα, α +1 on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982. − β · Γ βα, α + η2 Γ(α) α α [7] Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Transactions on Acoustics, Speech, 1 Γ βα, α +1 γ βα, α +1 and Signal Processing, vol. 34, no. 6, pp. 1391–1400, 1986. = − β · Γ βα, α + η2 . [8] S. Boll, “Suppression of acoustic noise in speech using spectral Γ(α) α α subtraction,” IEEE Transactions on Acoustics, Speech and Signal (E.4) Processing, vol. 27, no. 2, pp. 113–120, 1979. EURASIP Journal on Advances in Signal Processing 25

[9] J. Meyer and K. Simmer, “Multi-channel speech enhancement Journal on Applied Signal Processing, vol. 2003, no. 11, pp. in a car environment using Wiener filtering and spectral 1135–1146, 2003. subtraction,” in Proceedings of the International Conference [23] M. Mizumachi and M. Akagi, “Noise reduction by paired- on Acoustics, Speech, and Signal Processing (ICASSP ’97),pp. microphone using spectral subtraction,” in Proceedings of 1167–1170, 1997. the International Conference on Acoustics, Speech, and Signal [10] S. Fischer and K. D. Kammeyer, “Broadband beamforming Processing (ICASSP ’98), vol. 2, pp. 1001–1004, 1998. with adaptive post filtering for speech acquisition in noisy [24] T. Takatani, T. Nishikawa, H. Saruwatari, and K. Shikano, environment,” in Proceedings of the International Conference “High-fidelity blind separation of acoustic signals using on Acoustics, Speech, and Signal Processing (ICASSP ’97),pp. SIMO-model-based independent component analysis,” IEICE 359–362, 1997. Transactions on Fundamentals of Electronics, Communications [11] R. Mukai, S. Araki, H. Sawada, and S. Makino, “Removal and Computer Sciences, vol. E87-A, no. 8, pp. 2063–2072, 2004. of residual cross-talk components in blind source separation [25] S. Ikeda and N. Murata, “A method of ICA in the frequency using time-delayed spectral subtraction,” in Proceedings of domain,” in Proceedings of the International Workshop on the International Conference on Acoustics, Speech, and Signal Independent Component Analysis and Blind Signal Separation, Processing (ICASSP ’02), pp. 1789–1792, Orlando, Fla, USA, pp. 365–371, 1999. May 2002. [12] J. Cho and A. Krishnamurthy, “Speech enhancement using [26] E. W. Stacy, “A generalization of the gamma distribution,” The microphone array in moving vehicle environment,” in Pro- Annals of Mathematical Statistics, pp. 1187–1192, 1962. ceedings of the IEEE Intelligent Vehicles Symposium, pp. 366– [27] K. Kokkinakis and A. K. Nandi, “Generalized gamma density- 371, Graz, Austria, April 2003. based score functions for fast and flexible ICA,” Signal [13] Y. Ohashi, T. Nishikawa, H. Saruwatari, A. Lee, and K. Processing, vol. 87, no. 5, pp. 1156–1162, 2007. Shikano, “Noise robust speech recognition based on spatial [28] J. W. Shin, J.-H. Chang, and N. S. Kim, “Statistical modeling subtraction array,” in Proceedings of the International Workshop of speech signals based on generalized gamma distribution,” on Nonlinear Signal and Image Processing, pp. 324–327, 2005. IEEE Signal Processing Letters, vol. 12, no. 3, pp. 258–261, 2005. [14] J. Even, H. Saruwatari, and K. Shikano, “New architecture [29] L. Rabiner and B. Juang, Fundamentals of Speech Recognition, combining blind signal extraction and modified spectral sub- Prentice-Hall PTR, 1993. traction for suppression of background noise,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash, USA, 2008. [15] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K. Shikano, “Blind spatial subtraction array for speech enhance- ment in noisy environment,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 4, pp. 650–664, 2009. [16] S. B. Jebara, “A perceptual approach to reduce musical noise phenomenon with Wiener denoising technique,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’06), vol. 3, pp. 49–52, 2006. [17] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984. [18] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo, “Automatic optimization scheme of spectral sub- traction based on musical noise assessment via higher-order statistics,” in Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash, USA, 2008. [19] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo, “Musical noise generation analysis for noise reduction methods based on spectral subtraction and MMSE STSA estimatio,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09),pp. 4433–4436, 2009. [20] Y. Takahashi, Y. Uemura, H. Saruwatari, K. Shikano, and K. Kondo, “Musical noise analysis based on higher order statistics for microphone array and nonlinear signal processing,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09), pp. 229–232, 2009. [21] P.Comon, “Independent component analysis, a new concept?” Signal Processing, vol. 36, pp. 287–314, 1994. [22] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa, and K. Shikano, “Blind source separation combining inde- pendent component analysis and beamforming,” EURASIP Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 509541, 13 pages doi:10.1155/2010/509541

Research Article Microphone Diversity Combining for In-Car Applications

Jurgen¨ Freudenberger, Sebastian Stenzel (EURASIP Member), and Benjamin Venditti (EURASIP Member)

Department of Computer Science, University of Applied Sciences Konstanz, Hochschule Konstanz, Brauneggerstr. 55, 78462 Konstanz, Germany

Correspondence should be addressed to Jurgen¨ Freudenberger, [email protected]

Received 1 August 2009; Revised 23 January 2010; Accepted 17 March 2010

Academic Editor: Ivan Tashev

Copyright © 2010 Jurgen¨ Freudenberger et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-car applications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording of noise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. This work proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noise ratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequency domain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence- based noise reduction systems, even if one microphone is heavily corrupted by noise.

1. Introduction channel noise reduction or beamformer arrays [1–3]. Good noise robustness of single microphone systems requires the With in-car speech applications like hands-free car kits and use of single channel noise suppression techniques, most speech recognition systems, speech is corrupted by engine of them derived from spectral subtraction [4]. Such noise noise and other noise sources like airflow from electric fans reduction algorithms improve the signal-to-noise ratio, but or car windows. For safety and comfort reasons, hands-free they usually introduce undesired speech distortion. Micro- telephone systems should provide the same quality of speech phone arrays can improve the performance compared to as conventional fixed telephones. In practice however, the single microphone systems. Nevertheless, the signal quality speech quality of a hands-free car kit heavily depends on does still depend on the speaker position. Moreover, the the particular position of the microphone. Speech has to be microphones are located in close proximity. Therefore, picked up as directly as possible to reduce reverberation and microphone arrays are often vulnerable to airflow that might to provide a sufficient signal-to-noise ratio. The important disturb all microphone signals. question, where to place the microphone inside the car, is, Alternatively, multimicrophone setups have been pro- however, difficult to answer. The position is apparently a posed that combine the processed signals of two or more compromise for different speaker sizes, because the distance separate microphones. The microphones are positioned between microphone and speaker depends significantly on separately (e.g., 40 to 80 cm apart) in order to ensure the position of the driver and therefore on the size of the incoherent recording of noise [5–11]. Similar multichannel driver. Furthermore, noise sources like airflow from electric signal processing systems have been suggested to reduce fans or car windows have to be considered. Placing two or signal distortion due to reverberation [12, 13]. Basically, more microphones in different positions enables a better all these approaches exploit the fact that speech compo- compromise with respect to different speaker sizes and yields nents in the microphone signals are strongly correlated more noise robustness. while the noise components are only weakly correlated Today, noise reduction in hands-free car kits and in- if the distance between the microphones is sufficiently car speech recognition systems is usually based on single large. 2 EURASIP Journal on Advances in Signal Processing

The question at hand with distributed arrays is how 50 to combine these microphone signals with possibly rather 40 ff 30 di erent signal conditions? In this paper, we consider a diver- 20 sity technique that combines the processed signals of several 10 separate microphones. The basic idea of our approach is to SNR (dB) 0 apply maximum-ratio-combining (MRC) to speech signals, −10 −20 where we propose a frequency domain diversity approach for 0 1000 2000 3000 4000 5000 two or more microphone signals. MRC maximizes the signal- Frequency (Hz) to-noise ratio in the combined signal. A major issue for the application of maximum-ratio- mic. 1 combining for multimicrophone setups is the estimation mic. 2 of the acoustic transfer functions. In telecommunications, Figure 1: Input SNR values for a driving situation at a car speed of the signal attenuation as well as the phase shift for each 100 km/h. transmission path are usually measured to apply MRC. With speech applications we have no means to directly measure the acoustic transfer functions. There exists several blind approaches to estimate the acoustic transfer functions (see environment. For these measurements, we used two cardioid e.g., [14–16]) which were successfully applied to derever- microphones with positions suited for car integration. One beration. However, the proposed estimation methods are microphone (denoted by mic. 1) was installed close to the computationally demanding. inside mirror. The second microphone (mic. 2)wasmounted In this paper, we show that maximum-ratio-combining at the A-pillar. can be achieved without explicit knowledge of the acoustic Figure 1 depicts the SNR versus frequency for a driving transfer functions. Proper signal weighting can be achieved situation at a car speed of 100 km/h. From this figure, we based on an estimate of the input signal-to-noise ratio. We observe that the SNR values are quite distinct for these propose a two stage processing of the microphone signals. two microphone positions with differences of up to 10 dB In the first stage, the microphone signals are weighted depending on the particular frequency. We also note that with respect to their input signal-to-noise ratio. These the better microphone position is not obvious in this case, weights guarantee maximum-ratio-combining of the signals because the SNR curves cross several times. with respect to the signal magnitudes. To ensure cophasal Theoretically, a MRC combining of the two input signals addition of the weighted signals, we use the combined would result in an output SNR equal to the sum of the input signal as reference signal for frequency domain LMS filters SNR values. With two inputs, MRC achieves a maximum in the second stage. These filters adjust the phases of the gain of 3 dB for equal input SNR values. In case of the input microphone signals to guarantee coherent signal combining. SNR values being rather different, the sum is dominated by The proposed concept is similar to the single channel the maximum value. Hence, for the curves in Figure 1 the noise reduction system presented by Mukherjee and Gwee output SNR would essentially be the envelope of the two [17]. This system uses spectral subtraction to obtain a crude curves. estimate of the speech signal. This estimate is then used as Next we consider the coherence for the noise and speech the reference signal of a single LMS filter. In this paper, we signals. The corresponding results are depicted in Figure 2. generalize this concept to multimicrophone systems, where The figure presents measurements for two microphones our aim is not only noise reduction, but also dereverberation installed close to the inside mirror in an end-fire beamformer of the microphone signals. constellation with a microphone distance of 7 cm. The lower The paper is organized as follows: In Section 2,we figure contains the results for the microphone positions present some measurement results obtained in a car environ- mic. 1 and mic. 2 (distance of 65 cm). From these results, ment. This results motivate the proposed diversity approach. we observe that the noise coherence closely follows the In Section 3, we present a signal combiner that achieves theoretical coherence function (dotted line in Figure 2)inan MRC weighting based on the knowledge of the input ideal diffuse sound field [18]. Separating the microphones signal-to-noise ratios. Coherence based signal combining significantly reduces the noise coherence for low frequencies. is discussed in Section 4. In the subsequent section, we On the other hand, both microphone constellations have consider implementation issues. In particular, we present similar speech coherence. We note that the speech coherence an estimator for the required input signal-to-noise ratios. is not ideal, as it has steep dips. The corresponding frequen- Finally, in Section 6, we present some simulation results for cies will probably be attenuated by a signal combiner that is different real world noise situations. solely based on coherence.

2. Measurement Results 3. Spectral Combining The basic idea of our spectral combining approach is to In this section, we present the basic system concept. To sim- apply MRC to speech signals. To motivate this approach, plify the discussion, we assume that all signals are stationary we first discuss some measurement results obtained in a car and that the acoustic system is linear and time-invariant. EURASIP Journal on Advances in Signal Processing 3

2 1 | X( f ) is maximized. In the frequency domain, the signal ) f (

2 0.8 combining can be expressed as x 1 x

γ 0.6 | M 0.4 X f = Gi f Yi f , (3) 0.2 i=1

Coherence 0 where G ( f ) is the weight of the ith microphone signal. With 0 1000 2000 3000 4000 5000 i (2)wehave Frequency (Hz) M M (a) X f = X f Gi f Hi f + Gi f Ni f , (4) i=1 i=1

2 1 | )

f where the first sum represents the speech component and (

2 0.8 x

1 the second sum represents the noise component of the x

γ 0.6 | combined signal. Hence, the overall signal-to-noise ratio of 0.4 the combined signal is 0.2 2 E M Coherence 0 X f i=1 Gi f Hi f 0 1000 2000 3000 4000 5000 = γ f 2 . (5) Frequency (Hz) E M i=1 Gi f Ni f Noise Speech 3.1. Maximum-Ratio-Combining. The optimal combining Theoretical strategy that maximizes the signal-to-noise ratio in the com- (b) bined signal X( f ) is usually called maximal-ratio-combining Figure 2: Coherence for noise and speech signals for tow different (MRC) [19]. In this section, we briefly outline the derivation microphone positions. of the MRC weights for completeness. Furthermore, some of the properties of maximal ratio combining are discussed. 2 Let λX ( f ) = E{|X( f )| } be the speech power spectral density. Assuming that the noise power λN ( f ) is the same In the subsequent section we consider the modifications for for all microphones and that the noise at the different nonstationary signals and time variant systems. microphones is uncorrelated, we have We consider a scenario with M microphones. The 2 M microphone signals yi(k) can be modeled by the convolution λX f =1 Gi f Hi f = i (6) of the speech signal x(k) with the impulse response hi(k)of γ f M 2 . the acoustic system plus additive noise n (k). Hence the M λN f i=1 Gi f i microphone signals yi(k) can be expressed as | M |2 We consider now the term i=1 Gi( f )Hi( f ) in the denominator of (6). Using the Cauchy-Schwarz inequality yi(k) = hi(k) ∗ x(k) + ni(k), (1) we have 2 M M M 2 2 where ∗ denotes the convolution. ≤ (7) Gi f Hi f Gi f Hi f To apply the diversity technique, it is convenient to i=1 i=1 i=1 consider the signals in the frequency domain. Let X( f )be ∗ ∗ with equality if G ( f ) = cH ( f ), where H is the complex the spectrum of the speech signal x(k)andY ( f ) be the i i i i conjugate of the channel coefficient H .Herec is a real-valued spectrum of the ith microphone signal y (k). The speech i i constant common to all weights G ( f ). Thus, for the signal- signal is linearly distorted by the acoustic transfer function i to-noise ratio we obtain Hi( f ) and corrupted by the noise term Ni( f ). Hence, the M 2 M 2 signal observed at the ith microphone has the spectrum λ f = G f = H f ≤ X i 1 i i 1 i γ f M 2 λN f i=1 Gi f = Yi f X f Hi f + Ni f . (2) (8) M λX f 2 = Hi f . In the following, we assume that the speech signal and λN f i=1 the channel coefficients are uncorrelated. We assume a = ∗ With the weights Gi( f ) cHi ( f ), we obtain the maximum complex Gaussian distribution of the noise terms Ni( f ). Moreover, we presume that the noise power spectral density signal-to-noise ratio of the combined signal as the sum of the 2 signal-to-noise ratios of the M received signals λN ( f ) = E{|Ni( f )| } is the same for all microphones. This assumption is reasonable for a diffuse sound field. M Our aim is to linearly combine the M microphone signals γ f = γi f , (9) = Yi( f ) so that the signal-to-noise ratio in the combined signal i 1 4 EURASIP Journal on Advances in Signal Processing where Hence, we have 2 (i) λX f Hi f G f = cSC f Hi f (16) γi f = (10) SC λN f with is the input signal-to-noise ratio of the ith microphone. It is 1 appropriate to chose c as cSC f = . 2 M (17) 1 j=1 Hj f cMRC f = . M 2 (11) = H f j 1 j We observe that the weight G(i) ( f ) is proportional to the SC ∗ This leads to the MRC weights magnitude of the MRC weights Hi( f ) , because the factor ∗ c is the same for all M microphone signals. Consequently, H f SC (i) = ∗ = i coherent addition of the sensor signals weighted with the GMRC f cMRC f Hi f 2 , (12) M (i) j=1 Hj f gain factors GSC( f ) still leads to a combining, where the signal-to-noise ratio at the combiner output is the sum of and the estimated (equalized) speech spectrum the input SNR values. However, coherent addition requires = (1) (2) (3) ··· X GMRCY1 + GMRCY2 + GMRCY3 an additional phase estimate. Let φi( f ) denote the phase of Hi( f )atfrequency f . Assuming cophasal addition the H∗ H∗ X = 1 Y + 2 Y + ··· estimated speech spectrum is M | |2 1 M | |2 2 i=1 Hi i=1 Hi − − − = (1) jφ1 (2) jφ2 (3) jφ3 ··· X GSC e Y1 + GSC e Y2 + GSC e Y3 ∗( ) ∗( ) = H1H1X + N1 H2H2X + N2 ··· (18) M 2 + M 2 + (13) 1 − − | | | | = (1) jφ1 (2) jφ2 ··· i=1 Hi i=1 Hi X + GSC e N1 + GSC e N2 + . cSC H∗ H∗ = 1 2 ··· Hence, in the case of stationary signals the term X + M 2 N1 + M 2 N2 + = |H | = |H | i 1 i i 1 i M 2 = (1) (2) ··· 1 X + GMRCN1 + GMRCN2 + , = Hj f (19) cSC f = where we have omitted the dependency on f . The estimated j 1 speech spectrum X( f ) is therefore equal to the actual speech can be interpreted as the resulting transfer characteristic spectrum X( f )plussomeweightednoiseterm. of the system. An example is depicted in Figure 3.The The filter defined in (12) was previously applied to speech upper figure presents the measured transfer characteristics dereverberation by Gannot and Moonen in [14], because for two microphones in a car environment. Note that the it ideally equalizes the microphone signals if a sufficiently microphones have a high-pass characteristic and attenuate accurate estimate of the acoustic transfer functions is avail- signal components for frequencies below 1 kHz. The lower able. The problem at hand with maximum-ratio-combining figure is the curve 1/cSC( f ). The spectral combiner equalizes is that it is rather difficult and computationally complex to most of the deep dips in the transfer functions from the explicitly estimate the acoustic transfer characteristic Hi( f ) mouth of the speaker to the microphones while the envelope for our microphone system. of the transfer functions is not equalized. In the next section, we show that MRC combining can be achieved without explicit knowledge of the acoustic ff 3.3. Magnitude Combining. One challenge in multimicro- channels. The weights for the di erent microphones can phone systems with spatially separated microphones is a be calculated based on an estimate of the signal-to-noise reliable phase estimation of the different input signals. For ratio for each microphone. The proposed filter achieves a a coherent combining of the speech signals, we have to signal-to-noise ratio according to (9), but does not guarantee compensate the phase difference between the speech signals perfect equalization. at each microphone. Therefore, it is sufficient to estimate the phase differences to a reference microphone, for example, 3.2. Diversity Combining for Speech Signals. We consider the to the first microphone Δi( f ) = φ1( f ) − φi( f ), for all i = weights 2, ..., M. Cophasal addition is then achieved by (1) (2) (3) ( ) γi f = jΔ2 jΔ3 ··· i = X GSC Y1 + GSC e Y2 + GSC e Y3 . (20) GSC M . (14) = γj f j 1 But a reliable estimation of the phase differences is only Assuming the noise power is the same for all microphones possible in speech active periods and furthermore only for and substituting γi( f )by(10)leadsto that frequencies where speech is present. Estimating the phase differences 2 (i) Hi f Hi f G f = = . ∗ SC 2 (15) Y1 f Yi f M M 2 jΔi( f ) = E j=1 Hj f e (21) j=1 Hj f Y1 f Yi f EURASIP Journal on Advances in Signal Processing 5 leads to unreliable phase values for time-frequency points Transfer characteristics to the microphones without speech. In particular, if Hi( f ) = 0forsome 0 frequency , the estimated phase ( ) is undefined. A f Δi f )(dB) −

f 10 combining using this estimate leads to additional signal ( 2 − distortions. Additionally, noise correlation would distort the H 20 ), f − phase estimation. A coarse estimate of the phase difference ( 30 1 can also be obtained from the time-shift τi between the H −40 speech components in the microphone signals, for example, 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 using the generalized correlation method [20]. The estimate Frequency (Hz) ≈ is then Δi( f ) 2πfτi. Note that a combiner using these (a) phase values would in a certain manner be equivalent Overall transfer characteristic to a delay-and-sum beamformer. However, for distributed microphone arrays in reverberant environments this phase 0 compensation leads to a poor estimate of the actual phase −10

ff (dB) di erences. −20 SC

Because of the drawbacks, which come along with the /c 1 −30 phase estimation methods described above, we propose −40 another scheme. Therefore, we use a two stage combining 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 approach. In the first stage, we use the spectral combining Frequency (Hz) approach as described in Section 3.2 with a simple magni- tude combining of the microphone signals. For the mag- (b) nitude combining the noisy phase of the first microphone Figure 3: Transfer characteristics to the microphones and of the signal is adopted to the other microphone signals. This is also combined signal. obvious in Figure 5, where the phase of the noisy spectrum e jφ1( f ) is taken for the spectrum at the output of the filter (2) GSC ( f ), before the signals were combined. This leads to the applied the dereverberation principle of Allen et al. [13] following incoherent combining of the input signals to noise reduction. In particular, they proposed an LMS- ff based time domain algorithm to combine the di erent = (1) (2) jφ1( f ) ··· X f GSC f Y1 f + GSC f Y2 f e + microphone signals. This approach provides effective noise

suppression for frequencies where the noise components of (M) jφ1( f ) + GSC f YM f e the microphone signals are uncorrelated.

However, as we have seen in Section 2,forpractical = (1) (2) jφ1( f ) ··· GSC f Y1 f + GSC f Y2 f e + . microphone distances in the range of 0.4 to 0.8 m the noise (22) signals are correlated for low frequencies. These correlations reduce the noise suppression capabilities of the algorithm The estimated speech spectrum X( f )isequalto and lead to musical noise. We will show in this section that a combination of the jφ1( f ) spectral combining with the coherence based approach by X f e (23) Martin and Vary reduces this issues. cSC f plus some weighted noise terms. It follows from the triangle 4.1. Analysis of the LMS Approach. We present now an inequality that analysis of the scheme by Martin and Vary as depicted in Figure 4. The filter gi(k) is adapted using the LMS algorithm. M 2 For stationary signals x(k), n1(k), and n2(k), the adaptation 1 1 ≤ = Hj f . (24) converts to filter coefficients g (k) and a corresponding filter c f c f i SC SC j=1 transfer function E ∗ Magnitude combining does not therefore guarantee maxi- Yi f Yj f (i) = = mum-ratio-combining. Yetthe signal X( f ) is taken as a refer- GLMS f 2 , i / j (25) E Y f ence signal in the second stage where the phase compensation i is done. This coherence based signal combining scheme is described in the following section. that minimizes the expected value 2 E (i) − 4. Coherence-Based Combining Yi f GLMS f Yj f , (26) E{ ∗ } As an example of a coherence based diversity system we where Yi ( f )Yj ( f ) is the cross-power spectrum of the 2 first consider the two microphone approach by Martin two microphone signals and E{|Yi( f )| } is the power and Vary [5, 6]asdepictedinFigure4.MartinandVary spectrum of the ith microphone signal. 6 EURASIP Journal on Advances in Signal Processing

y1(k) = x(k) ∗ h1(k)+n1(k) y2(k) n1(k) g1(k)

− 0.5 x(k) h1(k) x(k) −

h2(k)

n2(k) g2(k) y2(k) = x(k) ∗ h2(k)+n2(k) y1(k)

Figure 4:BasicsystemstructureoftheLMSapproach.

Assuming that the speech signal and the noise signals are 4.2. Combining MRC and LMS. To ensure suitable weighting uncorrelated, (25)canbewrittenas and coherent signal addition we combine the diversity technique with the LMS approach to process the signals E 2 ∗ E ∗ X f Hi f Hj f + Ni f Nj f of the different microphones. It is informative to examine G(i) f = . LMS 2 2 2 the combined approach under ideal conditions, that is, we E X f Hi f + E Ni f assume ideal MRC weighting. (27) Analog to (13), weighting with the MRC gains factors For frequencies where the noise components are uncorre- according to (12) results in the estimate ∗ lated, that is, E{N ( f )Nj ( f )}=0, this formula is reduced i = (1) (2) ··· to X f X f + GMRC f N1 f + GMRC f N2 f + . (30) E 2 ∗ X f Hi f Hj f G(i) f = . (28) LMS 2 2 2 We now use the estimate X( f ) as the reference signal for the E X f Hi f + E Ni f LMS algorithm. That is, we adapted a filter for each input (i) signal such that the expected value The filter GLMS( f ) according to (28) results in fact in a minimum mean squared error (MMSE) estimate of the 2 E (i) − signal X( f )Hj ( f ) based on the signal Yi( f ). Hence, the Yi f GLMS f X f (31) weighted output is a combination of the MMSE estimates of the speech components of the two input signals. This is minimized. The adaptation results in the filter transfer explains the good noise reduction properties of the approach functions by Martin and Vary. E ∗ Yi f X f On the other hand, the coherence of the noise depends (i) = GLMS f 2 . (32) strongly on the distance between the microphones. For in- E Yi f car applications, practical distances are in the range of 0.4 to 0.8 m. Therefore, only the noise components for frequencies Assuming that the speech signal and the noise signals are above 1 kHz can be considered to be uncorrelated [6]. uncorrelated and substituting X( f ) according to (30)leads According to formula (27), the noise correlation leads to to abias ∗ ( ) E Y f X f ∗ i = i E GLMS f (33) Ni f Nj f E 2 (29) Yi f E 2 Yi f 2 E Ni f of the filter transfer function. An approach to correct the + G(i) f , (34) MRC E 2 filter bias by estimating the noise cross-power density was Yi f presented in [21]. Another issue with speech enhancement E ∗ solely based on the LMS approach is that the speech signals ( ) Ni f Nj f + G j f + ··· . (35) at the microphone inputs may only be weakly correlated MRC 2 E Yi f for some frequencies as shown in Section 2. Consequently, these frequency components will be attenuated in the output The first term signals. ∗ ∗ E 2 In the following, we discuss a modified LMS approach, E Y f X f Hi f X f i = where we first combine the microphone signals to obtain 2 2 2 2 E Y f H f E X f + E N f an improved reference signal for the adaptation of the LMS i i i filters. (36) EURASIP Journal on Advances in Signal Processing 7 in this sum is the Wiener filter that results in a minimum This formula shows that noise suppression can be introduced mean squared error estimate of the signal X( f )basedon by simply adding a constant to the numerator term in (14). the signal Yi( f ). The Wiener filter equalizes the microphone Most, if not all, implementations of spectral subtraction signal and minimizes the mean squared error between the are based on an over-subtraction approach, where an filter output and the actual speech signal X( f ). Note that the overestimate of the noise power is subtracted from the phase of the term in (36)is−φi, that is, the filter compensates power spectrum of the input signal (see e.g., [22–25]). Over- the phase of the acoustic transfer function Hi( f ). subtraction can be included in (40) by using a constant ρ The other terms in the sum can be considered as filter larger than one. This leads to the final gain factor biases where the term in (34) depends on the noise power density of the ith input. The remaining terms depend on (i) γi f the noise cross power and vanish for uncorrelated noise G f = . (41) SC ρ + γ f signals. However, noise correlation might distort the phase estimation. The parameter does hardly affect the gain factors for Similarly, when we consider the actual reference signal ρ (i) high signal-to-noise ratios retaining optimum weighting. For X( f ) according to (22), the filter equation for GLMS( f ) low signal-to-noise ratios this term leads to an additional contains the term attenuation. The over-subtraction factor is usually a function ∗ 2 jφ ( f ) ff Hi f E X f e 1 of the SNR, sometimes it is also chosen di erently for ff 2 2 2 (37) di erent frequency bands [25]. cSC f Hi f E X f + E Ni f = − with the sought phase Δi( f ) φ1( f ) φi( f ). If the 5. Implementation Issues correlation of the noise terms is sufficiently small we obtain the estimated phase Real world speech and noise signals are non-stationary processes. For an implementation of the spectral weighting, = (i) Δi f arg GLMS f . (38) we have to consider short-time spectra of the microphone signals and estimate the short-time power spectral densities The LMS algorithm estimates implicitly the phase differences (PSD) of the speech signal and the noise components. between the reference signal X( f ) and the input signals Therefore, the noisy signal y (k) is transformed into the (i) i Yi( f ). Hence, the spectra at the outputs of the filters GLMS( f ) frequency domain using a short-time Fourier transform of are in phase. This enables a cophasal addition of the signals length L.EachblockofL consecutive samples is multiplied according to (20). with a Hamming window. Subsequent blocks are overlapping By estimating the noise power and noise cross-power by K samples. Let Yi(κ, ν), Xi(κ, ν), and Ni(κ, ν) denote the densities we could correct the biases of the LMS filter transfer corresponding short-time spectra, where κ is the subsampled functions. Similarly, reducing the noisy signal components time index and ν is the frequency bin index. in (30) diminishes the filter biases. In the following, we will pursue the latter approach. 5.1. System Structure. The processing system for two inputs is depicted in Figure 5. The spectrum X(κ, ν) results from 4.3. Noise Suppression. Maximum-ratio-combining provides incoherent magnitude combining of the input signals an optimum weighting of the M sensor signals. However, it does not necessarily suppress the noisy signal compo- X(κ, ν) = G(1)(κ, ν)Y (κ, ν) nents. We therefore combine the spectral combining with SC 1 (42) an additional noise suppression filter. Of the numerous (2) ν + G (κ, ν)|Y (κ, ν)|e jφ1(κ, ) + ··· , proposed noise reduction techniques in literature, we con- SC 2 sider only spectral subtraction [4] which supplements the where spectral combining quite naturally. The basic idea of spectral subtraction is to subtract an estimate of the noise floor from ( ) γi(κ, ν) an estimate of the spectrum of the noisy signal. G i (κ, ν) = . (43) SC ρ + γ(κ, ν) Estimating the overall SNR according to (9) the spectral subtraction filter (see i.e., [1, page 239]) for the combined The power spectral density of speech signals is relatively signal X( f )canbewrittenas fast time varying. Therefore, the FLMS algorithm requires a quick update, that is, a large step size. If the step size γ f is sufficiently large the magnitudes of the FLMS filters GNS f = . (39) 1+γ f (i) ν (i) ν GLMS(κ, ) follow the filters GSC(κ, ). Because the spectra at (i) Multiplying this filter transfer function with (14) leads to the the outputs of the filters GLMS( f ) are in phase, we obtain the term estimated speech spectrum as (1) (2) γi f γ f γi f ν = ν ν ν ν ··· = (40) X(κ, ) GLMS(κ, )Y1(κ, ) + GLMS(κ, )Y2(κ, ) + . γ f 1+γ f 1+γ f (44) 8 EURASIP Journal on Advances in Signal Processing

ν n1(k) Y1(κ, )

y1(k) Windowing (1) ν (1) GSC (κ, ) G (κ, ν) ∗ + FFT LMS x(k) h1(k) − SNR and gain Phase X(κ, ν) IFFT x(k) computing computing +OLA − x(k) ∗ h2(k) Windowing ν (2) G(2)(κ, ν) |·| e jφ1(κ, ) G (κ, ν) + FFT SC LMS y2(k)

n2(k) Y2(κ, ν) Figure 5: Basic system structure of the diversity system with two inputs.

To perform spectral combining we have to estimate the [23, 27]. With this approach, the noise PSD estimate is current signal-to-noise ratio based on the noisy microphone determined by the minimum value input signals. In the next sections, we propose a simple and efficient method to estimate the noise power spectral λmin,i(κ, ν) = min λY,i(l, ν) l∈[κ−W+1,κ] (47) densities of the microphone inputs.

within a sliding window of W consecutive values of λY,i(κ, ν). 5.2. PSD Estimation. Commonly the noise PSD is estimated The noise PSD is then estimated by in speech pauses where the pauses are detected using voice 2 activity detection (VAD, see e.g., [24, 26]). VAD-based E |Ni(κ, ν)| ≈ omin · λmin,i(κ, ν), (48) methods provide good estimates for stationary noise. How- ever, they may suffer from error propagation if subsequent where omin is a parameter of the algorithm and should be decisions are not independent. Other methods, like the min- approximated as imum statistics approach introduced by Martin [23, 27], use a continuous estimation that does not explicitly differentiate = 1 omin E{ } . (49) between speech pauses and speech active segments. λmin Our estimation method combines the VAD approach The MS approach provides a rough estimate of the noise with the minimum statistics (MS) method. Minimum power that strongly depends on the smoothing parameter α statistics is a robust technique to estimate the power spectral and the window size of the sliding window (for details cf. density of non-stationary noise by tracing the minimum of [27]). However, this estimate can be obtained regardless of the recursively smoothed power spectral density within a speech being present or not. time window of 1 to 2 seconds. We use these MS estimates The idea of our approach is to approximate the PSD and a simple threshold test to determine voice activity for by the MS estimate during speech active periods while the each time-frequency point. smoothed input power is used for time-frequency points The proposed method prevents error propagation, wherespeechisabsent. because the MS approach is independent of the VAD. During speech pauses the noise PSD estimation can be enhanced E |N (κ, ν)|2 ≈ β(κ, ν)o · λ (κ, ν) compared with an estimate solely based on minimum i min min,i (50) statistics. A similar time-frequency dependent VAD was + 1 − β(κ, ν) λ (κ, ν), presented by Cohen to enhance the noise power spectral Y,i density estimation of minimum statistics [28]. where β(κ, ν) ∈{0, 1} is an indicator function for speech ν For time-frequency points (κ, ) where the speech signal activity which will be discussed in more detail in the next E{| ν |2} is inactive, the noise PSD Ni(κ, ) can be approximated section. by recursive smoothing The current signal-to-noise ratio is then obtained by 2 2 2 E |N (κ, ν)| ≈ λ (κ, ν) (45) E |Yi(κ, ν)| − E |Ni(κ, ν)| i Y,i ν = γi(κ, ) 2 , (51) E |Ni(κ, ν)| with assuming that the noise and speech signals are uncorrelated. 2 λY,i(κ, ν) = (1 − α)λY,i(κ − 1, ν) + α|Yi(κ, ν)| , (46) 5.3. Voice Activity Detection. Human speech contains gaps where α ∈ (0, 1) is the smoothing parameter. not only in time but also in frequency domain. It is During speech active periods the PSD can be estimated therefore reasonable to estimate the voice activity in the time- using the minimum statistics method introduced by Martin frequency domain in order to obtain a more accurate VAD. EURASIP Journal on Advances in Signal Processing 9

The VAD function β(κ, ν) can then be calculated upon the The decision rule for the ith channel is based on the current input noise PSD obtained by minimum statistics. conditional speech presence probability Our aim is to determine for each time-frequency point ⎧ ν ⎨⎪ P H1 | Y (κ, ) whether the speech signal is active or inactive. We 1, i ≥ T, therefore consider the two hypotheses H (κ, ν)andH (κ, ν) ( , ν) = P H | Y (61) 1 0 βi κ ⎩⎪ 0 i which indicate speech presence or absence at the time- 0, otherwise. frequency point (κ, ν), respectively. We assume that the ff coefficients X(κ, ν)andNi(κ, ν) of the short-time spectra of The parameter T>0enablesatradeo between the two both the speech and the noise signal are complex Gaussian possible error probabilities of voice activity detection. A random variables. In this case, the current input power, that value T>1 decreases the probability of a false alarm, that 2 ν = is, squared magnitude |Yi(κ, ν)| , is exponentially distributed is, β(κ, ) 1 when speech is absent. T<1 reduces the with mean (power spectral density) probability of a miss, that is, β(κ, ν) = 0 in the presence of speech. Note that the generalized likelihood-ratio test ν = E | ν |2 λYi (κ, ) Y(κ, ) . (52) P H | Y p (κ, ν) 1 i = i ≥ T (62) Similarly we define P H0 | Yi 1 − pi(κ, ν) ν =| ν |2E | ν |2 is according to the Neyman-Pearson-Lemma (see e.g., [30]) λXi (κ, ) Hi(κ, ) X(κ, ) , (53) an optimal decision rule. That is, for a fixed probability of a 2 false alarm it minimizes the probability of a miss and vice λ (κ, ν) = E |N (κ, ν)| . Ni i versa. The generalized likelihood-ratio test was previously We assume that speech and noise are uncorrelated. used by Sohn and Sung to detect speech activity in subbands Hence, we have [29, 31]. The test in inequality (62)isequivalentto ν = ν ν λYi (κ, ) λXi (κ, ) + λNi (κ, ) (54) −1 λX,i+ λN,i q 1+T pi(κ, ν) = 1+ exp(−ui) ≤ , during speech active periods and λN,i 1 − q T (63) = ν λYi(κ,ν) λNi (κ, ) (55) 2 wherewehaveused(59). Solving for |Yi(κ, ν)| using (60), in speech pauses. we obtain a simple threshold test for the ith microphone In the following, we occasionally omit the dependency on κ and ν in order to keep the notation lucid. The conditional 1, | ( , ν)|2 ≥ ( , ν) ( , ν), ν = Yi κ λN,i κ Θi κ probability density functions of the random variable Yi = βi(κ, ) (64) 2 0, otherwise. |Yi(κ, ν)| are [29] ⎧ ⎪ with the threshold ⎨⎪ 1 −Y exp i , Y ≥ 0, f Y | H = λ λ i (56) i 0 ⎪ Ni Ni ν = λN,i Tq 1+ λX,i/λN,i ⎩ Y Θi(κ, ) 1+ log . (65) 0, i < 0, λX,i 1 − q ⎧ ⎪ −Y This threshold test is equivalent to the decision rule in (61). ⎨ 1 i Y ≥ Y | = exp , i 0, With this threshold test, speech is detected if the current f i H1 ⎪ λXi + λNi λXi + λNi (57) ⎩⎪ input power |Y (κ, ν)|2 is greater or equal to the average noise 0, Yi < 0. i power λN,i(κ, ν) times the threshold Θi(κ, ν). This factor Applying Bayes rule for the conditional speech presence depends on the input signal-to-noise ratio λX,i/λN,i and the ν probability a priori probability of speech absence q(κ, ). In order to combine the activity estimates for the ff pi(κ, ν) = P H1 | Yi (58) di erent input signals, we use the following rule 2 we have [29] 1, if |Y (κ, ν)| ≥ λ , Θ for any i, β(κ, ν) = i N i i (66) −1 0, otherwise. λ + λ q ν = Xi Ni − (59) pi(κ, ) 1+ − exp( ui) , λNi 1 q 6. Simulation Results where q(κ, ν) = P(H0(κ, ν)) is the a priori probability of speech absence and In this section, we present some simulation results for dif- ferent noise conditions typical in a car. For our simulations Y | ν |2 iλXi Yi(κ, ) λXi ui(κ, ν) = = . (60) we consider the same microphone setup as described in λNi λXi + λNi λNi λXi + λNi Section 2, that is, we use a two-channel diversity system, 10 EURASIP Journal on Advances in Signal Processing

×103 mic. 1 Table 1: Average input SNR values [dB] from mic. 1/mic. 2 for 5 typical background noise conditions in a car. 4 SNR IN 100 km/h 140 km/h defrost 3 short speaker 1.2/3.1 −0.7/−0.5 1.7/1.3 2 − 1 tall speaker 1.9/10.8 0.1/7.2 2.4/9.0 Frequency (Hz) 0 1234567 Table 2: Log spectral distances with minimum statistics noise PSD Time (s) estimation and with the proposed noise PSD estimator. (a) DLS [dB] 100 km/h 140 km/h defrost ×103 Activity mic. 1 3.93/3.33 2.47/2.07 3.07/1.27 5 mic. 2 4.6/4.5 3.03/2.33 3.4/1.5 4 3 2 while the second ones are according to a tall person. For all 1 =

Frequency (Hz) algorithms, we used an FFT length of L 512 and an overlap 0 of 256 samples. For time windowing we apply a Hamming 1234567 window. Time (s)

(b) 6.1. Estimating the Noise PSD. The spectrogram of one input Figure 6: Spectrogram of the microphone input (mic. 1 at car speed signal and the result of the voice activity detection are shown of 140 km/h, short speaker). The lower figure depicts the results in Figure 6 for the worst case scenario (short speaker at car of the voice activity detection (black representing estimated speech speed of 140 km/h). It can be observed that time-frequency activity) with T = 1.2andq = 0.5. points with speech activity are reliably detected. Because the noise PSD is estimated with minimum statistics also during speech activity, the false alarms in speech pauses do hardly 10 ff 0 a ect the noise PSD estimation. −10 In Figure 7, we compare the estimated noise PSD with −20 actual PSD for the same scenario. The PSD is well approx- −30 imated with only minor deviations for high frequencies.

PSD (dB) − 40 To evaluate the noise PSD estimation for several driving −50 −60 situations we calculated as an objective performance measure 0 1000 2000 3000 4000 5000 6000 the log spectral distance (LSD) Frequency (Hz) 2 Noise 1 λ (ν) D = 10 log N (67) Estimate LS 10 L ν λN (ν) Figure 7: Estimated and actual noise PSD for mic. 2 at car speed of ν 140 km/h. between the actual noise power spectrum λN ( ) and the estimate λN (ν). From the definition, it is obvious that the LSD can be interpreted as the mean distance between two ff because this is probably the most interesting case for in-car PSDs in dB. An extended analysis of di erent distance applications. measures is presented in [33]. With respect to three different background noise situa- The log spectral distances of the proposed noise PSD tions, we recorded driving noise at 100 km/h and 140 km/h. estimator are shown in Table 2. The first number in each field As third noise situation, we considered the noise which arises is the LSD achieved with the minimum statistics approach from an electric fan (defroster). With an artificial head we while the second number is the value for the proposed recorded speech samples for two different seat positions. scheme. Note that every noise situation was evaluated with ff From both positions, we recorded two male and two female four di erent voices (two male and two female). From these speech samples, each of a length of 8 seconds. Therefore, results, we observe that the voice activity detection improves we took the German-speaking speech samples from the rec- the PSD estimation for all considered driving situations. ommendation P.501 of the International Telecommunication Union (ITU) [32]. Hence the evaluation was done using 6.2. Spectral Combining. Next we consider the spectral four different voices with two different speaker sizes, which combining as discussed in Section 3. Figure 8 presents the leads to 8 different speaker configurations. For all recordings, output SNR values for a driving situation with a car speed of we used a sampling rate of 11025 Hz. Table 1 contains the 100 km/h. For this simulation we used ρ = 0, that is, spectral average SNR values for the considered noise conditions. The combining without noise suppression. In addition to the first values in each field are with respect to a short speaker output SNR, the curve for ideal maximum-ratio-combining EURASIP Journal on Advances in Signal Processing 11

30 30 20 20 10 10 SNR (dB) 0 SNR (dB) 0 −10 −10 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz) Frequency (Hz)

Out MRC Out MRC-FLMS Ideal MRC Ideal MRC Figure 8: Output SNR values for spectral combining without Figure 10: Output SNR values for the combined approach with additional noise suppression (car speed of 100 km/h, ρ = 0). additional noise suppression (car speed of 100 km/h, ρ = 10).

30 Table 3: Output SNR values [dB] for different combining tech- 20 niques—short/tall speaker. 10 SNR OUT 100 km/h 140 km/h defrost SNR (dB) 0 FLMS 8.8/13.3 4.4/9.0 7.8/12.3 −10 SC 16.3/20.9 13.3/18.0 14.9/19.9 0 500 1000 1500 2000 2500 3000 3500 4000 SC + FLMS 13.5/17.8 10.5/15.0 12.5/16.9 Frequency (Hz) ideal FLMS 12.6/15.2 10.5/13.3 14.5/17.3 Out MRC-FLMS Ideal MRC Table 4: Cosh spectral distances for different combining tech- Figure 9: Output SNR values for the combined approach without niques—short/tall speaker. additional noise suppression (car speed of 100 km/h, ρ = 0). DCH 100 km/h 140 km/h defrost FLMS 0.9/0.9 0.9/1.0 1.2/1.2 is depicted. This curve is simply the sum of the input SNR SC 1.3/1.4 1.4/1.5 1.5/1.7 values for the two microphones which we calculated based SC + FLMS 1.2/1.1 1.2/1.2 1.4/1.5 on the actual noise and speech signals (cf. Figure 1). ideal FLMS 0.9/0.8 1.1/1.0 1.5/1.4 We observe that the output SNR curve closely follows the ideal curve but with a loss of 1–3 dB. This loss is essentially ff caused by the phase di erences of the input signals. With the as presented in [21] (see also Section 4.1). The label SC spectral combining approach only a magnitude combining marks results solely based on spectral combining with is possible. Furthermore, the power spectral densities are additional noise suppression as discussed in Sections 3 and estimates based on the noisy microphone signals, this leads 4.3. The results with the combined approach are labeled by to an additional loss in the SNR. SC + FLMS. Finally, the values marked with the label ideal FLMS are a benchmark obtained by using the clean and 6.3. Combining SC and FLMS. The output SNR of the unreverberantspeechsignalx(k) as a reference for the FLMS combined approach without additional noise suppression is algorithm. depicted in Figure 9. It is obvious that the theoretical SNR From the results in Table 3, we observe that the spectral curve for ideal MRC is closely approximated by the output combining leads to a significant improvement of the output SNR of the combined system. This is the result of the implicit SNR compared to the coherence based noise reduction. It phase estimation of the FLMS approach which leads to a even outperforms the “ideal” FLMS scheme. However, the coherent combining of the speech signals. spectral combining introduces undesired speech distortions Now we consider the combined approach with additional similar to single channel noise reduction. This is also noise suppression (ρ = 10). Figure 10 presents the corre- indicated by the results in Table 4. This table presents sponding results for a driving situation with a car speed of distance values for the different combining systems. As an 100 km/h. The output SNR curve still follows the ideal MRC objective measure of speech distortion, we calculated the curve but now with a gain of up to 5 dB. cosh spectral distance (a symmetrical version of the Itakura- In Table 3, we compare the output SNR values of the Saito distance) between the power spectra of the clean input three considered noise conditions for different combining signal (without reverberation and noise) and the output techniques. The first value is the output SNR for a short speech signal (filter coefficients were obtained from noisy speaker while the second number represents the result for data). the tall speaker. The values marked with FLMS correspond to The benefit of the combined system is also indicated by the coherence based FLMS approach with bias compensation the results in Table 5 which presents Mean Opinion Score 12 EURASIP Journal on Advances in Signal Processing

Table 5: Evaluation of the MOS-Test. 7. Conclusions MOS 100 km/h 140 km/h defrost average In this paper, we have presented a diversity technique that FLMS 2.58 2.77 2.10 2.49 combines the processed signals of several separate micro- SC 3.19 3.15 2.96 3.10 phones. The aim of our approach was noise robustness SC + FLMS 3.75 3.73 3.88 3.78 for in-car hands-free applications, because single channel ideal FLMS 3.81 3.67 3.94 3.81 noise suppression methods are sensitive to the microphone location and in particular to the distance between speaker and microphone. We have shown theoretically that the proposed signal ×103 Input 1 weighting is equivalent to maximum-ratio-combining. Here 4 we have assumed that the noise power spectral densities are equal for all microphone inputs. This assumption might 2 be unrealistic. However, the simulation results for a two- 0 Frequency (Hz) 01234567 microphone system demonstrate that a performance close to that of MRC can be achieved with real world noise situations. Time (s) Moreover, diversity combining is an effective means to (a) reduce signal distortions due to reverberation and therefore ×103 Input 2 improves the speech intelligibility compared to single chan- nel noise reduction. This improvement can be explained by 4 the fact that spectral combining equalizes frequency dips that 2 occur only in one microphone input (cf. Figure 3). 0

Frequency (Hz) The spectral combining requires an SNR estimate for 01234567 each input signal. We have presented a simple noise PSD Time (s) estimator that reliably approximates the noise power for (b) stationary as well as instationary noise. ×103 Output 4 Acknowledgments 2 Research for this paper was supported by the German Federal 0 Frequency (Hz) 01234567 Ministry of Education and Research (Grant no. 17 N11 08). Time (s) Last but not the least, the authors would like to thank the (c) reviewers for their constructive comments and suggestions which greatly improve the quality of this paper. Figure 11: Spectrograms of the input and output signals with the SC + FLMS approach (car speed of 100 km/h, ρ = 10). References

[1] E. Hansler¨ and G. Schmidt, Acoustic Echo and Noise Control: A Practical Approach, John Wiley & Sons, New York, NY, USA, (MOS) values for the different algorithms. The MOS test 2004. [2] P. Vary and R. Martin, Digital Speech Transmission: Enhance- was performed by 24 persons. The test set was taken in a ment, Coding and Error Concealment, John Wiley & Sons, New randomized order to avoid statistical dependences on the York, NY, USA, 2006. test order. Obviously, the FLMS approach using spectral [3] E. Hansler¨ and G. Schmidt, Speech and Audio Processing in combining as reference signal and the “ideal” FLMS filter Adverse Environments: Signals and Communication Technolo- reference approach are rated as the best noise reduction gie, Springer, Berlin, Germany, 2008. algorithm, where the values of the combined approach are [4] S. Boll, “Suppression of acoustic noise in speech using spectral similar to the results with the reference implementation subtraction,” IEEE Transactions on Acoustics, Speech and Signal of the “ideal” FLMS filter solution. From this evalua- Processing, vol. 27, no. 2, pp. 113–120, 1979. tion, it can also be seen that the FLMS approach with [5] R. Martin and P. Vary, “A symmetric two microphone speech spectral combining outperforms the pure FLMS and the enhancement system theoretical limits and application in a car pure spectral combining algorithms in all tested acoustic environment,” in Proceedings of the Digital Signal Processing situations. Workshop, pp. 451–452, Helsingoer, Denmark, August 1992. [6] R. Martin and P. Vary, “Combined acoustic echo cancellation, The combined approach sounds more natural compared dereverberation and noise reduction: a two microphone to the pure spectral combining. The SNR and distance values approach,” Annales des T´el´ecommunications,vol.49,no.7-8, are close to the “ideal” FLMS scheme. The speech is free of pp. 429–438, 1994. musical tones. The lack of musical noise can also be seen in [7] A. A. Azirani, R. L. Bouquin-Jeannes,` and G. Faucon, Figure 11, which shows the spectrograms of the enhanced “Enhancement of speech degraded by coherent and incoher- speech and the input signals. ent noise using a cross-spectral estimator,” IEEE Transactions EURASIP Journal on Advances in Signal Processing 13

on Speech and Audio Processing, vol. 5, no. 5, pp. 484–487, [23] R. Martin, “Spectral subtraction based on minimum statis- 1997. tics,” in Proceedings of the European Signal Processing Confer- [8] A. Guerin,´ R. L. Bouquin-Jeannes,` and G. Faucon, “A two- ence (EUSIPCO ’94), pp. 1182–1185, Edinburgh, UK, April sensor noise reduction system: applications for hands-free car 1994. kit,” EURASIP Journal on Applied Signal Processing, vol. 2003, [24] H. Puder, “Single channel noise reduction using time- no. 11, pp. 1125–1134, 2003. frequency dependent voice activity detection,” in Proceedings [9] J. Freudenberger and K. Linhard, “A two-microphone diver- of International Workshop on Acoustic Echo and Noise Control sity system and its application for hands-free car kits,” in (IWAENC ’99), pp. 68–71, Pocono Manor, Pa, USA, Septem- Proceedings of European Conference on Speech Communication ber 1999. and Technology (INTERSPEECH ’05), pp. 2329–2332, Lisbon, [25] A. Juneja, O. Deshmukh, and C. Espy-Wilson, “A multi-band Portugal, September 2005. spectral subtraction method for enhancing speech corrupted [10] T. Gerkmann and R. Martin, “Soft decision combining by colored noise,” in Proceedings of IEEE International Confer- for dual channel noise reduction,” in Proceedings of the ence on Acoustics, Speech, and Signal Processing (ICASSP ’02), 9th International Conference on Spoken Language Processing vol. 4, pp. 4160–4164, Orlando, Fla, USA, May 2002. (INTERSPEECH—ICSLP ’06), vol. 5, pp. 2134–2137, Pitts- [26] J. Ram´ırez, J. C. Segura, C. Ben´ıtez,A.deLaTorre,andA. burgh, Pa, USA, September 2006. Rubio, “A new voice activity detector using subband order- [11] J. Freudenberger, S. Stenzel, and B. Venditti, “Spectral com- statistics filters for robust speech recognition,” in Proceedings bining for microphone diversity systems,” in Proceedings of of IEEE International Conference on Acoustics, Speech and European Signal Processing Conference (EUSIPCO ’09),pp. Signal Processing (ICASSP ’04), vol. 1, pp. I849–I852, 2004. 854–858, Glasgow, UK, July 2009. [27] R. Martin, “Noise power spectral density estimation based on [12] J. L Flanagan and R. C. Lummis, “Signal processing to reduce optimal smoothing and minimum statistics,” IEEE Transac- multipath distortion in small rooms,” Journal of the Acoustical tions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512, Society of America, vol. 47, no. 6, pp. 1475–1481, 1970. 2001. [13] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophone [28] I. Cohen, “Noise spectrum estimation in adverse environ- signal-processing technique to remove room reverberation ments: improved minima controlled recursive averaging,” from speech signals,” Journal of the Acoustical Society of IEEE Transactions on Speech and Audio Processing, vol. 11, no. America, vol. 62, no. 4, pp. 912–915, 1977. 5, pp. 466–475, 2003. [14] S. Gannot and M. Moonen, “Subspace methods for mul- [29] J. Sohn and W. Sung, “A voice activity detector employing soft timicrophone speech dereverberation,” EURASIP Journal on decision based noise spectrum adaptation,” in Proceedings of Applied Signal Processing, vol. 2003, no. 11, pp. 1074–1090, IEEE International Conference on Acoustics, Speech and Signal 2003. Processing (ICASSP ’98), vol. 1, pp. 365–368, 1998. [15] M. Delcroix, T. Hikichi, and M. Miyoshi, “Dereverberation [30] G. D. Forney Jr., “Exponential error bounds for erasure, and denoising using multichannel linear prediction,” IEEE list, and decision feedback schemes,” IEEE Transactions on Transactions on Audio, Speech and Language Processing, vol. 15, Information Theory, vol. 14, no. 2, pp. 206–220, 1968. no. 6, pp. 1791–1801, 2007. [31] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based [16] I. Ram, E. Habets, Y. Avargel, and I. Cohen, “Multi- voice activity detection,” IEEE Signal Processing Letters, vol. 6, microphone speech dereverberation using LIME and least no. 1, pp. 1–3, 1999. squares filtering,” in Proceedings of European Signal Processing [32] ITU-T, Test signals for use in telephonometry, Recommenda- Conference (EUSIPCO ’08), Lausanne, Switzerland, August tion ITU-T P.501, International Telecommunication Union, 2008. Geneva, Switzerland, 2007. [17] K. Mukherjee and B.-H. Gwee, “A 32-point FFT based noise [33] A. H. Gray Jr. and J. D. Markel, “Distance measures for speech reduction algorithm for single channel speech signals,” in processing,” IEEE Transactions on Acoustics, Speech and Signal Proceedings of IEEE International Symposium on Circuits and Processing, vol. 24, no. 5, pp. 380–391, 1976. Systems (ISCAS ’07), pp. 3928–3931, New Orleans, La, USA, May 2007. [18] W. Armbruster,¨ R. Czarnach, and P. Vary, “Adaptive noise cancellation with reference input,” in Signal Processing III,pp. 391–394, Elsevier, 1986. [19] B. Sklar, Digital Communications: Fundamentals and Applica- tions, Prentice Hall, Upper Saddle River, NJ, USA, 2001. [20] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics Speech and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976. [21] J. Freudenberger, S. Stenzel, and B. Venditti, “An FLMS based two-microphone speech enhancement system for in- car applications,” in Proceedings of the 15th IEEE Workshop on Statistical Signal Processing (SSP ’09), pp. 705–708, 2009. [22] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’79), pp. 208–211, Washington, DC, USA, April 1979. Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 358729, 9 pages doi:10.1155/2010/358729

Research Article DOA Estimation with Local-Peak-Weighted CSP

Osamu Ichikawa, Takashi Fukuda, and Masafumi Nishimura

IBM Research-Tokyo, 1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan

Correspondence should be addressed to Osamu Ichikawa, [email protected]

Received 31 July 2009; Revised 18 December 2009; Accepted 4 January 2010

Academic Editor: Sharon Gannot

Copyright © 2010 Osamu Ichikawa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper proposes a novel weighting algorithm for Cross-power Spectrum Phase (CSP) analysis to improve the accuracy of direction of arrival (DOA) estimation for beamforming in a noisy environment. Our sound source is a human speaker and the noise is broadband noise in an automobile. The harmonic structures in the human speech spectrum can be used for weighting the CSP analysis, because harmonic bins must contain more speech power than the others and thus give us more reliable information. However, most conventional methods leveraging harmonic structures require pitch estimation with voiced-unvoiced classification, which is not sufficiently accurate in noisy environments. In our new approach, the observed power spectrum is directly converted into weights for the CSP analysis by retaining only the local peaks considered to be harmonic structures. Our experiment showed the proposed approach significantly reduced the errors in localization, and it showed further improvements when used with other weighting algorithms.

1. Introduction reflection with the advantage of reducing the effects of noise sources through localization. The performance of automatic speech recognition (ASR) Among these methods, CSP analysis is popular because is severely affected in noisy environments. For example, in it is accurate, reliable, and simple. CSP analysis measures automobiles the ASR error rates during high-speed cruising the time differences in the signals from two microphones with an open window are generally high. In such situations, using normalized correlation. The differences correspond to the noise reduction of beamforming technology can improve the direction of arrival (DOA) of the sound sources. Using the ASR accuracy. However, all beamformers except for Blind multiple pairs of microphones, CSP analysis can be enhanced Signal Separation (BSS) require accurate localization to focus for 2D or 3D space localization [6]. on the target sound source. If a beamformer has high perfor- This paper seeks to improve CSP analysis in noisy mance with acute directivity, then the performance declines environments with a special weighting algorithm. We assume greatly if the localization is inaccurate. This means ASR may the target sound source is a human speaker and the noise actually lose accuracy with a beamformer, if the localization is broadband noise such as a fan, wind, or road noise is poor in a noisy environment. Accurate localization is in an automobile. Denda et al. proposed weighted CSP critically important for ASR with a beamformer. analysis using average speech spectrums as weights [7]. The For sound source localization, conventional methods assumption is that a subband with more speech power include MUSIC [1, 2], Minimum Variance (MV), Delay conveys more reliable information for localization. However, and Sum (DS), and Cross-power Spectrum Phase (CSP) [3] it did not use the harmonic structures of human speech. analysis. For two-microphone systems installed on physical Because the harmonic bins must contain more speech power objectssuchasdummyheadsorexternalears,approaches than the other bins, they should give us more reliable with head-related transfer functions (HRTF) have been information in noisy environments. The use of harmonic investigated to model the effect of diffraction and reflection structures for localization has been investigated in prior art [4]. Profile Fitting [5] can also address the diffraction and [8, 9], but not for CSP analysis. This work estimated the 2 EURASIP Journal on Advances in Signal Processing

DOA the CSP coefficients should be processed as a moving average 0.2 using several frames around T, as long as the sound source is not moving, using 0.15 H = l=−H ϕT (i + l) 0.1 ϕT (i) ,(2) )

i (2H +1) ( T φ 0.05 where 2H + 1 is the number of averaged frames. Figure 1 shows an example of ϕT . In clean conditions, there is a sharp 0 −7 −6 −5 −4 −3 −2 −10 1 2 3 4 5 6 7 peak for a sound source. The estimated DOA iT for the sound i source is − 0.05 = Figure 1: An example of CSP. iT argmax ϕT (i) . (3) i

1 1 . 2.2. Tracking a Moving Sound Source. If a sound source is moving, the past location or DOA can be used as a cue 1 to the new location. Tracking techniques may use Dynamic Programming (DP), the Viterbi search [10], Kalman Filters, 0.9 or Particle Filters [11]. For example, to find the series of DOAs that maximize the function for the input speech Weight 0.8 frames, DP can use the evaluation function Ψ as

0.7 ΨT (i) = ϕT (i) · L(k, i) +max(ΨT−1(k)),(4) i−1≤k≤i+1 0.6 0 2000 4000 6000 8000 where L(k, i) is a cost function from k to i. (Hz) Figure 2: Average speech spectrum weight. 2.3. Weighted CSP Analysis. Equation (1) can be viewed as a summation of each contribution at bin j. Therefore we can introduce a weight W(j) on each bin so as to focus on the pitches (F0) of the target sound and extracted localization more reliable bins, as cues from the harmonic structures based on those pitches. · ∗ However, the pitch estimation and the associated voiced- S1,T j S2,T j ϕT (i) = IDFT W j · . (5) unvoiced classification may be insufficiently accurate in noisy S1,T j · S2,T j environments. Also, it should be noted that not all harmonic bins have distinct harmonic structures. Some bins may not Denda et al. introduced an average speech spectrum for the be in the speech formants and be dominated by noise. weights [7] to focus on human speech. Figure 2 shows their Therefore, we want a special weighting algorithm that puts weights. We use the symbol WDenda for later reference to larger weights on the bins where the harmonic structures these weights. It does not have any suffix T, since it is time are distinct, without requiring explicit pitch detection and invariant. voiced-unvoiced classification. Another weighting approach would be to use the local SNR [12], as long as the ambient noise is stationary and 2. Sound Source Localization Using measurable. For our evaluation in Section 4, we simply used larger weights where local SNR is high as CSP Analysis 2.1. CSP Analysis. CSP analysis measures the normalized WSNRT j correlations between two-microphone inputs with an Inverse 2 − 2 (6) Discrete Fourier Transform (IDFT) as max log ST j log NT j , ε = , KT · ∗ S1,T j S2,T j ϕT (i) = IDFT ,(1) S1,T j · S2,T j where NT is the spectral magnitude of the average noise, ε is a very small constant, and KT is a normalizing factor where Sm,T is a complex spectrum at the Tth frame observed ∗ 2 2 with microphone m and means complex conjugate. The bin KT = max log |ST (k)| − log |NT (k)| , ε . (7) number j corresponds to the frequency. The CSP coefficient k ϕT (i) is a time-domain representation of the normalized correlation for the i-sample delay. For a stable representation, Figure 3(c) shows an example of the local SNR weights. EURASIP Journal on Advances in Signal Processing 3

12 12

10 10

8 8

6 6 Log power Log power

4 4

2 2 0 2000 4000 6000 8000 0 2000 4000 6000 8000 (Hz) (Hz) (a) A sample of the average noise spectrum. (b) A sample of the observed noisy speech spectrum. 0.1 0.1

0.05 0.05 Weight Weight

0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 (Hz) (Hz) (c) A sample of the local SNR weights. (d) A sample of the local peak weights.

Figure 3: Sample spectra and the associated weights. The spectra were of the recording with air conditioner noise at an SNR of 0 dB. The noisy speech spectrum (b) was sampled in a vowel segment.

1.5

1 (a) 400 Weight

0.5 300

(Hz) 200

0 100 0 2000 4000 6000 8000 (Hz) 25 dB(clean) 5dB 10 dB 0dB Figure 4: A Sample of comb weight. (pitch = 300 Hz). (b)

Figure 5: A sample waveform (clean) and its pitches detected by SPTK in various SNR situations. The threshold of voiced-unvoiced 3. Harmonic Structure-Based Weighting classification was set to 6.0 (SPTK default). For the frames detected as unvoiced, SPTK outputs zero. The test data was prepared by 3.1. Comb Weights. If there is accurate information about the blending noise at different SNRs. The noise was recorded in a car pitch and voiced-unvoiced labeling of the input speech, then moving on an expressway with a fan at a medium level. we can design comb filters [13] for the frames in the voiced segments. The optimal CSP weights will be equivalent to the gain of the comb filters to selectively use those harmonic bins. Figure 4 shows an example of the weights when the in SPTK-3.0 [14] to obtain the pitch and voiced-unvoiced pitch is 300 Hz. information. There are many outliers in the low SNR Unfortunately, the estimates of the pitch and the voiced- conditions. Many researchers have tried to improve the unvoiced classification become inaccurate in noisy environ- accuracy of the detection in noisy environments [15], but ments. Figure 5 shows our tests using the “Pitch command” their solutions require some threshold for voiced-unvoiced 4 EURASIP Journal on Advances in Signal Processing

7500 7000 6500 6000 5500 5000 4500 Observed 4000 spectrum 3500 3000 2500 2000 1500 1000 500

1 2 3 4 5 6 7 8 9 10111213141516171819202122 23242526 27 28 3031 32 33343536373839 40 Noise or unvoiced frame Voiced frame

Log power spectrum

DCT to get cepstrum

Cut off upper and lower cepstrum

I-DCT

Get exponential and normalise to get weights

W(ω)

Weighted CSP

Figure 6: Process to obtain Local Peak Weight.

classification [16]. When noise-corrupted speech is falsely continuous converter from an input spectrum to a weight detected as unvoiced, there is little benefit from the CSP vector, which can be locally large for the bins whose weighting. harmonic structures are distinct. There is another problem with the uniform adoption of comb weights for all of the bins. Those bins not in the speech 3.2. Proposed Local Peak Weights. We previously proposed a formants and degraded by noise may not contain reliable method for speech enhancement called Local Peak Enhance- cues even though they are harmonic bins. Such bins should ment (LPE) to provide robust ASR even in very low SNR receive smaller weights. conditions due to driving noises from an open window Therefore, in Section 3.2, we explore a new weighting or loud air conditioner noises [17]. LPE does not leverage algorithm that does not depend on explicit pitch detection pitch information explicitly, but estimates the filters from or voiced-unvoiced classification. Our approach is like a the observed speech to enhance the speech spectrum. LPE EURASIP Journal on Advances in Signal Processing 5

DFT DFT −7 7

6 5 −4 4 Get weight ±0 W(j)

S1,T (j) Calculate weighted CSP S2,T (j)

φT (i) Microphone Smooth over frames Figure 7: Microphone installation and the resolution of DOA in the experimental car. φT (i)

Determine DOA 12 Figure 9: System for the evaluation. 11

10 40 9 35 8 30

Log power 7 6 25 5 20 4 0 2000 4000 6000 8000 15 (Hz) Window full open 10 DOA detection error (%) Fan max 5 Figure 8: Averaged noise spectrum used in the experiment. 0 Clean 10 dB 0 dB SNR assumes that pitch information containing the harmonic 1. CSP (Baseline) 4. W-CSP (Local SNR) structure is included in the middle range of the cepstral 2. W-CSP (Comb) 5. W-CSP (Denda) coefficients obtained with the discrete cosine transform 3. W-CSP (LPW) (DCT) from the power spectral coefficients. The LPE filter retrieves information only from that range, so it is designed Figure 10: Error rate of frame-based DOA detection. (Fan Max: to enhance the local peaks of the harmonic structures for single-weight cases). voiced speech frames. Here, we propose the LPE filter be used for the weights in the CSP approach. This use of the LPE filter is named Local Peak Weight (LPW), and we refer to the CSP the bin index of the DFT. Optionally, we may take with LPW as the Local-Peak-Weighted CSP (LPW-CSP). a moving average using several frames around T,to Figure 6 shows all of the steps for obtaining the LPW and smooth the power spectrum for YT (j). sample outputs of each step for both a voiced frame and an (2) Convert the log power spectrum YT (j) into the unvoiced frame. The process is the same for all of the frames, cepstrum CT (i) by using D(i, j), a DCT matrix. but the generated filters differ depending on whether or not the frame is voiced speech, as shown in the figure. CT (i) = D i, j · YT j ,(8) Here are the details for each step. j (1) Convert the observed spectrum from one of the where i is the bin number of the cepstral coefficients. microphones to a log power spectrum YT (j)foreach In our experiments, the size of the DCT matrix is 256 frame, where T and j are the frame number and by 256. 6 EURASIP Journal on Advances in Signal Processing

25 25

20 20

15 15

10 10 DOA detection error (%) DOA detection error (%) 5 5

0 0 Clean 10 dB 0 dB Clean 10 dB 0 dB SNR SNR

1. CSP (Baseline) 4. W-CSP (Local SNR) 1. CSP (Baseline) 2. W-CSP (Comb) 5. W-CSP (Denda) 6. W-CSP (LPW and Denda) 3. W-CSP (LPW) 7. W-CSP (LPW and Local SNR) 8. W-CSP (Local SNR and Denda) Figure 11: Error rate of frame-based DOA detection. (Window Full 9. W-CSP(LPW and Local SNR and Denda) Open: single-weight cases). Figure 13: Error rate of frame-based DOA detection. (Window Full Open: combined-weight cases). 40

35

30 for human speech is from 100 Hz to 400 Hz. This assumption gives IL = 55 and IH = 220, when the 25 sampling frequency is 22 kHz. 20 (4) Convert CT (i) back to the log power spectrum 15 domain VT (i) by using the inverse DCT: −1 10 VT j = D j, i · CT (i). (10) DOA detection error (%) 5 i 0 (5) Then converted back to a linear power spectrum: Clean 10 dB 0 dB SNR wT j = exp VT j . (11) 1. CSP (Baseline) 6. W-CSP (LPW and Denda) (6) Finally, we obtain LPW, after normalizing, as 7. W-CSP (LPW and Local SNR) 8 W-CSP (Local SNR and Denda) . = wT j 9. W-CSP(LPW and Local SNR and Denda) WLPWT j . (12) k wT (k) Figure 12: Error rate of frame-based DOA detection. (Fan Max: combined-weight cases). Forvoicedspeechframes,LPWwillbedesignedtoretain only the local peaks of the harmonic structure as shown in the bottom-right graph in Figure 6 (see also Figure 3(d)) (3) The cepstra represent the curvatures of the log power For unvoiced speech frames, the result will be almost flat spectra. The lower and higher cepstra include long due to the lack of local peaks with the target harmonic and short oscillations while the medium cepstra structure. Unlike the comb weights, the LPW is not uniform capture the harmonic structure information. Thus over the target frequencies and is more focused on the the range of cepstra is chosen by filtering out the frequencies where harmonic structures are observed in the lower and upper cepstra in order to cover the possible input spectrum. harmonic structures in the human voice. ⎧ ⎨ 3.3. Combination with Existing Weights. The proposed LPW λ · CT (i) if (iIH ), = and existing weights can be used in various combinations. CT (i) ⎩ (9) CT (i) otherwise, For the combinations, the two choices are sum and product. In this paper, they are defined as the products of each where λ is a small constant. IL and IH correspond component for each bin j, because the scale of each to the bin index of the possible pitch range, which component is too different for a simple summation and we EURASIP Journal on Advances in Signal Processing 7 hope to minimize some fake peaks in the weights by using the graph, “Window Full Open” contains lots of transient noise products of different metrics. Equations (13)to(16) show from the wind and other automobiles. thecombinationsweevaluateinSection 4. Figure 9 shows the system used for this evaluation. We used various types of weights for the weighted CSP analysis. The input from one microphone was used to = · , (13) WLPW&DendaT j WLPWT j WDenda j generate the weights. Using both microphones could provide = · better weights, but in this experiment we used only one WLPW&SNRT j WLPWT j WSNRT j , (14) microphone for simplicity. Since the baseline (normal CSP) = · WSNR&DendaT j WSNRT j WDenda j , (15) does not use weighting, all of its weights were set to 1.0. The weighted CSP was calculated using (5), with smoothing = · WLPW&SNR&DendaT j WLPWT j WSNRT j over the frames using (2). In addition to the weightings, (16) we introduced a lower cut-off frequency of 100 Hz and an · W j . Denda upper cut-off frequency of 5 kHz to stabilize the CSP analysis. Finally, the DOA was estimated using (3) for each frame. We 4. Experiment did not use the tracking algorithms discussed in Section 2.2, because we wanted to accurately measure the contributions In the experimental car, two microphones were installed of the various types of weights in a simplified form. Actually, near the map-reading lights on the ceiling with 12.5 cm the subject speakers rarely moved when speaking. between them. We used omnidirectional microphones. The The performance was measured as frame-based accuracy. sampling frequency for the recordings was 22 kHz. In this The frames reporting the correct DOA were counted, and configuration, CSP gives 15 steps from −7to+7fortheDOA that was divided by the total number of speech frames. The resolution (see Figure 7). correct DOA values were determined manually. The speech A higher sampling rate might yield higher directional segments were determined using clean speech data with a resolution. However, many beamformers do not support rather strict threshold, so extra segments were not included higher sampling frequencies because of processing costs and before or after the phrases. aliasing problems. We also know that most ASR systems work at sampling rates below 22 kHz. These considerations led us 4.1. Experiment Using Single Weights. We evaluated five types to use 22 kHz. of CSP analysis. Again, we could have gained directional resolution by increasing the distance between the microphones. In Case 1. Normal CSP (uniform weights, baseline). general, a larger baseline distance improves the performance of a beamformer, especially for lower frequency sounds. Case 2. Comb-Weighted CSP. However, this increases the aliasing problems for higher frequency sounds. Our separation of 12.5 cm was another Case 3. Local-Peak-Weighted CSP (our proposal). tradeoff. Our analysis used a Hamming window, 23-ms-long Case 4. Local-SNR-Weighted CSP. frames with 10-ms frame shifts. The FFT length was 512. For (2), the length of the moving average was 0.2 seconds. Case 5. Average-Speech-Spectrum-Weighted CSP (Denda). The test subject speakers were 4 females and 4 males. Case 2 requires the pitch and voiced-unvoiced infor- Each speaker read 50 Japanese commands. These are short mation. We used SPTK-3.0 [14] with default parameters phrases for automobiles known as Free Form Command to obtain this data. Case 4 requires estimating the noise [18]. The total number of utterances was 400. They were spectrum. In this experiment, the noise spectrum was recorded in a stationary car, a full-size sedan. The subject continuously updated within the noise segments based on speakers sat in the driver’s seat. The seat was adjusted to oracle VAD information as each speaker’s preference, so the distance to the microphones varied from approximately 40 cm to 60 cm. Two types of NT j = (1 − α) · NT−1 j + α · ST j noise were recorded separately in a moving car, and they ⎧ were combined with the speech data at various SNRs (clean, ⎨0.0ifVAD= active, (17) = 10 dB, and 0 dB). The SNRs were measured as ratios of α ⎩ speech power and noise power, ignoring the frequency 0.1 otherwise. components below 300 Hz. One of the recorded noises was an air-conditioner at maximum fan speed while driving on The initial value of the noise spectrum for each utterance file a highway with the windows closed. This will be referred was given by the average of all of the noise segments in that to as “Fan Max”. The other was of driving noise on a file. highway with the windows fully opened. This will be referred Figures 10 and 11 show the experimental results for to as “Window Full Open”. Figure 8 compares the average “Fan Max” and “Window Full Open”, respectively. Case 2 spectra of the two noises. “Window Full Open” contains failed to show significant error reduction in both situations. more power around 1 kHz, and “Fan Max” contains relatively This failure is probably due to bad pitch estimation or poor large power around 4 kHz. Although it is not shown in the voiced-unvoiced classification in the noisy environments. 8 EURASIP Journal on Advances in Signal Processing

This suggests that the result could be improved by intro- Case 9, the combination of the three weights worked ducing robust pitch trackers and voiced-unvoiced classifiers. well in both situations. Because each weighting method has However, there is an intrinsic problem since noisier speech different characteristics, we expected that their combination segments are more likely to be classified as unvoiced and thus would help against variations in the noise. Actually, the lose the benefit of weighting. results were almost equivalent to the best combinations of Case 5 failed to show significant error reduction for “Fan the paired weights in each situation. Max”, but it showed good improvement for “Window Full Open”. As shown in Figure 8, “Fan Max” contains more noise power around 4 kHz than around 1 kHz. In contrast, 5. Conclusion the speech power is usually lower around 4 kHz than We proposed a new weighting algorithm for CSP analysis to around 1 kHz. Therefore, the 4-kHz region tends to be more improve the accuracy of DOA estimation for beamforming in degraded. However Denda’s approach does not sufficiently a noisy environment, assuming the source is human speech lower the weights in the 4-kHz region, because the weights and the noise is broadband noise such as a fan, wind, or road are time-invariant and independent on the noise. Case 3 noise in an automobile. and Case 4 outperformed the baseline in both situations. The proposed weights are extracted directly from the For “Fan Max”, since the noise was almost stationary, the input speech using the midrange of the cepstrum. They local-SNR approach can accurately estimate the noise. This represent the local peaks of the harmonic structures. As is also a favorable situation for LPW, because the noise does the process does not involve voiced-unvoiced classification, not include harmonic components. However, LPW does little it does not have to switch its behavior over the voiced- for consonants. Therefore, Case 4 had the best results for unvoiced transitions. “Fan Max”. In contrast, since the noise is nonstationary for Experiments showed the proposed local peak weighting “Window Full Open”, Case 3 had slightly fewer errors than algorithm significantly reduced the errors in localization Case 4. We believe this is because the noise estimation for the using CSP analysis. A weighting algorithm using local SNR local SNR calculations is inaccurate for nonstationary noises. also reduced the errors, but it did not produce the best results Considering that the local SNR approach in this experiment in the nonstationary noise situation in our evaluations. Also, used the given and accurate VAD information, the actual it requires VAD information to estimate the noise spectrum. performance in the real world would probably be worse than Our proposed algorithm does not require VAD information, our results. LPW has an advantage in that it does not require voiced-unvoiced information, or pitch information. It does either noise estimation or VAD information. not assume the noise is stationary. Therefore, it showed advantages in the nonstationary noise situation. Also, it can 4.2. Experiment Using Combined Weights. We also evaluated be combined with existing weighting algorithms for further some combinations of the weights in Cases 3 to 5.The improvements. combined weights were calculated using (13)to(16).

Case 6. CSP weighted with LPW and Denda (Cases 3 and 5). References

Case 7. CSP weighted with LPW and Local SNR (Cases 3 and [1] D. Johnson and D. Dudgeon, Array Signal Processing, Prentice- Hall, Englewood Cliffs, NJ, USA. 4). [2] F. Asano, H. Asoh, and T. Matsui, “Sound source localization and separation in near field,” IEICE Transactions on Funda- Case 8. CSP weighted with Local SNR and Denda (Cases 4 mentals of Electronics, Communications and Computer Sciences, and 5). vol. E83-A, no. 11, pp. 2286–2294, 2000. [3] M. Omologo and P. Svaizer, “Acoustic event localization using Case 9. CSP weighted with LPW, Local SNR, and Denda a crosspower-spectrum phase based technique,” in Proceedings (Cases 3, 4, and 5). of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’94), pp. 273–276, 1994. Figures 12 and 13 show the experimental results for [4] K. D. Martin, “Estimating azimuth and elevation from “Fan Max” and “Window Full Open”, respectively, for the interaural differences,” in Proceedings of IEEE ASSP Workshop combined weight cases. on Applications of Signal Processing to Audio and Acoustics For the combination of two weights, the best combina- (WASPAA ’95), p. 4, 1995. tion was dependent on the situation. For “Fan Max”, Case 7, [5] O. Ichikawa, T. Takiguchi, and M. Nishimura, “Sound source the combination of LPW and the local SNR approach was localization using a profile fitting method with sound reflec- best in reducing the error by 51% for 0 dB. For “Window tors,” IEICE Transactions on Information and Systems, vol. E87- Full Open”, Case 6, the combination of LPW and Denda’s D, no. 5, pp. 1138–1145, 2004. [6] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Local- approach was best in reducing the error by 37% for 0 dB. ization of multiple sound sources based on a CSP analysis These results correspond to the discussion in Section 4.1 with a microphone array,” in Proceedings of IEEE International about how the local SNR approach is suitable for stationary Conference on Acoustics, Speech and Signal Processing (ICASSP noises, while LPW is suitable for nonstationary noises, and ’00), vol. 2, pp. 1053–1056, 2000. Denda’s approach works well with noise concentrated in the [7] Y. Denda, T. Nishiura, and Y. Yamashita, “Robust talker lower frequency region. direction estimation based on weighted CSP analysis and EURASIP Journal on Advances in Signal Processing 9

maximum likelihood estimation,” IEICE Transactions on Infor- mation and Systems, vol. E89-D, no. 3, pp. 1050–1057, 2006. [8] T. Yamada, S. Nakamura, and K. Shikano, “Robust speech recognition with speaker localization by a microphone array,” in Proceedings of the International Conference on Spoken Language Processing (ICSLP ’96), vol. 3, pp. 1317–1320, 1996. [9] T. Nagai, K. Kondo, M. Kaneko, and A. Kurematsu, “Esti- mation of source location based on 2-D MUSIC and its application to speech recognition in cars,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’01), vol. 5, pp. 3041–3044, 2001. [10] T. Yamada, S. Nakamura, and K. Shikano, “Distant-talking speech recognition based on a 3-D Viterbi search using a microphone array,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 2, pp. 48–56, 2002. [11] H. Asoh, I. Hara, F. Asano, and K. Yamamoto, “Tracking human speech events using a particle filter,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’05), vol. 2, pp. 1153–1156, 2005. [12] J.-M. Valin, F. Michaud, J. Rouat, and D. Letourneau,´ “Robust sound source localization using a microphone array on a mobile robot,” in Proceedings of IEEE International Conference on Intelligent Robots and Systems (IROS ’03), vol. 2, pp. 1228– 1233, 2003. [13] H. Tolba and D. O’Shaughnessy, “Robust automatic continuous-speech recognition based on a voiced-unvoiced decision,” in Proceedings of the International Conference on Spoken Language Processing (ICSLP ’98), p. 342, 1998. [14] SPTK: http://sp-tk.sourceforge.net/. [15]M.Wu,D.L.Wang,andG.J.Brown,“Amulti-pitchtracking algorithm for noisy speech,” in Proceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP ’02), vol. 1, pp. 369–372, 2002. [16] T. Nakatani, T. lrino, and P. Zolfaghari, “Dominance spectrum based V/UV classification and F0 estimation,” in Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech ’03), pp. 2313–2316, 2003. [17] O. Ichikawa, T. Fukuda, and M. Nishimura, “Local peak enhancement combined with noise reduction algorithms for robust automatic speech recognition in automobiles,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’08), pp. 4869–4872, 2008. [18] http://www-01.ibm.com/software/pervasive/embedded via- voice/. Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 690732, 11 pages doi:10.1155/2010/690732

Research Article Shooter Localization in Wireless Microphone Networks

David Lindgren,1 Olof Wilsson,2 Fredrik Gustafsson (EURASIP Member),2 and Hans Habberstad1

1 Swedish Defence Research Agency, FOI Department of Information Systems, Division of Informatics, 581 11 Linkoping,¨ Sweden 2 Linkoping¨ University, Department of Electrical Engineering, Division of Automatic Control, 581 83 Linkoping,¨ Sweden

Correspondence should be addressed to David Lindgren, [email protected]

Received 31 July 2009; Accepted 14 June 2010

Academic Editor: Patrick Naylor

Copyright © 2010 David Lindgren et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Shooter localization in a wireless network of microphones is studied. Both the acoustic muzzle blast (MB) from the gunfire and the ballistic shock wave (SW) from the bullet can be detected by the microphones and considered as measurements. The MB measurements give rise to a standard sensor network problem, similar to time difference of arrivals in cellular phone networks, and the localization accuracy is good, provided that the sensors are well synchronized compared to the MB detection accuracy. The detection times of the SW depend on both shooter position and aiming angle and may provide additional information beside the shooter location, but again this requires good synchronization. We analyze the approach to base the estimation on the time difference of MB and SW at each sensor, which becomes insensitive to synchronization inaccuracies. Cramer-Rao´ lower bound analysis indicates how a lower bound of the root mean square error depends on the synchronization error for the MB and the MB-SW difference, respectively. The estimation problem is formulated in a separable nonlinear least squares framework. Results from field trials with different types of ammunition show excellent accuracy using the MB-SW difference for both the position and the aiming angle of the shooter.

1. Introduction time difference of arrival (TDOA), it is of major importance that synchronization errors are carefully controlled. Regard- Several acoustic shooter localization systems are today less if the synchronization is solved by using GPS or other commercially available; see, for instance [1–4]. Typically, one techniques, see, for instance [8–10], the synchronization or more microphone arrays are used, each synchronously procedures are associated with costs in battery life or sampling acoustic phenomena associated with gunfire. An communication resources that usually must be kept at a overview is found in [5]. Some of these systems are mobile, minimum. and in [6] it is even described how soldiers can carry the In [11] the synchronization error impact on the sniper microphone arrays on their helmets. One interesting attempt localization ability of an urban network is studied by using to find direction of sound from one microphone only is Monte Carlo simulations. One of the results is that the described in [7]. It is based on direction dependent spatial inaccuracy increased significantly ( 2 m) for synchroniza- filters (mimicking the human outer ear) and prior knowledge > of the sound waveform, but this approach has not yet been tion errors exceeding approximately 4 ms. 56 small wireless applied to gun shots. sensor nodes were modeled. Another closely related work Indeed, less common are shooter localization systems that deals with mobile asynchronous sensors is [12], where based on singleton microphones geographically distributed the estimation bounds with respect to both sensor synchro- in a wireless sensor network. An obvious issue in wireless nization and position errors are developed and validated by networks is the sensor synchronization. For localization Monte Carlo simulations. Also [13] should be mentioned, algorithms that rely on accurate timing like the ones based on where combinations of directional and omnidirectional 2 EURASIP Journal on Advances in Signal Processing acoustic sensors for sniper localization are evaluated by per- turbation analysis. In [14], estimation bounds for multiple Shock wave acoustic arrays are developed and validated by Monte Carlo simulations. In this paper we derive fundamental estimation bounds for shooter localization systems based on wireless sensor networks, with the synchronization errors in focus. An accurate method independent of the synchronization errors will be analyzed (the MB-SW model)aswellasauseful bullet deceleration model. The algorithms are tested on data from a field trial with 10 microphones spread over an area of 100 m and with gunfire at distances up to 400 m. Partial results of this investigation appeared in [15]andalmost Muzzle blast simultaneously in [12]. The outline is as follows. Section 2 sketches the local- 0 50 100 150 200 ization principle and describes the acoustical phenomena (ms) that are used. Section 3 gives the estimation framework. Figure 1: Signal from a microphone placed 180 m from a firing gun. Section 4 derives the signal models for the muzzle blast ff Initial bullet speed is 767 m/s. The bullet passes the microphone at a (MB), shock wave (SW), combined MB;SW, and di erence distance of 30 m. The shockwave from the supersonic bullet reaches MB-SW, respectively. Section 5 derives expressions for the the microphone before the muzzle blast. root mean square error (RMSE) Cramer-Rao´ lower bound (CRLB) for the described models and provides numerical results from a realistic scenario. Section 6 presents the results to the microphone. The figure shows real data, but a rather from field trials, and Section 7 gives the conclusions. ideal case. Usually, and particularly in urban environments, there are reflections and other acoustic effects that make it difficult to accurately determine the MB and SW times. 2. Localization Principle This issue will however not be treated in this work. We will instead assume that the detection error is stochastic with a Two acoustical phenomena associated with gunfire will be certain distribution. A more thorough analysis of the SW exploited to determine the shooter’s position: the muzzle propagation is given in [16]. blast and the shock wave. The principle is to detect and time Of course, the MB and SW (when present) can be used stamp the phenomena as they reach microphones distributed in conjunction with each other. One of the ideas exploited over an area, and let the shooter’s position be estimated by, later is to utilize the time difference between the MB and in a sense, the most likely point, considering the microphone SW detections. This way, the localization is independent of locations and detection times. the clock synchronization errors that are always present in The muzzle blast (MB) is the sound that probably most of wireless sensor networks. us associate with a gun shot, the “bang.” The MB is generated by the pressure depletion in effect of the bullet leaving the gun barrel. The sound of the MB travels at the speed of sound 3. Estimation Framework in all directions from the shooter. Provided that a sufficient number of microphones detect the MB, the shooters position It is assumed throughout this work that can be more or less accurately determined. (1) the coordinates of the microphones are known with The shock wave (SW) is formed by supersonic bullets. negligible error, The SW has (approximately) the shape of an expanding cone, with the bullet trajectory as axis, and reaches only (2) the arrival times of the MB and SW at each micro- microphones that happens to be located inside the cone. phone are measured with significant synchronization The SW propagates at the speed of sound in direction away error, from the bullet trajectory, but since it is generated by a (3) the shooter position and aim direction are the sought supersonic bullet, it always reaches the microphone before parameters. the MB, if it reaches the microphone at all. A number of SW Thus, assume that there are M microphones with known detections may primarily reveal the direction to the shooter. M positions { } = in the network detecting the muzzle blast. Extra observations or assumptions on the ammunition are pk k 1 Without loss of generality, the first ≤ ones also detect generally needed to deduce the distance to the shooter. The S M { MB}M SW detection is also more difficult to utilize than the MB the shock wave. The detected times are denoted by yk 1 { SW}S detection, since it depends on the bullet’s speed and ballistic and yk 1, respectively. Each detected time is subject to a { MB}M { SW}S ff behavior. detection error ek 1 and ek 1,di erent for all times, { }M Figure 1 shows an acoustic recording of gunfire. The and a clock synchronization error bk 1 specific for each 3 first pulse is the SW, which for distant shooters significantly microphone. The firing time t0, shooter position x ∈ R , dominates the MB, not the least if the bullet passes close and shooting direction α ∈ R2 are unknown parameters. EURASIP Journal on Advances in Signal Processing 3

Also the bullet speed v and speed of sound c are unknown. Here, θL is the weighted least squares estimate and PL is the Basic signal models for the detected times as a function of the covariance matrix of the estimation error. This simplifies the parameters will be derived in the next section. The notation nonlinear minimization to is summarized in Table 1. = The derived signal models will be of the form x arg min min V x, θN , θL; p x θN = −1 y h x, θ; p + e, (1) = − T −1 arg min min y hN + hL hL R hL θ x N (6) where is a vector with the measured detection times, y h 2 is a nonlinear function with values in RM+S,andwhereθ × T −1 − hL R y hN , represents the unknown parameters apart from x. The error R e is assumed to be stochastic; see Section 4.5. Given the R = R + h P hT . sensor locations in p ∈ RM×3, nonlinear optimization can L L L be performed to estimate x, using the nonlinear least squares This general separable least squares (SLSs) approach will now (NLS) criterion: be applied to four different combinations of signal models for the MB and SW detection times. x = arg min min V x, θ; p , x θ (2) 4. Signal Models = − 2 V x, θ; p y h x, θ; p R. 4.1. Muzzle Blast Model (MB). According to the clock at Here, argmin denotes the minimizing argument, min the microphone k, the muzzle blast (MB) sound is assumed to  2 minimum of the function, and v Q denotes the Q-norm, reach pk at the time − that is, v2  vT Q 1v. Whenever Q is omitted, Q = I Q 1 is assumed. The loss function norm R is chosen by con- y = t0 + b + p − x + e . (7) k k c k k sideration of the expected error characteristics. Numerical optimization, for instance, the Gauss-Newton method, can The shooter position x and microphone location pk are in here be applied to get the NLS estimate. Rn, where generally n = 3. However, both computational In the next section it will become clear that the assumed and numerical issues occasionally motivate a simplified plane unknown firing time and the inverse speed of sound enter model with n = 2. For all M microphones, the model is the model equations linearly. To exploit this fact we identify represented in vector form as a sublinear structure in the signal model and apply the = weighted least squares method to the parameters appearing y b + hL x; p θL + e, (8) linearly, the separable least squares method; see, for instance [17]. By doing so, the NLS search space is reduced which in where turn significantly reduces the computational burden. For that T = 1 reason, the signal model (1)isrewrittenas θL t0 , (9a) c T y = hN x, θN ; p + hL x, θN ; p θL + e. (3) − hL,k x; p = 1 pk x , (9b)

Note that θL enters linearly here. The NLS problem can then and where y, b,ande are vectors with elements yk, bk,and be formulated as ek,respectively.1M is the vector with M ones, where M might be omitted if there is no ambiguity regarding the dimension. = x arg min min V x, θN , θL; p , Furthermore, p is M-by-n, where each row is a microphone x θL,θN position. Note that the inverse of the speed of sound enters = − − 2 · · V x, θN , θL; p y hN x, θN ; p hL x, θN ; p θL R. linearly. The L notation indicates that is part of a linear = (4) relation, as described in the previous section. With hN 0 and hL = hL(x; p), (6)gives

Since θL enters linearly, it can be solved for by linear least −1 2 = − T −1 T −1 squares (the arguments of hL(x, θN ; p)andhN (x, θN ; p)are x arg miny hL hL R hL hL R y , (10a) suppressed for clarity): x R −1  = T −1 T R R + hL hL R hL hL . (10b) θL = arg minV x, θN , θL; p θL (5a) Here, hL depends on x as given in (9b). −1 T −1 T −1 This criterion has computationally efficient implemen- = h R hL h R y − hN , L L tations, that in many applications make the time it takes to −1 = T −1 do an exhaustive minimization over a, say, 10-meter grid PL hL R hL . (5b) acceptable. The grid-based minimization of course reduces 4 EURASIP Journal on Advances in Signal Processing

Table 1: Notation. MB, SW, and MB-SW are different models, and L/N indicates if model parameters or signals enter the model linearly (L) or nonlinearly (N).

Variable MB SW MB-SW Description M Number of microphones S Number of microphones receiving shock wave, S ≤ M x N N N Position of shooter, Rn (n = 2, 3) n pk N N N Position of microphone k, R (n = 2, 3)

yk L L L Measured detection time for microphone at position pk t0 L L Rifle or gun firing time c L N N Speed of sound v N N Speed of bullet α N N Shooting direction, Rn−1 (n = 2, 3) bk L L Synchronization error for microphone k ek L L L Detection error at microphone k r N N Bullet speed decay rate dk Point of origin for shock wave received by microphone k β Mach angle, sin β = c/v γ Angle between line of sight to shooter and shooting angle

Microphones The shock wave from the bullet trajectory propagates at the speed of sound c with angle βk to the bullet heading. βk Shooter is the Mach angle defined as c c sin βk = = . (12) v v0 − rdk − x

1000 m dk is now the point where the shock wave that reaches microphone k is generated. The time it takes the bullet to Figure 2: Level curves of the muzzle blast localization criterion reach d is based on data from a field trial. k  −  x dk dξ 1 v = log 0 . (13) 0 v0 − r · ξ r v0 − rdk − x the risk to settle on suboptimal local minimizers, which This time and the wave propagation time from dk to pk sum otherwise could be a risk using greedy search methods. up to the total time from firing to detection: The objective function does, however, behave rather well. Figure 2 visualizes (10a) in logarithmic scale for data from = 1 v0 1 −  = yk t0 + bk + log + dk pk + ek, a field trial (the norm is R I). Apparently, there are only r v0 − rdk − x c two local minima. (14)

according to the clock at microphone k. Note that the 4.2. Shock Wave Model (SW). In general, the bullet follows a variable names y and e for notational simplicity have been ballistic three-dimensional trajectory. In practice, a simpler reused from the MB model. Below, also h, θN ,andθL model with a two-dimensional trajectory with constant will be reused. When there is ambiguity, a superscript will ffi deceleration might su ce. Thus, it will be assumed that the indicate exactly which entity that is referred to, for instance, bullet follows a straight line with initial speed v0;seeFigure 3. yMB, hSW. Due to air friction, the bullet decelerates; so when the bullet It is a little bit tedious to calculate d . The law of sines  −  k has traveled the distance dk x , for some point dk on the gives trajectory, the speed is reduced to ◦ − − ◦ sin 90 βk γk = sin 90 + βk  −  − , (15) v = v0 − rdk − x, (11) dk x pk x

which together with (12) implicitly defines dk.Wehavenot where r is an assumed known ballistic parameter. This is found any simple closed form for dk;sowesolvefordk a rather coarse bullet trajectory model, compared with, for numerically, and in case of multiple solutions we keep the instance, the curvilinear trajectories proposed by [18], but admissible one (which turns out to be unique). γk is trivially we use it here for simplicity. This model is also a special case induced by the shooting direction α (and x, pk). Both these of the ballistic model used in [19]. angles thus depend on x implicitly. EURASIP Journal on Advances in Signal Processing 5

pk c 4.4. Difference Model (MB-SW). Motivated by accurate Shock wave v localization despite synchronization errors, we study the MB- βk SW model:

|| MB-SW = MB − SW x yk yk yk − p k || ◦ = MB MB − SW SW 90 + βk Bullet trajectory hL x; p θL hN x, θN ; p (25) || dk γk x − SW SW MB SW ||dk − , ; + − , Gun α hL x θN p θN ek ek x for k = 1, 2 ...S. This rather special model has also Figure 3: Geometry of supersonic bullet trajectory and shock wave. been analyzed in [12, 15]. The key idea is that y is by Given the shooter location x, the shooting direction (aim) α,the cancellation independent of both the firing time t0 and the bullet speed v, and the speed of sound c, the time it takes from firing synchronization error b. The drawback, of course, is that the gun to detecting the shock wave can be calculated. there are only S equations (instead of a total of M + S)and MB − SW the detection error increases, ek ek . However, when the synchronization errors are expected to be significantly larger Thevectorformofthemodelis ffi than the detection errors, and when also S is su ciently large y = b + hN x, θN ; p + hL x, θN ; p θL + e, (16) (at least as large as the number of parameters), this model is believed to give better localization accuracy. This will be where investigated later. hL x, θN ; p = 1, There are no parameters in (25) that appear linearly everywhere. Thus, the vector form for the MB-SW model can θL = t0, be written as (17) 1 T yMB-SW = hMB-SW x, θ ; p + e, (26) θ = αT v , N N N c 0 where S×1 and where row k of hN (x, θN ; p) ∈ R is MB-SW hN,k x, θN ; pk = 1 v0 1 − hN,k x, θN ; pk log −  −  + dk pk , (18) r v0 r dk x c = 1 − − 1 v0 − 1 − pk x log −  −  dk pk , and d is the admissible solution to (12)and(15). c r v0 r dk x c k (27)

MB SW MB SW 4.3. Combined Model (MB;SW). In the MB and SW models, and y = y − y and e = e − e .Asbefore,dk is the synchronization error has to be regarded as a noise the admissible solution to (12)and(15). The MB-SW least component. In a combined model, each pair of MB and SW squares criterion is detections depends on the same synchronization error, and 2 consequently the synchronization error can be regarded as a MB−SW MB−SW x = arg min y − h x, θN ; p , N R (28) parameter (at least for all sensor nodes inside the SW cone). x,θN The total signal model could be fused from the MB and SW which requires numerical optimization. Numerical experi- models as the total observation vector: ments indicate that this optimization problem is more prone MB;SW = MB;SW MB;SW y hN x, θN ; p + hL x, θN ; p θL + e, (19) to local minima, compared to (10a) for the MB model; therefore good starting points for the numerical search are where ⎡ ⎤ essential. One such starting point could, for instance, be the yMB MB estimate xMB. Initial shooting direction could be given by MB;SW = ⎣ ⎦, (20) y SW assuming, in a sense, the worst possible case, that the shooter y aims at some point close to the center of the microphone T network. T θL = t0 b , (21)   4.5. Error Model. At an arbitrary moment, the detection MB;SW = 1M,1  IM  hL x, θN ; p , (22) errors and synchronization errors are assumed to be inde- 1S,1 IS 0S,M−S pendent stochastic variables with normal distribution: T 1 T MB ∼ N MB θN = α v0 , (23) e 0, R , (29a) c ⎡ ⎤ T eSW ∼ N 0, RSW , (29b) ⎢ MB 1 ⎥ MB;SW = ⎣hL x; p 0 ⎦ hN x, θN ; p c . (24) SW ∼ N b hN x, θN ; p b 0, R . (29c) 6 EURASIP Journal on Advances in Signal Processing

For the MB-SW model the error is consequently 5. Cramer-Rao´ Lower Bound The accuracy of any unbiased estimator η in the rather eMB-SW ∼ N 0, RMB + RSW . (29d) general model Assuming that S = M in the MB;SW model, the covariance y = h η + e (30) of the summed detection and synchronization errors can be expressed in a simple manner as is, under not too restrictive assumptions [20], bounded by   the Cramer-Rao´ bound: MB R + Rb Rb − RMB;SW = . (29e) Cov η ≥ I 1 ηo , (31) Rb RSW + Rb where I(ηo) is Fisher’s information matrix evaluated at Note that the correlation structure of the clock synchroniza- the correct parameter values ηo. Here, the location x is tion error b enables estimation of these. Note also that the for notational purposes part of the parameter vector η. (assumed known) total error covariance, generally denoted Also the sensor positions pk can be part of η, if these are by R, dictates the norm used in the weighted least squares known only with a certain uncertainty. The Cramer-Rao´ criterion. R also impacts the estimation bounds. This will be lower bound provides a fundamental estimation limit for discussed in the next section. unbiased estimators; see [20]. This bound has been analyzed thoroughly in the literature, primarily for AOA, TOA, and 4.6. Summary of Models. Four models with different pur- TDOA [21–23]. N poses have been described in this section. The Fisher information matrix for e ∼ (0, R) takes the form     (i) MB. Given that the acoustic environment enables I =∇ −1∇T reliable detection of the muzzle blast, the MB η η h η R η h η . (32) model promises the most robust estimation algo- The bound is evaluated for a specific location, parameter rithms. It also allows global minimization with setting, and microphone positioning, collectively η = ηo. low-dimensional exhaustive search algorithms. This The bound for the localization error is model is thus suitable for initialization of algorithms   based on the subsequent models. − I Cov() ≥ I 0 I 1 o n (33) x n η 0 . (ii) SW. The SW model extends the MB model with shooting angle, bullet speed, and deceleration param- This covariance can be converted to a more convenient scalar eters, which provide useful information for sniper value giving a bound on the root mean square error (RMSE) detection applications. The SW is easier to detect using the trace operator: in disturbed environments, particularly when the   shooter is far away and the bullet passes closely. 1 −1 In However,asufficient number of microphones are RMSE ≥ tr In 0 I ηo . (34) n 0 required to be located within the SW cone, and the SW measurements alone cannot be used to determine The RMSE bound can be used to compare the information the distance to the shooter. in different models in a simple and unambiguous way, which (iii) MB;SW. The total MB;SW model keeps all informa- does not depend on which optimization criterion is used or tion from the observations and should thus provide which numerical algorithm that is applied to minimize the the most accurate and general estimation perfor- criterion. mance. However, the complexity of the estimation problem is large. 5.1. MB Case. For the MB case, the entities in (32)are (iv) MB-SW. All algorithms based on the models above identified by require that the synchronization error in each micro- T = T T phone either is negligible or can be described with η x θL , a statistical distribution. The MB-SW model relaxes = MB (35) such assumptions by eliminating the synchronization h η hL x; p θL, error by taking differences of the two pulses at each R = RMB + Rb. microphone. This also eliminates the shooting time. The final model contains all interesting parameters Note that b is accounted for by the error model. The Jacobian for the problem, but only one nuisance parameter ∇ηh is an M-by-n+2 matrix, n being the dimension of x.The (actual speed of sound, which further may be elim- LS solution in (5a) however gives a shortcut to an M-by-n inated if known sufficiently well). Jacobian: −1 The different parameter vectors in the relation y = ∇ =∇ T −1 T −1 o x hLθL x hL hL R hL hL R y (36) hL(θN )θL + hN (θN )+e are summarized in Table 2. EURASIP Journal on Advances in Signal Processing 7

Table 2: Summary of parameter vectors for the different models y = hL(θN )θL + hN (θN )+e, where the noise models are summarized in (29a), (29b), (29c), (29d), and (29e). The values of the dimensions assume that the set of microphones giving SW observations is a subset of the MB observations. Model Linear Parameters Nonlinear Parameters dim(θ) dim(y) MB = T MB = MB θL [t0 1/c] θN [] 2+0 M SW = MB = T T SW θL t0 θN [1/c, α , v0] 1+(n +1) S MB;SW = T MB;SW = T T MB;SW θL [t0 b] θN [1/c, α , v0] (M +1)+(n +1) M + S MB-SW = MB-SW = T T MB-SW θL [] θN [1/c, α , v0] 0+(n +1) S

Road MB Model. The localization accuracy using the MB model is bounded below according to Trees   64 −17 Cov xMB ≥ σ2 + σ2 · 104. (39) Shooter Camp e b −17 9

The root mean square error (RMSE) is consequently Trees x2 bounded according to Microphones x1 1 1000 m RMSE xMB ≥ tr Cov xMB ≈ 606 σ2 + σ2 [m]. (40) n e b Figure 4: Example scenario. A network with 14 sensors deployed for camp protection. The sensors detect intruders, keep track on Monte Carlo simulations (not described here) indicate that vehicle movements, and, of course, locate shooters. 2 2 the NLS estimator attains this lower bound for σe + σb < 0.1 s. The dash-dotted curve in Figure 5 shows the bound versus σb for fix σe = 500 μs. An uncontrolled increase as o = o o o o o o for y hL(x ; p )θL,wherex , p ,andθ denote the true soon as σb >σe can be noted. (unperturbed) values. For the case n = 2 and known p = po, this Jacobian can, with some effort, be expressed explicitly. SW Model. The SW model is disregarded here, since the SW The equivalent bound is detections alone contain no shooter distance information. −1 ≥ ∇T −1∇ Cov(x) x hLθL R x hLθL . (37) MB-SW Model. The localization accuracy using the MB-SW 5.2. SW, MB;SW, and MB-SW Cases. The estimation bounds model is bounded according to   for the SW, MB;SW, and MB-SW cases are analogously to (33), but there are hardly any analytical expressions available. MB-SW ≥ 2 28 5 · 5 Cov x σe 10 , (41) The Jacobian is probably best evaluated by finite difference 512 methods. MB-SW RMSE x ≥ 1430σe [m]. (42) 5.3. Numerical Example. The really interesting question is ff ThedashedlinesinFigure 5 correspond to the RMSE bound how the information in the di erent models relates to each ff other. We will study a scenario where 14 microphones are for four di erent values of σe. Here, the MB-SW model gives deployed in a sensor network to support camp protection; at least twice the error of the MB model, provided that see Figure 4. The microphones are positioned along a road to there are no synchronization errors. However, in a wireless track vehicles and around the camp site to detect intruders. network we expect the synchronization error to be 10–100 Of course, the microphones also detect muzzle blasts and times larger than the detection error, and then the MB-SW shock waves from gunfire, so shooters can be localized and error will be substantially smaller than the MB error. the shooter’s target identified. A plane model (flat camp site) is assumed, x ∈ R2, α ∈ MB;SW Model. The expression for the MB;SW bound is R. Furthermore, it is assumed that somewhat involved; so the dependence on σb is only pre- sented graphically, see Figure 5. The solid curves correspond b = 2 R σb I synchronization error Cov . , to the MB;SW RMSE bound for the same four values (38) MB = SW = 2 of σe as for the MB-SW bound. Apparently, when the R R σe I (detection error Cov .), synchronization error σb is large compared to the detection and that α = 0, c = 330 m/s, v0 = 700 m/s, and r = 0.63. error σe, the MB-SW and MB;SW models contain roughly The scenario setup implies that all microphones detect the the same amount of information, and the model having shock wave, so S = M = 14. All bounds presented below are the simplest estimator, that is, the MB-SW model, should calculated by numerical finite difference methods. be preferred. However, when the synchronization error is 8 EURASIP Journal on Advances in Signal Processing

Target 2 3 1.5 σe= 1000 μs

1 1 500 m

σe= 500 μs RMSE (m)

0.5 Shooter σe= 200 μs Microphone σ = 50 μs e Figure 6: Scene of the shooter localization field trial. There are ten 0 0.1 1 10 100 microphones, three shooter positions, and a common target. σb (ms)

MB (σ =500 μs) e position three, six instead of four rounds of 308 W are fired. MB-SW(σe=50−1000 μs) MB; SW (σe=50−1000 μs) All ammunition types are supersonic. However, when firing from position three, not all microphones are subjected to the Figure 5: Cramer-Rao´ RMSE bound (34)fortheMB(40), the MB- shock wave. SW (42), and the MB;SW models, respectively, as a function of the Light wind, no clouds, and around 24◦C are the weather ff synchronization error (STD) σb, and for di erent levels of detection conditions. Little or no acoustic disturbances are present. error σe. The terrain is rough. Dense woods surround the test site. There is light bush vegetation within the site. Shooter position 1 is elevated some 20 m; otherwise spots are within smaller than 100 times the detection error, the complete ±5 m of a horizontal plane. Ground truth values of the MB;SW model becomes more informative. positions are determined with less relative error than 1 m, These results are comparable with the analysis in except for shooter position 1, which is determined with 10 m [12, Figure 4a], where an example scenario with 6 micro- accuracy. phones is considered. 6.1. Detection. The MB and SW are detected by visual 5.4. Summary of the CRLB Analysis. The synchronization inspection of the microphone signals in conjunction with error level in a wireless sensor network is usually a matter filtering techniques. For shooter positions 1 and 2, the of design tradeoff between performance and battery costs SW ≈ shock wave detection accuracy is approximately σe required by synchronization mechanisms. Based on the MB 80 μs, and the muzzle blast error σe is slightly worse. For scenario example, the CRLB analysis is summarized with the shooting position 3 the accuracies are generally much worse, following recommendations. since the muzzle blast and shock wave components become intermixed in time. (i) If σb σe, then the MB-SW model should be used. (ii) If σb is moderate, then the MB;SW model should be 6.2. Numerical Setup. For simplicity, a plane model is used. assumed. All elevation measurements are ignored and x ∈ R2 ∈ R (iii) Only if σb is very small (σb ≤ σe), the shooting and α . Localization using the MB model (7)isdone direction is of minor interest, and performance may by minimizing (10a) over a 10 m grid well covering the area be traded for simplicity, then the MB model should of interest, followed by numerical minimization. be used. Localization using the MB-SW model (25)isdoneby numerically minimizing (28). The objective function is sub- ject to local optima; therefore the more robust muzzle blast 6. Experimental Data localization x is used as an initial guess. Furthermore, the A field trial to collect acoustic data on nonmilitary small direction from x toward the mean point of the microphones arms fire is conducted. 10 microphones are placed around (the camp) is used as initial shooting direction α. Initial = a fictitious camp; see Figure 6. The microphones are placed bullet speed is v 800 m/s and initial speed of sound is = = close to the ground and wired to a common recorder with 16- c 330 m/s. r 0.63 is used, which is a value derived from bit sampling at 48 kHz. A total of 42 rounds are fired from the 308 Winchester ammunition ballistics. three positions and aimed at a common cardboard target. Three rifles and one pistol are used; see Table 3. Four rounds 6.3. Results. Figure 7 shows, at three enlarged parts of the are fired of each armament at each shooter position, with scene, the resulting position estimates based on the MB two exceptions. The pistol is only used at position three. At model (blue crosses) and based on the MB-SW (squares). EURASIP Journal on Advances in Signal Processing 9

Table 3: Armament and ammunition used at the trial, and number of rounds fired at each shooter position. Also, the resulting localization RMSE for the MB-SW model for each shooter position. For the Luger Pistol the MB model RMSE is given, since only one microphone is located in the Luger Pistol SW cone.

Type Caliber Weight Velocity Sh. pos. # Rounds RMSE 308 Winchester 7.62 mm 9.55 g 847 m/s 1, 2, 3 4, 4, 6 19, 6, 6 m Hunting Rifle 9.3 mm 15 g 767 m/s 1, 2, 3 4, 4, 4 6, 5, 6 m Swedish Mauser 6.5 mm 8.42 g 852 m/s 1, 2, 3 4, 4, 4 40, 6, 6 m Luger Pistol 9 mm 6.8 g 400 m/s 3 —, —, 4 —, —, 2 m

Apparently, the use of the shock wave significantly improves Table 4: Localization RMSE and theoretical bound (34)forthe localization at positions 1 and 2, while rather the opposite three different shooter positions using the MB and the MB-SW holds at position 3. Figure 8 visualizes the shooting direction models, respectively, beside the aim RMSE for the MB-SW model. The aim RMSE is with respect to the aim at x against the target, estimates, α. Estimate root mean square errors (RMSEs) for  the three shooter positions, together with the theoretical α , not with respect to the true direction α. This way the ability to bounds (34), are given in Table 4. The practical results identify the target is assessed. indicate that the use of the shock wave from distant shooters Shooter position 1 2 3 cut the error by at least 75%. RMSE(xMB) 105 m 28 m 2.4 m MB Bound 1 m 0.4 m 0.02 m 6.3.1. Synchronization and Detection Errors. Since all micro- RMSE (xMB-SW) 26 m 5.7 m 5.2 m phones are recorded by a common recorder, there are actually MB-SW Bound 9 m 0.1 m 0.08 m no timing errors due to inaccurate clocks. This is of course RMSE(α)0.041◦ 0.14◦ 17◦ the best way to conduct a controlled experiment, where any uncertainty renders the dataset less useful. From experimen- tal point of view, it is then simple to add synchronization errors of any desired magnitude off-line. On the dataset at model discrepancy that could have greater impact than we hand, this is however work under progress. At the moment, first anticipated. This will be investigated in experiments to there are apparently other sources of error, worth identifying. come. It should however be clarified that in the final wireless sensor product, there will always be an unpredictable clock error. 6.3.3. Numerical Uncertainties. Finally, we face numerical As mentioned, detection errors are present, and the expected uncertainties. There is no guarantee that the numerical level of these (80 μs) is used for bound calculations in Table 4. minimization programs we have used here for the MB- It is noted that the bounds are in level with, or below, the SW model really deliver the global minimum. In a realistic positioning errors. implementation, every possible a priori knowledge and also There are at least two explanations for the bad perfor- qualitative analysis of the SW and MB signals (amplitude, mance using the MB-SW model at shooter position 3. One is duration, caliber classification, etc.) together with basic that the number of microphones reached by the shock wave consistency checks are used to reduce the search space. The is insufficient to make accurate estimates. There are four reduced search space may then be exhaustively sampled over unknown model parameters, but for the relatively low speed a grid prior to the final numerical minimization. Simple of pistol ammunition, for instance, only one microphone has experiments on an ordinary desktop PC indicate that with a valid shock wave detection. Another explanation is that the an efficient implementation, it is feasible to, within the time increased detection uncertainty (due to SW/MB intermix) frame of one second, minimize any of the described model impacts the MB-SW model harder, since it relies on accurate objective functions over a discrete grid with 107 points. Thus, detectionofboththeMBandSW. by allowing—say—one second extra of computation time, the risk for hitting a local optima could be significantly reduced. 6.3.2. Model Errors. No doubt, there are model inaccuracies both in the ballistic and in the acoustic domain. To that end, there are meteorological uncertainties out of our control. 7. Conclusions For instance, looking at the MB-SW localizations around shooter position 1 in Figure 7 (squares), three clusters We have presented a framework for estimation of shooter are identified that correspond to three ammunition types location and aiming angle from wireless networks where each with different ballistic properties; see the RMSE for each node has a single microphone. Both the acoustic muzzle ammunition and position in Table 3. This clustering or bias blast (MB) and the ballistic shock wave (SW) contain useful more likely stems from model errors than from detection information about the position, but only the SW contains errors and could at least partially explain the large gap information about the aiming angle. A separable nonlinear between theoretical bound and RMSE in Table 4. Working least squares (SNLSs) framework was proposed to limit the with three-dimensional data in the plane is of course another parametric search space and to enable the use of global 10 EURASIP Journal on Advances in Signal Processing

40 Target

20

0 1

− 20 500 m − 50 0 50 100 150 (m)

(a) Shooter 8 Microphone Estimated position

0 2 Figure 8: Estimated shooting directions. The relatively slow pistol ammunition is excluded.

−8 −10 0 10 20 30 40 (m) the synchronization error distribution may be completely disregarded. (b) The bullet speed occurs as nuisance parameters in the proposed signal model. Further, the bullet retardation con- 03 stant was optimized manually. Future work will investigate if the retardation constant should also be estimated, and if these two parameters can be used, together with the MB and − 2 SW signal forms, to identify the weapon and ammunition.

−4 Acknowledgment This work is funded by the VINNOVA supported Centre for Advanced Sensors, Multisensors and Sensor Networks, − 6 FOCUS, at the Swedish Defence Research Agency, FOI.

−6 −4 −20 2 4 References (m) [1] J. Bedard´ and S. Pare,´ “Ferret, a small arms’ fire detection Shooter system: localization concepts,” in Sensors, and Command, MB model Control, Communications, and Intelligence (C31) Technologies MB-SW model for Homeland Defense and Law Enforcement II, vol. 5071 of Proceedings of SPIE, pp. 497–509, 2003. (c) [2] J. A. Mazurek, J. E. Barger, M. Brinn et al., “Boomerang Figure 7: Estimated positions x based on the MB model and on the mobile counter shooter detection system,” in Sensors, and C3I MB-SW model. The diagrams are enlargements of the interesting Technologies for Homeland Security and Homeland Defense IV, areas around the shooter positions. The dashed lines identify the vol. 5778 of Proceedings of SPIE, pp. 264–282, Bellingham, shooting directions. Wash, USA, 2005. [3] D. Crane, “Ears-MM soldier-wearable gun-shot/sniper detec- tion and location system,” Defence Review, 2008. [4] “PILAR Sniper Countermeasures System,” November 2008, grid-based optimization algorithms (for the MB model), http://www.canberra.com. eliminating potential problems with local minima. [5] J. Millet and B. Balingand, “Latest achievements in gunfire For a perfectly synchronized network, both MB and detection systems,” in Proceedings of the of the RTO-MP-SET- SW measurements should be stacked into one large signal 107 Battlefield Acoustic Sensing for ISR Applications, Neuilly- model for which SNLS is applied. However, when the sur-Seine, France, 2006. synchronization error in the network becomes comparable [6] P. Volgyesi, G. Balogh, A. Nadas, et al., “Shooter localization to the detection error for MB and SW, the performance and weapon classification with soldier-wearable networked ff sensors,” in Proceedings of the 5th International Conference on quickly deteriorates. For that reason, the time di erence of Mobile Systems, Applications, and Services (MobiSys ’07),San MB and SW at each microphone is used, which automatically Juan, Puerto Rico, 2007. ff ff eliminates any clock o set. The e ective number of measure- [7] A. Saxena and A. Y. Ng, “Learning Sound Location from a ments decreases in this approach, but as the CRLB analysis single microphone,” in Proceedings of the IEEE International showed, the root mean square position error is comparable Conference on Robotics and Automation (ICRA ’09), pp. 1737– to that of the ideal stacked model, at the same time as 1742, Kobe, Japan, May 2009. EURASIP Journal on Advances in Signal Processing 11

[8] W. S. Conner, J. Chhabra, M. Yarvis, and L. Krishnamurthy, “Experimental evaluation of synchronization and topology control for in-building sensor network applications,” in Proceedings of the 2nd ACM International Workshop on Wireless Sensor Networks and Applications (WSNA ’03), pp. 38–49, San Diego, Calif, USA, September 2003. [9] O. Younis and S. Fahmy, “A scalable framework for distributed time synchronization in multi-hop sensor networks,” in Proceedings of the 2nd Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON ’05), pp. 13–23, Santa Clara, Calif, USA, September 2005. [10] J. Elson and D. Estrin, “Time synchronization for wireless sensor networks,” in Proceedings of the International Parallel and Distributed Processing Symposium, 2001. [11] G. Simon, M. Maroti,´ A.´ Ledeczi,´ et al., “Sensor network-based countersniper system,” in Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys ’04), pp. 1–12, Baltimore, Md, USA, November 2004. [12] G. T. Whipps, L. M. Kaplan, and R. Damarla, “Analysis of sniper localization for mobile, asynchronous sensors,” in Signal Processing, Sensor Fusion, and Target Recognition XVIII, vol. 7336 of Proceedings of SPIE, 2009. [13] E. Danicki, “Acoustic sniper localization,” Archives of Acoustics, vol. 30, no. 2, pp. 233–245, 2005. [14] L. M. Kaplan, T. Damarla, and T. Pham, “Qol for passive acoustic gunfire localization,” in Proceedings of the 5th IEEE International Conference on Mobile Ad-Hoc and Sensor Systems (MASS ’08), pp. 754–759, Atlanta, Ga, USA, 2008. [15] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad, “Shooter localization in wireless sensor networks,” in Proceed- ings of the 12th International Conference on Information Fusion (FUSION ’09), pp. 404–411, Seattle, Wash, USA, 2009. [16] R. Stoughton, “Measurements of small-caliber ballistic shock waves in air,” Journal of the Acoustical Society of America, vol. 102, no. 2, pp. 781–787, 1997. [17] F. Gustafsson, Statistical Sensor Fusion, Studentlitteratur, Lund, Sweden, 2010. [18] E. Danicki, “The shock wave-based acoustic sniper localiza- tion,” Nonlinear Analysis: Theory, Methods & Applications, vol. 65, no. 5, pp. 956–962, 2006. [19]K.W.LoandB.G.Ferguson,“Aballisticmodel-basedmethod for ranging direct fire weapons using the acoustic muzzle blast and shock wave,” in Proceedings of the International Conference on Intelligent Sensors, Sensor Networks and Information Pro- cessing (ISSNIP ’08), pp. 453–458, December 2008. [20] S. Kay, Fundamentals of Signal Processing: Estimation Theory, Prentice Hall, Upper Saddle River, NJ, USA, 1993. [21]N.Patwari,A.O.HeroIII,M.Perkins,N.S.Correal,and R. J. O’Dea, “Relative location estimation in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 51, no. 8, pp. 2137–2148, 2003. [22] S. Gezici, Z. Tian, G. B. Giannakis et al., “Localization via ultra-wideband radios: a look at positioning aspects of future sensor networks,” IEEE Signal Processing Magazine, vol. 22, no. 4, pp. 70–84, 2005. [23] F. Gustafsson and F. Gunnarsson, “Possibilities and funda- mental limitations of positioning using wireless commu- nication networks measurements,” IEEE Signal Processing Magazine, vol. 22, pp. 41–53, 2005.