A Sub-Band Time Delay Estimation Approach

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Frequency-Sliding Generalized Cross-Correlation: A Sub-band Time Delay Estimation Approach Maximo Cobos, Senior Member, IEEE, Fabio Antonacci, Senior Member, IEEE, Luca Comanducci, Student Member, IEEE, and Augusto Sarti, Senior Member, IEEE Abstract—The generalized cross-correlation (GCC) is regarded [6], is still today the most popular technique for TDE. By as the most popular approach for estimating the time difference using GCCs, the TDOA between two signals is estimated of arrival (TDOA) between the signals received at two sensors. as the time lag that maximizes the cross-correlation between Time delay estimates are obtained by maximizing the GCC output, where the direct-path delay is usually observed as a filtered versions of such signals. In this context, the GCC may prominent peak. Moreover, GCCs play also an important role consider different weighting functions (filters) characterized by in steered response power (SRP) localization algorithms, where a particular behavior [7], [8], [9]. Roth, smoothed coherence the SRP functional can be written as an accumulation of the transform, Eckart or Hannan-Thomson (maximum likelihood) GCCs computed from multiple sensor pairs. Unfortunately, the are examples of such weighting schemes [6], [10]. The GCC accuracy of TDOA estimates is affected by multiple factors, including noise, reverberation and signal bandwidth. In this with phase transform (GCC-PHAT) has been repeatedly shown paper, a sub-band approach for time delay estimation aimed to be a suitable alternative for TDE in real reverberant sce- at improving the performance of the conventional GCC is narios [11], [12]. However, it has also been demonstrated presented. The proposed method is based on the extraction of that its performance can only be considered optimal under multiple GCCs corresponding to different frequency bands of uncorrelated noise conditions and a high signal-to-noise ratio the cross-power spectrum phase in a sliding-window fashion. The major contributions of this paper include: 1) a sub-band GCC (SNR) [13]. The present work is particularly focused on PHAT representation of the cross-power spectrum phase that, despite weighting, thus, the terms GCC and GCC-PHAT will be used having a reduced temporal resolution, provides a more suitable indistinctly throughout this paper. Since the GCC was first representation for estimating the true TDOA; 2) such matrix proposed, many approaches have appeared to ameliorate the representation is shown to be rank one in the ideal noiseless robustness of TDE techniques, with improvements that are case, a property that is exploited in more adverse scenarios to obtain a more robust and accurate GCC; 3) we propose a mostly achieved by exploiting the spatial diversity provided set of low-rank approximation alternatives for processing the by more than two microphones [14], [15], [16], blind channel sub-band GCC matrix, leading to better TDOA estimates and estimation [17], [18], modified GCC weightings [19], or the source localization performance. An extensive set of experiments incorporation of some a priori information [20], [21]. Further is presented to demonstrate the validity of the proposed approach. improvements to deal with reverberation and correlated noise fields have also been proposed [22], addressing some of the Index Terms—Time delay estimation, GCC, SVD, weighted limitations of the GCC. In addition, other methods exploiting SVD, sub-band processing, SRP-PHAT. temporal diversity by means of Bayesian filtering have shown great potential [23], [24]. I. INTRODUCTION In contrast to the above methods, this paper proposes a IME delay estimation (TDE) refers to finding the time novel sub-band approach to TDE with two sensors which is T differences-of-arrival (TDOAs) between signals received not directly categorized within the above processing improve- at an array of sensors. Traditionally, TDE has played an im- ments. Concretely, the method is based on the exploration of portant role in many location-aware systems, including radar, the cross-power spectrum phase by following a sliding window arXiv:1910.08838v2 [eess.AS] 24 Mar 2020 sonar, wireless systems, sensor calibration or seismology. In approach, obtaining a set of sub-band GCCs that encode the acoustic signal processing, TDE is essential for localizing and contribution of different frequency bands to the estimated tracking acoustic sources [1], [2]. Automatic camera steering TDOA. The resulting sub-band GCC matrix is shown to be [3], speaker localization [4] or speech enhancement systems rank-one for a full-band signal in the noiseless single-path are application examples strongly relying on TDE [5]. case. This fact is exploited to obtain a robust GCC in adverse The generalized cross-correlation (GCC), originally pro- scenarios, proposing low-rank approximations of the sub-band posed by Knapp and Carter in their 1976 seminal paper GCC matrix that ultimately lead to better estimation accuracy, reduced level of spurious peaks and lower probability of M. Cobos is with the Departament d’Informatica,` Universitat de Valencia,` Burjassot, 46100 Spain e-mail: ([email protected]). anomalous estimates. F. Antonacci, A. Sarti and L. Comanducci are with the Dipartimento di The rest of the paper is structured as follows. Section II Elettronica, Informazione e Bioingegneria at the Politecnico di Milano, Italy. summarizes the background concerning GCC-based TDOA E-mail: (fi[email protected]) This work has been partially supported by FEDER and the Spanish Ministry estimation. Section III presents the frequency-sliding GCC of Science, Innovation and Universities under Grants RTI2018-097045-B-C21 (FS-GCC) representation. Section IV discusses our proposed and PRX19/00075. methods for TDE based on the FS-GCC. The experimental evaluation is in Section V. Finally, the conclusions of this c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2 work are summarized in Section VI. the following estimator for time-delay estimation over signals acquired by a pair of sensors: II. TIME DELAY ESTIMATION τ^0 = arg max R[τ]: (8) τ This section summarizes the conventional GCC approach The GCC-PHAT inherently discards the magnitude infor- for TDE. To this end, the ideal anechoic model is first mation of the signals and provides a time-delay estimate presented, discussing the main problems arising in realistic based only on the phase of the cross-power spectrum, i.e. the acoustic conditions. phase of the frequency domain analysis of the cross-correlation between the two signals. It is important to remark that the A. Anechoic Signal Model measured time delay is an integer multiple of the sampling Let us consider a pair of sensors with spatial coordinates period. Note, however, that a finer resolution can be achieved 3 by interpolating between consecutive samples of the GCC given by column vectors m1, m2 2 R and an emitting acoustic source located at s 2 R3. The time difference-of- function if necessary. arrival (TDOA) measured in samples is defined as C. Problems in Realistic Scenarios ks − m k − ks − m k τ 1 2 f = η − η ; (1) 0 , c s 1 2 It is well-known that several problems arise when using GCC-PHAT for estimating the TDOA in realistic scenarios. where c is the wave propagation speed, b·e denotes the Indeed, TDOA measurements are very sensitive to reverbera- rounding operator and fs is the sampling frequency. The terms tion, noise, and the presence of potential interferers: η1 and η2 represent the time of flight (TOF) of the sound to • In reverberant environments, for certain locations and the sensors in samples. orientations of the source signal, the peak of the GCC Assuming an anechoic scenario, the signals received by the related to a reflective path could overcome that of the two sensors can be modeled as direct path. • In noisy scenarios, for some time instants, the noise level xm[n] = βms[n − ηm] + wm[n]; m = 1; 2; (2) could exceed that of the signal, making the estimated where βm 2 R+ is a positive amplitude decay factor, s[n] TDOA unreliable. is the source signal and wm[n] is an additive noise term. In • Peaks corresponding to the direct path, reflections or the discrete-time Fourier transform (DTFT) domain, the sensor combinations of interfering signals may also lead to errors signals can be written as when estimating the TDOA of a target source. The above issues can be even more problematic when the X (!) = β S(!)e−j!ηm + W (!); m = 1; 2; (3) m m m spectral characteristics of the target source or the additive noise lead to a reduced signal-to-noise ratio (SNR) at some fre- where S(!), Wm(!) 2 C, are the DTFTs ofp the source signal and the noise signal, respectively, and j = −1. quency bands [25]. The phase information can be completely lost at those frequency bands where either there is no signal information or where its content is especially affected by the B. Generalized Cross-Correlation noise. Consider, for example, the (normalized) GCCs shown in The GCC of a pair of sensor signals is defined as the inverse Figure 1. The top row corresponds to the ideal case of a noise- Fourier transform of the weighted cross-power spectrum, i.e. less full-band signal. The other two rows correspond to the 1 Z π same signal but affected by different spectral noise profiles. As R[τ] Ψ(!)ej!τ d! = F −1 fΨ(!)g ; (4) , 2π it can be observed, the noisy phase introduces many spurious −π peaks that can cause anomalous TDOA estimates.

A Sub-Band Time Delay Estimation Approach

Moving Average Filters

Random Signals

STATISTICAL FOURIER ANALYSIS: CLARIFICATIONS and INTERPRETATIONS by DSG Pollock

Use of the Kurtosis Statistic in the Frequency Domain As an Aid In

2D Fourier, Scale, and Cross-Correlation

20. the Fourier Transform in Optics, II Parseval’S Theorem

Introduction to Frequency Domain Processing

Time Domain and Frequency Domain Signal Representation

The Wavelet Tutorial Second Edition Part I by Robi Polikar

Image Analysis

Signals and Sampling

Cross-Correlation