arXiv:1804.03372v1 [cs.SD] 10 Apr 2018 olce ytemcohnsisaldo KEMAR a on installed microphones the the data on by verified Experimental collected then platform. was robotic and newly-developed simulation in was technique tested proposed first The localization. was performing source robots sound array plat- ground unmanned robotic bi-microphone for novel developed self-rotational also A a cue. using ITD form the only using local- microphone distance ization and the orientation complete of ro- a provide integrated motions array The translational steps. and of predefined number tational a fixed by a vector of source-robot distance per- the toward microphone to the face shifting pendicular to then and array source microphone sound the the rotating first performed is by in localization ’buried’ distance signal a Additionally, zero-ITD noise. the extracting by elevation h rpsdmto scpbeo eetn a detecting Also, of capable approach. is fitting method proposed curve the polynomial a elevation the using estimates angle zero-degree- further the and check condition elevation to models and 3D two-dimensional the the respectively from az- resulting the estimates between imuth difference the utilizes method proposed spae iheeainage of angles source elevation sound with the placed when is unobservable is system the shows model. that model state-space the state-space the of a in analysis observability using The established frame is coordinate dynamics spherical system a The by robotic platform. generated bi-microphone cue, self-rotational (ITD) newly-developed difference the only sound time using a space interaural (3D) of three-dimensional a localization in distance source and orientation both later inserted be should acceptance and receipt of date the Sun Liang Lindsay, Nathan Gala, Deepak Bi-Microphone Self-Rotational a Unmanned Array Using for Localization Robots Source Ground Sound Active Realtime nvriy M S;LagSn snns.d,NwMexico New USA [email protected], NM, University, Sun, State State Liang Mexico USA; New University NM, [email protected], University, Lindsay, State Nathan Mexico USA; New NM, [email protected], Gala, Deepak wl eisre yteeditor) the by inserted be (will No. manuscript Noname hswr rsnsanvltcnqeta performs that technique novel a presents work This 90 and 0 ere The degree. 90 -degree , ytm htlct otl ak,atley incoming artillery, tanks, hostile locate surveillance detections, that submarine systems for such applications, passive military as as well mo- as robot [37], and planning [25], tion robotics humanoid for (HRI) inter- action human-robot [49], [26, monitoring environmental conferencing 52], video intelligent as civilian dis- such in used and applications, widely directions been of have sources techniques terms SSL sound in tances. music) of and location speech the (e.g., identify that developed re- not operate. do to power and significant inexpensive quire become microphones technol- ogy, microelectromechanical of advancement completely Thanks the (or manner. to light omnidirectional varying an sight in under of conditions work dark) line to a a able require in is not and component does complementary system, sensing a robotic as microphone), (e.g. a sensor acoustic An algorithms. image-processing line- demand- ing computationally and cam- on lighting rely (e.g., and to conditions subject of-sight sensors are visual however, which, by eras), supported environment. vision, com- unknown on puter based an is localization for in technology Mainstream especially surround- of objects), locations ing (e.g., awareness situational self-localizat (i.e., and locations own sys- their unmanned identify to autonomous tems for importance techniques great Localization of make are [10]. to autonomous problem been truly fundamental has robots most field the robotic as the recognized in problem localization The INTRODUCTION 1 proposed the tech- of proposed effectiveness technique. the the show test results to All used nique. also were head dummy on-orelclzto SL ehiushv been have techniques (SSL) localization Sound-source ion) , 2 missiles [27], aircraft [7], and UAVs [11]. SSL techniques presented a binaural distance localization approach us- have great potential by itself to enhance the sensing ca- ing the motion-induced rate of intensity change which pability of autonomous unmanned systems as well as requires the use of parallax motion and errors up to 3.4 working together with vision-based localization tech- m were observed. Kneip and Baumann [32] established niques. formulae for binaural identification of the azimuth and SSL has been achieved by using microphone arrays elevation angles as well as the distance information of a with more than two microphones [38,45,47,48,50]. The sound source combining the rotational and translational accuracy of the localization techniques based on micro- motion of the interaural axis. However, large localiza- phone arrays is dictated by their physical sizes [5,12,56]. tion errors were observed and no solution was given Microphone arrays are usually designed using particular to handle sensor noise nor model uncertainty. Rode- (e.g., linear or circular) structures, which result in their mann [41] proposed a binaural azimuth and distance lo- relatively large sizes and sophisticated control compo- calization technique using signal amplitude along with nents for operation. Therefore, it becomes difficult to ITD and ILD cues in an indoor environment with a use them on small robots nor large systems due to the sound source ranging from 0.5 m to 6 m. However, the complexity of mounting and maneuvering. azimuth estimation degrades with the distance and re- In the past decade, research has been carried out for duced error with the required calibration was still large. robots to have auditory behaviors (e.g. getting atten- Kumon and Uozumi [31] proposed a binaural system tion to an event, locating a sound source in potentially on a robot to localize a mobile sound source but it re- dangerous situations, and locating and paying atten- quires the robot to move with a constant velocity to tion to a speaker) by mimicking human auditory sys- achieve 2D localization. Also, further study was pro- tems. Humans perform with their posed for a parameter α0 introduced in the EKF. Zhong two ears using integrated three types of cues, i.e., the et al. [46, 54] and Gala et al. [17] utilized the extended interaural level difference (ILD), the interaural time dif- Kalman filtering (EKF) technique to perform orienta- ference (ITD), and the spectral information [22,35]. ILD tion localization using the ITD data acquired by a set and ITD cues are usually used respectively to identify of binaural self-rotating microphones. Moreover, large the horizontal location (i.e., azimuth angle) of a sound errors were observed in [54] when the elevation angle of source with higher and lower frequencies. Spectral cues a sound source was close to zero. are usually used to identify the vertical location (i.e., To the best of our knowledge, the works presented in elevation angle) of a sound source with higher frequen- the literature for SSL using two microphones based on cies. Additionally, acoustic landmarks aid towards bet- ITD cues mainly provided formulae that calculate the tering the SSL by humans [55]. azimuth and elevation angles of a sound source with- To mimic human acoustic systems, researchers have out incorporating sensor noise [32]. The works that use developed sound source localization techniques using probabilistic recursive filtering techniques (e.g., EKF) only two microphones. All three types of cues have been for orientation estimation [54] did not conduct any ob- used by Rodemann et al. [42] in a binaural approach of servability analysis on the system dynamics. In other estimating the azimuth angle of a sound source, while words, no discussion on the limitation of the techniques the authors also stated that reliable elevation estima- for orientation estimation was found. In addition, no tion would need a third microphone. Spectral cues were probabilistic recursive filtering technique was used to used by the head-related-transfer-function (HRTF) that acquire distance information of a sound source. This was applied to identify both the azimuth and eleva- paper aims to address these research gaps. tion angles of a sound source for binaural sensor plat- The contributions of this paper include (1) an ob- forms [20,25,28,29]. The ITD cues have also been used servability analysis of the system dynamics for three- in binaural sound source localization [15], where the dimensional (3D) SSL using two microphones and the problem of cone of confusion [51] has been overcome by ITD cue only; (2) a novel algorithm that provides the incorporating head movements, which also enable both estimation of the elevation angle of a sound source when azimuth and elevation estimation [40,51]. Lu et al. [33] the states are unobservable; and (3) a new EKF-based used a particle filter for binaural tracking of a mobile technique that estimates the robot-sound distance. Both sound source on the basis of ITD and motion parallax simulations and experiments were conducted to validate but the localization was limited in a two-dimensional the proposed techniques. (2D) plane and was not impressive under static condi- The rest of this paper is organized as follows. Sec- tions. Pang et al. [39] presented an approach for binau- tion 2 describes the preliminaries. In Section 3, 2D and ral azimuth estimation based on reverberation weight- 3D orientation localization models are presented along ing and generalized parametric mapping. Lu et al. [34] with their observability analysis. In Section 4, a novel 3

y (t) H (f) The time difference of y1(t) and y2(t), i.e., the ITD, 1 1 ˆ , is given by T argmaxτ Ry1,y2 . The distance difference Cross Peak d Correlation Detection of the sound signal traveling to the two microphones is , ˆ y2(t) H2 (f) given by d T · c0, where c0 is the sound speed and is usually selected to be 345 m/s. Fig. 1 Interaural Time Delay (ITD) estimation between signals y1(t) and y2(t) using the cross-correlation technique. 2.2 Far-Field Assumption method is proposed to detect non-observability condi- The area around a sound source can be divided into tions and a solution to the non-observability problem five different fields: free field, near field, far field, direct is presented. Section 5 presents a distance localization field and reverberant field [1,21]. The region close to a model with its observability analysis. The EKF algo- source where the sound pressure and the acoustic par- rithm is presented in Section 6. In Sections 7 and 8, ticle velocity are not in phase is regarded as the near the simulation and experimental results are presented field. The range of the near field is limited to a distance respectively, followed by Section 9, which concludes the from the source equal to approximately a wavelength paper. of sound or equal to three times the largest dimension of the sound source, whichever is the larger. The far 2 PRELIMINARIES field of a source begins where the near field ends and extends to infinity. Under the far-field assumption, the 2.1 Calculation of ITD acoustic wavefront reaching the microphones is planar and not spherical, in the sense that the waves travel in The only cue used for localization in this paper is the parallel i.e. the angle of incidence is the same for the ITD, which is the time difference of a sound signal trav- two microphones [14]. eling to the two microphones and can be calculated us- ing the cross-correlation technique [3, 30]. Consider a single stationary sound source placed in 2.3 Observability Analysis an environment. Let y1(t) and y2(t) be the sound signals captured by two spatially separated microphones in the Consider a nonlinear system described by a state-space presence of noise, which are given by [30] model x˙ = f (x) , (3) y1(t)= s(t)+ n1(t), (1) y = h (x) , (4) y2(t)= δ · s (t + td)+ n2(t), (2) x Rn y Rm where s (t) is the sound signal, n1(t) and n2(t) are real where ∈ and ∈ are the state and output vec- and jointly stationary random processes, td denotes the tors, respectively, and f (·) and h (·) are the process and time difference of s (t) arriving at the two microphones, output functions, respectively. The observability matrix and δ is the signal attenuation factor due to different of the system described by (3) and (4) is then given traveling distances of the sound signal to the two micro- by [23] T phones. It is commonly assumed that δ changes slowly 0 T 1 T ∂Lf h ∂Lf h and s(t) is uncorrelated with noises n1(t) and n2(t) [30]. Ω = ··· ,  ∂x ∂x  The cross-correlation function of y1(t) and y2(t) is given     0 by where the Lie derivatives are given by Lf h = h (x) −1 ∂Ln h n f Ry1,y2 (τ)= E [y1(t) · y2(t − τ)] , and Lf h = ∂x f. The system is observable if the observability matrix Ω has rank n. where E [·] represents the expectation operator. Fig- ure 1 shows the process of delay estimation between y1(t) and y2(t), where H1(f) and H2(f) represent scal- 3 Mathematical Models and Observability ing functions or pre-filters [30]. Various techniques can Analysis for Orientation Localization be used to eliminate or reduce the effect of background noise and reverberations [8, 9, 18, 19, 36, 44]. An im- The complete localization of a sound source is usually proved version of the cross-correlation method incor- achieved in two stages, the orientation (i.e., azimuth porating H1(f) and H2(f) is called Generalized Cross- and elevation angles) localization and distance localiza- Correlation (GCC) [30], which further improves the es- tion. In this section, the methodology of the orientation timation of time delay. localization is presented. 4

p S (D, θ, ϕ) q

ϕ β

ψ

L O

R Fig. 4 The shaded triangle in Figure 3. Robot

Fig. 2 Top view of the robot illustrating different angle defini- tions due to the rotation of the microphone array. where b is the distance between the two microphones, i.e. the length of the segment LR. To avoid cone of confusion [51] in SSL, the two- microphone array is rotated with a nonzero angular ve- locity [54]. Without loss of generality, in this paper we assume a clockwise rotation of the microphone array on the horizontal plane while the robot itself does not rotate nor translate throughout the entire estimation process, which implies that ϕ is constant.

3.2 2D Localization

Fig. 3 3D view of the system for orientation localization. If the sound source and the robot are on the same hor- izontal plane, i.e., θ = 0, we have d = b sin ψ. Assume 3.1 Definitions that the microphone array rotates clockwise with a con- stant angular velocity, ω. Considering the state-space , As shown in Figures 2 and 3, the acoustic signal gen- model for 2D localization with the state x2D ψ, and , erated by the sound source S is collected by the left the output as y2D d, we have and right microphones, L and R, respectively. Let O be x˙ = ψ˙ = −ω, (7) the center of the robot as well as the two microphones. 2D The location of S is represented by (D,θ,ϕ), where D y2D = b sin ψ. (8) is the distance between the source and the center of the π robot, i.e., the length of segment OS, θ ∈ 0, 2 is the elevation angle defined as the angle between OS and Theorem 1 The system described by Equations (7) and the horizontal plane, and ϕ ∈ (−π, π] is the azimuth (8) is observable if 1) b 6= 0 and 2) ω 6= 0 or ψ 6= angle defined as the angle measured clockwise from the π Z 2kπ ± 2 , where k ∈ . robot heading vector, p, to OS. Letting unit vector q be the orientation (heading) of the microphone array, β Proof The observability matrix [23, 24] for the system be the angle between p and q, and ψ be the angle be- described by Equations (7) and (8) is given by tween q and OS, both following a right hand rotation 2 T rule, we have O2D = b cos ψ bω sin ψ −bω cos ψ ··· . (9)   ϕ = ψ + β. (5) The system is observable if O2D has rank one, which implies b 6= 0. If ω = 0, observability requires that π For a clockwise rotation, we have β (t) = ωt, where cos ψ 6= 0, which implies ψ 6= 2kπ ± 2 . If ω 6= 0, O2D ω is the rotational speed of the two microphones, and is full rank for all ψ. ψ (t) = ϕ − ωt. In the shaded triangle, △SOF , shown in Figures 3 and 4, define α , ∠SOF and we have Remark 1 Since the two microphones are separated by π a non-zero distance, (i.e., b 6= 0) and the microphone α + ψ = 2 and cos α = cos θ sin ψ. Based on the far- field assumption in Section 2.2, we have array rotates with a non-zero constant angular velocity (i.e., ω 6=0), the system is observable in the domain of d = Tˆ · c0 = b cos α = b cos θ sin ψ. (6) definition. 5

3.3 3D Localization Corollary 1 The azimuth angle in the system described by Equations (14) and (15) is observable if 1) b 6= 0, Considering the state-space model for 3D localization 2) ω 6=0, and 3) θ 6= 90o. T with the state x3D , [θ, ψ] , and the output as y3D , d, we have Proof The observability matrix associated with (14) and (15) θ˙ 0 is given by x˙ D = = , (10) 3  ψ˙   −ω  T O = b cos θ cos ψ bω cos θ sin ψ ··· . (16) y3D = b cos θ sin ψ. (11) ψ   So, the system is observable if, Theorem 2 The system described by Equations (10) o π and (11) is observable if 1) b 6=0, 2) ω 6=0, 3) θ 6=0 , b 6=0, θ 6= 90o, and ω 6=0 or ψ 6=2kπ ± . (17) and 4) θ 6= 90o. 2

Proof The observability matrix for (10) and (11) is given This shows that ψ is unobservable when θ = 90o. by −b sin θ sin ψ b cos θ cos ψ Assume that ψ is known and consider the following  bω sin θ cos ψ bω cos θ sin ψ  system O = bω2 sin θ sin ψ −bω2 cos θ cos ψ . (12) 3D    −bω3 sin θ cos ψ −bω3 cos θ sin ψ  x˙ = θ˙ =0, (18)   θ  ······    yθ = b cos θ sin ψ. (19) It should be noted that higher-order Lie derivatives do not add rank to O3D. Consider the squared matrix con- sisting of the first two rows of O3D Corollary 2 The elevation angle in the system described −b sin θ sin ψ b cos θ cos ψ Ω = , by Equations (18) and (19) is observable if the follow- 3D bω sin θ cos ψ bω cos θ sin ψ   ing conditions are satisfied: 1) b 6= 0, 2) ω 6= 0, and o and the determinant of the Ω3D is 3) θ 6=0 . det {Ω } = −b2ω sin θ cos θ. 3D Proof The observability matrix asscoiated with (18) and (19) The system is observable if is given by

o o b 6=0, ω 6=0, θ 6=0 , and θ 6= 90 . (13) T Oθ = −b sin θ sin ψ 0 ··· . (20) Further investigation can be done by selecting two even   (or odd) rows from O to form a squared matrix, 3D So the system is observable if whose determinant is always zero. .

Remark 2 As it is always true that b 6= 0 and ω 6= 0 b 6=0, θ 6=0o, and ψ 6= kπ. (21) due to Remark 1, the system is observable only when o o θ 6=0 and θ 6= 90 . Experimental results presented by As ψ is time-varying, so it won’t stay at kπ. It can be Zhong et al. [54] using a similar model illustrates large seen that θ is unobservable when θ =0o. estimation error when θ is close to zero. To further investigate the system observability, consider the following two special cases: (1) θ is known and (2) 4 Complete Orientation Localization ψ is known. Assume that θ is known and consider the following To handle the unobservable situations, i.e., θ =0o and system θ = 90o, we present a novel algorithm in this section x˙ ψ = ψ˙ = −ω, (14) that utilizes both the 2D and 3D localization models to enable the orientation localization of a sound source yψ = b cos θ sin ψ. (15) residing anywhere in the domain of definition, i.e., θ ∈ [0,π/2] and ϕ ∈ (−π, π]. 6

◦ |XN (ω)| of the signal d(t) for a sound source at θ = 50 the angular velocity of the rotation of the microphone 20 array). However, the peaks observed in the bottom sub- o 10 figure (i.e., when θ = 85 ) are comparatively very small.

Magnitude To eliminate the noise in Figure 5, define the es- 0 −250 −200 −150 −100 −50 0 50 100 150 200 250 ˆ 2 Frequency (rad/s) timated amplitude of the ITD signal as Ad(ω) = · ◦ N |XN (ω)| of the signal d(t) for a sound source at θ = 85 ˆ 20 |X(ω)|. Figure 6 shows the estimated amplitude (Ad) of the signal d(t) resulting from Figure 5. The bottom 10 subfigure (i.e., when θ = 85o) shows that the maximum Magnitude value of Aˆd is very small compared to the top subfigure 0 −250 −200 −150 −100 −50 0 50 100 150 200 250 o Frequency (rad/s) (i.e., when θ = 50 ). The ITD is considered as zero if the maximum value of the estimated amplitude Aˆd (when Fig. 5 The signals after taking the Discrete Fourier transform (DFT) of the noised signal d(t) with the source located at θ = 50o the frequency equals the angular velocity of the rota- and θ = 85o, respectively. The two big peaks in the upper figure tion of the microphone array) is less than a predefined ± occur at 2π/5 rad/sec (i.e., the angular velocity of the rotation threshold, dthreshold. The selection of dthreshold deter- of the microphone array) when θ = 50o, whereas small peaks are o mines the accuracy of the estimation when the sound present in the bottom figure when θ = 85 . o source is around 90 elevation. The value of dthreshold,

◦ for example, can be selected as 0.017 m, which corre- Estimated amplitude Aˆd of the signal d(t) using DFT for a sound source at θ = 50 o 0.15 sponds to θ = 85 as in Figure 6, thereby giving an o 0.1 accuracy of 5 .

0.05 Amplitude (m) 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Frequency (rad/s) ˆ ◦ Estimated amplitude Ad of the signal d(t) using DFT for a sound source at θ = 85 4.2 Identification of θ =0o 0.04

0.02 Theorem 1 guarantees accurate azimuth angle estima-

Amplitude (m) tion using the 2D model when the sound source is lo- 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Frequency (rad/s) cated with zero elevation. We observed that when the elevation of the sound source is not close to zero, the Fig. 6 Estimated amplitudes Aˆd of the signal d(t) using the DFT for the source located at θ = 50o and θ = 85o, respectively. The estimation of the azimuth angle provided by the 2D maximum of Aˆd occurs when the frequency is at ±2π/5 rad/sec. model is far off the real value. o When θ = 85 , the maximum of Aˆd is less than 0.017 m.

◦ 2D and 3D model’s estimated ϕ comparison at θ = 0 70 4.1 Identification of θ = 90o 2D model estimation ) 3D model estimation 60 Actual azimuth angle deg

o (

ITD could be zero due to either 90 elevation or ab- ϕ 50

sence of sound, the latter of which can be detected by 0 50 100 150 200 250 300 350 ψ deg ( ) ◦ evaluating the power reception of microphones. In this 2D and 3D model’s estimated ϕ comparison at θ = 20 paper, we focus on the former case. 70 ) Assume that the sensor noise is Gaussian, which 60 deg o ( dominates the ITD signal when θ gets close to 90 . ϕ 50 0 50 100 150 200 250 300 350 To check the presence of the signal d(t) buried in the ψ(deg) noise, we can first apply the Discrete Fourier Transform (DFT) onto the stored d (t). The N-point DFT of the Fig. 7 Comparison of azimuth angle estimations using the 2D signal d(t) results in a sequence of complex numbers in and 3D localization models when a sound source is located at θ = 0o and θ = 20o, respectively. the form of Xreal +jXimag, where Xreal and Ximag rep- resent the real and imaginary coordinates of the com- plex number. The magnitude of the complex number is On the other hand, Theorem 2 guarantees that the 2 2 then obtained by |X(ω)| = Xreal + Ximag. Figure 5 azimuth angle estimation using the 3D model is accu- q shows the resulting magnitude (|X(ω)|) signals of d(t) rate for all elevation angles except for θ = 90o, which after taking DFT when the sound source is placed at is detected by the approach in Section 4.1. Therefore, θ = 50o and 85o, respectively, in simulation. Two big the estimations resulting from both the 2D model 3D peaks in the top subfigure (i.e., when θ = 50o) are ob- models will be identical if the sound source is located served when the frequency is at ±2π/5 rad/sec (i.e., at θ =0o, as shown in Figure 7. The root-mean-square 7 error (RMSE) is used as a measure of the difference be- Algorithm 0.1 Complete 3D orientation localization tween the two azimuth estimations as it includes both 1: Calculate the ITD, Tˆ, from the recorded signals of two micro- mean absolute error (MAE) as well as additional infor- phones. ˆ mation related to the variance [13]. This error is depen- 2: IF Ad < dthreshold THEN 3: The elevation angle of the sound source is θ = 90o and the dent on the value of elevation angle and it increases as azimuth angle, ϕ, is undefined. the elevation angle increases, as shown in Figure 8. 4: ELSE 5: Estimate the azimuth ϕ2D and ϕ3D using 2D and 3D lo- calization models, respectively RMSE between 2D and 3D model's estimated azimuth 6: Calculate the RMSE between ϕ2D and ϕ3D 7: IF RMSE < RMSEthreshold THEN 8: Use polynomial curve fitting to determine θ using the

3 calculated RMSE value and estimate ϕ using either 2D or 3D localization model 2 9: ELSE estimate both θ and ϕ using the 3D localization 1 model RMSE (deg) 0 10: END IF 15 0 11: END IF 10 100 200 5 300 0 Actual θ (deg) Actual ϕ (deg) 4.3 Complete Orientation Localization Algorithm Fig. 8 RMSE between 2D and 3D localization model’s estimated azimuth angles.

Start Polyfit approximation of estimated elevation 15 Calculate ITD

Yes No   d ? 10     Estimate Estimate Elevation ≈ 90 ο and azimuth using azimuth using azimuth is undefined 3D model 2D model (deg) θ 5 Calculate RMSE Elevation ≈ 0 ο, use 2D Yes Actual elevation model to estimate azimuth and curve RMSE < RMSE threshold ? Approximated elevation fitting to estimate elevation 0 No 0 0.5 1 1.5 2 Use 3D model RMSE (deg) estimation for azimuth and elevation Fig. 9 Approximation of the elevation angle from the RMSE data using the least square fitted polynomial. End

Fig. 10 Flowchart addressing the non-observability problems re- In order to get an accurate estimate of the elevation flected in the 3D localization model. angle close to zero, a polynomial curve fitting approach is used to map (in a least-square sense) the RMSE val- ues to the elevation angles. Different RMSE values are collected beforehand in the environment where the lo- Figure 10 illustrates the flowchart of the proposed calization would be done. The RMSE values associated algorithm for the complete orientation localization. The with the same elevation angle but different azimuth pseudo code of the proposed complete orientation local- angles express small variations, as seen in Figure 8. ization is given in Algorithm 0.1. The RMSEthreshold Therefore, for a particular elevation angle, the mean of is the value used to check when the elevation angle is all RMSE values with different azimuth angles will be close to 0o. This threshold value decides the point until selected as the RMSE value corresponding to the ele- which the curve fitting is required, ansd after which the vation angle. An example curve is shown in Figure 9. 3D model can be trusted for elevation estimation. 8

5.1 Mathematical Model for Distance Localization

Consider the gray triangle shown in Figure 12. Based on the far-field assumption in Section 2.2, the length R′P ′ is given by

′ d = b sin γ. (22)

In triangle △SOO′, we have △d sin γ = . (23) (△d)2 + D2 p Defining the state as xdist = D and output as ydist, the state-space model is given by

Fig. 11 3D view of the system for distance localization. x˙ dist =0, (24) b △d ydist = . (25) (△d)2 + D2 p Theorem 3 The system described by Equations (24) and (25) is observable if the following conditions are satisfied: 1) b 6=0, 2) △d 6=0, and 3) D 6=0.

Proof The observability matrix associated with (24) and (25) is given by

2 2 −2 b (△d) D Odist = (△d)2+D2 ··· . (26) h i So the system is observable if

b 6=0, △d 6=0, and D 6=0. (27)

Fig. 12 Gray triangle in Figure 11. Remark 3 As the microphones are separated by a non- zero distance, i.e., b 6= 0, and the microphone array is being translated by a non-zero distance, i.e., △d 6= 0, the system is always observable unless the sound source and the robot are at same location making D = 0, 5 Distance Localization which is not in the scope of discussion of this paper.

The novel distance localization approach presented in 6 Extended Kalman Filter this section depends on an accurate orientation local- ization. Assume that the angular location of the sound The estimation for the angles and distance of the sound source has been obtained by using Algorithm 0.1 and source is conducted by extended Kalman filters. De- the microphone array has been regulated facing toward tailed mathematical derivation of the EKF can be found the sound source, as shown in Figure 11. The proposed in [4]. Algorithm 0.2 summaries the EKF procedure distance localization approach requires the microphone used in this paper for SSL. The sensor covariance ma- 2 array, LR, to translate with a distance ∆d along the trix (R) is defined as σw, and the process covariance 2 line perpendicular to the center-source vector (on the matrix (Q) is defined as σv for the distance localization, 2 2 2 horizontal plane). This translation shifts the center of σv1 for the 2D orientation localization and diag{σv1, σv2} the microphone array, O, to a new point, O′, and γ is for the 3D orientation localization, respectively, where ′ defined as the angle between vectors O S and OS, as σvi is the process noise variance corresponding to the th shown in Figure 12. Note that the center of the robot, i state and σw is the sensor noise variance. Key pa- O, is unchanged. The objective is to estimate distance rameters are listed in Table 1. The complete EKF-based D between the center of the robot O and the source S. SSL procedure is illustrated in Figure 13. 9

Table 1 EKF parameters. Angular Distance Parameter localization localization

Process noise variance (σvi,i = 1, 2) 0.01 0.1 Sensor noise variance (σw ) 0.01 0.001 o Initial azimuth angle estimate (ϕinitial) 5 – o Initial elevation angle estimate (θinitial) 5 – Initial distance estimate (Dinitial) – 1 m

Algorithm 0.2 Pseudo code for EKF [4] 7.1 Simulation Environment 1: Initialize: xˆ 2: At each value of sample rate Tout, The Audio Array Toolbox [16] is used to simulate a rect- 3: FOR i=1toN DO angular space using the image method described in [2]. Prediction Tout The robot was placed in the center (origin) of the room. 4: xˆ =x ˆ + ( N ) f(ˆx,u) ∂f The two microphones were separated by a distance of 5: AJ = ∂x (ˆx,u) Tout T 0.18 m from each other which is equal to the approxi- 6: P = P + ( N )(AJ P + PAJ + Q) Update mate distance between human ears. The sound source ∂h 7: CJ = ∂x (ˆx,u) and the microphones are assumed omnidirectional and 8: T T −1 K = P CJ (R + CJ P CJ ) the attenuation of the sound is calculated as per the 9: P = (I − KCJ )P 10: xˆ =x ˆ + K(y[n] − h(ˆx, u[n]) specifications in Table 2. 11: END FOR

Table 2 Simulated room specifications Parameter Value Fix the center of the microphone array Dimension 20m x 20m x 20m Reflection coefficient of each wall 0.5 Reflection coefficient of the floor 0.5 Rotate the array clockwise at a Reflection coefficient of the ceiling 0.5 Angle constant angular velocity Velocity of the sound 345 m/s estimation Temperature 22oC Static pressure 29.92 mmHg Estimate the azimuth and elevation Relative humidity 38 % angle for each rotation step using EKF

Align the microphone array perpendicular to the source-center vector 7.2 Validation of Observablity

Distance Shift the center of the microphone-array continuously estimation As discussed earlier, Theorem 1 shows that the 2D by a small fixed distance at a constant speed model is always observable, however, it does not pro- vides any elevation information of the sound source. On Estimate the distance for each the other hand, Theorem 2 shows that the 3D model shift using EKF is unobservable when the elevation angle of the sound Fig. 13 Block diagram showing the process for the proposed source is 0o or 90o. In order to validate the observabil- complete angular and distance localization of a sound source us- ity analysis, localization was performed in the simulated ing successive rotational and translational motions of a set of two microphones. environment. For a sound source located on a 2D plane, Figure 14 shows the average of absolute estimation errors versus different azimuth angles with the sound source at dis- 7 Simulation Results tance of 5 m and 10 m to the robot, respectively. It can be seen that all errors are smaller than 1.8o and the mean of the average of absolute errors is approximately In this section, we present the simulation results of the 1o for the two cases. proposed localization technique for both angle and dis- To verify the observability conditions for the 3D tance localization of a sound source. model as described by Equations (10) and (11), the 10

Average absolute errors in estimated azimuth Average of absolute error in estimated azimuth with D = 5m 2

1.5

1 200

Error (deg) 0.5 150

0 (deg) −200 −150 −100 −50 0 50 100 150 200 Actual azimuth (deg) ϕ 100 Average of absolute error in estimated azimuth with D = 10m 50 2 Error in 1.5 0 100 −200 1 80 −100 60 0

Error (deg) 0.5 40 20 100 0 −200 −150 −100 −50 0 50 100 150 200 0 200 Actual ϕ (deg) Actual azimuth (deg) Actual θ (deg)

Fig. 14 Average of absolute errors in azimuth angle estimation Fig. 17 Average of absolute errors in azimuth estimation us- using 2D model with a sound source placed at different azimuth ing the 3D localization model. Relatively large errors illustrate locations at a constant distance of 5 m and 10 m from the center the non-observability condition in azimuth angle estimation with of the robot. sound source placed around 90o elevation, as described by Theo- rem 2 . Sound source locations with D=5m

Sound source tions. Larger errors were observed when the elevation Robot 5 was close to 90o, which again echoes Theorem 2. Sz (m) 0 7.3 Simulation Results for Orientation Localization 5 5 0 0 A number of experiments were performed to validate Sy (m) −5 −5 Sx (m) the performance of the proposed SSL technique for ori- Fig. 15 Sound source locations with a fixed distance of 5 m to entation localization, as described in Algorithm 0.1. the center of the robot in the simulated room. White noise and speech signals were used as a sound

Average absolute errors in estimated elevation source which was placed individually at different loca- tions in the simulated room with specifications sum- marized in Table 2. The microphone array was rotated

8 with an angular velocity of ω = 2π/5 rad/sec in the 6 clockwise direction for three complete revolutions. The (deg) θ 4 ITD was calculated after every 1o rotation followed by 2

Error in the estimation performed using the EKF with parame- 0 0 200 ters given in Table 1. Four different sets of experiments 20 100 40 0 were performed keeping the source at different loca- 60 80 −100 tions. In first two sets of experiments, the source was 100 −200 Actual ϕ (deg) Actual θ (deg) placed in all four quadrants including the axes at dif- o Fig. 16 Average of absolute errors in elevation estimation us- ferent distances, keeping the elevation constant at 20 ing the 3D localization model. Relatively large errors illustrate and 60o. To validate the performance of the proposed the non-observability condition in elevation angle estimation with solution to the non-observability conditions, other two sound source placed around 0o elevation, as described by Theo- rem 2 . sets experiments were performed by keeping the sound source at elevation close to 0o and 90o. The results of the localization are presented in Tables 3 and 4. It can sound source is placed at different locations with a dis- be seen that orientation localization is achieved with tance of 5 m from the robot in the simulated room, errors less than 4o using speech as well as white noise which evenly cover the hemisphere above the ground, sound source. Large errors are observed when the eleva- as shown in Figure 15. Figure 16 shows the averaged tion of the sound source is around 0o and 90o. Further, absolute errors in the elevation estimation versus ac- the errors with source elevation around 0o is less as com- tual azimuth and elevation angles of the sound source. pared to source elevation around 90o. This was achieved Larger errors were observed when the elevation was by using polynomial curve fitting approach mentioned o o close to 0 , which coincides with Theorem 2. Figure 17 in Section 4.2, with RMSEthreshold =1.9 , which cor- shows the averaged absolute errors in the azimuth angle responds to θ = 15o on the fitted curve shown in Fig- estimation for a single sound source at different posi- ure 9 . The value dthreshold was calculated as 0.017 m 11

Table 3 Simulation results of orientation localization for speech Expt. Act. Act. Est. Avg of abs Act. Est. Avg of abs No. D(m) ϕ(o) ϕ(o) error (o) θ(o) θ(o) error (o) 1 a 5 0 0.60 0.60 20.39 0.39 1 b 5 50 51.03 1.03 21.44 1.44 1 c 7 90 91.21 0.21 20.83 0.83 1 d 7 120 121.57 1.57 20.96 0.96 20 1 e 3 180 181.03 1.03 20.16 0.16 1 f 3 -40 -39.33 0.67 19.10 0.90 1 g 10 -90 -88.85 1.15 21.66 1.66 1 h 10 -140 -139.52 0.48 21.18 1.18 2 a 5 0 2.31 2.31 60.68 0.68 2 b 5 50 50.65 0.65 60.53 0.53 2 c 7 90 91.79 1.79 60.70 0.70 2 d 7 120 121.85 1.85 60.84 0.84 60 2 e 3 180 181.66 1.66 60.05 0.05 2 f 3 -40 -38.66 1.34 60.38 0.38 2 g 10 -90 -89.38 0.62 59.62 0.38 2 h 10 -140 -138.20 1.80 59.78 0.22 3 a 5 50 50.69 0.31 0 3.39 3.39 3 b 7 -120 -119.00 1.00 4 2.40 1.60 4 a 5 -40 not def. not def. 86 90.00 4.00 4 b 7 150 not def. not def. 89 90.00 1.00

Table 4 Simulation results of orientation localization for white noise Expt. Act. Act. Est. Avg of abs Act. Est. Avg of abs No. D(m) ϕ(o) ϕ(o) error (o) θ(o) θ(o) error (o) 1 a 5 0 1.18 1.18 19.66 0.34 1 b 5 50 51.03 1.03 20.44 0.44 1 c 7 90 90.25 0.25 20.11 0.11 1 d 7 120 121.35 1.35 19.70 0.30 20 1 e 3 180 180.41 0.41 20.48 0.48 1 f 3 -40 -39.44 0.56 19.75 0.25 1 g 10 -90 -89.11 0.89 19.71 0.29 1 h 10 -140 -139.67 0.33 21.18 1.18 2 a 5 0 1.31 1.31 60.38 0.38 2 b 5 50 51.59 1.59 60.39 0.39 2 c 7 90 90.74 0.74 60.87 0.87 2 d 7 120 121.21 1.21 60.39 0.39 60 2 e 3 180 181.16 1.16 60.51 0.51 2 f 3 -40 -38.66 1.34 60.41 0.41 2 g 10 -90 -88.90 1.10 60.70 0.70 2 h 10 -140 -138.64 1.36 60.57 0.57 3 a 5 50 51.45 1.45 0 1.57 1.57 3 b 7 -120 -118.36 1.64 4 1.57 2.43 4 a 5 -40 not def. not def. 86 90.00 4.00 4 b 7 150 not def. not def. 89 90.00 1.00

(which corresponds to θ = 85o, thereby giving an ac- was continuously shifted for 200 steps each with a dis- curacy of 5o when the sound source gets close to 90o tance of △d = 0.0007 m. The results are summarized elevation) for the simulated environment with specifi- in Tables 5 and 6. The key parameters of the EKF are cation given in Table 2. given in Table 1. The results for the distance localiza- tion with a sound source placed at different locations are shown in Figure 18. It is observed that the error 7.4 Simulation Results for Distance Localization in the estimation converges quickly and a total shift of microphone array of approximately 3 cm is sufficient Speech and white-noise sounds were also used to test for the estimates to completely converge to and remain the performance of the distance localization. A single in the three standard deviation bounds. The average of sound source was placed at different locations and the absoute error in the estimation is found to be less than ITD signal was recorded while the microphone array 12

Table 5 Simulation results of distance localization using speech Distance = 5m sound source 6 0.5 4 0

Expt. Act. Act. Act. Est. Avg of abs 2 −0.5 o o 3 3.5 4 4.5 5 5.5 No. ϕ( ) θ( ) D(m) D(m) error (m) 0 Error (m) Estimation error 1 a 0 5 5.01 0.01 −2 Bound

1 b 50 5 5.01 0.01 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Microphone array shift distance (cm) 1 c 90 7 6.94 0.06 Distance = 10m 1 d 120 7 6.93 0.07 10 20 0.5 1 e 180 3 3.01 0.01 0 5 −0.5 1 f -40 3 3.01 0.01 3 3.5 4 4.5 5 5.5 1 g -90 10 9.54 0.46 0 Error (m) Estimation error 1 h -140 10 9.81 0.19 −5 Bound

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 2 a 0 5 5.02 0.02 Microphone array shift distance (cm) 2 b 50 5 5.02 0.02 2 c 90 7 6.94 0.06 Fig. 18 Simulation results for distance estimation using EKF. 2 d 120 7 6.94 0.06 60 A single sound source was placed at two different locations with 2 e 180 3 3.00 0.00 distances of 5 m and 10 m, respectively. The bounds represent 2 f -40 3 3.01 0.01 the three standard deviation of the estimation error. 2 g -90 10 9.52 0.48 2 h -140 10 9.41 0.59 3 a 50 0 5 5.02 0.02 ing subsections discuss the hardware platforms and the 3 b -120 4 7 6.87 0.13 results. 4 a -40 86 5 5.02 0.02 4 b 150 89 7 6.83 0.17 8.1 Results using KEMAR Dummy Head Table 6 Simulation results of distance localization using white noise sound source Experiments using the KEMAR dummy head were con- Expt. Act. Act. Act. Est. Avg of abs ducted in a high frequency focused sound treated room [53] No. ϕ(o) θ(o) D(m) D(m) error (m) with dimension 4.6 mx 3.7 mx 2.7 m as shown in Fig- 1 a 0 5 5.01 0.01 1 b 50 5 5.01 0.01 ure 19. The ITD however is mostly effective for low 1 c 90 7 6.92 0.08 frequency sounds below 1.5 kHz as a spatial hearing 1 d 120 7 6.92 0.08 20 cue [35]. The walls, floor, and ceiling of the room were 1 e 180 3 3.01 0.01 covered by polyurethane acoustic foam with a thickness 1 f -40 3 3.01 0.01 1 g -90 10 9.52 0.48 of only 5 cm which is relatively low compared to the 1 h -140 10 9.44 0.56 sound wavelength thereby making a relatively low re- 2 a 0 5 5.01 0.01 duction in low and middle frequencies [6], thereby mak- 2 b 50 5 5.01 0.01 ing it a challenging acoustic environment. For broad 2 c 90 7 6.92 0.08 2 d 120 7 6.92 0.08 band noise, T60 (i.e., the time required for the sound 60 2 e 180 3 3.01 0.01 level to decay 60 dB [43]) was 97 ms. In an octave band 2 f -40 3 3.01 0.01 centered at 1000 Hz, T60 for the noise was on an aver- 2 g -90 10 9.48 0.52 age of 324 ms. 2 h -140 10 9.43 0.57 The digitally generated audio signals using a MAT- 3 a 50 0 5 5.01 0.01 3 b -120 4 7 6.89 0.11 LAB program and three 12-channel Digital-to-Analog 4 a -40 86 5 5.01 0.01 converters running at 44,100 cycles each second per 4 b 150 89 7 6.90 0.10 channel were amplified using AudioSource AMP 1200 amplifiers before they were played from an array of 36 loudspeakers. The two microphones were installed on 0.6 m in both the case of speech as well as white noise the KEMAR dummy head temporarily mounted on a sound sources. rotating chair which was rotated at an approximate rate of 32°/s for about two circles in the middle of the room. The data collected in the second rotation was used for 8 Experimental Results the EKF. Motion data was collected by the gyroscope mounted on the top of the dummy head. The audio Experiments were conducted using two different hard- signals were amplified and collected by a sound card ware platforms: a KEMAR dummy head in a well equipped which were then stored on a desktop computer for fur- hearing laboratory and a robotic platform equipped ther processing. The ITD was processed with a gener- with a set of two rotational microphones. The follow- alized cross-correlation model [30] in each time frame 13

shown in Figure 10. Table 7 shows the estimation re- sults obtained using the 3D localization model. It can be seen that the RMSE of the difference between the es- timated azimuth values using respectively the 2D and 3D models works well in checking the zero elevation condition.

8.2 Results using Robotic Platform

Experiments were also performed using a robotic plat- form shown in Figure 21. In these experiments, two mi- croelectromechanical systems (MEMS) analog/digital microphones were used for recording the sound signal Fig. 19 Setup of the KEMAR dummy head on a rotating chair coming from the sound source. Flex adapters were used in the middle of the sound treated room [46]. to hold the microphones. The angular speed of the rota- tion of the microphone array was controlled by a bipolar corresponding to the 120 Hz sampling rate of the gyro- stepper motor with gear ratio adjusted to 0.9o per step. scope. The computation was completed by a MATLAB The stepper motor was controlled by an Arduino micro- program on a desktop computer. Raw data with a sin- processor. The distance between two microphones was gle sound source located at four different locations were kept constant as 0.3 m. An audio (music) was played in collected. a loud speaker which was used as a sound source kept at different locations. The estimation results are shown D and 3D model’s estimated ϕ comparison at θ = 0◦ 2D and 3D model’s estimated ϕ comparison at θ = 30◦ 100 100 2D model estimation 2D model estimation in Figure 22 and Table 8. 3D model estimation 3D model estimation 50 95 Actual azimuth angle Actual azimuth angle

0 90 (deg) ϕ −50 85 MEMS Microphones −100 80 0 100 200 300 400 0 100 200 300 400 Number of samples Number of samples Estimated elevation Estimated elevation 40 Estimated elevation Estimated elevation Actual elevation 60 Actual elevation Motor shield Microphone Evaluation Board 20 40 Arduino Board (deg) 0 θ 20

−20 0 0 100 200 300 400 0 100 200 300 400 Number of samples Number of samples

Fig. 20 Experimental results for orientation localization using o the KEMAR dummy head. When θ = 0 , the azimuth estimates Fig. 21 A two-microphone system is equipped on a ground using the 2D and 3D models are very close (in the top-left fig- robot. ure), which implies the elevation estimates are not reliable (in the bottom-left figure). When θ = 30o, the azimuth estimates are ob- viously different (in the top-right figure), which implies reliable It can be seen that the azimuth estimations using elevation estimates using the 3D model (in the bottom-right fig- ure). the 2D and 3D models shown in the top-left subfigure in Figure 22 generated when the actual elevation angle is 0o are very close, which implies that the elevation is The left two subfigures in Figure 20 are generated close to 0o and the elevation estimation shown in the when the actual elevation angle is 0o. It can be seen that bottom-left subfigure in Figure 22 using the 3D localiza- the azimuth estimations using the 2D and 3D models tion model is not reliable. However, the two subfigures are very close, which implies that the actual elevation on the right in Figure 22 are generated by keeping the angle is close to 0o and the elevation estimation using sound source at an elevation angle of 55o. As proposed the 3D model is not reliable. The right two subfigures in in the algorithm shown in Figure 10, the azimuth esti- Figure 20 are generated when the actual elevation angle mations using the 2D and 3D localization models are is 30o. It can be seen that the azimuth estimations using different while the elevation estimation using the 3D the 2D and 3D localization models are obviously differ- model is fairly accurate. Table 8 shows the estimation ent while the elevation estimation using the 3D model results obtained using the 3D localization model. It can is fairly accurate, which verifies the proposed algorithm be seen that the zero elevation condition can be checked 14

Table 7 Experimental results using KEMAR dummy head: Orientation localization using the 3D model. (RMSE: difference between azimuth estimations using the 2D and 3D models, respectively) Expt. Act. Est. Avg of abs RMSE Act. Est. Avg of abs No. ϕ(o) ϕ(o) error (o) (o) θ(o) θ(o) error (o) 1 90 91.21 1.21 1.39 0 13.64 13.64 2 -20 -21.53 1.53 1.16 0 48.14 48.14 3 90 90.40 0.40 79.94 60 59.05 0.95

Table 8 Experimental results using the robotic platform: Orientation localization using 3D model (RMSE: difference between azimuth estimations using the 2D and 3D models, respectively) Expt. Act. Est. Avg of abs RMSE Act. Est. Avg of abs No. ϕ(o) ϕ(o) error (o) (o) θ(o) θ(o) error (o) 1 -140 -140.65 0.65 0.72 0 14.96 14.96 2 180 178.71 1.29 0.69 5 11.59 6.59 3 40 39.67 0.33 8.80 55 55.24 0.24 4 40 38.20 1.80 10.96 65 64.67 0.33

and 3D model’s estimated ϕ comparison at θ = 0◦ 2D and 3D model’s estimated ϕ comparison at θ = 55◦ tance) of a stationary sound source in a three-dimensional 0 2D model estimation 300 2D model estimation 3D model estimation 3D model estimation (3D) space. Two singular conditions when unreliable Actual azimuth angle Actual azimuth angle −50 200 orientation localization (the elevation angle equals 0 −100 (deg) ϕ 100 or 90o) occurs were found by using the observability −150 0 theory. The root-mean-squared error (RMSE) value of

0 100 200 300 400 0 100 200 300 400 Number of samples Number of samples the difference between the azimuth estimates using re- Estimated elevation Estimated elevation 50 Estimated elevation 80 Estimated elevation spectively the 2D and 3D models was used to check 40 Actual elevation Actual elevation o 30 60 the 0 elevation condition and the elevation was fur- 20 (deg) 40 θ ther estimated using a polynomial curve fitting tech- 10 20 o 0 nique. The 90 elevation was detected by checking zero- −10 0 0 100 200 300 400 0 100 200 300 400 Number of samples Number of samples ITD signal. Based on an accurate orientation localiza- tion, the distance localization was done by first rotating the microphone array to face toward the sound source Fig. 22 Experimental results for orientation localization using the robotic platform. When θ = 0o, the azimuth estimates using and then shifting the microphones perpendicular to the the 2D and 3D models are very close (in the top-left figure), which source-robot vector by a distance of a fixed number implies the elevation estimates are not reliable (in the bottom- of steps. Under challenging acoustic environments with o left figure). When θ = 55 , the azimuth estimates are obviously relatively low-energy targets and high-energy noise, high different (in the top-right figure), which implies reliable elevation estimates using the 3D model (in the bottom-right figure). localization accuracy was achieved in both simulations and experiments. The mean of the average of absolute estimation error was less than 4o for angular localiza- using the RMSE of the difference between the estimated tion and less than 0.6 m for distance localization in azimuth values using respectively the 2D and 3D mod- simulation results and techniques to detect θ =0o and els. 90o are verified in both simulation and experimental A fitted curve similar to one shown in the Figure 9 results. can be generated for the environment by keeping the sound source at different elevation angles and recording Acknowledgements Acknowledgment the RMSE values between ϕ2D and ϕ3D estimations. The authors would like to thank Dr. Xuan Zhong The value of the parameter RMSEthreshold can be de- for providing with the experimental raw data using the cided, which can be used to check the θ =0o scenario. KEMAR dummy head. Further, the generated fitted curve can be used to give a closer estimation of the elevation angle. References

9 Conclusion 1. International Organization for Standardization (ISO), British, European and International Standards (BSEN), Noise emitted by machinery and equipment – Rules for the This paper presents a novel technique that performs drafting and presentation of a noise test code. 12001: 1997 a complete localization (i.e., both orientation and dis- 15

2. Allen, J.B., Berkley, D.A.: Image method for efficiently 21. Goelzer, B., Hansen, C.H., Sehrndt, G.: Occupational ex- simulating small-room acoustics. The Journal of the posure to noise: evaluation, prevention and control. World Acoustical Society of America 65(4), 943–950 (1979). Health Organisation (2001) doi:10.1121/1.382599 22. Goldstein, E.B., Brockmole, J.: Sensation and perception. 3. Azaria, M., Hertz, D.: Time delay estimation by generalized Cengage Learning (2016) cross correlation methods. IEEE Transactions on Acous- 23. Hedrick, J.K., Girard, A.: Control of nonlinear dynamic sys- tics, Speech, and Signal Processing 32(2), 280–285 (1984). tems: Theory and applications. Controllability and observ- doi:10.1109/TASSP.1984.1164314 ability of Nonlinear Systems p. 48 (2005) 4. Beard, R., McLain, T.: Small Unmanned Aircraft: Theory 24. Hermann, R., Krener, A.: Nonlinear controllability and ob- and Practice. Princeton University Press (2012) servability. IEEE Transactions on Automatic Control 22(5), 5. Benesty, J., Chen, J., Huang, Y.: Microphone array signal 728–740 (1977). doi:10.1109/TAC.1977.1101601 processing, vol. 1. Springer Science & Business Media (2008). 25. Hornstein, J., Lopes, M., Santos-Victor, J., Lacerda, F.: doi:10.1007/978-3-540-78612-2 Sound localization for humanoid robots-building audio- 6. Beranek, L.L., Mellow, T.J.: Acoustics: sound fields and motor maps based on the HRTF. In: IEEE/RSJ International transducers. Academic Press (2012) Conference on Intelligent Robots and Systems (IROS), pp. 7. Blumrich, R., Altmann, J.: Medium-range localisation of 1170–1176 (2006). doi:10.1109/IROS.2006.281849 aircraft via triangulation. Applied Acoustics 61(1), 65–82 26. Huang, Y., Benesty, J., Elko, G.W.: Passive acous- (2000). doi:10.1016/S0003-682X(99)00066-3 tic source localization for video camera steering. In: 8. Boll, S.: Suppression of acoustic noise in speech us- IEEE International Conference on Acoustics, Speech, ing spectral subtraction. IEEE Transactions on Acous- and Signal Processing, vol. 2, pp. II909–II912 (2000). tics, Speech, and Signal Processing 27(2), 113–120 (1979). doi:10.1109/ICASSP.2000.859108 doi:10.1109/TASSP.1979.1163209 27. Kaushik, B., Nance, D., Ahuja, K.: A review of the 9. Boll, S., Pulsipher, D.: Suppression of acoustic noise in speech role of acoustic sensors in the modern battlefield. using two microphone adaptive noise cancellation. IEEE In: 11th AIAA/CEAS Aeroacoustics Conference (2005). Transactions on Acoustics, Speech, and Signal Processing doi:10.2514/6.2005-2997 28(6), 752–753 (1980). doi:10.1109/TASSP.1980.1163472 28. Keyrouz, F.: Advanced binaural sound localization in 3- 10. Borenstein, J., Everett, H., Feng, L.: Navigating mobile D for humanoid robots. IEEE Transactions on In- robots: systems and techniques. A K Peters Ltd. (1996) strumentation and Measurement 63(9), 2098–2107 (2014). 11. Brandes, T.S., Benson, R.H.: Sound source imaging doi:10.1109/TIM.2014.2308051 of low-flying airborne targets with an acoustic cam- 29. Keyrouz, F., Diepold, K.: An enhanced binaural 3D sound era array. Applied Acoustics 68(7), 752–765 (2007). localization algorithm. In: IEEE International Symposium doi:10.1016/j.apacoust.2006.04.009 on Signal Processing and Information Technology, pp. 662– 12. Brandstein, M., Ward, D.: Microphone arrays: signal process- 665 (2006). doi:10.1109/ISSPIT.2006.270883 ing techniques and applications. Springer Science & Business 30. Knapp, C., Carter, G.: The generalized correlation method Media (2013). doi:10.1007/978-3-662-04619-7 for estimation of time delay. IEEE Transactions on Acous- 13. Brassington, G.: Mean absolute error and root mean square tics, Speech, and Signal Processing 24(4), 320–327 (1976). error: which is the better metric for assessing model perfor- doi:10.1109/TASSP.1976.1162830 mance? In: EGU General Assembly Conference Abstracts, 31. Kumon, M., Uozumi, S.: Binaural localization for a mobile vol. 19, p. 3574 (2017) sound source. Journal of Biomechanical Science and Engi- 14. Calmes, L.: Biologically inspired binaural sound source local- neering 6(1), 26–39 (2011). doi:10.1299/jbse.6.26 ization and tracking for mobile robots. Ph.D. thesis, RWTH 32. Laurent Kneip, C.B.: Binaural model for artificial spa- Aachen University (2009) tial sound localization based on interaural time delays 15. Chen, J., Benesty, J., Huang, Y.: Time delay estimation and movements of the interauralaxis. The Journal of in room acoustic environments: an overview. EURASIP the Acoustical Society of America pp. 3108–3119. (2008). Journal on applied signal processing pp. 170–170 (2006). doi:10.1121/1.2977746 doi:10.1155/ASP/2006/26 33. Lu, Y.C., Cooke, M.: Motion strategies for binaural local- 16. Donohue, K.D.: Audio array toolbox. [Online] Available: isation of speech sources in azimuth and distance by artifi- http://vis.uky.edu/distributed-audio-lab/about/ , 2017, Dec cial listeners. Speech Communication 53(5), 622–642 (2011). 22 doi:10.1016/j.specom.2010.06.001 17. Gala, D., Lindsay, N., Sun, L.: Three-dimensional sound 34. Lu, Y.C., Cooke, M., Christensen, H.: Active binaural dis- source localization for unmanned ground vehicles with a self- tance estimation for dynamic sources. In: INTERSPEECH, rotational two-microphone array. In: Proceedings of the 5th pp. 574–577 (2007) International Conference of Control, Dynamic Systems, and 35. Middlebrooks, J.C., Green, D.M.: Sound localization by hu- Robotics (CDSR’18). Accepted (2018) man listeners. Annual review of psychology 42(1), 135–159 18. Gala, D.R., Misra, V.M.: SNR improvement with speech (1991). doi:10.1146/annurev.ps.42.020191.001031 enhancement techniques. In: Proceedings of the Inter- 36. Naylor, P., Gaubitch, N.D.: Speech dereverbera- national Conference and Workshop on Emerging Trends tion. Springer Science & Business Media (2010). in Technology, ICWET ’11, pp. 163–166. ACM (2011). doi:10.1007/978-1-84996-056-4 doi:10.1145/1980022.1980058 37. Nguyen, Q.V., Colas, F., Vincent, E., Charpillet, F.: Long- 19. Gala, D.R., Vasoya, A., Misra, V.M.: Speech enhance- term robot motion planning for active sound source local- ment combining spectral subtraction and ization with Monte Carlo tree search. In: Hands-free Speech techniques for microphone array. In: Proceedings of Communications and Microphone Arrays (HSCMA), pp. 61– the International Conference and Workshop on Emerging 65 (2017). doi:10.1109/HSCMA.2017.7895562 Trends in Technology, ICWET ’10, pp. 163–166 (2010). 38. Omologo, M., Svaizer, P.: Acoustic source location in noisy doi:10.1145/1741906.1741938 and reverberant environment using csp analysis. In: IEEE 20. Gill, D., Troyansky, L., Nelken, I.: Auditory localization using International Conference on Acoustics, Speech, and Sig- direction-dependent spectral information. Neurocomputing nal Processing Conference, vol. 2, pp. 921–924 (1996). 32, 767–773 (2000). doi:10.1016/S0925-2312(00)00242-3 doi:10.1109/ICASSP.1996.543272 16

39. Pang, C., Liu, H., Zhang, J., Li, X.: Binaural sound lo- 55. Zhong, X., Yost, W., Sun, L.: Dynamic binaural sound source calization based on reverberation weighting and generalized localization with ITD cues: Human listeners. The Journal of parametric mapping. IEEE/ACM Transactions on Audio, the Acoustical Society of America 137(4), 2376–2376 (2015). Speech, and Language Processing 25(8), 1618–1632 (2017). doi:10.1121/1.4920636 doi:10.1109/TASLP.2017.2703650 56. Zietlow, T., Hussein, H., Kowerko, D.: Acoustic source local- 40. Perrett, S., Noble, W.: The effect of head rotations on ization in home environments-the effect of microphone array vertical plane sound localization. The Journal of the geometry. In: 28th Conference on Electronic Speech Signal Acoustical Society of America 102(4), 2325–2332 (1997). Processing, pp. 219–226 (2017) doi:10.1121/1.419642 41. Rodemann, T.: A study on distance estimation in binau- ral sound localization. In: IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pp. 425–430 (2010). doi:10.1109/IROS.2010.5651455 42. Rodemann, T., Ince, G., Joublin, F., Goerick, C.: Using binaural and spectral cues for azimuth and elevation local- ization. In: IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pp. 2185–2190 (2008). doi:10.1109/IROS.2008.4650667 43. Sabine, W.: Collected Papers on Acoustics. Harvard Univer- sity Press (1922) 44. Spriet, A., Van Deun, L., Eftaxiadis, K., Laneau, J., Moo- nen, M., Van Dijk, B., Van Wieringen, A., Wouters, J.: Speech understanding in background noise with the two- microphone adaptive beamformer beam in the nucleus free- dom cochlear implant system. Ear and hearing 28(1), 62–72 (2007). doi:10.1097/01.aud.0000252470.54246.54 45. Sturim, D.E., Brandstein, M.S., Silverman, H.F.: Track- ing multiple talkers using microphone-array measurements. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 371–374 vol.1 (1997). doi:10.1109/ICASSP.1997.599650 46. Sun, L., Zhong, X., Yost, W.: Dynamic binaural sound source localization with interaural time difference cues: Artificial lis- teners. The Journal of the Acoustical Society of America 137(4), 2226–2226 (2015). doi:10.1121/1.4920636 47. Tamai, Y., Kagami, S., Amemiya, Y., Sasaki, Y., Mizoguchi, H., Takano, T.: Circular microphone array for robot’s au- dition. In: Proceedings of IEEE Sensors, 2004., vol. 2, pp. 565–570 (2004). doi:10.1109/ICSENS.2004.1426228 48. Tamai, Y., Sasaki, Y., Kagami, S., Mizoguchi, H.: Three ring microphone array for 3D sound localization and separation for mobile robot audition. In: IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pp. 4172– 4177 (2005). doi:10.1109/IROS.2005.1545095 49. Tiete, J., Domínguez, F., Silva, B.d., Segers, L., Steenhaut, K., Touhafi, A.: Soundcompass: a distributed MEMS micro- phone array-based sensor for sound source localization. Sen- sors 14(2), 1918–1949 (2014). doi:10.3390/s140201918 50. Valin, J.M., Michaud, F., Rouat, J., Letourneau, D.: Robust sound source localization using a microphone array on a mo- bile robot. In: IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS), vol. 2, pp. 1228–1233 (2003). doi:10.1109/IROS.2003.1248813 51. Wallach, H.: On sound localization. The Journal of the Acoustical Society of America 10(4), 270–274 (1939). doi:10.1121/1.1915985 52. Wang, H., Chu, P.: Voice source localization for au- tomatic camera pointing system in videoconferencing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 187–190 (1997). doi:10.1109/ICASSP.1997.599595 53. Yost, W.A., Zhong, X.: Sound source localization identifi- cation accuracy: Bandwidth dependencies. The Journal of the Acoustical Society of America 136(5), 2737–2746 (2014). doi:10.1121/1.4898045 54. Zhong, X., Sun, L., Yost, W.: Active binaural localization of multiple sound sources. Robotics and Autonomous Systems 85, 83–92 (2016). doi:10.1016/j.robot.2016.07.008