<<

Acoustical Classification of the Urban Road Traffic with Large Arrays of Raphaël Leiba, François Ollivier, Régis Marchiano, Nicolas Misdariis, Jacques Marchal, Pascal Challande

To cite this version:

Raphaël Leiba, François Ollivier, Régis Marchiano, Nicolas Misdariis, Jacques Marchal, et al.. Acous- tical Classification of the Urban Road Traffic with Large Arrays of Microphones. Acta Acustica united with Acustica, Hirzel Verlag, 2019, 105 (6), pp.1067-1077. ￿10.3813/AAA.919385￿. ￿hal-02457922￿

HAL Id: hal-02457922 https://hal.sorbonne-universite.fr/hal-02457922 Submitted on 28 Jan 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019) 1067 –1077 DOI 10.3813/AAA.919385

Acoustical Classification of the Urban Road Traffic with Large Arrays of Microphones

Raphaël Leiba1,2),François Ollivier1),Régis Marchiano1),Nicolas Misdariis2),Jacques Marchal1), Pascal Challande1) 1) Sorbonne Université, CNRS, Institut Jean Le Rond ∂’Alembert, 75005 Paris, France. [email protected] 2) STMS Ircam-CNRS-SU

Colour Figures: Figures in colour aregiven in the online version

Summary This work is part of astudy dealing with city-dwellers’ quality of life. Noise is known to be an important fac- tor influencing the quality of life. In order to diagnose it properly,wepropose anoise monitoring system of urban areas. It is based on the use of large arrays in order to extract the radiated sound field from each passing-by vehicle in typical urban scenes. Amachine learning algorithm is trained so as to classify these extracted signals in clusters combining both the vehicle type and the driving conditions. This system makes it possible to monitor the evolution of the noise levels for each cluster.The proposed system wasfirst tested on passing-by isolated vehicles measurements and then implemented in areal street in Paris (France). ©2019 The Author(s).Published by S. Hirzel Verlag · EAA. This is an open access article under the terms of the Creative Commons Attribution (CCBY4.0)license (https://creativecommons.org/licenses/by/4.0/). PACS no. 43.50. 43.50.Lj, 43.50.Rq, 43.60.Yw,43.60.Bf,43.60.Cg, 43.60.Fg, 43.60.Jn, 43.60.Np, 43.60.Vx

1. Introduction diagnosis step the large cities (over100,000 inhabitants) have to provide action plans showing their ambition to re- Quality of life is amajor challenge in urban places because duce the proportion of persons highly annoyed by noise of the large number of environmental factors influencing with practical actions. it. In projects such as Mouvie [1] systemic approaches are These regulations, with regard to WHO’sthreshold val- needed to work on both air pollution and sound pollution ues [7, 8], provide an efficient waytoinform the city and their impact on the comfort of city-dwellers. dwellers on their average exposure to noise. But these Notably,noise has stress-related impacts on human maps have some limits. First, theyare provided for each health such as sleep disturbances or cardiovascular dis- type of source separately (road, aircraft, railwayand in- eases [2, 3, 4]. The World Heath Organisation (WHO)in- dustrial)whereas in urban areas city dwellers are usually dicates [5] that between 1and 1.6 million years in good exposed to several sound sources at the same time –with health are lost (disability-adjusted life-years, or DALYs) possible interactions and not just asummation. So that every year in occidental Europe because of transportation the overall exposure is not estimated. Second, whereas the noise. Sleep disturbance is the major effect with 903 000 variation of noise exposure overthe day may be of inter- years lost every years and long-term noise annoyance (due est for city-dwellers and urban planners, it is averaged be- to passive effects of noise)isthe second effect (654 000 cause of the use of the Lden.Third, the noise maps are years lost). obtained by simulations and rarely confronted to measure- In order to unify the national initiativesand provide an ments and consequently validated experimentally.There- efficient tool to diagnose the urban sonic environment, Eu- fore, it is interesting to see the emergence of monophonic ropean parliament voted the 2002/49/CE directive in 2002 acoustic sensor networks (such as CENSE or DYNAMAP [6]. It imposes for the large cities of the member coun- [9])toprovide additional real-time information about the tries of the European union to realise noise maps based sonic environment. In this work, we focus on the noise induced by the on the Lden index(stands for Day-Evening-Night level).It aims to takeinto account the influence of time-periods of road traffic. In this specificcontext, noise maps have addi- the day in noise annoyance by increasing by 5dB(A) the tional weaknesses. Thus, though theyare often cited as the evening levels and by 10 dB(A) the night levels. After this most annoying sources [10, 11], the powered two-wheelers where considered as light vehicles [11]. Indeed, in many cities the trafficflow estimation is based on counting the Received17June 2019, number of axles of the passing-by vehicle, therefore it can- accepted 25 October 2019. not distinguish apowered two-wheeler from acar.

©2019 The Author(s). Published by S. Hirzel Verlag · EAA. This is an open access article under the terms of the CC BY 4.0 license. 1067 ACTA ACUSTICA UNITED WITH ACUSTICA Leiba et al.:Classification of urban road traffic Vol. 105 (2019)

Note that the European Union voted in 2015 adirective alearning database. In asecond step, the system is im- [12] redefining the categories to be used for noise map- plemented in areal urban context to evaluate its perfor- ping, including the powered two-wheelers. The applica- mances in terms of classification according to objective tion of this directive had to be applied before 31/12/2018. features (Section 3.2). Finally,the system is modified to In addition, Marquis-Favre et al. [13] reported that the perform classification according to perceptual indices in perceivednoise annoyance induced by road traffic(here Section 3.3. the short-term annoyance, estimated with active listening) Finally,Part 4exposes the global outcome of this work is related to the mode of transportation (road, rail or air by presenting the evolution of the sound levelall overa traffic) butalso to the vehicle type. The driving conditions day with respect to the estimated perceptual clusters. are also pointed out as adecisive factor to this annoyance. Thus it appears that the road trafficestimation should be improvedtodetect all type of vehicles and their driving 2. Materials and Methods conditions as well. The method for an acoustical classification of the urban In this aim, we propose to capture the of road trafficstarts from atracking step of each vehicle in each road vehicle, extract from it the vehicle type and the trafficflow (Section 2.1). This provides the trajectory its driving condition in order to provide amore detailed of the sound sources to be measured and classified. The description of the road trafficnoise and pave the wayof individual signal extraction relies on the implementation short-term noise annoyance estimation in urban areas. of large microphone arrays presented in Section 2.2. The If the audio signal is known, it can be used to identify method uses beamforming technique dedicated to moving the vehicle. Indeed, sound source classification has been sources and presented in Section 2.3. The last step aims at investigated with machine learning based on monophonic classifying the extracted signals. The method based on a signals in the past decade (see e.g. [14, 15, 16, 17]). But in supervised machine learning process is presented in Sec- major streets the spatial and temporal masking effects of tion 2.4. the different sound sources prevent from classifying each vehicle properly. 2.1. Moving vehicle tracking method Multiple studies have been conducted to separate sound source on monophonic signals (see for example Gloaguen In order to obtain the trajectory of each vehicle embedded et al.[18] for recent development in Non-Negative Matrix in the traffic, we developed an in-house tracking method. Factorisation applied to urban sound scenes)but we as- We first perform acontour detection on the video file sume that aspatial filtering technique will provide better recorded by acamera located at the centre of the micro- results and extract each vehicle audio signal with more ac- phone array.Itisbased on background subtraction for each curacy. video frame using the OpenCV library1.The background is Weinstein et al. [19] showed that microphone arrays can created by averaging the 500 preceding frames. Each mov- be used for sound sources separation using inverse tech- ing object is reduced to arectangle including it. The rect- niques. The application to low-speed moving source sig- angles connection between twoconsecutive frames is sim- nal extraction has been done by Hafizovic et al.[20] on a ply computed by finding the minimum distance between basketball court with a300 microphones array. tworectangle centroids. Finally,the vehicle trajectory is In this article, we propose aroad trafficmonitoring sys- obtained by gathering the connected centroids. tem. It aims at detecting each vehicle type, identifying its Figure 1a shows an example of vehicle tracking for a driving conditions and extracting its specificsound signal. pass-by measurement. But the trajectory detection also has This permits to compute indices that could be used to bet- to be robust to the presence of obstacles (such as trees)be- ter assess noise annoyance such as loudness. By identi- tween the camera and the vehicle, likeinthe configuration fying the vehicles and isolating their audio features, the presented in Figure 1b: camera overamulti-lane street. proposed system provides more detailed information than To do so, the current frame –towhom the background is those provided by standard urban noise observatories. subtracted –isblurred in the direction of the vehicles. Part 2presents the tools implemented in the study.First avideo tracking method provides the trajectory of each ve- 2.2. Microphone arrays hicle (see sec. 2.1). From this trajectory,the system uses 2.2.1. The Megamicros acquisition system large microphone arrays (sec. 2.2)together with adedi- The Megamicros project, introduced by Vanwynsberghe et cated beamforming technique to extract the signal of each al.[21] with a128 microphones array,aims at providing vehicle embedded in the traffic(sec. 2.3). The last step of digital acquisition systems able to capture up to 1024 syn- the process consists in classifying these signals into clus- chronised acoustic signals. These systems are dedicated to ters mixing both vehicle type and driving condition (sec. applications such as acoustic imaging, room acoustics or 2.4). Note that this study only focuses on internal combus- source directivity measurements. Based on digital MEMS tion vehicles. microphones these systems are very versatile and easy Part 3presents the applications of the method. In afirst step, it is used in acontrolled set up to characterise iso- lated vehicles on atest track (Section 3.1)and constitute 1 Available at opencv.org

1068 Leiba et al.:Classification of urban road traffic ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019)

Table I. Characteristics of the vehicles for the track tests. hv stands for heavy vehicle, lv for light vehicle and twv for twho- wheeler vehicle.

Label Energy Engine Range

lv1 diesel 4cyl. sedan lv2 gasoline 3cyl. urban lv3 diesel 4cyl. minivan (a) lv4 diesel 4cyl. sedan hv diesel 4cyl. 25 m3 utility twv1 gasoline 50 cm3,2stroke2-wheeler twv2 gasoline 400 cm3,4stroke2-wheeler to set-up. The MEMS microphones (ADMP441 -Ana- log Device)are omnidirectional and have arather flat fre- quencyresponse between 60 and 15,000 Hz. These sys- tems allowtobuild arrays of arbitrary geometries, possi- (b) bly with extensions of afew tens of meters. Twoofsuch arrays were implemented in this study.Theyare presented Figure 1. (Colour online)Detected moving vehicles (green rect- in Sections 2.2.2 and 2.2.3. In twofollowing experiments, angles)and trajectory extraction (red lines)for isolated vehicles the signals are sampled at 50 kHz. or real urban situation. (a) Isolated vehicle on test track (test- track experiment), (b) Vehicles in aparisian street (in situ exper- 2.2.2. Isolated vehicle pass-by measurements iment). The microphone array used for the test-track experiment has been designed to provide the best possible acoustic image of the noise sources of passing-by vehicles [22]. Therefore, the microphone array is large enough to offer a sufficient resolution at lowfrequencies and dense enough 2 1 to avoid grating lobes at high frequencies. The array was (m) built according to the geometry presented in Figure 2. z 0 -5 -5 0 256 microphones were distributed overa20m longand y (m) 0 5 x (m) 2.25 mhigh area, thanks to 32 vertical uprights supporting 8microphones each. The microphone array waslocated Figure 2. (Colour online)Microphone array positions (red dots). 7.5 mawayfrom the vehicle path, following the ISO 362 recommendations for pass-by noise measurements. In de- tail, horizontally the inter-microphone distances are from out on amulti-lane urban street (StBernard Quay in Paris - 10 cm to 1.53 mand from 17 cm to 39.9 cm vertically. France)with aspecificmicrophone array.Itis21.6 mlong We aimed at being representative of the urban road traf- and is located 9.5 meters overhanging a3×1lane street at fic in terms of vehicle types and driving conditions. Table I a13.5 mdistance from its centre (see Figure 3).Itiscom- lists the characteristics of the vehicles involved in the track posed of 128 MEMS microphones regularly spaced with a tests. Note that the vehicle named hv is the one considered pitch of 17 cm. This configuration allows segregating the as heavy vehicle despite it is alarge utility vehicle, not vehicles in the same lane in awide frequencyrange, even aproper busortruck. The study only focuses on internal at lowfrequencies thanks to the array length. The over- combustion vehicles. hanging position of the array allows to have aphase differ- To simulate all possible urban driving configurations, ence between the lanes, which makes it possible to sepa- each vehicle under test has passed the following scenarios rate the sources located in different lanes. Due to its light- in both back and forth ways: ness and simplicity,the installation of the antenna only re- • 25 km/h constant speed in second gear; quires 30 minutes. • 50 km/h constant speed in third gear; As illustrated in Figure 3, this experiment takes place in • trafficlight vicinity: astreet with an “L”shape, meaning that there is no build- – deceleration from 30 to 0km/h; ing in the opposite side of the street. It is an important – 2sstop, engine idling; street and the microphone array is set-up close to atraffic – acceleration from 0to30km/h with gear change; light so that we can expect to have all the driving condi- • full acceleration over20m. tions described in the previous section. This experiment took place during aday in winter which 2.2.3. Urban experiment prevented the video tracking process to be disrupted by the Our goal is to monitor the road trafficinurban environ- leavesonthe trees. Six sequences of ten minutes each have ments. Therefore, in situ measurements have been carried been recorded during day time.

1069 ACTA ACUSTICA UNITED WITH ACUSTICA Leiba et al.:Classification of urban road traffic Vol. 105 (2019)

2.3. Moving source signal extraction

Knowing the source position at all time, the microphone z array recordings can undergo abeamforming (BF) process to extract audio signal of each passing-by vehicle from a x y multi-source sound scene. 9.5 m In acoustics, BF is used as areference method, since it is robust for source localization overadiscretised plane or into avolume including static sound sources. The method can also be considered as away of spatial filtering (see e.g. 13.5 m [19, 20]). In this study,wepropose to use this property on mov- Figure 3. (Colour online)Scheme of the experiment configura- ing vehicles. Therefore, the standard free-field propagation tion. The linear microphone array (red dot)isset-up over a3×1 model used in classical and sum (DAS)applications lane street in Paris. has been modified to takeinto account the cinematic of the vehicles. Figure 4presents aclassical scenario with alin- ear microphone array recording the sound field propagated i Vt()e by the ith monopolar sound source moving on astraight line. Morse and Ingard [23], considering an homogeneous h rmi,e  media and afree field, write the pressure at time tr at mi- e crophone m emitted by source i at time te as s (t ) 1 m N p (t ) = i e (1) m m r  2 rmi,e 1 − Ma(te)sin θe Figure 4. Configuration example with a Nm linear microphone with tr = te + rmi,e/c0, rmi,e the distance between the mi- array and a ith sound source moving rectilinearly at aspeed of V crophone m and the source i at emission time te, si(t) = at t = te. q(t)/4π, q(t)isthe derivative of the source mass flow and Ma(te)isthe Mach number of the source at emission 2.4. Moving vehicle classification time M (t ) = V (t )/c . a e e 0 In classification tasks, the challenge is not only in select- The reconstructed source signal s (t )isestimated by i e ing the best algorithm butalso in finding the best data Cousson et al. [24], for instance, with to use: here the audio descriptors. Valero et al. [15] pro- Nm 1  2 vides acomparison of classification accuracyofenviron- si(te) = rmi,e 1 − Ma(te)sin θe pm(tr) (2) mental noise obtained by 13 signal features and 4differ- Nm m=1 ent methods. Theypoint out that the MFCCs (Mel Fre- with Nm the number of microphones. The central term quencyCepstral Coeficients)orthe MPEG7 descriptors bears the only modification in the classical DASexpres- give good accuracyinnoise source classification, espe- sion. cially for Gaussian Mixture Model (GMM), K-Nearest However, in the rest of this paper the energy compensa- Neighbour (KNN)orNeural Network (NN). tion is discarded (byremoving the rmi,e factor)sothat the MFCCs are widely used in noise source classification. output signal levelisthe one recorded by the microphones. As pointed out by Giannoulis et al. [25], amajor part Thus, the extracted signal writes: of the research teams uses MFCCs as audio descriptors for classifying different sound scenes signals. MFCCs, Nm 1  2 as presented by Davis and Mermelstein [26], result from sˆi(te) = 1 − Ma(te)sin θe pm (tr) . (3) Nm different calculations overeach signal frame, typically m=1 25 ms long. First, the energy summation of the filtered The signal sˆi(t)isanestimator of what would have been spectrum overatriangle filter bank align along the mel recorded by one microphone if only the source si where scale (mimicking the cochlea)istaken. Then, the dis- in the sound scene. By doing so, only aspatial filtering is crete cosine transform of the log of each energy is com- operated and the results are more comparable to the noise puted. Forour purposes, MFCCs are calculated using the maps (that provides Lden in facade)and to the city dwellers python_speech_features library2. feeling. During the DCASE 2013 challenge [25] the best results The simulations and the experimental tests presented in have been obtained by combining MFCCs with aSupport appendix A1 showthat the BF method presented in equa- Vector Machine (SVM). Introduced by Cortes and Vapnik tion (3) is accurate to extract each vehicle signal. [27], this method is widely used for supervised classifica- Note that in both experiments presented in this paper, tion tasks. It aims at finding so-called hyper-plans that sep- the free-field model can be considered valid as there are no major reflectors except for the road, which is too close to the sources to have areal influence in these configurations. 2 Available at github.com/jameslyons/python_speech_features

1070 Leiba et al.:Classification of urban road traffic ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019)

Table II. Classification clusters. 18 2-Wheeler Light V. Heavy V. 12

Const. Speed 147 pass-by 89 Acceleration 258 of 7 5 5 4 Deceleration 369 2

Number 123456789 Cluster arate the samples of different classes with the maximum margin. The hyper-plans are defined by normal vector w Figure 5. Number of pass-by measurements in each cluster. and the margin width is equal to 2/w so that minimis- ing w maximises the margin. The w vector has the same number of components that the number of features used 100 for classification (here the MFCCs). In order to allower- 1 63.3 36.7 0 0 0 0 0 0 0 rors and approximation in the case clusters could not be 2 0 100 0 0 0 0 0 0 0 80 linearly separable, slack variables ζi –counting the num- ber of errors –are added to relax the constraints on the 3 0 0 100 0 0 0 0 0 0 learning vectors. Finding hyper-plans reduces at solving 4 0 0 0 100 0 0 0 0 0 60 the equation category 5 3.5 0 0 6.9 79.3 0 10.3 0 0 n 6 0 0 0 3.3 20 76.7 0 0 0 40 1 T wˆ = argmin w w + C ζi , (4) Expected 0 0 0 0 0 0 88.9 11.1 0 w,ζ 2 7 i=1 20 8 0 0 0 0 0 0 0 100 0 with C the parameter that determines the tradeoff between 9 0 0 0 0 66.7 33.3 0 0 0 increasing the margin-size and ensuring that the samples 0 1 2 3 4 5 6 7 8 9 lie on the correct side of the margin. C can be avector or a Predicted category scalar,providing respectively avalue for each class or the same for all. Figure 6. Confusion matrix in percentage of vehicle per category Note that convolutional neural network have also been -Global classification accuracy: 88 %. used in environmental sound classification for the past years. This type of algorithm seems interesting, butsofar, it givesthe same type of performance as SVM butwith an bag-of-frames approach [30] is used for representing the intensive computational cost (see eg. [28, 29]). MFCCs evolution overthe time: the time evolution is sim- In our case, the task consists in classifying road vehi- plified to its mean and standard deviation. So that for each cles in terms of type (2-wheeler,light vehicle and heavy pass-by the extracted signal is represented by 52 audio fea- vehicle)and driving conditions (constant speed, acceler- tures. The classification is trained by SVM over88% of the ation and deceleration). Nine clusters are used. Theyare signals and is tested on the remaining 12 %(8signals). detailed in Table II. The results are averaged over35different learning and 3. Implementation testing signal combinations. Figure 6presents the confusion matrix for this classifier. The monitoring method proposed in the previous part is It shows some small confusion, mostly within the classes first tested and validated with isolated vehicles on atest of avehicle type. Some confusions outside avehicle type track. Then, its application in areal urban scenario is pre- are also present. Forexample, the light vehicle in acceler- sented. ation (cat. 5) are classified 10 percent of the time as heavy vehicle in constant speed (cat.7)and 3.5 percent as a2- 3.1. Controlled set of vehicles wheeler in constant speed (cat.1) This extraction method has been applied to the pass-by Using the MFCCs as descriptors and arelaxation pa- measurements listed in Table Ifor the various driving rameter C from 5and above,the score is 88 %ofcor- conditions. Note that the “trafficlight simulation” record- rect classification which is good regarding the literature ings are split into 3driving conditions: deceleration, idle [15, 29, 28]. (not classified)and acceleration. Finally,70pass-by sig- This result can be improvedbyadding more informa- nals constitute the classification database (called test-track tion to the dataset used by SVM: the driving conditions. database in the rest of the article). Theyare distributed in Indeed, in this controled experiment theyare well known. the different clusters as presented in Figure 5. So that the learning and testing datasets are composed of Signal features selection is important for obtaining the 55 elements: 52 audio features and 3binary values, one best classification results. MFCCs are computed every for each driving condition. Note that each element of the 100 ms with 52 filters (ranging from 0to25kHz, half the dataset is normalised by its maximum among the 70 pass- sample rate)and 26 MFCCs are obtained. Asimplified by data.

1071 ACTA ACUSTICA UNITED WITH ACUSTICA Leiba et al.:Classification of urban road traffic Vol. 105 (2019)

Table III. Number of pass-by measurements by category for training and testing datasets. The training dataset is based on test- 100 track and in situ measurements. 1 100 0 0 0 0 0 0 0 0 2 0 100 0 0 0 0 0 0 0 Dataset 1234 56789Sum 80 3 0 0 100 0 0 0 0 0 0 test-track 89518127542 ain 494 4 0 0 0 100 0 0 0 0 0 60

Tr in situ 32 60260 60 39 7173

category 5 0 0 0 0 100 0 0 0 0 st in situ 10 51 68 14 10 241115 Te 6 0 0 0 0 0 100 0 0 0 40

Expected 7 0 0 0 0 0 0 100 0 0 Figure 7presents the confusion matrix for this classi- 20 0 0 0 0 0 0 0 fier, the global classification accuracyrises to 99 %. It rep- 8 0 100 resents the percentage of predicted vehicle category with 9 0 0 0 0 66.7 33.3 0 0 0 0 reference to the real (expected)one. 1 2 3 4 5 6 7 8 9 Only the heavy vehicle in deceleration (cluster 9) is Predicted category misclassified, being considered as alight vehicle in ac- celeration 67 %ofthe time or in deceleration 33 %ofthe Figure 7. Confusion matrix in percentage of vehicle per category time. This could be explained by the nature of the vehicle: -Driving conditions added to the dataset -Global classification alarge utility vehicle, powered by acar-likeengine not a accuracy: 99 %. proper truck one. Note that tuning of C parameter allows some samples to be misclassified. The counterpart of hav- ing agood fit of the SVM on the overall data is that the 100 heavy vehicle in deceleration is 67 %ofthe time misclas- 1 90 0 0 10 0 0 0 0 0 sified in acceleration. 2 0 0 0 0 100 0 0 0 0 80 3 0 0 0 100 0 0 0 0 0 3.2. Urban sound scene 4 2.9 0 0 91.2 5.9 0 0 0 0 60 From the in situ experiment presented in Section 2.2.3, six category 0 0 0 7.1 92.9 0 0 0 0 available recordings are distributed from 11:50 to 15:15. 5 Three of them (11:50, 15:00 and 15:15)havebeen manu- 6 0 0 0 30 0 70 0 0 0 40 ally tagged in order to quantify the classification accuracy. Expected 7 0 0 0 0 0 0 100 0 0 It provides respectively 210, 246 and 83 vehicle trajecto- 20 8 0 0 0 25 50 0 0 25 0 ries (only the 200 first seconds have been tagged for the 9 0 0 0 100 0 0 0 0 0 15:15 recording)and forms what will be called the in situ 0 database in the following. 1 2 3 4 5 6 7 8 9 The first classification tests have shown that we have Predicted category to enhance the training dataset by adding data from the Figure 8. Confusion matrix in percentage of vehicle per cluster - in situ database to the test-track database. The distribu- Global classification accuracy: 82 %. tion between training and testing dataset is presented in Table III. Note that for some clusters the number of sam- ples is very low. This is mainly due to the tracking method other clusters are often confused with equivalent driving which is not robust enough butalso to the lownumber condition clusters of light vehicles. We can also see that of heavy vehicles passing during the measurements. For the vehicles of clusters 2, 3and 9are always misclassified. the third category,the two-wheelers trajectories were usu- This can be explained by the lack of data for both learning ally lost when theywere overtaking another vehicle at idle. and testing in those clusters. We can finally state that heavy Note that the 52 audio features are normalised by the the vehicles in acceleration (cluster 8) are confused half of same values (maximum of test-track database)and that the the time with light vehicles in acceleration (cluster 5) and driving condition is deduced from the vehicle trajectory. 25 %ofthe time in constant speed. As discussed in Section 2.4, the C parameter can be a This in-situ classification test is promising and the per- scalar (same value for all clusters)oravector (one value formances seem coherent with the literature [15, 16, 28, per clusters). After aparametric study,the best classifica- 29]. This work confirms that the audio signal include the tion results give 82 %ofaccurate classification. Theyare clues to identifies the vehicle types and driving conditions, obtained for such as humans can do.   C = 6, 0.6, 0.6, 1.5, 1.5, 0.3, 6.6, 2.4, 3.3 . 3.3. Urban sound scene -Perceptual categories Formore details, the confusion matrix is giveninFigure 8. Astudy carried out by Morel et al. [31] proposes amulti- The good results are mainly due to the good classification criterion typology of road trafficpass-by noises. During of the light vehicles: from 70 to 93 %ofaccuracy. The afree clustering task, theyfound that the subjects were

1072 Leiba et al.:Classification of urban road traffic ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019)

Table IV.Morel et al. perceptual clusters. 100 2-Wheeler Light V. Heavy V. 1 90 0 10 0 0 0 0 Const. Speed 133 2 0 40 0 0 0 60 0 80 Acceleration 267 Deceleration 455 3 4.3 0 90 0 0 4.3 1.4 60

category 4 0 0 100 0 0 0 0 grouping pass-by noises into clusters that can mainly be 40 5 0 0 36.4 0 63.6 0 0 explained using twocriteria: the vehicle type and the ve-

Expected hicle driving condition. Theywere also interested in the 6 0 7.1 0 0 0 92.9 0 20 influence of athird criterion, the road morphology,but it revealed not to be significant in the clustering process. 7 0 0 25 0 0 25 50 0 During the free clustering task, the subjects were gather- 1 2 3 4 5 6 7 ing the light and heavy vehicles in both constant speed and Predicted category deceleration. As aresult, their proposed perceptual clus- tering of the road trafficisexplained by sevenclusters de- Figure 9. Confusion matrix in percentage of vehicle per percep- tailed in Table IV. tive cluster -Global classification accuracy: 84 %. We propose here to modify the previous classification method to use Morel et al. perceptual clusters on the same training and testing sets than those used in Section 3.2. Cat. 1 Cat. 3 Cat. 5 Cat. 7 After aparametric study,the best success rate is found to Cat. 2 Cat. 4 Cat. 6 be 84 %with the relaxation parameters 11:50 C = [2.38, 1.68, 0.42, 0.9, 0.14, 0.14, 0.53]. 16 5 96 19 61 7 12:27 114 74 20 49 7 The confusion matrix is detailed in Figure 9. It shows some 13:25 good results, especially for clusters 1, 3and 6: the classi- 20 7 82 17 48 6 fication rate for these clusters are equivalent to those ob- 14:45 tained in Section 3.2 (previously labelled category 1, 4and 8 2 66 15 43 6 15:00 5respectively). We can see that the two-wheelers in decel- 8 2 33 4 25 1 eration (category 4) are still classified as alight (orheavy) 15:15 vehicles in constant speed (category 3).Wecan also notice 8 3 50 15 30 7 improvements: two-wheelers in acceleration are nowwell- 0 25 50 75 100 125 150 175 200 classified 40 %ofthe time versus 0% previously and the classification accuracyofheavy vehicles in acceleration Figure 10. Number of vehicles over the different ten minute ac- (cluster 7) rises from 25 to 50 %. Nevertheless, classifica- quisitions by perceptual clusters. Total number of detected vehi- tion rate decreased from 70 to 64 %for cluster 5because cles: 539. it include the heavy vehicle decelerating that is still classi- fiedaspassing-by in constant speed. can suppose that it continues rising later in the afternoon. It is interesting to notice that the classification accuracy Also, such as in Section 3.3, we can see that the 2-wheelers rises when we use those perceptual categories. This is not in deceleration (cluster 4) are neverdetected because of only because of the category merging (that can decrease the camera position. We can also see that there is no ma- the error), butalso by improving the classification of some jor evolution of the distribution overthe clusters, except at categories, such as the powered two-wheelers. 15:00 when clusters 5and 7(light and heavy vehicles in deceleration)are less detected. We can finally notice that 4. Monitoring overdaytime the road trafficishighly dominated by clusters 3(light and heavy vehicles at constant speed)and 6(light vehicles in The classification method is applied to all the ten minute acceleration)but not much vehicles in deceleration (cate- recordings acquired in one day at the Saint-Bernard Quay. gories 4and 5).This can be explained by the street con- Figure 10 shows the number of detected vehicles during figuration (see Figure 3):three lanes located after atraffic the six measurements available with the cluster estimated light and only one before. by the method. The data presented in this Section are Figure 11 shows the evolution of the distribution of the labelled by their starting acquisition time. The tracking equivalent sound levelbyperceptual cluster with boxplots. method allowed to detect 539 vehicles during the acqui- The sound leveliscalculated overthe pass-by duration: sition time. one second (for the fastest vehicle)and fiveseconds for First, we can see that the number of detected vehicles the accelerations and decelerations. Note that cluster 4is decreases from midday to 15:00 then rises at 15:15. We neverdetected and not represented. We can notice aten-

1073 ACTA ACUSTICA UNITED WITH ACUSTICA Leiba et al.:Classification of urban road traffic Vol. 105 (2019) dency: the lower noise levelvalues are emitted by the light and heavy vehicles decelerating (category 5).This Cat. 1 Cat. 3 Cat. 6 is not the case for the 15:00 ten minute recordings. When Cat. 2 Cat. 5 Cat. 7 analysing precisely the distribution it seems to be bi-modal with twosets of values centred on 61 dB and 78 dB. This bi-modality seems to be only due to the integration time and so the type of deceleration: short and stopping quickly 11:50 (high equivalent level) or long and idling in front of the array (low equivalent level). It seems also that the higher noise levels can be at- tributed to the 2-wheelers passing-by at constant speed 12:27 (cluster 1) and heavy vehicles accelerating (cluster 7).For this last cluster,at15:15, we have alarge variation range due to twosets of values centred on 64 dB and 87 dB. It is interesting to note that this cluster has quite aconstant number of vehicle during the day (between 6and 7, except 13:25 for the 15:00 experiment)but with aimportant variation of noise levels. Moreover, we can notice that the variation range for each cluster is small until 13:25, reflecting an homoge- neous traffic, butthen, we have noticed large variations in 14:45 the noise levels and an important number of outliers. This can reflect the increasing diversity of the road trafficfor those periods (different types of heavy vehicles or differ- ent cylinders number for the 2-wheelers). With these results, we can see that the analysis by per- 15:00 ceptual category doesn’tdirectly imply areduced variabil- ity of noise level. Thus, we can see that both the traffic segmentation thanks to the classification task and the as- sociated noise levelare complementary in aurban noise monitoring system. 15:15

5. Conclusion 40 50 60 70 80 90 100 Noise level (dB) In this study,wehavebeen interested in proving the feasi- bility of monitoring the urban road trafficfrom the vehicle Figure 11. Boxplots of noise levels over the different ten minute radiated sound field (thermal vehicles only). In order to do acquisitions (labelled by their starting acquisition time)byper- so, large arrays of microphones have been implemented to ceptual clusters. Note that cluster 4isnever detected and not rep- spatially filter asound scene. This wasachievedthanks to resented. The limits of the rectangles represent the first and third quartiles so that 50% of the data is included in this range. The adedicated beamforming algorithm coupled with avideo red triangle symbolises the mean value, the black vertical bar the tracking method, able to extract the audio signal of each median value and the black circles the outliers. passing-by vehicle. Appendix A1 presents the method val- idation with simulations and an isolated vehicle experi- also on manually tagged pass-by in the urban in situ mea- ment. The in situ spatial filtering gain in also investigated. surements. An adaptation is finally proposed to classify Once the signal extracted, aclassification step has been overperceptual clusters. It allowed to increase the accu- designed with Support Vector Machines using MFCCs as rate classification rate to 84%. audio features. As learning samples, MFCCs where com- Finally,anapplication to six available ten minutes pleted by the driving condition information based on the recordings has been presented. It allows to analyse the video tracking algorithm. It led to 99% of accurate classi- noise levelfor each perceptual category overthe day.It fication on isolated vehicles. has mainly pointed out that the most noisy vehicles –for Then, the application to areal urban sound scene has this measurement place (St-Bernard Quay in Paris city been presented with 539 detected vehicles. Amaximum of centre)–where the 2-wheelers in constant speed and the 82% of accurate classification has been pointed out. This heavy-vehicles in acceleration. The least noisy category wasmainly due to the lack of data for the 2- wheelers wasfound to be almost always the light and heavy vehi- in acceleration and deceleration and for the heavy vehi- cles in deceleration. cles in deceleration. To reach this performance, the learn- This method appears to give agood knowledge of the ing database is based on the isolated vehicle database but road trafficcomposition. Nevertheless, it could be im-

1074 Leiba et al.:Classification of urban road traffic ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019) provedbyadding samples in the training and testing datasets, especially with two-wheelers and heavy vehicles 2100 66 64 decelerating signals butalso with two-wheelers in acceler- (Hz) 62 2000 ation. Note that the microphone array is easy to use and 60 can be adapt in all urban situations (eg. on attaching it 58 Frequency 1900 to the balconies). Even though the current one –inSt- 0.0 0.5 1.0 1.5 2.0 2.5 Bernard Quay –isvery challenging, the results are already Time (s) satisfying. 2100 90 Subsequently,some improvements could be investi- (Hz) 88 2000 86 gated. The video tracking step could be improvedtorise 84 82 the number of detected vehicles. It could be done using Frequency 1900 aremote camera with abetter field of vieworbycou- 0.0 0.5 1.0 1.5 2.0 2.5 Time (s) pling the video tracking with atracking system on an acoustic image. In addition, in the presence of leaves, this Figure A1. Spectrogram of simulated and beamformed signal for method could be modified by doing the tracking step over asource emitting a2kHz pure tone - V = 50 km/h -Dynamics = an acoustic image rather than on the video. The perfor- 10 dB. (a) Simulated signal, (b) Beamformed signal. mance should not be really degraded as the effect of the leavesshould be at very high frequency. The array geometry of the in situ experiment could also 2100 66 be improvedinorder to allowabetter source separation 64

(Hz) between the trafficlanes. Furthermore, as we have ex- 62 2000 tracted the audio signal of each vehicle, different kind of 60 58 metric could be computed to better assess short-term noise Frequency 1900 annoyance in urban environments such as loudness or an- 0.0 0.5 1.0 1.5 2.0 2.5 Time (s) noyance itself thanks to different models [32, 33]. There is no clear consensus in the literature on the link between Figure A2. Spectrogram of beamformed signal realised with audio signal and long-term annoyance (asused by WHO Equation (3),for asource emitting a2kHz pure tone - V = in the estimation of DALYs) butweassume that the short- 50 km/h -Dynamics = 10 dB. term annoyance would probably explain alarger part of the variance of the long-term annoyance than anymetric derivedfrom the instantaneous or equivalent sound level. time sample of each microphone signal, no interpolation is introduced. Appendix Figure A1a shows the spectrogram of the central micro- A1. Beamforming validations phone simulated signal where the Doppler effect is visible. We propose here some validation cases for the beamform- Moreover, the amplitude is varying from 62 to 66 dB while ing formalism on moving source. The propagation and the source approaches the microphone. beamforming model is first tested on simulated data. An Figure A1b shows the spectrogram of the beamformed adaptation is then proposed to extract audio signal without signal. We can see that the beamforming allows to get the the energy compensation so that the output signal sound initial emitted sound field, meaning that the amplitude re- levelisthe one recorded by the microphones. This beam- mains constant at 90.9 dB with afrequencyshift compen- forming model is then validated on asimilar experiment. sated. Finally,extraction performance is investigated. This simulation shows agood accuracyofthe beam- forming to estimate the radiated sound field of asingle A1.1. Model accuracy moving source. However, we will be interested in remov- Based on the propagation and beamforming models pre- ing the energy compensation so that the output signal sented in section 2.3, asimulation with asource mov- sound levelisthe one recorded by the microphones. In- ing at 50 km/h is proposed. It is done with amonopo- deed, by doing so, our results are more comparable to lar source emitting a2kHz pure tone with an amplitude the noise maps (that provides Lden in facade)and to city of 1Pa(90.9 dB SPL). The simulation lasts 3s allowing dwellers feeling. This can be done by removing the dis- the source to travel41.6 m. Signals recorded by alinear tance factor rmi,e in equation (2),sothat the extracted sig- microphone array in the configuration presented in Fig- nal is obtained in equation (3). ure 4(with h = 16.5m)issimulated. The array is the Figure A2 shows the spectrogram of the resulting beam- same as the one used for Saint-Bernard Quay experiment formed signal for the simulation presented before. By (presented in Section 2.2.3): it is 21.6 meters long with comparing with the spectrogram of the initial signal (Fig- 128 microphones regularly spaced 17 cm. The simulation ure A1a), we can see that the evolution of the energy is done at asample rate of Fs = 50 kHz and computed in is conserved (from 63 to 66 dB while the vehicle ap- time domain rounding the reception time tr to the closest proaches).

1075 ACTA ACUSTICA UNITED WITH ACUSTICA Leiba et al.:Classification of urban road traffic Vol. 105 (2019)

2100 0 60 Initial signal 50 -10 Beamformed signal

(dB) 40

(Hz) -20 2000 30

-30 level 20 -40

Frequency

Sound 10 1900 -50 0 1 2 3 4 5 0 800 1000 1200 (a) Time (s) 0 200 400 600 Frequency (Hz) 2100 0 -10 Figure A4. Power spectral density of central microphone (in

(Hz) -20 blue)and beamformed signal on the 1kHz pure tone . 2000 -30 Distance between source and central microphone: 26.7 m, V =

-40 0km/h.

Frequency 1900 -50 0 1 2 3 4 5 (b) Time (s) Figure A4 shows the power spectral density of the cen- tral microphone and the one of the beamformed signal Figure A3. Spectrograms of recorded and beamformed signals while the source is placed at 26.7 mfrom the array cen- for asource emitting a2kHz pure tone - V = 20 km/h -Dynamics tre. We can see that initially the loudspeaker is barely au- = 50 dB. (a) Recorded signal, (b) Beamformed signal. dible because of the energy of the other sound sources (road traffic).But thanks to the array dimension (21.6 m long)and beamforming we can see (orange curve) that the A1.2. Validation on experimental data background noise is reduced by 20 dB around 1kHz. But The set-up and the beamforming method are proposed to this gain reduces when the frequencydecreases, reflecting be validated on apass-by measure realised during the test- the fact that the array resolution is frequencydependent. track experiment presented in Section 2.2.2. To do so, a Such that in very lowfrequency(under 50 Hz)the tech- loudspeaker emitting a2kHz pure tone has been set-up nique seems not to provide anyfiltering gain. on acar passing-by at 20 km/h. The Figure 1a has shown the tracking step for this experiment. As we can see on Acknowledgements this figure, the size of the moving object is over-estimated The authors wish to thank: Dominique BUSQUET (Sor- on the edge of the frame because it takes into account the bonne University), Jean-Christophe CHAMARD (PSA), shadows. Hélène MOINGEON (Sorbonne University), Christian OL- Figure A3 shows the spectrograms of the recorded and LIVON (Sorbonne University)etVincent ROUSSARIE beamformed signals referenced to their maximum. On Fig- (PSA). ure A3a, we can notice both the frequencyshift for the This research benefits from the support of the chair loudspeaker signal butalso the broadband noise produced “Mobilité et qualité de vie en milieu urbain” (Mobility and by the tire/road contact. life quality in urban areas), carried by the Sorbonne Uni- The beamformed signal spectrogram is presented in versity Foundation from 2014 to 2017 and supported by Figure A3a. The de-Dopplerization seems not to be per- sponsors (PSA Peugeot-Citroën et Renault). fect in the first and last seconds of the signal. This is due to the mis-positioning of the source when it enters and leaves References the camera frame. Indeed, the beamforming is done on the [1] N. Misdariis, R. Marchiano, P. Susini, F. Ollivier,R.Leiba, centroid of the rectangle including the moving object so J. Marchal: Mobility and life quality relationships –mea- that we focus on the shadows and the engine when the surement and perception of noise in urban context. INTER- enters and on the exhaust pipe and the shadows when it NOISE, Melbourne, 2014. leavesthe frame. This involves awrong speed estimation [2] W. Babisch: Trafficnoise and cardiovascular disease: Epi- if the vehicle is not entirely in the camera frame. But when demiological reviewand synthesis. Noise and Health 2 it is, we can see that the signal is well de-Dopplerised and (2000)9–32. the background noise is reduced proving the good estima- [3] D. Vienneau, C. Schindler,L.Perez, N. Probst-Hensch, M. tion of the vehicle position with respect to the microphone Röösli: The relationship between transportation noise ex- array. posure and ischemic heart disease: Ameta-analysis. Envi- ronmental Research 138 (apr 2015)372–380. A1.3. Extraction performances [4] W. Babisch: Updated exposure-response relationship be- tween road trafficnoise and coronary heart diseases: a The performances of this technique in filtering asound meta-analysis. Noise and Health 16 (2014)1–9. scene is nowinvestigated. Forthis purpose, during the in [5] L. Fritschi, A. L. Brown, R. Kim, D. Schwela, S. Kephalo- situ experiment (with the 128 microphone array)aloud- poulos (eds.): Burden of disease from environmental noise. speaker wasset-up at different places emitting a1kHz quantification of healthylife years lost in europe. WHO pure tone. Regional Office for Europe, 2011.

1076 Leiba et al.:Classification of urban road traffic ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019)

[6] Directive 2002/49/EC relating to the assessment and man- system for real-time speech acquisition. Applied Acoustics agement of environmental noise. European Parliament & 73 (Feb 2012)132–143. Council of the European Union, 2002. [21] C. Vanwynsberghe, R. Marchiano, F. Ollivier,P.Challande, [7] B. Berglund, T. Lindvall, D. H. Schwela: Guidelines for H. Moingeon, J. Marchal: Design and implementation of community noise. 1999. amulti-octave-band audio camera for realtime diagnosis. [8] Night noise guidelines for europe. World Health Organiza- Applied Acoustics 89 (Mar 2015)281–287. tion, 2009. [22] R. Leiba, F. Ollivier,R.Marchiano, N. Misdariis, J. Mar- [9] X. Sevillano, J. C. Socoró, F. Alías, P. Bellucci, L. Peruzzi, chal: Urban acoustic imaging :from measurement to the S. Radaelli, P. Coppi, L. Nencini, A. Cerniglia, A. Bis- soundscape perception evaluation. Inter-Noise, 2016. ceglie, et al.: Dynamap–development of lowcost sensors [23] P. M. Morse, K. U. Ingard: Theoretical acoustics. McGraw- networks for real time noise mapping. Noise Mapping 3 Hill, 1968. (2016). [24] R. Cousson, Q. Leclère, M.-A. Pallas, M. Bérengier: Iden- [10] J. Vos: Noise annoyance caused by mopeds and other traffic tification of acoustic moving sources using atime-domain sources. INTER-NOISE, 2006. method. Berlin Beamforming Conference, March 2018. [11] L.-A. Gille, C. Marquis-Favre, A. Klein: Noise annoyance [25] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. due to urban road trafficwith powered-two-wheelers: Quiet Lagrange, M. D. Plumbley: Detection and classification of periods, order and number of vehicles. Acta Acustica acoustic scenes and events: An ieee aasp challenge. 2013 united with Acustica 102 (may 2016)474–487. IEEE Workshop on Applications of Signal Processing to [12] Commission directive (eu) 2015/ 996 of 19 may 2015 es- Audio and Acoustics, Oct 2013, 1–4. tablishing common noise assessment methods according to directive 2002/ 49/ ec of the european parliament and of the [26] S. Davis, P. Mermelstein: Comparison of parametric rep- council. resentations for monosyllabic word recognition in continu- ously spoken sentences. IEEE Transactions on Acoustics, [13] C. Marquis-Favre, E. Premat, D. Aubrée: Noise and its Ef- Speech, and Signal Processing 28 (August 1980)357–366. fects –AReviewonQualitative Aspects of Sound. Part II: Noise and Annoyance. Acta Acustica united with Acustica [27] C. Cortes, V. Vapnik: Support-vector networks. Machine 91 (2005). Learning 20 (Sep 1995)273–297. [14] A. Rabaoui, M. Davy,S.Rossignol, N. Ellouze: Using [28] J. Salamon, J. P. Bello: Deep convolutional neural networks one-class svms and wavelets for audio surveillance. IEEE and data augmentation for environmental sound classifica- Transactions on Information Forensics and Security 3 (Dec. tion. IEEE Signal Processing Letters, 2016, IEEE. 2008). [29] K. J. Piczak: Environmental sound classificationwith con- [15] X. Valero, F. Alias: Hierarchical classification of environ- volutional neural networks. 2015 IEEE International Work- mental noise sources considering the acoustic signature of shop on Machine Learning for Signal Processing, Boston, vehicle pass-bys. ArchivesofAcoustics 37 (jan 2012). USA, Sep. 2015, IEEE. [16] J. C. Socoró, F. Alías, R. M. Alsina-Pagès: An anomalous [30] J.-J.Aucouturier,B.Defreville, F. Pachet: The bag-of- noise events detector for dynamic road trafficnoise map- frames approach to audio pattern recognition: Asufficient ping in real-life urban and suburban environments. Sensors model for urban soundscapes butnot for polyphonic music. 17(10),2323 (2017). J. Acoust. Soc. Am. 122 (2007)881. [17] P. Maijala, Z. Shuyang, T. Heittola, T. Virtanen: Environ- [31] J. Morel, C. Marquis-Favre, D. Dubois, M. Pierrette: Road mental noise monitoring using source classification in sen- trafficinurban areas: Aperceptual and cognitive typology sors. Applied Acoustics 129 (2018)258 –267. of pass-by noises. Acta Acustica united with Acustica 98 [18] J.-R. Gloaguen, A. Can, M. Lagrange, J.-F.Petiot: Road (Jan 2012)166–178. trafficsound levelestimation from realistic urban sound [32] J. Morel, C. Marquis-Favre, L.-A. Gille: Noise annoyance mixtures by non-negative matrix factorization. Applied assessment of various urban road vehicle pass-by noises in Acoustics (2019)229–238. isolation and combined with industrial noise: Alaboratory [19] E. Weinstein, K. Steele, A. Agarwal, J. Glass: Loud: A study.Applied Acoustics 101 (Jan 2016)47–57. 1020-node microphone array and acoustic beamformer. [33] A. Klein, C. Marquis-Favre, R. Weber,A.Trollé: Spectral ICSV14 Cairns • Australia 9-12 July,2007. and modulation indices for annoyance-relevant features of [20] I. Hafizovic, C.-I. C. Nilsen, M. Kjølerbakken, V. Jahr: urban road single-vehicle pass-by noises. J. Acoust. Soc. Design and implementation of aMEMS microphone array Am. 137 (Mar 2015)1238–1250.

1077