ARCHIVES OF ACOUSTICS Copyright c 2017 by PAN – IPPT Vol. 42, No. 2, pp. 255–261 (2017) DOI: 10.1515/aoa-2017-0028

Impulsive Sound Detection Directly in Sigma-Delta Domain

Igor D. dos S. MIRANDA, Antonio C. de C. LIMA Department of Industrial Engineering Postgraduate Federal University of Bahia 2 Aristides Novis Street, 6th Floor, Federa¸c˜ao, Salvador, Bahia 40.210630, Brazil email: [email protected], [email protected]

(received September 26, 2016; accepted February 1, 2017 )

Recent implementations of SigmaDelta (Σ∆) converters have achieved low cost, low power consump tion, and high integration while maintaining resolution as high as in Nyquistrate converters. However, its usage implies demodulating the source signal delivered from Σ∆ to PulseCode Modulation (PCM) on a preprocessing stage. This work proposes an algorithm based on Discrete Cosine Transform for impulsive signal detection to be applied directly on a modulated Σ∆ bitstream, targeting to reduce computational cost in acoustic event detection applications such as gunshot recognition systems. From prerecorded impulsive sounds in Σ∆ format, it has been shown that the new method presents a simi lar error rate in comparison with traditional energybased approaches in PCM, meanwhile, it reduces significantly the number of operations per unit time. Keywords: impulsive signal detection; sigmadelta modulation; discrete cosine transform.

1. Introduction reaches high power values, it may indicate abnormal events in several environments and human tasks, such Impulsive sound signals is a category of audio sig as building collapses, gunfires, and explosions in fac nals characterised by an abrupt rise in amplitude, im tories. Systems to automatically detect these signals mediately followed by an underdamped motion or ex may be useful for monitoring emergency situations. ponential decay, as shown in Fig. 1. Claps, glass break Impulsive signal detection has been addressed by ing, hamming, and door slams are examples of every several works based on a variety of approaches, in day impulsive sound sources. When an impulsive signal cluding power (or amplitude) analysis, Wavelet Trans form, correlation, and statistical methods (Chacón Rodr´ıguez et al., 2011; Kauppinen, 2002). Tech niques that rely on power assessment have presented lower computational cost at an acceptable performance as compared to statistical approaches. These powerbased methods are typically imple mented by means of a power estimator followed by an onset detector, setting thresholds to activate the detector when limits for power and power slope are reached simultaneously (ChacónRodr´ıguez et al., 2011; Dufaux, 2001). Better results were obtained by implementing adaptive thresholds (Dufaux, 2001; Sharkey et al., 1996; Showen, Dunham, 1999). Onset may be found through derivatives, median fil ters, and other methods applied to the power signal (Dufaux, 2001; Kauppinen, 2002). In general, impulse detection systems that ad Fig. 1. Impulsive sound signal waveform of 0.40 semi dress explosion and impact monitoring in large ar automatic pistol gunshot sampled at a 8 kHz rate. eas are composed of a central processing unit (CPU) 256 Archives of Acoustics – Volume 42, Number 2, 2017

– a powerful serverlike machine – and an array of been shown that the proposed technique, which is acoustic sensors nodes, whose typical architecture con based on DCT, reduces computational cost without in tains a microphone, a microcontroller unit (MCU), and creasing the error rate when compared to the reference a transceiver (Wessels, Basten, 2016). When an im architecture described in Sec. 2. The final goal is to pulsive signal is detected in an acoustic sensor, the enable sensor nodes to be implemented using MEMS audio is transmitted to the CPU where a pattern clas microphones with builtin Σ∆ converters, taking all sifier is run. If an emergency is detected, signals from of their advantages as well as relaxing MCU require different sensors are used to triangulate the source. ments. Despite the fact that CPU performs the most com putationally demanding tasks and runs important de cision algorithms, sensor nodes have bigger impact 2. Reference architecture for impulse detection on system performance and cost. Sensors are spread in Σ∆ signals throughout the monitored area such that there are in tersection regions to allow triangulation. The overall Throughout this work, the system depicted in system cost is then directly linked to the trade off be Fig. 3 has been chosen as the reference architecture for tween microphone quality, impulse detection efficiency, impulse detection in Σ∆ modulated signals. The mod nodes separation, and MCU cost. ules presented in Fig. 3 are described in this section Categories of converters and microphones that and comprise three processing stages: a Σ∆ demod present features compatible with those mentioned for ulator, power estimator, and abrupt change detector. sensor nodes are Σ∆ ADCs and microelectrome chanical systems (MEMS) microphones, respectively. Both have been used in a broad range of applications due to their low cost, high quality, and high integra tion. Currently, semiconductor companies have man ufactured MEMS microphone, preamplifier, and Σ∆ ADC in a single integrated circuit. Σ∆ modulation, also known as PulseDensity Mod Fig. 3. Reference architecture for this work. ulation (PDM), represents analog amplitude levels as a density of bits in a bitstream. Modulation process can be modelled as an oversampled ADC (at MHz The Σ∆ demodulator has been implemented using range), singlebit quantiser and error feedback loop. the CIC filter structure shown in Fig. 2. The number of This feedback implements the noise shaping technique integrators and comb pairs must be chosen according that pushes the quantisation noise, generated by the to demodulator requirements and signal characteris low resolution, to higher frequencies and leaves the tics, such as ratio, noise shape, output baseband almost intact. Therefore, conversion from bits, and phase linearity. Fourth order CIC filters are Σ∆ to PCM can be performed by means of a low pass suitable for most applications of Σ∆ demodulator. filter followed by a downsampler. Since impulsive sounds are generally aperiodic, sig An efficient method to filter and decimate with low nal power has been estimated with successive compu computational cost, often used in Σ∆ demodulation, is tation of time average of the energy within LP sample the CascadedIntegrator Comb (CIC) filter. CIC deci windows, as given by: mators are implemented only with adders, decreasing L −1 the total number of operations substantially (Park, 1 P E [k]= x2[kL + n]. (1) 1991). However, their passbands are not flat, requir avg L P P n=0 ing an additional compensation FIR stage. Figure 2 illustrates a CIC decimator structure. The abrupt change detector has been implemented based on the Conditional Median Filter (CMF), in troduced by Kasparis et al. (1992). Firstly, in CMF, a median filter is applied to the power estimate se quence delivered by the previous stage, as in the ex pression below:

MF[k] = median Eavg[i] i = k LM 1, ..., k . (2) Fig. 2. CIC decimator structure for converting from Σ∆ to { | − − } PCM. Then, a threshold is applied to the median filter output, targeting to recover the power sequence but In this work, we propose an algorithm to detect removing impulsive noise by detecting abrupt transi impulsive sound signals directly in Σ∆ format. It has tions, as follows: I.D. dos S. Miranda, A.C. de C. Lima – Impulsive Sound Detection Directly in SigmaDelta Domain 257

MF[k], if MF[k] Eavg[k d] to Parseval theorem, this abrupt energy transition in | − − | time domain must be reflected by an abrupt energy CMF[k]= < Rth, (3) transition in frequency domain, if Fourier transforms Eavg[k d], otherwise, − are taken immediately before and after the impulse  where d is d= (Lm 1)/2 and Rth is a threshold for onset. abrupt transitions. In− case of adaptive threshold, this In order to detect impulses in a Σ∆ signal, only result might be used to estimate long term background the energy in a narrow frequency range should be ob noise power. For impulse detection, a binary version of served due to its high frequency quantisation noise. CMF has been implemented as follows: For example, most gunshots have their energy con centrated below 2 kHz, with peaks ranging from 500

0, if MF[k] Eavg[k d] < Rth, to 600 Hz (Graves, 2012; Millet, Baligand, 2006). BCMF[k]= | − − | (4) This represents less than 1% of the spectrum in a Σ∆ 1, otherwise. signal sampled at 512 kHz, making frequency selection Therefore the reference architecture has been im a hard task to be performed in discrete time domain, plemented using (1), (2), and (4). In these equations, as discussed in Sec. 1. In frequency domain, a straight LP and LM determine the maximum impulse width forward strategy would be to use a Discrete Fourier detectable by the system, since the median filter only Transform (DFT). Calculating a complete DFT for removes impulses with duration shorter than LM /2. a wideband signal, with interest in only a few terms, Rth must be chosen according to the average power of is a very complex and wasteful operation. both impulses and background noise. Simpler and more efficient mathematical tools to A simulation example for the reference architecture represent the energy content of a signal are the Discrete presented in this section is shown in Fig. 4. The gener Cosine Transform (DCT) and Discrete Sine Transform ated input signal has two impulsive events and a sinu (DST) (Ahmed et al., 1974; Britanak et al., 2010; soid fragment between them (Fig. 4a). Note in Fig. 4b Rao, Yip, 2014). Both transformations create a modi that although the impulses and the sinusoid have their fied magnitude/frequency signal representation by cor power represented by the power estimator, impulses relating it, over time, with cosine and sine functions, are removed from the median filter’s output. Due to respectively. DCT/DST has 16 different implementa this, only two impulses are detected on the threshold tions, each one considering special boundary condi module described by (4). tions on the finite input signal to form a periodic and symmetric sequence (Britanak et al., 2010). a) The DCT of type II (DCTII) is the most used among DCTs and DSTs, mainly in audio and image ap plications due to its energy compaction property that allows signals to be represented using fewer compo nents, if compared to DFT (Khayam, 2003). For an N point sequence x[n], the direct form of DCTII is defined as (Oppenheim et al., 1989):

− b) N 1 πk(2n + 1) Xc2[k]=2 x[n]cos , 2N n=0 (5) 0 k N 1. ≤ ≤ −

The DCTII transform produces coefficients that are related to 2Npoint DFT of a x2[m] sequence, Fig. 4. Application example of the reference architecture: formed from zero padding x[n]. Being X2[k] the DFT of c2 a) simulated input signal with two impulses and a sinusoid x2[m], X [k] can be expressed as (Oppenheim et al., in between; b) outputs of power estimator, median filter, 1989): and CMF. − c2 jπk X [k]=2Re X [k]e 2N , k =0, ..., N 1. (6) 2 − 3. Proposed method 3.1. Feature extraction This expression suggests that there is a close rela tionship between the DCTII components and the sig As mentioned before, the fast rise in energy is a very nal spectrum. Squaring both sides of (6) and applying peculiar characteristic of impulsive signals. According trigonometric identities, we receive: 258 Archives of Acoustics – Volume 42, Number 2, 2017

c2 2 2 πk where kmax is the maximum selected component and X [k] = 2 (Re X2[k] ) 1+cos { } N kmin is the minimum selected component. πk DEE can be used as an power estimate for a se + 2 (Im X [k] )2 1 cos lected range within a Σ∆ signal. This estimator works { 2 } − N over the following assumptions: πk +4Re X2[k] Im X2[k] sin . (7) πk N; { } { } N • ≪ the baseband signal phase has an approximately Similarly to DFT, a low pass filter may be imple • random uniform distribution throughout fre mented by computing only a few DCT components. quency and time domains. Due to the narrow baseband of Σ∆ signal and DCT The first statement is generally true, once Σ∆ energy compaction property, very few components may signal is modulated using a high oversampling ratio. be computed for such a filter. This implies that πk The second statement will be further analysed in Sub N, then cos πk 1 and sin πk 0. Thus, (7) may≪ N N sec. 4.2. be reduced to: ≃ ≈ c2 2 3.2. Impulse detector architecture 2 X [k] (Re X2[k] ) . (8) { } ≈ 4 The proposal herein is to replace both the Σ∆ de Urban and countryside atmospheres are random modulator and discrete power estimator of the refer medium for sound waves, due to the nonuniform pres ence architecture (Fig. 3) by DEE presented in (11). ence of buildings, trees, mountains, and some mov Figure 5 illustrates the proposed architecture for im ing objects that change wave characteristics through pulsive signal detection in Σ∆ signals. physical phenomena such as reflections, scattering, and dispersion. According to Ishimaru (1978), these kinds of media vary in time and space, making sound amplitude and phase to fluctuate randomly. For a fixed source/receptor configuration, atmospheric tur bulence may also cause signal variability between dif ferent emissions, as recently observed by Cheinet and Broglin (2015). Fig. 5. Proposed impulsive signal detector architecture. In this scenario, if the phase is considered to have approximately random uniform distribution through This replacement relies on the fact that DEE pro frequency domain and for different time frames, we can duces energy signals whose shapes are similar to those also consider that energy, for a given time window, is of the conventional method. If waveforms are pre evenly distributed between real and imaginary compo served, impulses’ characteristics in the energy signals nents of DFT. Based on this assumption, energy may are also preserved, allowing CMF (described in Sec. 2) be estimated as: to detect impulsive activities with a performance close

2N−1 to that obtained by the reference architecture. The 1 2 main characteristics that must be preserved in the sig Ex 2 (Re X2[k] ) 2 ≈ 2N { } nal delivered by DEE are rise time, duration, and SNR. k=0 − The time frame length used in the demodulated 1 2N 1 2 (Im X [k] )2 . (9) signal must be also used when estimating energy with ≈ 2N { 2 } k=0 DEE in Σ∆ domain. That means the number of sam ples analysed should increase by the oversampling ratio Since x2[m] and x[m] have the same energy and to keep the same frame duration. their spectra are symmetric, the estimator may be Note that the proposed architecture does not con rewritten, arbitrary choosing the real term, as: vert Σ∆ into PCM in any step of processing. Once detected an impulse, the audio signal may be trans N−1 2 mitted to the CPU for further processing. E (Re X [k] )2 . (10) x ≈ N { 2 } k=0 3.3. Computational budget decrease Replacing the result of (8) in (10), the DCT Energy To evaluate the advantage of the proposed detec Estimate (DEE) is introduced: tor against the reference architecture, the DEE mod k ule in Fig. 5 is compared to a CIC decimator, as de 1 max DEE = (Xc2[k])2, (11) picted in Fig. 2, in conjunction with the discrete power 2N estimator of (1). k=kmin I.D. dos S. Miranda, A.C. de C. Lima – Impulsive Sound Detection Directly in SigmaDelta Domain 259

A CIC decimator of L stages, as in Fig. 2, (k k )=3 for M = 32. These parameters were CIC max − min has LCIC additions before decimation and LCIC addi increased at the same ratio as M in order to preserve tions after it. For a Lcomp thorder compensation filter, the spectral characteristics for the two systems. The there are also (Lcomp + 1) multiplications and Lcomp baseband processing frame, Lp, was fixed to 32. Note additions after decimation. Therefore, for a oversam that the proposed method needs power operations in LCIC+Lcomp pling ratio of M, it has LCIC + M additions the considered range of M and the difference in the Lcomp+1 number of summations increases with M. and M multiplications per unit time (sampling period at ). The power estimator over For instance, if M = 64, LCIC = 16, and Lcomp = 1 1 16 the power estimator using the CIC decimator re Lp PCM samples requires + multiplications M LpM quires 0.26 multiplications and 16.50 additions per unit and 1 1 additions per unit time. time. For the same case, an energy estimator imple M − LpM Each DCT component in DEE expression requires mented using DEE, as in (11), with kmin = 1 and (k k ) multiplication and (k k ) ad kmax = 6, requires 0.01 multiplications and 6 addi max − min max − min ditions per unit time. Thus, DEE requires (kmax tions per unit time. We can observe for this case that k )(1 + 1 ) + 1 multiplications and (k − DEE requires 63.6% lower additions when compared to min LpM LpM max 1 1 − the reference architecture. Multiplications have a mi kmin)(1 + ) additions per unit time. LpM − LpM nor impact in both implementations. Multiplication on 1bit streams can be replaced by conditional statements when computing DCT compo nents. This way, multiplications per unit time reduce 4. Experimental results to kmax−kmin + 1 in the proposed one. LpM LpM Figure 6 shows the number of operations as a func 4.1. Data collection tion of the oversampling ratio for both reference archi tecture and proposed method. CIC and DEE param The dataset used in this work has 42 audio files containing a total of 407 gunshots, recorded in an eters were assumed to be LCIC = 8, Lcomp = 8 and outdoor shooting club. The recording was performed a) using a MEMS omnidirectional digital microphone, MP45DT02, mounted on STM32F4 Discovery board that embeds an ARM based microcontroller, all three from ST Microelectronics. The microphone was set to deliver 512 kHz PDM bit streams which were stored in a raw format for fur ther processing. Besides the raw bit streams, a PCM version of these recordings was also generated by fil tering it to 4 kHz and decimating it with a 64 down sampling rate. The firearms used for building this dataset were 0.38 revolvers and 0.40 pistols. Although the shooters were fixed in their positions, the sounds were collected from several distances and angles to provide, from b) one sample to another, diversity in propagation envi ronment and SignaltoNoise Ratio (SNR). The latter ranges from 36 to 54 dB.

4.2. Phase randomness and DEE analysis

An evenly distributed energy between real and imaginary through DFT components is a mandatory characteristic for the validity of (11). If that happens, the shape of the energy distribution is preserved even if only a real or imaginary component is used to compute its value. This hypothesis has been evaluated assessing the similarity between the energy of X [k] with and with Fig. 6. Number of operations as a function of oversam 2 out the imaginary component, using crosscorrelation. pling ratio M, assuming LCIC = 8, Lcomp = 8 and (kmax − kmin) = 3 for M = 32. As M increases, theses Since the sound recordings are modulated in Σ∆, only parameters have been increased in the same proportion. the first 8 terms of a 2048point DFT have been used, A constant Lp of 32 was used. corresponding to 4 kHz in X2[k] spectrum. 260 Archives of Acoustics – Volume 42, Number 2, 2017

For the 42 audio files analysed, the crosscorrelation as much as possible for the audio sample with the low average is 0.998 out of 1 and standard deviation est SNR. of 0.0021. Thus, it is proper to say that the real portion As expected, the number of DCT components of the income audio signal may be used to estimate its utilised in DEE has significant influence on the error energy evolution. rate. Figure 8 shows false positive and false negative In accordance with the result above, DEE has errors as function of kmax, reaching the minimum false shown energy estimations similar to those obtained by negative error for kmax =6. If too few components are the conventional process of (1), as observed in the plots used, the spectrum is misrepresented, increasing the of Fig. 7. This similarity indicates that an algorithm error. On the other hand, too many components may to detect impulses, like CMF, may have suitable per add Σ∆ quantisation noise, affecting the detection. formance when using DEE, since rise time, duration, and SNR of the energy pulses have remained nearly the same. a)

b)

Fig. 8. False negative and false positive errors as a function of kmax (kmin = 1).

Table 1 lists the results obtained by the two ap proaches. It can be seen that the proposed architec Fig. 7. Results for an audio fragment extracted from the ture has shown an acceptable performance (even bet dataset: a) power evolution calculated as in (1) from PCM ter) as compared to the powerbased implementation signal; b) energy estimated with DEE from Σ∆ bitstream, using fewer operation per cycle. for kmin = 1 and kmax = 6. Table 1. Detection results for reference and proposed im pulse detectors (k = 1 and k = 6). Note that although there are differences in energy min max peaks between the two wave shapes, the impulsive sig Reference Proposed nature is preserved, allowing this signal to be used for architecture method detecting explosions. Additionally, a difference in scale Additions/Cycle 16.5 6 was expected, since (1) is a measure of power, while Detected 394 (96.8%) 398 (97.8%) (11) is an estimation of energy. False Negative 13 (3.2%) 9 (2.2%) False Positive 17 3 4.3. Detection results

The proposed detector and the reference architec 5. Conclusions ture were run over the dataset to detect impulsive activities. The powerbased detector implemented as In this work, a method to detect impulsive sounds sesses the signal using (1), while the proposed one uses directly in Σ∆ signals is presented. Experimental re the DEE. CMFs have been used in both detectors to sults performed on 42 recordings containing 407 gun highlight abrupt changes in energy. shots have demonstrated that, in comparison with Energy (or power) has been estimated on time win the traditional approach, a DCTbased energy estima dows of 4 ms, which corresponds to 32 samples for the tor has presented an acceptable accuracy and lower power based detector and 2048 for the proposed one. computational cost. The proposed estimator has the The median filter length has been set to 11 in the two only drawback of requiring to store tens of kilobytes implementations. For both detectors, thresholds for en for coefficients, which is reasonable for memory re ergy/power and impulse measurer have been adjusted sources available in the today’s microcontrollers tech to decrease the False Positive and False Negative Rates nology. Evaluated with the same dataset, the proposed I.D. dos S. Miranda, A.C. de C. Lima – Impulsive Sound Detection Directly in SigmaDelta Domain 261 impulse detector has shown a slightly better perfor 8. Kasparis T., Tzannes N.S., Chen Q. (1992), Detail mance than the reference architecture, performing sig preserving adaptive conditional median filters, Journal nificantly fewer operations per unit time. Although, in of Electronic Imaging, 1, 4, 358–364. this work, experimental results have been obtained for 9. Kauppinen I. (2002), Methods for detecting impulsive gunshot sounds, the proposed algorithm may be appli noise in speech and audio signals, [in:] Digital Signal cable to other impulsive events as long as they have Processing, DSP 2002, 14th International Conference similar waveforms and the parameters are adjusted to on, Vol. 2, pp. 967–970, IEEE. their characteristics. 10. Khayam S.A. (2003), The Discrete Cosine Transform (DCT): theory and application, Michigan State Univer References sity. 11. Millet J., Baligand B. (2006), Latest achievements 1. Ahmed N., Natarajan T., Rao K.R. (1974), Dis in gunfire detection systems, Technical report, DTIC crete cosine transform, Computers, IEEE Transactions Document. on, 100, 1, 90–93. 12. Oppenheim A.V., Schafer R.W., Buck J.R. (1989), 2. Britanak V., Yip P.C., Rao K.R. (2010), Discrete Discretetime signal processing, Vol. 2, Prentice Hall, cosine and sine transforms: general properties, fast al Englewood Cliffs. gorithms and integer approximations, Academic Press. Park S. 3. ChacónRodr´ıguez A., Julian´ P., Castro L., Al 13. (1991), Principles of sigmadelta modulation varado P., Hernandez N. (2011), Evaluation of gun for analogtodigital converters, Motorola Application shot detection algorithms, Circuits and Systems I: Reg Notes. ular Papers, IEEE Transactions on, 58, 2, 363–373. 14. Rao K.R., Yip P. (2014), Discrete cosine transform: 4. Cheinet S., Broglin T. (2015), Sensitivity of shot de algorithms, advantages, applications, Academic press. tection and localization to environmental propagation, 15. Sharkey J.B., Doblar R.A., Bothwell F.E., Applied Acoustics, 93, 97–105. Belt R.A., Page E.A. (1996), System for effective 5. Dufaux A. (2001), Detection and recognition of im control of urban environment security, US Patent pulsive sounds signals, Institute de Microtechnique 5,504,717. Neuchatel, Switzerland. 16. Showen R.L., Dunham J.W. (1999), Automatic real 6. Graves J.R. (2012), Audio gunshot detection and lo time gunshot locator and display system, US Patent calization systems: history, basic design, and future 5,973,998. possibilities, PhD thesis, University of Colorado. 17. Wessels P.W., Basten T.G. (2016), Design aspects 7. Ishimaru A. (1978), Wave propagation and scattering of acoustic sensor networks for environmental noise in random media, Vol. 2, Academic Press, New York. monitoring, Applied Acoustics, 110, 227–234.