Digital Forensics of Gun Audio Samples Meets Artificial Intelligence
Total Page:16
File Type:pdf, Size:1020Kb
Sound of Guns: Digital Forensics of Gun Audio Samples meets Artificial Intelligence Simone Raponi, Gabriele Oligeri, Isra M. Ali Division of Information and Computing Technology College of Science and Engineering, Hamad Bin Khalifa University - Doha, Qatar fsraponi, goligeri, [email protected] Abstract—Classifying a weapon based on its muzzle blast is a of multiple replicas of the same gunshot enables shooter local- challenging task that has significant applications in various secu- ization and weapon identification. The physical characteristics rity and military fields. Most of the existing works rely on ad-hoc of acoustic propagation can be exploited to infer the position deployment of spatially diverse microphone sensors to capture multiple replicas of the same gunshot, which enables accurate of the shooter and the category of the gun. Multiple spatially detection and identification of the acoustic source. However, diverse acoustic sources enable the estimation of Angle of carefully controlled setups are difficult to obtain in scenarios such Arrival (AoA), Time of Arrival (ToA), and Time Difference as crime scene forensics, making the aforementioned techniques of Arrival (TDoA). The obtained recordings can be modeled inapplicable and impractical. by geometrical acoustics that enable the localization of the We introduce a novel technique that requires zero knowledge about the recording setup and is completely agnostic to the shooter. Furthermore, multiple replicas of the same acoustic relative positions of both the microphone and shooter. Our source allow to filter out echoes and background noise affect- solution can identify the category, caliber, and model of the ing a subset of the deployed microphones, thus enabling a deep gun, reaching over 90% accuracy on a dataset composed of characterization for both the time and frequency domains. 3655 samples that are extracted from YouTube videos. Our Acoustic acquisition via Wireless Sensor Network (WSN) results demonstrate the effectiveness and efficiency of applying Convolutional Neural Network (CNN) in gunshot classification requires a specialized infrastructure overlay to enable sensor eliminating the need for an ad-hoc setup while significantly communication, data processing, and computation distribution. improving the classification performance. Solutions that rely on the spatial diversity provided by the WSN introduce several types of burdens. Firstly, each soldier I. INTRODUCTION has to carry a wearable device equipped with a microphone Gunshot analysis have received significant attention from and other sensors, such as a compass, to collect meaningful both the military and scientific communities. Acoustic analysis information about AoA, ToA, and TDoA. Secondly, in a of gunshots can provide useful information, such as the military scenario, the WSN should feature a jamming-resistant position of the shooter, the projectile trajectory, the caliber communication protocol and non-interfering radio channels. of the gun, and the gun model. Although acoustical evidence Both assumptions are difficult to achieve given the resource may significantly contribute to audio forensic reconstruction constraints of WSNs in terms of CPU, battery, and memory. and analysis, the forensic analysis of gunshots is characterized In most cases, WSNs cannot afford the computational burden by many challenges due to the broadcast and noisy nature of of multimedia processing. Therefore, the captured data should the acoustic channel. be first off-loaded to a remote server, then downloaded and arXiv:2004.07948v2 [eess.AS] 1 Mar 2021 Consider a scenario where a microphone is deployed in distributed again. This represents a challenge from the con- a close neighborhood to the shooter. The recorded audio nectivity perspective since, in many cases, military WSNs are sample can be significantly affected by the environmental unattended or provided with a discontinued link to the control surroundings, such as trees, foliage, and buildings, which center. attenuate and reflect the main component of the shock wave. In this work, we do not rely on ad-hoc acquisition setups, The resulting audio sample may feature different echoes of but we exploit publicly available audio recordings of gun- the gunshot that are characterized by different attenuation shots, considering their temporal and spectral representations. factors as a function of their paths. This naive approach Spectral analysis of sound has been adopted in many con- is impractical, which motivated the development of more texts to detect and identify recurrent patterns. In particular, complex ad-hoc acoustic data acquisition strategies over the the combination of time-frequency decomposition of audio last decade. To mitigate echoes and overcome the intrinsic samples with Convolutional Neural Network (CNN) provides lack of information, that the aforementioned scenario suffers promising performance in detecting recurrent patterns [1]. The from, additional microphones are deployed. The comparison CNN is trained over several “images” constituted by a three- dimensional representation of time, frequency, and amplitude. distributed nature of the recording setup provides spatial The result is a robust solution that can “recognize” the same diversity, where multiple acoustic observations from different sound by cross-matching similar images. locations of the same gunshot are obtained, which can be lever- Contribution. We propose an inexpensive solution that is aged to increase the classification accuracy. Sanchez-Hevia´ et able to detect and identify gunshots without resorting to any al. [4] exploited this feature and proposed a multi-observation ad-hoc infrastructure. Contrary to other studies, our solution weapon classification system that leverages various classifier requires only an audio sample of a gunshot that can be ensembles to enhance classic decision fusion techniques. Each easily obtained by any commercially available microphone. node in the sensor network produces a classification decision Our approach is agnostic to the microphone position with using Least Squares Linear Discriminant Analysis (LS-LDA). respect to the shooter, and it does not require multiple spatially The decisions are later fused using a Maximum Likelihood- different replicas of the gunshot; we consider recordings based fusion rule that weights the decision of each node based from mono-channel setups with different sample rates. We on its location. proved the effectiveness of our solution by considering 3655 The main constraint induced by this type of analysis is samples of gunshots constituted by 30 pistols, 18 rifles, and the requirement of spatial information, which can only be 11 shotguns for a total of 7 different calibers. The proposed obtained by deploying a distributed sensor network. Therefore, approach guarantees an accuracy higher than 90% for all of limiting the applicability of gunshot detection and firearm the considered cases, namely, the category, model and caliber classification to a carefully controlled recording setup only. of the gun. Consequently, various pattern recognition approaches were Paper organization. The remainder of this paper is orga- proposed that identify the firearm category in the absence nized as follows. Section II summarizes recent contributions of spatial information. The most used classifiers for firearm in the field of weapon classification. Section III introduces the identification are Gaussian Mixtures Model (GMM) [5], [6], background concepts related to frequency domain analysis, [7] and Hidden Markov Model (HMM) [8], [9]. CNNs, and acoustic characteristics of gunshots. Section IV Most of these approaches can be described as frame-based describes our dataset and Section V discusses the the dataset feature classification approaches [5], [6], [7], [9], where the generation process. The neural network architecture is pre- time-domain acoustic signal is subdivided into a sequence sented in Section VI. Section VII shows the performance of of short-time windowed frames. From each frame, a set of our solution. Finally, Section VIII draws some concluding predetermined features is extracted and used for gunshot clas- remarks. sification. The most common extracted features are statistical measures of the spectrum and intensity of the signal, in addi- II. RELATED WORK tion to perceptual features such as Mel-Frequency Cepstrum Firearm classification based on the acoustic evidence gen- Coefficient (MFCC) or Perceptual Linear Prediction Coeffi- erated by its discharge has long been investigated, but not cients (PLP). Temporal features, such as energy and Zero extensively studied in the literature. Proposed solutions vary Crossing Rate (ZCR), are also used, but only in conjunction in many aspects, including the source of acoustic data, the with spectral or perceptual features. type of analysis applied, the type of features extracted, and the Morton et al. [8] proposed an alternative classification application area. Table I summarizes prior studies, that provide approach that does not rely on frame-based features aiming to gunshot classification and firearm identification, according to eliminate the dependency on performance-driven parameters, these aspects. which are often optimized over a finite training set. They The source of the data is characterized by the type, the proposed modeling each firearm category as an HMM with quality, and the environmental conditions of the deployed AutoRegressive (AR) source densities using non-parametric audio recording setup, which defines the amount