Arxiv:2108.09203V1 [Cs.LG] 20 Aug 2021 Et Al
Total Page:16
File Type:pdf, Size:1020Kb
Parsing Birdsong with Deep Audio Embeddings Irina Tolkova1∗ , Brian Chu1∗ , Marcel Hedman1∗ , Stefan Kahl2 and Holger Klinck2 1School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 2K. Lisa Yang Center for Conservation Bioacoustics, Cornell Lab of Ornithology, Cornell University, Ithaca, NY fitolkova, brianchu, [email protected], [email protected], [email protected] Abstract In particular, BirdNET is an acoustic monitoring frame- work drawing on citizen science for large-scale data col- Monitoring of bird populations has played a vital lection on species-level occurrence [Kahl et al., 2021]. role in conservation efforts and in understanding The foundation of BirdNET is a CNN classifier, trained biodiversity loss. The automation of this process on high-quality recordings from labeled birdsong reposi- has been facilitated by both sensing technologies, tories such as Xeno-Canto (www.xeno-canto.org) and the such as passive acoustic monitoring, and accompa- Macaulay Library at the Cornell Lab of Ornithology nying analytical tools, such as deep learning. How- (https://www.macaulaylibrary.org). A user can record bird- ever, machine learning models frequently have dif- song through the BirdNET app and see the species identi- ficulty generalizing to examples not encountered in fied by the classifier, across three thousand possible species. the training data. In our work, we present a semi- Moreover, these audio recordings, classification outputs, and supervised approach to identify characteristic calls associated time/location metadata are anonymized and cat- and environmental noise. We utilize several meth- alogued. Therefore, the project intends to both engage the ods to learn a latent representation of audio sam- public in bird-watching while also collecting large-scale data ples, including a convolutional autoencoder and for spatiotemporal analysis of avian populations. two pre-trained networks, and group the resulting While the use of machine learning mitigates the need for embeddings for a domain expert to identify clus- expert labeling and opens opportunities in data collection, it ter labels. We show that our approach can improve is imperfect, and will have difficulties generalizing to sound classification precision and provide insight into the structure not encountered in the training dataset. This is latent structure of environmental acoustic datasets. of particular significance for BirdNET, which has a domain shift between the training and deployment datasets. Xeno- 1 Introduction Canto and the Macaulay Library contain recordings with high signal-to-noise ratio obtained with high-quality (often direc- Over the past half-century, there has a been an estimated de- tional) microphones, from which it is easy to obtain labels cline of 29% across North American bird populations, cor- localized in time (“strong” labels). However, user submis- [ responding to a net loss of about 3 billion birds Rosenberg sions are often taken with a cell phone in noisy environments, et al. ] , 2019 . To design and implement effective conserva- potentially within a long recording window. To improve ro- tion policies for counteracting such biodiversity loss, there is bustness, the training pipeline incorporates data augmentation a need for automated monitoring technologies to understand with various sources of “background noise” data; but submis- changes in species-level populations. In recent years, tech- sions will likely contain noise structure which is not encom- [ nologies such as camera traps Rowcliffe and Carbone, 2008; passed by these augmentations. Overall, evaluations of Bird- arXiv:2108.09203v1 [cs.LG] 20 Aug 2021 et al. et al. ] Steenweg , 2017; O’Connell , 2010 , drone footage NET have shown false positive rates around 15-20% [Arif et [ et Jimenez´ Lopez´ and Mulero-Pazm´ any,´ 2019; van Gemert al., 2020]. While this accuracy outperforms other classifiers, al. ] , 2014; Koh and Wich, 2012 , or even satellite imagery and may be sufficient for citizen-science engagement, such a [ et al. ] Pettorelli , 2014 have been used for species-level iden- margin is significant in informing conservation policy. tification and population estimation. However, these visual methods are difficult to apply to bird recognition, due to their In this work, we aim to introduce a semi-supervised ap- small size and frequent occlusion in the environment. But proach for post-processing BirdNET classifications and iden- a majority of birds are vocal, with complex species-specific tifying false positives. Specifically, we utilize several meth- calls and song repertoires. Consequently, there has been a ods for representation learning to construct embeddings of rising interest in using automated passive acoustic technolo- acoustic data and identify clusters characteristic of calls or en- gies for monitoring bird populations [Browning et al., 2017; vironmental noise. By incorporating human input on the con- Sugai et al., 2019; Blumstein et al., 2011; Wood et al., 2019]. tent of each cluster, we provide alternative confidence scores to the BirdNET classifications. We evaluate this approach ∗equal contribution on a collection of recordings across three species which were all identified as “positives” by the BirdNET system and were As most bird calls occur below 10kHz, we truncate the fre- subsequently manually evaluated by an expert, with the pri- quency range to this value. To remove low-amplitude win- mary goal of improving system precision. We hope that this dows below a baseline level of acoustic activity, we apply a project can improve the performance of automated passive simple detector score similar to that used in [Kahl et al., 2021; acoustic monitoring, inform the targeted system augmenta- Sprengel et al., 2016]. For a given spectrogram, the detector tions needed to address false positives, and provide insights score is defined by the proportion of pixels that are greater into lower-dimensional structure within intraspecific acoustic than the column median and 1.5 times the row median. For data. the BirdNET dataset, we ignore all windows with a score of less than 0.1; on the other hand, for obtaining “call” samples 2 Related Work from the Xeno-Canto dataset, we use only windows scoring The introduction of AudioSet has enabled many recent ad- above 0.3. These thresholds were determined by visual ex- vances in audio event recognition. AudioSet is a collection amination. Lastly, we downsize the windows to a size of of short human-labeled sound clips from YouTube videos, 100x100 pixels for computational feasibility. These spectro- containing over 2 million samples separated into 632 audio grams are used as input for the autoencoder, and for visu- classes by human annotation [Gemmeke et al., 2017]. Most alization across all methods; the corresponding time-domain deep learning approaches for audio analysis will first gener- signal is given as input to both pre-trained architectures (VG- ate spectrograms from the data, turning the task into an im- Gish and PANNs). age classification problem. VGGish is a pre-trained convolu- 3.2 Embeddings tional neural network inspired by the VGG network used for image classification [Hershey et al., 2017]. Other similar ap- Pre-trained VGGish proaches include SoundNet [Aytar et al., 2016] and L3-net The first embedding method explored uses the VGGish ar- [Arandjelovic and Zisserman, 2017; Arandjelovic and Zis- chitecture pretrained on AudioSet, following the methodol- [ ] serman, 2018; Cramer et al., 2019] which learn representa- ogy described by Sethi et al., 2020 . This approach has been tions on unlabelled data. More recently, studies have lever- shown to successfully identify anomalous sounds in a natural aged triplet learning or contrastive learning to generate audio environment, but has not been applied to the more specific embeddings. These approaches are able to use the raw au- problem of identifying intraspecific calls. After following the dio signal [Yu et al., 2020] or its spectrogram [Spijkervet and pre-processing steps outlined above, 1-second time-domain Burgoyne, 2021]. samples are given as input into the pre-trained architecture, While there have been major advances in audio embed- which converts the signal into a spectrogram and returns a dings within machine hearing, they have had limited use in 128-dimensional embedding. passive acoustic monitoring. Unsupervised methods such Pre-trained Wavegram-Logmel-CNN as clustering techniques, t-distributed Stochastic Neighbor While most audio analysis is performed either on spec- Embedding (t-SNE) and Uniform Manifold Approximation trogram images or (more rarely) on raw audio, [Kong (UMAP) have previously been applied for distinguishing et al., 2020] proposes integrating these approaches for calls [Clink and Klinck, 2021; Parra-Hernandez´ et al., 2020]. the Wavegram-Logmel-CNN16 architecture: using both In particular, [Sainburg et al., 2020] performed a thorough a learned 1-dimensional time-frequency representation to- analysis of UMAP applied to clustering across intraspecific gether with the calculated log-amplitude mel-scale spectro- vocalizations in a range of species. Of studies that have gram as input. This network, pretrained on AudioSet, gives leveraged deep learning, [Sethi et al., 2020] used VGGish, superior performance to prior methods on a number of down- pre-trained on AudioSet, to analyze soundscape data. The stream classification tasks with a 2048-dimensional embed- authors found that the embedding captured clear structure ding, and is made