Die approbierte Originalversion dieser Diplom-/ Masterarbeit ist in der Hauptbibliothek der Tech- nischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at

The approved original version of this diploma or master thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng

Large-scale song identification using convolutional neural networks

MASTER’S THESIS

submitted in partial fulfillment of the requirements for the degree of

Master of Science

in

Computational Intelligence

by

Botond Fazekas, BSc Registration Number 0925351

to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Andreas Rauber Assistance: Dipl.-Ing. Thomas Lidy Dipl.-Ing. Alexander Schindler

Vienna, 3rd May, 2018 Botond Fazekas Andreas Rauber

Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at

Erklärung zur Verfassung der Arbeit

Botond Fazekas, BSc Linzer Straße 261/2/25, 1140 Wien, Österreich

Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe.

Wien, 3. Mai 2018 Botond Fazekas

iii

Kurzfassung

Um Ökosysteme begreifen zu können, ist es wichtig, Wildtiere zu verstehen; deren Beob- achtung kann jedoch kompliziert und zeitintensiv sein. Eine automatisierte Aufnahme von Tönen und Lauten der Umwelt selbst ist ein einfaches Unterfangen, für eine nach- trägliche Identifizierung der Wildtiere ist allerdings ein komplexeres System vonnöten. Eine manuelle Identifizierung wäre zu umständlich, wodurch eine automatisierte Methode im Forschungsbereich eine vielversprechende Alternative wäre. Vögel eigenen sich dabei besonders gut für diese Aufgabe, da ihre Kommunikation zum Großteil durch das Singen passiert und sie darüber hinaus aufgrund ihrer schnellen Reaktion auf Änderungen in ihrer Umgebung gute ökologische Indikatoren darstellen. Die aktuell verfügbaren Datensets beinhalten sehr viele Vogelgesänge verschiedenster Arten in diversen Umgebungen und so ist neben der Klassifizierungs-Genauigkeit auch die Skalierbarkeit ein wichtiger Faktor. Das Ziel dieser Studie ist das Verbessern von den akustischen state-of-the-art Vögel- identifzierungs-Methoden, die in dem BirdCLEF2016 Wettbewerb1 evaluiert wurden, sowohl im Sinne der Genauigkeit als auch der benötigten Trainingszeit, und damit im Sinne der Skalierbarkeit. Diese Arbeit beschreibt die Vorbereitungs-Schritte, die für die Unterscheidung von Vogelgesang von Hintergrundsgeräuschen verwendet wurde, sie evalu- iert die Leistung von den vorgeschlagenen Convolutional Neural Network (CNN) Modellen mit Rectified Linear Units und Exponential Linear Units (ELU) mit dem BirdCLEF2017 Datenset, sowie den Effekt von Mel-Skalierung und Constant-Q Transformation der Töne. Außerdem wird in dieser Arbeit eine neue, multi-modale Architektur präsentiert, die die verschiedenen vorhandenen Metadaten für die Feld-Aufnahmen für die Klassifizierung verwendet. Die Ergebnisse zeigen, dass einfachere CNN Modelle mit ELUs im Bezug auf Trainingszeit und Klassifizierungs-Leistung besser abschneiden als jene state-of-the-art Lösungen, wohingegen die Verwendung von Metadaten einen deutlich positiveren Effekt auf die Identifizierungs-Genauigkeit hat.

1http://www.imageclef.org/lifeclef/2016/bird

v

Abstract

Understanding wildlife population is important for understanding ecosystems, monitoring it, however, is difficult and time-consuming. Automatically capturing environmental sounds is easy, but it requires subsequent identification of the wildlife. Doing this manually in cumbersome, thus automated methods are a promising field of research. are especially well fitted for this task as their main way of communication is by singing and they are an important ecological indicator since they are responding quickly to changes in their environment. The current datasets available contain a large number of bird songs of different species in various environments, hence besides the classification accuracy the scalability of the methods is an important factor, too. The aim of this study is to improve upon the state-of-the-art acoustical bird identification methods evaluated in the BirdCLEF2016 competition2 in terms of both the identification accuracy as well as the required training time and therefore the scalability. The work describes the pre-processing steps used to separate the bird songs from the background noise, it evaluates the performance proposed simpler convolutional neural network (CNN) models with Rectified Linear Units and Exponential Linear Units on the BirdCLEF2017 dataset, along with the effect of Mel-scaling and Constant-Q transforming of the sounds. Furthermore a novel multi-modal architecture is proposed which, incorporates the various metadata available for the field recordings. The results show that the simpler CNN model with exponential linear units largely improves on the training time and classification performance compared to the state-of- the-art solutions, while using metadata significantly has a major positive effect on the identification accuracy.

2http://www.imageclef.org/lifeclef/2016/bird

vii

Contents

Kurzfassung v

Abstract vii

Contents ix

1 Introduction 1 1.1 Motivation ...... 1 1.2 Aim of the work ...... 2 1.3 Structure of the work ...... 2

2 Background 5 2.1 Machine learning in general ...... 5 2.2 Neural networks ...... 10 2.3 Convolutional neural networks ...... 15 2.4 Signal processing ...... 19 2.5 Summary ...... 20

3 Related works 21 3.1 Acoustic signal processing with CNNs ...... 21 3.2 Bird classification challenges ...... 21 3.3 Summary ...... 26

4 Methodology 29 4.1 Dataset ...... 29 4.2 Pre-processing ...... 32 4.3 Network architectures ...... 38 4.4 Combining the single predictions ...... 41 4.5 Evaluation methods ...... 41 4.6 Summary ...... 43

5 Results 45 5.1 Sound representation ...... 45 5.2 Metadata ...... 46

ix 5.3 Data augmentation ...... 47 5.4 Network architecture ...... 48 5.5 Activation functions and training times ...... 49 5.6 Discussion ...... 50

6 BirdCLEF 2017 53 6.1 Submitted runs ...... 53 6.2 Other participating teams ...... 54 6.3 Results and analysis ...... 55

7 Conclusion and future work 59 7.1 Future work ...... 59

List of Figures 61

List of Tables 63

Acronyms 65

Bibliography 67 CHAPTER 1 Introduction

As the world is facing the climate change, it is getting increasingly important to get acquainted with the living in the wild, to discover their habitats and where they live in order to preserve their biodiversity and to understand the impact of humans on the world’s ecology [HP05]. Higher biodiversity has been showed to have positive effects on the human health and on the food chain thus affecting also the economy [SMP12, p. 3–5]. There is still a long way to go for the humanity as it is estimated that there are about 10-14 millions of different species in the world of which only 1.2 million has been described and categorized. Most of these species are concentrated to the tropical forests which only cover 10% of the earth surface while they contain 90% of all living species [You03]. These biomes are hard to study because of their distance from more developed areas and thus they are hard to access. Automatic identification tools may help support the monitoring of animals present in these areas and to close this taxonomic gap. There has been several studies and results in this field [CEP+07, GO04, TPN+12].

1.1 Motivation

Birds are very sensitive to changes in their environments and this renders them a good ecological indicator [GNF+03]. However, due to the great diversity of the bird species, traditional approaches require professional knowledge and it is difficult for the general public. Although an automated visual observation is possible, it is a complicated task especially in rain forests with very dense flora. Therefore most research concentrate on the use of acoustic signals to monitor and classify animals [BLN+12, BHR+13]. Several public communities emerged in the past decades that focus on the collection of acoustic observations of birds (e.g. eBird1, Xeno-canto2) which may help domain experts around the world to use them in their research without the need of traveling to the native

1http://ebird.org/ 2http://www.xeno-canto.org

1 1. Introduction

territory of these animals. However, it is hard for the professionals to keep the pace with the growth of these data sets [GGV+14], so several competitions have been held in the previous years in order to stimulate the research in the automatic classification of bird vocalizations (See Section 3.2). The major challenges include the vast number of bird species, simultaneously vocalizing birds of the same or of different species, sound of other animals in the area (e.g. insects) and background noise like rain, wind or vehicles.

1.2 Aim of the work

The aim of this thesis is to improve performance of the state-of-the-art acoustic bird classification solution [SJKH16], in the sense of both the classification performance and the scalability, i.e. the time required for the training. Many cutting edge Convolutional Neural Network (CNN) models are highly complex and thus require professional equipment and large amount of time to train [SVI+16, SIVA17]. In some cases the training must be stopped because of time constraints before a full convergence was reached [SG17]. An objective of this thesis is to evaluate simpler CNN architectures for bird song classification which can be efficiently trained with a large amount of data, while improving the classification performance. Furthermore, two methods for perceptual scaling of pitches, Mel-Scale and Constant-Q transform are examined. While the perceptual scaling yields smaller inputs and thus smaller models, they may lead to information loss, especially as these methods were designed for the human sound perception system. The state-of-the-art solutions do not incorporate all of the information available e.g. the date/time and the location of the recordings, therefore an additional objective of this thesis is to evaluate the effects of the metadata on the performance of the classification. In summary, the following main questions are addressed:

• Are simplified CNN architectures capable to reach the performance of the state-of- the-art solutions?

• How does the perceptual scaling of pitches with Mel-scale and Constant-Q transform impact the classification performance and the scalability of bird song classification methods?

• What are the effects of the inclusion of the various metadata available for the bird song recordings on the accuracy of the classifications?

1.3 Structure of the work

The rest of this thesis is organized as follows:

2 1.3. Structure of the work

• Chapter 2 gives a general overview of the theoretical background of this work. Section 2.1 focuses on the basic concepts of machine learning such as the prevailing types, hyper-parameters, capacity, overfitting and underfitting. Section 2.2 aims to give the historical and theoretical background of Artificial Neural Networks (ANN), starting with the Perceptron in Section 2.2.1 and introducing the most important algorithm of ANNs, the backpropagation algorithm in Section 2.2.3. Section 2.3 introduces the Convolutional Neural Networks, giving an overview of the most important layers, such as the convolutional layer in Section 2.3.1, the pooling layer in Section 2.3.2, dropout in Section 2.3.3, batch normalization in Section 2.3.4 and activation functions in Section 2.3.5. Finally, in Section 2.4 the signal processing methods used in this thesis are presented.

• Chapter 3 focuses on the bird classification competitions of the previous years and introduces the state-of-the-art solution of Piczak [Pic16], Tóth et al.[TC16] and Sprengel et al. [SJKH16]. • Chapter 4 introduces the dataset in detail used for the evaluation of this thesis in Section 4.1. Section 4.2 discusses the pre-processing methods that were used before the sound files and metadata was used for training and the data augmentation techniques involved during the training process. Section 4.3 presents the two novel network architectures that were evaluated for the task, including the methods for incorporating the metadata and the different sound representations. In Section 4.5 the evaluation metrics are discussed.

• Chapter 5 presents in detail the performance of the perceptual pitch scaling methods compared to the Short-Time Fourier Transform (STFT) spectrograms, the effects of the various metadata and their combinations, the impact of data augmentation, the training time required for the different architectures and finally a discussion of the results.

• Chapter 6 gives an overview of the BirdCLEF2017 challenge, including the solutions of the participants and discusses the competition-performance of the methods presented in this work.

• Chapter 7 concludes the major findings of this thesis and the potential future work is discussed.

Remark Parts of the methods and results presented in this thesis have been published in the writeup of the BirdCLEF2017 competition [FSLR17].

3

CHAPTER 2 Background

This chapter aims to give a general overview about the concepts required for the under- standing and interpretation of the methods and results. Firstly, in Section 2.1 Machine Learning in general is discussed, including the various directions and tasks of it and introducing the basic vocabulary. Section 2.2 aims to give an introduction and historical overview of key concepts of Artificial Neural Networks, including an informal introduction to the backpropagation algorithm in Section 2.2.3, a method, that still forms the basis for most of the Neural Network learning methods today [GBC16]. Section 2.4 provides a short introduction to the Signal Processing concepts used in this thesis. Since the discipline of machine learning is already so extensive, only a narrow portion of the topics is discussed in this thesis, mainly the parts that were required for the implementation of the solution discussed in the later chapters.

2.1 Machine learning in general

A typical goal of computer programs is to make a mapping from an input data to an output value. While conventional programs follow a strict sequence of instructions and rules defined by the designer to solve a problem, machine learning algorithms are a set of meta-algorithms and data-representations that enable the computer itself to synthesize an adaptive model for the solution [Bis13]. While the former approach performs better in many well-defined situations where the inputs can be described with a mathematical model, problems with large variety and input-space are handled much harder by them. A classical example is the recognizing of handwritten digits. The MNIST [LC98] dataset contains 70 000 handwritten digits in 28 × 28 = 784 pixels images collected from American Census Bureau employees and high school students. Due to the wide variability of handwriting, as illustrated in Figure 2.1, creating hand-crafted rules to recognize the digits based on the shape of the strokes

5 2. Background

would require a lot of manpower and would lead to a high number of rules still containing exceptions.

Figure 2.1: Sample images from the MNIST dataset. [LC98] The upper row contains some variations of digit 2, the lower of digit 4.

On the contrary, machine learning algorithms are tuning parameters of adaptive models, while they are trained on the so-called training set. The training set contains n input vectors { ~x1, ~x2, . . . , ~xn}, and for each input a target vector corresponding to the category of the input [Bis13]. After the training process, the quality of the model can be evaluated with the separate test set. It also contains input instances with known labels and the predicted categories can be compared to the real categories. The ability to correctly classify these previously unknown instances is called generalization. This is the most important feature of machine learning algorithms and models, since normally the training set only contains a small fraction of all possible input instances [Bis13].

2.1.1 Types There are different machine learning types, which can be used depending on the task and the characteristics on the dataset. The main tends are depicted in Figure 2.2 and further described below.

Supervised learning Supervised learning covers the applications in which in the training set the input vectors and the corresponding target vectors are known as in the example with the classification of handwritten digits. If there are a finite number of categories, the task is called a classification problem, while in cases in which one or more continuous variables must be predicted, it is called regression [Bis13]. A typical example is the prediction of the company sales depending on different factors such as store, promotion and competitor data. This thesis concentrates on supervised learning and classification.

6 2.1. Machine learning in general

Classification Supervised learning Regression

Reinforcement learning Machine learning Clustering

Density estimation Unsupervised learning Dimensionality reduction

···

Figure 2.2: Overview of the Machine Learning with the most common settings

Unsupervised learning If the corresponding target vectors are not available or used for the input vectors, the algorithms must find regularities unsupervised. These techniques may be used to discover similarities between different training instances, called clus- tering, or project the data to reduce the dimensions in order to decrease the size of the input vectors and select the more relevant features, or further decrease them to two or three dimension for visualization. If the goal is to determine the distribution of the instances within the training set, density estimations can be made [Bis13]. Reinforcement learning The motivation of the reinforcement learning is to learn behavior and take actions depending on the environment in order to maximize reward. Instead of training the network with concrete input and target vectors as in the case of supervised learning, the learning algorithm must discover the optimal steps by trial and error with a constant feedback from the environment. A key parameter of a reinforcement learning process is the trade-off between exploration of new actions and the exploitation of already known high reward- ing actions [Bis13]. An example of reinforcement learning is the AlphaGo by Google [SHM+16], a reinforcement learning based system that learned playing Go by playing against itself.

2.1.2 Model selection and hyper-parameters There are different kinds of machine learning models and algorithms and selecting the best fitting one for the task is part of the development process of machine learning based solutions. Models often have parameters which are set before the beginning of the training and are not or only rarely adjusted during the training. These are called

7 2. Background

hyper-parameters in contrast to normal parameters, usually called weights. Consider as a simple example of finding a polynomial function that tries to fit a curve. The polynomial function takes the form: M 2 M X j y(x, ~w) = w0 + w1x + w2x + ... + wM x = wjx (2.1) j=0 where the vector ~w represents the weights that will be adjusted during the training, and M, the order of the polynomial function, is a hyper-parameter that needs to be chosen in advance. While in some algorithms, like in the k-Nearest Neighbors there is only one hyper-parameter, in other cases, e.g. in neural network this number can be up to the multiples of ten [Bis13]. If there is enough training data available, the dataset can be divided in three parts, containing an extra validation set besides the training and the test set. This is then used to evaluate the predictive performance of the different models and selected hyper- parameters, and in the end the best performing is selected as the final model for the prediction. Although it is not necessary by all means, it is recommended since if many hyper-parameter adjusting iterations were performed during the model selection, the model is not independent anymore from the test set. Hence if the validation set is the same as the test set we cannot estimate the predictive performance, and thus the generalization error of the model on independent new instances [Bis13].

2.1.3 Capacity, overfitting and underfitting The main causes of poor predictive performance in machine learning is either overfitting or underfitting. Underfitting may occur if the model is not complex enough thus it has not enough capacity to fit the underlying function, or it was not trained long enough with enough number of samples. In case of overfitting the model may fit perfectly the training data, however performs poorly on new data, since the good training performance is reached by memorizing the training data [GBC16]. Generally, machine learning algorithms perform the best if their capacity matches the complexity required for the problem and the amount of training data available[GBC16]. To illustrate this issue, let’s consider the example of Bishop et al.[Bis13], that we want to fit the function sin(2πx) with the polynomial defined in Equation 2.1. The sample points were generated with the original function with random Gaussian of σ = 0.25 added. As it can be seen in Figure 2.3, in case of M = 0 and M = 1 the model is not complex enough to be able to model the trends of the data and thus it is under-fitted. On the other hand, with M = 9 the polynomial fits perfectly the training data, however, the curve is wildly oscillating and therefore it is also unable to capture a good representation of sin(2πx).

2.1.4 Pre-processing Many machine learning algorithms may require some pre-processing because they cannot handle or represent the data in its raw format, it contains too much irrelevant information

8 2.1. Machine learning in general

1.5 1.5

1 1

0 0

−1 −1

−1.5 −1.5 0 0.5 1 0 0.5 1

M = 0 M = 1 1.5 1.5

1 1

0 0

−1 −1

−1.5 −1.5 0 0.5 1 0 0.5 1

M = 5 M = 9

Figure 2.3: Polynomial (red) fitting of the sin (2πx) function (green) with added noise. The polynomials with order 0 and 1 show severe underfitting, while the one with order 9 overfitting. The blue circles represent the instances of the dataset or there are missing values [GBC16]. Computer audition applications usually require that the raw audio data is resampled to the sample rate or require them in spectrogram format. Categorical data must be encoded in numerical data and the same holds for textual inputs. Machine Learning algorithms often need that the input variables are normalized to an interval, e.g. [0, 1]. Normally, it is crucial to apply the same pre-processing steps on unknown instances during prediction as during the training [GBC16].

Data augmentation

As Goodfellow et al.[GBC16] states, the best way to reduce the generalization error is to train with more data. However, acquiring more data is often expensive or even impossible, so it might be feasible to add ’forged’ training data. In case of Computer Vision, data augmentation might be rotating or flipping the image, adding noise or other irrelevant objects [GBC16], while when working with audio samples, changing the pitch and volume of the recording or adding additional environmental sounds is a common practice [GGV+16]. However, one must be careful during this process, as for example in case of the digit recognition task (See Figure 2.1) vertical flips should be avoided, as then the network would see no difference between ’6’ and ’9’, leading to worse results.

9 2. Background

Generating additional training data turned to be useful also in bird song classification tasks, as shown in Sprengel et al. [SJKH16].

2.2 Neural networks

This section gives a brief historical overview about the emerge of artificial neural networks. In Section 2.2.1 the single perceptrons are covered followed by the Multilayer Perceptrons in Section 2.2.2.

2.2.1 Perceptron The idea of artificial neural networks can be traced back to the research of Warren McCulloch and Walter Pitts, who invented a linear threshold gate model for neurons in 1943 [MP43]. Rosenblatt has built around this idea the model of the perceptron (see Figure 2.4). A perceptron k has three basic elements:

1. A set of input synapses and for each input j a corresponding weight wj.

2. An adder that combines the inputs.

3. An activation function which limits the output range of the neuron. The activation function is normally a non-linear function otherwise the perceptron would be a normal linear transformation, which form only a small subset of all possible functions, greatly reducing its expressive power.

Bias x1 wk1 bk Activation function x2 wk2 Inputs P ϕ (·) Output . . yk . . Adder

xn wkn

Weights

Figure 2.4: The Rosenblatt model of a perceptron

The model often includes an independent bias, denoted by bk, which can increase or lower the net input of the activation function. This can be depicted as an extra input x0 with a fixed value of 1 and with a weight w0 = bk. In this case the output yk of the perceptron k can be defined as follows:

n ! X yk = ϕ wkixi (2.2) i=0

10 2.2. Neural networks

where x1, x2, . . . , xn are the input values, w1, w2, . . . , wn the corresponding weights, x0 = 1 and w0 = bk the bias and ϕ (·) is the activation function. Rosenblatt used in his model the Heaviside function H(n), which is 0 for n < 0 and 1 for n ≥ 0, however many other functions are in common use as well, such as the sigmoid function: 1 ϕ(x) = (2.3) 1 + e−x or the hyperbolic tangent function: e2x − 1 ϕ(x) = tanh(x) = (2.4) e2x + 1 which are depicted in Figure 2.5 [GBC16].

1 1 1

0.8 0.8 0.5

0.6 0.6

-2 2 0.4 0.4

-0.5 0.2 0.2

-1 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 (a) (b) (c)

Figure 2.5: Three typical activation functions, the Heaviside step function (a), the sigmoid function (b) and the hyperbolic tangent (c).

In 1969 Marvin Minsky and Seymour Papert discovered a fundamental issue with perceptrons [MP69]: It is impossible to express a function with single layer perceptrons for linearly not classifiable problems, greatly limiting its representational power. An example can be seen in Figure 2.6 with the exclusive OR (XOR) function.

1

0 0 1

Figure 2.6: The XOR function creates a set consisting of two classes that cannot be separated with a single straight line.

2.2.2 Feedforward Neural Networks It was already known by 1969 that while perceptrons itself have limited computing power, a layered system with enough number of nodes is able to approximate any

11 2. Background

computable function [MP69, p. 232]. However, it was not until 1974 Paul Werboss invented the backpropagation (See Section 2.2.3) method which could be effectively used to train Multilayer Perceptrons (MLP) that enabled the advancement of Artificial Neural Networks (ANN) [Wer74]. Feedforward NNs consist of an input layer, one or more hidden layers and an output layer. The nodes of neighboring layers are typically fully-connected and the output of a node is fed into the corresponding input of each perceptron in the following layer [Bis13, p. 227-229]. An example for this structure can be seen on Figure 2.7. There are several other Neural Network types, such as radial basis function or recurrent networks, however this thesis focuses on fully-connected and convolutional feedforward networks.

x1

x2 y1

Features x3 y2 Outputs

. . . y . . . 3

Output xn layer

Input Hidden layers layer

Figure 2.7: Multilayer Perceptron

2.2.3 Backpropagation A key concept of Neural Networks is the ability to learn how to solve given tasks. For this purpose the optimal parameters of the network (e.g. weights) need to be found. During the training process training instances are presented to the network and the output is evaluated. Depending on the deviation of the actual output from the desired value the parameters need to be slightly adjusted in order to converge to the optimal solution. Although there are other algorithms as well, nowadays most of the feedforward neural network learning algorithms are based on the backpropagation algorithm. The aim of this algorithm is to iteratively find a network configuration in which the error when solving the problem is minimized. The pseudo-code of the algorithm can be found in Algorithm 2.1. Before the training, the weights of the network are initialized in some way, e.g. randomly. The algorithm consists of two phases. In the first phase, the forward phase, the input is fed forward in the network and the output vector ~y is calculated. This is the same

12 2.2. Neural networks process that is used in a trained network for classification. In the second phase, also called backward phase, firstly, the difference between the desired and the actual value is calculated with a special function, called loss function. A common choice for this is the Mean Squared Error (MSE) function:

n 1  2 MSE = X ~y − d~ (2.5) n i i i=1 where d~ is the desired target vector. After that the error value, which is the gradient of this loss function, is determined and it is propagated back to calculate to the hidden neurons. The error values of these are also calculated and iteratively propagated back until the input neurons. In the last step the weights of each neuron is updated by subtracting a ratio of the weight’s gradient from the weight. This ratio is called the learning rate. The learning rate can be either fixed or dynamic and has a great impact on the speed and the quality of the learning. The right choice of the learning rate is a crucial step during the training process. If it is too small the neural network learns too slowly, if it is too big the neural network fails to converge to a local optimum. A variant of this method is computing the update gradients for some portion of the training set and then updating the weights in one step. This mini-batch learning can lead to a more stable and faster convergence. The size of this mini-batch is a hyper-parameter which can be tuned manually for the learning process.

13 2. Background

Step 1: Perform a forward propagation step, i.e. calculate the activations of the hidden and output neurons using the input vector xi. o Step 2: Calculate the error term δi = yi − ti for each output, where yi is the ith value of the output y and t is the target vector. Step 3: Backpropagate the error term to the layers before the output layer using the backpropagation formula:

l 0 l X (l+1) (l+1) δi = f (zj) wkj δk (2.6) k

l 0 where δi is the ith neuron of the lth layer, f (·) is the first l derivative of the f(·) activation function, zj denotes the weighted sum of the inputs of the jth neuron in layer l and wkj is the weight of the axis connecting the jth neuron of the l layer connecting to (l+1) the kth neuron of the l + 1 layer. δk is the error term of neuron k of the (l + 1)th layer. l Step 4: For each weight wji, calculate the gradient of the error function l with respect of wji using the gradient descent formula:

∂E(W ) (l+1) l l = δj zi (2.7) ∂wji

l Calculate the change of weight wji using the learning rate η:

l ∂E(W ) ∆wji = η l (2.8) ∂wji

l l l 0 and subtract ∆wji from wji in order to get the new (wji) used in the next iteration. Step 5: If there are training samples available, continue with Step 1 using the input vector xi+1, otherwise stop the algorithm.

Algorithm 2.1: Pseudo-code of the backpropagation algorithm using gradient de- scent.

14 2.3. Convolutional neural networks

2.3 Convolutional neural networks

Convolutional Neural Network (CNN) was first proposed by Yann LeCun et al. in 1989 [LBD+89] in order to provide a method for computer vision that is invariant to shift and distortion. Although theoretically normal feedforward Neural Networks are also capable of this, a huge number of input and neurons are needed, which makes it very compute intensive and the so called curse of dimensionality [GBC16, p. 155] becomes a significant problem: as the number of features (i.e. input variables) increases the number of training examples should be respectively increased, so that the network is trained with enough samples of each combination of values. For many problems collecting training data is very time consuming and expensive, or even impossible. The CNNs are intended to mimic the mammalian visual system [GBC16, p. 365] by processing the images in a pipeline that starts with simple feature extraction and proceeds layer-by-layer towards to more complicated structures and concepts. In the following, the most important components of CNNs are presented.

2.3.1 Convolutional Layer

image f filter matrix g output

1 2 1

000

-1 -2 -1

Figure 2.8: A convolutional operator that can be used to detect horizontal edges in an image, also called the Sobel operator.

Usually the input of a convolutional layer is one or more two dimensional matrices (images of feature maps). The layer consists of multiple convolutional filters (also called kernels), which are then applied on the input matrix. An example for a convolutional filter can be seen on Figure 2.8. The filters are moved across the image and at each pixel the matrix formed by the central pixel and its surrounding pixels is multiplied element-wise with the filter matrix, the elements of the resulting matrix is summed and this will be the value of the corresponding pixel in the output matrix, also called the feature map. The kernels are characterized by their sizes and the overlapping parameter, and the weights are learned during the training process. The set of the resulting feature maps form the input of the next layer. This technique can be also applied on color images. In that case the filters are applied on all of the (usually 3) color channels. A simple overview of the convolutional layer is depicted in Figure 2.9.

15 2. Background

The convolutional layer is usually followed by an activation layer, which just applies an activation function (see Section 2.2.1 and Section 2.3.5) to each element of the feature map matrices.

In the experiments of this thesis the first convolutional layer receives a 2D-spectrogram as an input.

Figure 2.9: A single convolutional layer featuring 3 convolutional layers and thus resulting 3 feature maps. These maps are the inputs for the upcoming layer.

2.3.2 Pooling Layer The main purpose of the pooling layer is to reduce the resolution of the underlying feature maps and so to reduce the number of weights in the network. By discarding the exact feature location, some translation invariance and robustness against noise is introduced. However, the pooling layer while reducing the resolution, it has to keep the relevant information from its input.

A commonly used pooling layer is the Max-pooling, which works similarly to the convolu- tional layer in the sense that it operates also with a sliding window, but from the sliding window only the maximum value is selected. The Max-pooling operator is characterized by the kernel-size and the stride, where the kernel-size is the size of the sliding window and the stride is the number of items skipped while sliding the window. The working of max-pooling is illustrated in Figure 2.10.

26 55 84 21

21 32 13 99 Max-Pooling 55 99

56 43 49 22 61 73

23 61 73 13

Figure 2.10: A sample for Max-Pooling with kernel size 2 × 2 and stride 2.

16 2.3. Convolutional neural networks

2.3.3 Dropout As discussed before, complex models with many parameters may learn relationships between the input and outputs based on the sampling noise instead of the relevant training data, resulting in overfitting. The Dropout method proposed by N. Srivastava et al.[SHK+14] addresses this issue with the basic idea, that if an input parameter is not available, the model cannot mislearn it. For each sample in each training iteration, each node is disabled with a probability p (called dropout rate), thus the node emits zero (See Figure 2.11). This makes the network more robust against noise that is normally unique to each training sample, in contrary to the shared relevant features. During prediction, no units are dropped, but the outputs of the nodes are reduced with their respective dropout probabilities in order to preserve the same expected values. Dropout heuristically simulates the case of having 2N models (with N being the number of nodes) and averaging their results without the need of actually computing each of them resulting in less overfitting. Additionally, having less nodes means fewer parameters, making the training iterations faster. The original paper [SHK+14] suggests using 50% dropout for the hidden layers and less than 25% for the input layer [SHK+14].

x1

x2 y1

Features x3 y2 Outputs

. . . y . . . 3

Output xn layer

Input Hidden layers layer

Figure 2.11: Visualization of the effect of the dropout on the network from Figure 2.7

2.3.4 Batch normalization A known problem with the Stochastic Gradient Descent (SGD) algorithm is that during the training the distributions of the layer input/outputs changes after each training iteration therefore the successive layers have to keep adapting to that. This implies lower learning rate that results in a slower training. Batch normalization proved to be an efficient solution for this issue [IS15]. A batch normalization layer, as the name suggests, performs a normalization by subtracting the mean of the batch of the output of the previous activation layer, and divides it with its standard deviation. Therefore the subsequent layer always receives values normalized to [0,1], avoiding the need of adaptation. The layer keeps track of the batch means and variances, which are then

17 2. Background

averaged and used when the network is applied on the test set. Although calculating the mean and variance of the mini-batch is resource-intensive and may make the calculation of a single mini-batch iteration considerably slower, due to the higher possible learning rate the network learns faster and may converge according to the original paper of S. Iofe et al. [IS15] up to 14 times faster with a better accuracy. Additionally, it makes the network less sensitive to the initial values chosen.

2.3.5 Activation functions Several different activation functions are used with convolutional neural networks besides the sigmoid, hyperbolic tangent and the Heaviside function (See Section 2.2.1). We cover in this section only Rectified Linear Unit (ReLU), a commonly used activation function and Exponential Linear Unit (ELU), a variation of the former.

Rectified Linear Unit According to LeCun et al.[LBH15], ReLU was the most commonly used activation function in 2015 and was also used in bird classification solutions since then [SJKH16][Pic16][TC16]. The ReLU is a constant function x ≤ 0 and a linear function x > 0 (See Figure 2.12). Formally:

f(x) = max(0, x) (2.9)

Glorot et al. have shown in 2011 [XCL16] that rectifiers may reach lower classification error rates than hyperbolic tangent. Further advantages include the efficient propagation of the gradient because of the linearity above 0, the scale-invariance (max(0, αx) = α max(0, x)), and the efficient computation, since only a comparison is required [XCL16]. However, the function is unbounded, it cannot be differentiated at zero and in some cases it may block the gradient backpropagation [XCL16]. Also, because the ReLU is always non-negative, using it might lead to an effect called bias shift, which means that the mean of the output of the activation function is always larger or equal than the mean of the input. This may lead to extreme output values blocking the training or more sensitivity to the learning rate. Besides Batch Normalization, there are some other proposed solutions for these issues such as Leaky Rectified Linear Unit (LReLU) [MHN13] and Parametric Rectified Linear Unit (PReLU) [HZRS15], but they are out of the scope of this thesis.

Exponential linear units The ELUs as an alternative to ReLUs were proposed by Shah et al. in [SKSS16]. The exponential linear unit is defined as follows:

(x if x ≥ 0 f(x) = (2.10) α(ex − 1) otherwise

where α is a hyper-parameter, usually fixed at α = 1.0. While the ReLU is always non-negative the ELU have also negative values, which may cause the mean activations to be nearer to zero and thus lessens the bias shift effect. It was shown in [SKSS16], that ELUs may reach the same or better performance than ReLU layers combined with

18 2.4. Signal processing

Batch Normalization, however, without the need of the extra computational complexity of batch normalization and thus faster training speed.

1 1

0.5 0.5

-3 -2 -1 1 2 3 -3 -2 -1 1 2 3

-0.5 -0.5

-1 -1

(a) (b)

Figure 2.12: Plot of the ReLU function (a), and of the ELU function (b).

2.4 Signal processing

In the following section a brief introduction is presented to the signal processing techniques used in this thesis.

2.4.1 Spectrogram A spectrogram is a visual representation of the frequencies occurring in a signal as they vary with time. For discrete signals, as the digital recordings of bird songs, it is usually computed with Short-Time Fourier Transform (STFT). The input signal ~x = (x1, x2, . . . , xn) is broken up into overlapping frames and then for each frame the Discrete Fourier Transformation (DFT) is computed. This results in the complex matrix:

∞ X −iωn STFT{~x} (m, ω) = Xm (ω) = xnw (n − mH) e (2.11) n=−∞ where xn is the input signal at time n, H is the hop size and w(n) is the Hanning window: 1   2πn  w(n) = 1 − cos 0 ≤ n ≤ M − 1 (2.12) 2 M − 1 in which M is the frame/window size [OSB99] [GL83] [MMN+17].

2.4.2 Mel Spectrogram According to Stevens et al.[SVN37] the human ears hear the frequencies below 1 kHz with a linear scale instead of a logarithmic scale as is the case above this limit. The Mel-Frequency scaling and consequently the Mel spectrogram employs until 1000 Hz a nearly linear scaling and above it increasingly larger intervals. There is no ’the’

19 2. Background

Mel-scale formula [War70] but the librosa library [MMN+17] which were used in the implementation part of this thesis uses the formula of Slaney [Sla98]:

( 3 200 f f ≤ 1000 mel(f) = 27(log f−log 1000) (2.13) 15 + log 6.4 otherwise

2.4.3 Constant-Q transform Constant-Q transform (CQT) is a related concept to STFT, as it also transforms data series to the frequency domain. Compared to STFT it has a lower resolution in higher frequency bins which is analogue with the human hearing system that can better distinguish the lower frequencies from each other [SK10]. The CQT can be characterized by the quality factor Q, which is the number of complete cycles each window contains for the center frequency of the bins. While the window size is different per each bin, the frequency to bandwidth ratio (Q) remains constant, thus the name of the method. The frequency bins are thus spaced logarithmically instead of linearly and the length of the window function increases proportionally with the center frequency of each bin.

The window length for bin k is calculated from the sample frequency fs and the central fs frequency fk, in which is the number of samples required to describe a full cycle of fk fk and Q is the number of full cycles: f N (k) = Q s (2.14) fk

The Hanning window function thus becomes also dependent on k: 1   2πn  w(n, k) = 1 − cos 0 ≤ n ≤ N (k) (2.15) 2 N (k)

The Formula 2.11 is accordingly modified with these functions:

∞ 1 X −iωnQ CQT {~x} (k, ω) = Xk (ω) = xnw(n − mH, k)e N (k) (2.16) N (k) n=−∞

2.5 Summary

In this chapter chapter an overview was given about the basic concepts of machine learning, an introduction to the neural networks, an overview of the CNNs and an introduction to signal processing, in order to form the theoretical basis for the following chapters.

20 CHAPTER 3 Related works

This chapter first discusses selected topics of acoustic signal processing with CNNs and then gives an overview of the bird classification challenges of the previous years and of the state-of-the-art at the time of writing of this thesis.

3.1 Acoustic signal processing with CNNs

The emerge of CNNs lead to huge performance increase of the state-of-the-art solutions in many areas in the acoustic signal processing field [ZPB+17]. In current Automatic Speech Recognition solutions either CNNs are combined with Hidden Markov Model (HMM) [MDH12, HDY12, AMJP12, SPKL16] or Recurrent Neural Networks (RNN) are used instead. The current known best-performing CNN solution by Zang et al.[ZPB+17] does not reach the performance of the state-of-the-art RNN solutions, reaching 18.2% error rate on the TIMIT dataset[GLF+93], compared to 17.7% reached by Graves et al. [GMH13]. Sound Event Detection and Classification (SED/C) has also been actively researched in the recent years [MHD+17]. While the best performing solutions were HMM, Gaussian Mixture Model (GMM) or Support Vector Machine (SVM) based, starting in 2016 the field started to be dominated by deep learning solutions and 3 of 4 tasks of the DCASE20161 competition were won by CNN or RNN based solutions.

3.2 Bird classification challenges

As machine learning approaches gain importance, many new possible applications are considered. However, often, as there is not enough data available for the researchers for

1Detection and Classification of Acoustic Scenes and Events

21 3. Related works

developing new techniques, it is hard to follow the state-of-the-art of them or researchers don’t even think about these areas of use. The various machine learning competitions aim to solve this problem by providing datasets to train and test the ideas, to benchmark the state-of-the-art approaches and to promote various topics and encourage their research. This section summarizes the competitions involving bird-vocalization classification and gives an overview of the state-of-the-art at the time of the writing of the thesis.

3.2.1 The ICML 2013 Bird Challenge The first notable bird classification competition was held in June 2013 as part of the International Conference on Machine Learning (ICML) 2013 [ISS13a]. The competition was organized by the SABIOD MASTODONS CNRS group2, the University of Toulon, and the data was provided by the Muséum national d’Histoire naturelle of Paris. The challenge was hosted on Kaggle3 a designed platform for data mining competitions. The dataset consisted thirty-five 30-seconds long recordings for 35 different species and ninety 150-seconds long audio files as the test set. The instances of the two sets were recorded in disparate settings. The training set was taped with directional microphones on a single day without rain and with no or minimal background noise present. One instance contained the vocalization of a single species, usually with brief quiet periods. On the contrary, the test set instances were recorded with omnidirectional microphones during a 60 days period in three different forest settings (mature, young, open) in the same area. The 150-seconds long recordings may thus contain significant noise, including wind, raindrops and echoes and there are parts where no bird song is present and others, where there are more at the same time. The ground truth was given for the first day of the test set, for validation purposes. There was no metadata available for the training besides of the name of the bird. In the test set the weather, the date and time of the recording and the forest setting was given [ISS13a, p. 20-24]. The participants were able to submit up to two runs per day and 400 runs of the 76 contestants were evaluated by the Kaggle system. The winning team used a then state- of-the-art syllable segmentation algorithm to select the relevant parts of the recordings. For these portions they extracted the Mel-Frequency Cepstral Coefficients (MFCCs) and Delta-MFFCs. After reducing the dimensionality with a Linear Discriminant Analysis projection, a bag of shallow ANNs have been trained for the classification [ISS13a, p. 96-97]. All of the teams submitting a write-up used hand-crafted features for the classification.

3.2.2 MLSP 2013 Bird Classification Challenge The second challenge was in fact the 9th Annual MLSP Competition of the 2013 IEEE Machine Learning for Signal Processing (MLSP) Workshop in September 2013 [BHR+13].

2http://sabiod.univ-tln.fr 3https://www.kaggle.com/c/the-icml-2013-bird-challenge

22 3.2. Bird classification challenges

Similarly to the ICML 2013 challenge, MLSP 2013 was also hosted on Kaggle4. The dataset contained 645 ten-seconds long recordings from 19 different species from 13 different sites in the H. J. Andrews Long-Term Experimental Research Forest from a two-years period, recorded with a special device with two omnidirectional microphones. For the half of the dataset the set of birds audible in the instances was provided forming the training set, the other half being the test set. Although not only the recording site, but also the date/time of the recordings were available, the latter were prohibited to be used in any form. In addition to the dataset the organizers provided intermediate data representations from a prior work [BLN+12], which is a hand-crafted feature set based on a method using Random Forest classifier [Bre01]. This proved to be a useful help since many of the Top-10 teams have used at least components of it. 79 teams entered the competition and made a total of 368 submissions. The winning team [Fod13] modified the baseline method by using a new segmentation algorithm, a set of template patterns and incorporated the location information. A team using convolutional-network reached the 4th place (AUROC: 0.941 vs. 0.956 of the first team, vs. 0.855 of the baseline method) without using the location information, which has meant a significant boost to other methods and without any significant feature engineering. Little was made public about the exact network architecture used, which was trained using the stochastic diagonal Levenberg-Marquadt algorithm [OM98]. In order to reach better time-shift invariance, the instances were randomly shifted during the training and the output was the average of the five classification runs from various random shifts of the test inputs.

3.2.3 NIPS4B 2013 The third competition [ISS13b] was held in late 2013 and it was part of the 1st Big Bioacoustics Data Workshop of the Neural Information Processing Systems (NIPS) 2013 conference. It was organized by the same team as the ICML 2013 challenge and it was also hosted on Kaggle5. The dataset by the BIOTOPE6 society contained 1 687 various-length short recordings (< 5.5 seconds) of 78 different bird species and 9 insects and 1 amphibian living in the vicinity of the birds. The nearly 2 hours of recordings were collected in France, using a device with two omnidirectional microphones for stereo recordings. They contain significant background noise, there are clips without any bird singing and with different bird-vocalization loudness. The environment conditions were the same for the 687 samples in the training set and the 1 000 instances of the test set. Neither for the training set nor for the test set was any metadata provided but in return the organizers provided a baseline method for the classification.

4https://www.kaggle.com/c/mlsp-2013-birds 5https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013 6http://www.biotope.fr

23 3. Related works

The 401 runs submitted by the 32 entrants were evaluated with the Area Under the Curve (AUC) metric. The winning solution [ISS13b, p. 176-181] was based on the best performing solution of MLSP 2013 [Fod13], however with a more advanced sound event extracting algorithm and some additional features. All of the participants submitting a write-up used hand-crafted features for the classifica- tion.

3.2.4 BirdCLEF The first BirdCLEF in 2014 [GGV+14] was a significant step compared to the previous challenges, as the number of the species was increased almost by an order of a magnitude and the data was collected by hundreds of partially amateur recordists without special equipment. The later BirdCLEFs have further raised the bar by doubling the number of species in 2015, introducing the soundscape recordings in 2016, and further increasing the dataset to 1 500 species in 2017. However, this makes the results of the competitions harder to compare.

BirdCLEF 2014 The dataset was provided by Xeno-canto7, a web-based collaboration project aiming to collect bird-vocalization recordings from all over the world. The 501 species from with the highest number of recordings in the database were selected as the dataset. The dataset consisting 14 027 recordings contained on average 28 samples per species, the extreme values being 15 and 91. They were created by hundreds of recordists, with different equipment and in varying quality, however, the recordings were normalized to 44.1 kHz sample rate and 16 bit mono format. Besides the most active singing bird in the recording, various meta-data was contained in the dataset, including other birds singing in the background, coordinates, elevation, date, time, type of the sound, taxonomy information, and the quality of the recording. The dataset was randomly split in 2/3 training set and 1/3 test set, however, recordings from the same person on the same day were considered as the same recording and were either only in the test or in the training set[GGV+14]. The metadata of the test set was partially stripped, keeping only the spatial and date-time information. Each participating team could submit up to 4 runs, which were only evaluated after the end of the competition. Mean Average Precision (MAP) (See Section 4.5.1) was used as the metric, being calculated for two categories: MAP of the most active species and MAP of all species occurring in the recordings. The winning team [Las14], also the winner of the NIPS4B competition, used metadata and hand-crafted features for the classification. The incorporated metadata were the year and month of the recording, the time converted to minutes, latitude, longitude and

7http://www.xeno-canto.org

24 3.2. Bird classification challenges elevation, the locality and author indices. The sound related features included 35 spectral, 13 ceptral, 6 energy and 3 voicing-related features, and the statistics of these, resulting in 6 669 features. These were calculated for the segments extracted with the same algorithm as in the NIPS4B submission. After a feature-selection step the number of features were reduced to 1 277 and a SVM [CV95] was trained using them. The method reached a 0.509 MAP score on the test-set considering only the foreground species, and 0.451 for including also the birds singing in the background.

BirdCLEF 2015 In 2015, the BirdCLEF2014 dataset was extended with about 20 000 extra recordings, bringing the total number of species to 999. The additional dataset was also provided by Xeno-canto and they were normalized to the same format as the dataset in the previous year. The rules and metrics remained also unchanged [GGV+15]. In this year 6 teams has completed a submission and the same team won the competition as in 2014 [Las15]. They used an improved version of their 2014 algorithm, using among others a better feature-selection method and bagging. The method reached 0.454 MAP on the foreground species, and 0.414 on the foreground and background species combined. Although this is a worse result than in 2014, the total number of species has doubled in this year.

BirdCLEF 2016 In 2016 the training set remained the same as in 2015, however, the test set was extended with soundscape recordings [GGV+16]. These recordings are several minutes long and they were made with omnidirectional microphones installed in various locations. The recordings had to be split in 5 seconds long segments and for each segment the birds present should predicted. There were segments containing no birds and segments with multiple birds singing. It could also happen that a series of sings was cut in half when it was at a segment border. This was the first BirdCLEF with entrants using convolutional neural networks:

BME TMIT The BME TMIT team downsampled the audio files to 16 kHz and a low-pass filter was applied on the Fourier-transformed recordings to extract the essential parts of the spectrograms, yielding 200 frequency bins. The spectrograms were then cut in 30 × 10 sized cells and removed the irrelevant parts of the spectrogram based on the mean and the variance of these cells. After the pre-processing, the remaining parts were cut in 5 seconds long pieces and the resulting 200 × 310 arrays were fed in a convolutional neural network [TC16]. The team has evaluated two different network architectures, the first network being based on the AlexNet [KSH12] architecture, but with batch normaliza- tion added. The second architecture was a much simpler one, consisting four

25 3. Related works

convolutional layers with ReLU activation and with a smaller fully connected layer. The two architectures performed very similarly, with reaching 0.426 MAP for the foreground species, 0.338 for fore- and background, and 0.053 in the soundscape recordings. WUT The WUT team [Pic16] started with perceptual weighting using peak power as reference, scaling and thresholding to reduce noise. Lastly, Chambolle’s algorithm [Cha04] was applied on the spectrograms in order to further reduce noise. During the training, they randomly selected 5 second long parts from the recordings, while the shorter recordings were padded to 5 second. Three different network architectures were evaluated, one with 1 convolutional / max-pooling layer with LReLU activation, followed by a fully-connected layer with PReLU as the activation function. The second architecture consisted 4 convolutional layers with LReLU with no fully-connected layer at the end. The third architecture was an extension of the second by adding an additional layer and having more convolutional filters. Additionally, they have made an ensemble method by averaging the results of the three architectures, which then yielded the best performance. They reached a MAP of 0.529 for the foreground, 0.412 for the fore- and background, and 0.036 for the soundscape recordings. Cube The Cube was the best performing team in 2016, with reaching a MAP of 0.555 when considering the background species as well, and 0.686 for the foreground species only, and 0.072 for the soundscape recordings [SJKH16]. They were the only team that were able to outperform the winner team of the previous contests. They used a CNN with five convolutional layers (See Table 3.1) with the input being the raw spectrogram of the preprocessed signal. As the method described in this thesis is based on the solution of the Cube team it is covered later with a greater extent.

3.3 Summary

The aim of this chapter was to give an overview of the recent emerge of CNN based solutions in acustic signal processing, especially in bird song identification and to introduce the previous bird song classification challanges. The methods developed for this concepts form the state-of-the-art of acoustic bird identification at the time of the writing on this thesis and they are therefore directly related to the methodology described in the next chapter.

26 3.3. Summary

Layer Configuration InputLayer Dropout probability = 0.2 BatchNormalization Convolution2D 64 5 × 5 kernels, 2 × 1 stride, ReLU MaxPooling2D 2 × 2 kernels, 2 × 2 stride BatchNormalization Convolution2D 64 5 × 5 kernels, 1 × 1 stride, ReLU MaxPooling2D 2 × 2 kernels, 2 × 2 stride BatchNormalization Convolution2D 128 5 × 5 kernels, 1 × 1 stride, ReLU MaxPooling2D 2 × 2 kernels, 2 × 2 stride BatchNormalization Convolution2D 256 5 × 5 kernels, 1 × 1 stride, ReLU MaxPooling2D 2 × 2 kernels, 2 × 2 stride BatchNormalization Convolution2D 256 3 × 3 kernels, 1 × 1 stride, ReLU MaxPooling2D 2 × 2 kernel, 2 × 2 stride BatchNormalization Flatten 16384 neurons Dropout probability = 0.4 Dense 1024 neurons Dropout probability = 0.4 Dense 999 neurons

Table 3.1: The network architecture of the winner team of BirdCLEF 2016. It works directly on the spectrogram, and contains five convolutional layers with ReLU activation function.

27

CHAPTER 4 Methodology

In this chapter, firstly the dataset used for the evaluation of the methods is presented in detail in Section 4.1, followed by the discussion of sound and metadata pre-processing methods in Section 4.2. The data augmentation methods used in this thesis are presented in Section 4.2.3. Section 4.3 introduces the two novel network architectures that were closely evaluated in this thesis, Section 4.4 describes the two different methods used for the calculation of the final predictions and finally, Section 4.5 presents the evaluation metrics used.

4.1 Dataset

As so far in the BirdCLEF competitions the ground truths for the test-sets were not released, the organizers are able to reuse the datasets in the subsequent years. The BirdCLEF 2017 dataset is thus a superset of the BirdCLEF 2014-2016 datasets. The training set comes from a single source, from the Xeno-canto collection. It contains the 1 500 birds with the most recordings from different locations in Brazil, , Surinam, , and Columbia, a total of 36 496 samples. Most of the recordings concentrate on a single bird with a directional microphone [JGG+17]. The vast majority, 12 347 samples of the test set is also from the Xeno-canto collection, however, it was extended in 2016 with an additional set of 925 soundscape recordings. These were taped with omnidirectional microphones on specific locations for an extended time. Most of them are about 10 minutes long and are consisted of 10-12 successive recordings. On average about 10 birds are audible in a single sample, however there are some with only one single bird, and others with 25 different species singing in the 10 minutes long segment. The test set was further extended for the BirdCLEF 2017 with 6.5 hours soundscape recordings. About 2 hours of them were recorded in next to the Amazonas river,

29 4. Methodology

in the Peruvian basin from the highest point of the area. Although the recordings were carried out with multiple systems, they are all sampled with 96kHz and are stored in 24 bit PCM wav files. Another 4.5 hours of soundscapes were recorded in various parts of Columbia. The recordings are sampled with 44.1 kHz. The total length of the recordings is over 237.5 hours, with the shortest being under 0.1 seconds while the longest one is over 45 minutes. The average length is 34 seconds and the median is 25 seconds. The number of recordings per each species may also vary significantly in the training set, e.g. there are only 4 samples for Laniocera rufescens, while the dataset contains 160 items for the Henicorhina leucophrys [JGG+17]. As a result of the high biodiversity of the Amazonas Basin and the size of the collection, almost all taxonomical orders of the birds are represented in the dataset. Figure 4.1 gives an overview of the size of each orders, in terms of the number of recording instances in each order and family. Passeriformes (perching birds) is by a large margin the most significant order of the collection (25 118 instances of the 36 496 total), which is not surprising, as it includes more than half of the currently known bird species [GD17].

4.1.1 Metadata Similarly to the previous years, the organizers provided for each recording an XML file, that contained various metadata, such as the type of the vocalization (call, song, alarm, flight, etc.), date/time, location, scientific name, quality of the recording, author, notes, and taxonomical categorization. However, for the test set only the location and the date/time information was available. Figure 4.2 aims to give an overview about the distribution of the metadata values in the training set. The elevation varies from 0m to 4600m and 2/3 of the instances lie under 1100 meters. There are recordings from the whole Amazonas basin, however, there are some hotspots near to larger cities such as Rio de Janeiro and Sao Paulo and from the western parts of Columbia and . The day parts displayed on the third plot of Figure 4.2 were calculated using the method described in Section 4.2.2. As the plot indicates, 75% of the recordings were carried out before noon, the time at which the bird activity is the highest [BB02, BVHB99]. A sample metadata XML can be seen in Listing 4.1.

30 4.1. Dataset

Piciformes Psittaciformes

Accipitriformes Apodiformes Passeriformes Galliformes

Strigiformes Gruiformes Tinamiformes

Columbiformes Falconiformes Trogoniformes

Charadriiformes Cuculiformes Coraciiformes Anseriformes

Podicipediformes

Caprimulgiformes Cariamiformes Pelecaniformes Eurypygiformes Opisthocomiformes

Cathartiformes

Figure 4.1: The bird orders in the BirdCLEF 2017 dataset. The single colored large rectangles represent the orders, while their sub-rectangles the families. For some orders there was only a single family present in the dataset. The size of the rectangles correspond to the number of recordings in the families and orders.

-19.2204 -42.4832 260 Roney Assis Souza AJYSYOLBOF song bird-seen:noplayback-used:no 0 BirdCLEF2015 Passeriformes Tyrannidae Myiopagis caniceps Grey Elaenia

31 4. Methodology

Night2 Night1 3% 2% 4500 m Dusk Dawn 4000 m 3% 2% 3500 m Afternoon 3000 m 15% 2500 m 2000 m Elevation 1500 m 1000 m Forenoon 500 m 75% 0 m Instances

(i) (ii) (iii)

Figure 4.2: The distribution of the metadata in the BirdCLEF2017 dataset. (i) presents the distribution of the different elevation values, on the map in (ii) the bird positions are plotted and (iii) displays the day parts of the day according to the time of the recordings. The background of (ii) is based on the NASA 16-day vegetation index photo

Listing 4.1: The metadata file for LIFECLEF2015_BIRDAMAZON_XC_WAV_RN32208.wav

4.2 Pre-processing

This part firstly describes in Section 4.2.1 the separation of the relevant bird singing parts of the recordings from the background noise on a sample instance from the dataset, along with the way they are prepared for the fix-sized input of the neural network. In Section 4.2.2 the metadata pre-processing steps are discussed, while in Section 4.2.3 the methods are detailed that were used to augment the training set in order to counter over-fitting.

4.2.1 Sound pre-processing The sound pre-processing is illustrated on the LIFECLEF2014_BIRDAMAZON_XC_- WAV_RN24.wav sample of which the spectrogram can be found in Figure 4.3. For this purpose, the method formulated in [SJKH16] is applied with some extensions. At the end of this process, the audio recordings are split in three parts; sound, noise and irrelevant segments. Firstly, the spectrogram of the sound file is computed using STFT with a Hanning window (Equation 2.12) function with window size 512, and 75% overlap. The resulting matrix is normalized to the interval [0,1] and is treated then as a grayscale image. In the vast majority of the recordings the foreground bird singing has higher amplitude than the background noise, resulting in brighter pixels in the grayscale spectrogram

32 4.2. Pre-processing

Figure 4.3: Original spectrogram

Figure 4.4: Selected pixels of the spectrogram

Figure 4.5: Selected pixels after erosion and dilation

Figure 4.6: The indication vector of selected pixels after two dilations. The black pixels in the top line correspond to the selected columns for the signal set, while the bottom line corresponds the selected noise samples. 33 4. Methodology

image. With a statistical method the relevant pixels are selected: each pixel is set to 1 if its value is larger than three times the median of its row and larger than three times the median of its corresponding column. Otherwise it is set to 0. The effect of this step is illustrated in Figure 4.4. This results in a noisy image, thus two image morphological operations are applied, binary erosion and dilation. As suggested in [SJKH16], a filter of 4 × 4 size is used. This step results in an image similar to Figure 4.5. This image is vertically compressed in an indication row vector, containing a 1 if the corresponding column contains at least one positive pixel, otherwise 0. To avoid short segments the vector is dilated twice. The indicator vector is then scaled up to the original length of the recording and it is used as a mask to extract the relevant sound parts. The same method with different parameters is applied to extract the noise from the sound part: The median clipping threshold is set to 2.5 instead of 3, and the resulting image is inverted. The further steps with the inverted image are the same as in the sound extraction method. This results in two distinct subset of the original sound file, with columns containing pixels that have a value larger than 2.5× and smaller than 3× the median of the row/column being ignored, as they cannot be distinguished clearly from neither the sound nor the noise part. The indication vectors for the sample sound file are depicted in Figure 4.6.

The original sound file

The extracted bird-song part The noise part

Figure 4.7: After erosion and dilation the sound file is split into a sound and a noise part. The parts, that are not significantly different from the noise or sound part, are left out.

A limitation of this process is that it is purely statistics based and ignores the semantical content of the recording. Therefore in recordings containing significant noise or containing only bird singing parts without any break, it may consider the whole recording as noise, containing no or only brief sound segments.

34 4.2. Pre-processing

To avoid this issue a minimum threshold length of 32 768 samples is selected, as this is the minimum length of sound chunks supported by most of the network architectures evaluated in this thesis. Both the sound and noise separation threshold is iteratively lowered from 3 and 2.5 by 0.1 until either the whole file is considered as sound or the sound part is at least 32 768 samples long. The spectrogram of the resulting sound and noise part is shown in Figure 4.7. The evaluated network architectures need a fix sized input during the training. The batches are composed by randomly selecting 16 files (equaling the batch size) and from each 16 files a random segment is selected. As some recordings in the dataset contain less then 32 768 samples, in such cases the recording is concatenated with itself multiple times until it contains more than 32 768 samples. These samples are converted depending on the sound representation used either in STFT Mel-, or CQT-spectrograms.

4.2.2 Metadata pre-processing As the provided metadata is represented in many various formats, including missing values, metadata pre-processing cannot be avoided. In the training set, the missing values are randomly recreated using the metadata of birds of the same species where these data were available. The mean and the variance is calculated of the known data and Gauss-distributed random values are generated for the missing attributes using µ and σ. Only date, time of the day, elevation and coordinates are available for both the training set and the test set (apart from the missing values). Earlier ornithological works suggest that there is a strong correlation between melatonin levels of the birds (directly influenced by the daylight) and the bird song intensity [BB02, BVHB99]. However, as the time of the sunrise and sunset depends not just on the coordinates, but also on the date, the time of the day is not directly related to the position of the sun and the amount of daylight. Thus, instead of directly using the time values, the day has been split up in six parts depending on the position of the sun on the sky. The angle of the sun above the horizon is approximated with the algorithm presented in [FP79]. Although the time of the sunrise and the sunset is partially dependent on the elevation as well, only the coordinates and the day of the year is used in the calculations. The following parts of a day has been defined (illustrated in Figure 4.8):

• Night1 - From midnight until the sun is 9° below the horizon (BTH) • Dawn - From 9° BTH until 4° above the horizon (ATH)

• Forenoon - From 4° ATH until noon

• Afternoon - From noon until 4° ATH

• Dusk - From 4 ° ATH to 9° BTH

• Night2 - 9° BTH until midnight

35 4. Methodology

90°

AfternoonForenoon

4° 4° Dawn Dusk -9° -9°

Night1 Night2

-90°

Figure 4.8: Parts of the day depending on the position of the sun

9° BTH is selected because it lies between the nautical twilight (i.e. the horizon is visible) and the civil twilight (i.e. terrestrial objects are visible to the human eye), 4° ATH is selected arbitrarily as a point where the sun is already clearly above the horizon. A subset of recordings contained the time information as a textual value of the part of day, such as forenoon or morning. These were directly converted to the above categories.

4.2.3 Data Augmentation

Although the training set provided for the BirdCLEF competition is the largest collection of bird songs currently available for machine learning research with 36 496 different recordings for 1 500 species, this means still only 24 recordings / species on average. As the sounds can be vastly different even for the same species in different situations e.g. calling, singing, alarming or flying, this can easily lead to overfitting. In order to avoid that various data augmenting techniques were applied on the training set, so the network can be trained with vastly different recordings and conditions.

Most of the augmentation steps are similar to the ones used in [SJKH16]. However, it was found that in addition to the data augmentation steps proposed in [SJKH16], small variations in the amplitude, introducing more variance while adding noise and songs from the same species, and combining dampened recordings of other birds singing in the surrounding can further improve the accuracy, resulting in the data augmentation set detailed below.

The data augmented chunks were generated on-the-fly parallel to the training process. For a training epoch batch_size × number_of_training_instances number of chunks were generated, resulting in average 16 chunks per file. This method also aimed to avoid the over-representation of longer recordings, as on average, regardless of the length of the recording the network is trained by the same number of instances per recording.

36 4.2. Pre-processing

Noise overlay In order to make the network focus on the bird vocalization, random noise are added to the recordings. These were taken from the noise set which were created during the signal/noise separation step of the pre- processing (See Section 4.2.1). The selected noise samples belong always to the files of the training set in order to avoid training the network on the validation data and thus degrading the independence between the training and the test set.

Up to four random noise samples ~nrn are taken randomly each is added with 75% probability to the sound sample ~s of the same length:

~saug = ~s + δ1α1~nr1 + δ2α2~nr2 + δ3α3~nr3 + δ4α4~nr4 (4.1)

where δn ∈ 0, 1 is a random indicator variable, with P (δn = 1) = 0.75, and α ∈ [0.36, 0.44] is a random dampening factor. This method gives an additional variance compared to the method used in [SJKH16], where always three noise samples were added. This method results in segments containing no extra noise overlay at all while others having four different. The evaluations have shown that adding more noise to the samples does not provide any benefits, but it is often lowering the prediction accuracy. It has been confirmed that greatly lowering the volume of the noise samples (as described in [SJKH16]) has adverse effects, so the volume of the overlay samples were randomly changed by ±10% thus further increasing the variability. Combining same class audio files It was shown in [SJKH16] that combining samples of birds from the same class are making the convergence faster and provide also additional data augmentation as it is thoroughly possible that there are more than one bird of the same species in an area. Therefore, with a probability of 70%, recordings of birds from the same species are overlayed with a random damping factor between 20% and 60%:

~saug = 0.6~s + 0.4α ~x2 (4.2) where α ∈ [0.96, 1.04] is a random dampening factor. Combining birds from the neighboring area It is assumed that if different species are present in the same area, they might be audible in the background. Thus samples from random birds from the neighborhood (defined as the area within 1°away in the East/West/North/South directions from the position of the bird being trained) were overlayed on the chunks with 30% ± 5% damping. Note, that this method leads to an unavoidable decrease of the prediction accuracy when the background species present in the recordings are also considered. Random cut As described in [SJKH16], the spectrograms are randomly cut in two parts after the augmentation steps above. The two parts are then concatenated

37 4. Methodology

again in a reversed order. Although this creates a sharp corner at the cut, the network learns to deal with irregularities in the spectrogram and introduces time invariance. Volume shift The volume of the input audio x is randomly changed by ±5%:

~saug = (1 + α)~s (4.3) where α ∈ [−0.05, 0.05] is chosen randomly from an uniform distribution. Pitch shift The pitch of the input audio is randomly changed by ±5%. The pitch shift was accomplished by shifting the spectrogram matrix vertically and filling the new rows with the mean of the preceding rows, instead of converting the data in the frequency space and performing there a true pitch shift.

4.3 Network architectures

Two different CNN architectures were evaluated closely. The first one was inspired by the winner of architecture [SJKH16] and AlexNet [KSH12] of the BirdCLEF 2016 and takes different sound representations as input. The second is a parallel architecture that takes two Mel-spectrograms with different temporal and frequency resolutions. All of the network inputs has been scaled with a normalization scaler that was trained with 2 000 × 16 = 32 000 samples generated by the data augmentator method.

4.3.1 Pyramidal convolutional network A problem with the architecture in [SJKH16] that as it takes the raw DFT window as input, the number of input parameters is high and that makes the training considerably slower. As the dataset is large and the data augmentation yields many instances, a faster training makes not only a faster experimenting possible, but the final trained network can see more instances in the same amount of time. In order to reduce the number of parameters, the Mel scale (See Section 2.4.2) was applied on the spectrogram and the Mel-Spectrogram was fed into the network. The Mel scale was originally intended to model the human perception of sounds part of this thesis was to evaluate the effectiveness of Mel scale in convolutional neural network based bird song classification. The number of Mel-bands was 80, thus reducing the number of input parameters in case of an FFT of window size 512 from 256 × 512 = 131 072 to 80 × 512 = 40 960. Besides Mel, experiments has been carried out using CQT [SK10], as suggested by [LS16] and with STFT spectrograms without Mel-scaling. 80 bins are used for the CQT transformation with 12 bins per octave, with minimum frequency of 200 Hz and with a hop length of 128. The STFT variant is used with 256 and 512 window sizes with 75% overlap.

38 4.3. Network architectures

Batch Normalization can speed up the convergence of networks with the ReLU activation function, however it adds a considerable extra computing effort. In some cases ELUs can speed up the convergence of the networks without the need of the batch normalization, and thus it can speed up the training. This architecture consists of 5 convolutional layers always followed by a Max-Pooling layer and after flattening it, it is merged with the metadata in a fully-connected layer followed by the output layer. The metadata was connected with a dense layer of 100 neurons before merging. Dropout was applied on the input layer (p = 0.2), and before each of the dense layers (p = 0.4).

4.3.2 Metadata fusion The metadata was integrated in the network as an input vector of 7 elements:

1. Are coordinates available (1 if available, 0 if not)

2. Latitude (normalized to 0..1)

3. Longitude (normalized to 0..1)

4. Is elevation available (1 if available, 0 if not)

5. Elevation (normalized to 0..1)

6. Is part of day available (1 if available, 0 if not)

7. Part of day (normalized to 0..1)

The input layer for the metadata is connected to a dense layer of 100 neurons. The fully connected layer is merged with the output of the last max-pooling layer of the spectrogram and are fed into a dense layer. The architecture is visualized in the Table 4.1.

4.3.3 Parallel pyramidal convolutional network Increasing the STFT window size yields better frequency resolution, however the temporal resolution (rhythm) gets worse, so there is always a trade off between the two features, which can lead to loss of important information. A possible solution to this problem is that we train two network described in Section 4.3.1 with different STFT window sizes and then combine their prediction in an ensemble predictor. However, the combination of the two features include some manually tuned parameters, which could possibly learned by the network itself instead. This network has two different input layers for the two different STFT window sizes. The configuration of the convolutional/max-pooling part is identical with the network

39 4. Methodology

InputLayer 256 Configuration

Dropout p = 0.2

Convolution2D 64 5 × 5 kernels, 2 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 64 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 128 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 256 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 256 3 × 3 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernel, 2 × 2 stride InputLayer (Metadata) Configuration

Flatten Dense n = 100 Merge

Dropout, p = 0.4 Dense, n = 1024, ELU

Dropout, p = 0.4 Dense, n = 1500, Softmax

Table 4.1: A parallel architecture for different DFT window sizes.

described in Section 4.3.1, but since the input sizes for the two input layers are different, the output shape is different of each shape. A schematic overview of the network can be seen on in the Table 4.2.

InputLayer 256 Configuration InputLayer 512 Configuration

Dropout p = 0.2 Dropout p = 0.2

Convolution2D 64 5 × 5 kernels, 2 × 1 stride, ELU Convolution2D 64 5 × 5 kernels, 2 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 64 5 × 5 kernels, 1 × 1 stride, ELU Convolution2D 64 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 128 5 × 5 kernels, 1 × 1 stride, ELU Convolution2D 128 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 256 5 × 5 kernels, 1 × 1 stride, ELU Convolution2D 256 5 × 5 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernels, 2 × 2 stride MaxPooling2D 2 × 2 kernels, 2 × 2 stride

Convolution2D 256 3 × 3 kernels, 1 × 1 stride, ELU Convolution2D 256 3 × 3 kernels, 1 × 1 stride, ELU MaxPooling2D 2 × 2 kernel, 2 × 2 stride MaxPooling2D 2 × 2 kernel, 2 × 2 stride InputLayer (Metadata) Configuration

Flatten Dense n = 100 Merge

Dropout, p = 0.4 Dense, n = 1024, ELU

Dropout, p = 0.4 Dense, n = 1500, Softmax

Table 4.2: A parallel architecture for different DFT window sizes.

40 4.4. Combining the single predictions

4.4 Combining the single predictions

Since the method used splits of a sound file in several small chunks, that are classified, in order to classify the whole sound file, the predictions for these chunks must be combined for the whole file. Two different combination methods were evaluated:

Average The class vectors of the chunks c1, c2, . . . , cn are averaged in order to get the class vector c for the file. Majority vote The class vector c of the file is defined as the number of occurrence of class c among the Top-1 classes in the chunks of the file.

4.5 Evaluation methods

In this section an overview is given about the evaluation metrics used to compare the different prediction results and the different ways.

4.5.1 Mean Average Precision The MAP is one of the most widely-used metrics [TS06], and it was also the measure used by BirdCLEF 2017 for the official evaluation. As shown in [Zhu04] there is always a trade-off between the also commonly used recall and precision metrics, which means that both of them should be considered for evaluation purposes. Average Precision (AP) however, takes both of them into account. Unlike precision, AP does not penalize for retrieving more classes, but the rank of the correct prediction(s) is important. Lower ranked correct predictions yield lower scores and so unlike with recall it is not possible to achieve a perfect score just by returning all of the classes. The AP (q) for a given test file q can be defined the following way:

Pn (P (k) × rel (k)) AP (q) = k=1 q q (4.4) |Cr| where n is the number of predictions, P (i) is the precision at cut-off k in the list, relq(k) is an indicator function that is 1 if the class at k is a relevant class, otherwise zero and |C| is the number of relevant classes. The MAP is defined as the average of APs over all of the test instances q from the test set Q:

P AP (q) MAP = q∈Q (4.5) |Q|

41 4. Methodology

4.5.2 Top-n Accuracy The top-n accuracy considers only the first n highest ranked classes of the prediction, and scores if the ground truth is among these n predictions: P f (q, pn(q)) top-n accuracy = q∈Q (4.6) |Q|

where pn(q) is the n highest ranked class predicted for file q and f(q, p) is an indicator function which yields 1 if the ground-truth is among the classes returned by pn(q), and 0 otherwise.

4.5.3 Area under the ROC curve The Receiver Operating Characteristic (ROC) curve is used to visualize the performance of binary-classifiers and also aids to select the right classifier depending on the characteristics of the task [Faw06]. It can be used if the output of a binary classifier is a relative probability of the class instead of the labels true or false, as what a discrete classifier would output. It is a two dimensional curve which is created by increasing the threshold value used to decide above which threshold the class is considered as a positive instance by plotting the true positive rate (TPR) against the false positive rate (FPR). Although this curve provides a good method to help to select the right threshold values or select the right classifiers if we want to reach a specific TPRs or FPRs, but if we want to compare the general predictive capabilities of different classifiers, they are hard to use. A widely used solution for this problem is the usage of the area under the receiver operating characteristic curve (AUROC) as a metric [Faw06]. The AUROC is always a portion of the unit square, so its value is between 0.0 and 1.0. An interesting property of the AUROC is that the value gives the probability that the classifier will rank a randomly chosen positive instance higher than a negative one [Faw06]. Consequently, the value of the AUROC is usually above 0.5. If it is below 0.5, it is said that the classifier is not properly calibrated and by flipping the positive and negative values we get a true classifier with AUROC > 0.5. Another important characteristic of the AUROC is that in contrary to accuracy it is not affected by unequal class distributions. However, a major drawback of the ROC curve is that it is defined only for binary classifiers. Fortunately, there are a number of proposed extensions to multiclass-problems, however there is no universally accepted definition for this [Faw06]. In this thesis the Hand and Till’s formulation is used, which is defined as follows: 2 AUROC = X AUROC(c , c ) (4.7) total |C|(|C| − 1) i j {ci,cj }∈C

where C is the set of all classes and AUROC(ci, cj) is the area under the ROC curve involving only class ci and cj. An advantage of the Hand and Till’s formulation is that it is also invariant to skewed class distributions just like the AUROC for binary classification

42 4.6. Summary cases, however it is computationally very expensive, as it requires the calculation of the |C|(|C|−1) AUROC for 2 pairs which is 1.12 million pairs in the case of the BirdCLEF2017 dataset.

4.6 Summary

The main goal of this chapter was to describe the contributions of this thesis. Besides the summarization of the dataset, the whole training pipeline was presented from the pre-processing of the audio and metadata, through the network architectures and the way the classification of the sound chunks were combined to a final prediction of the instance, to the evaluation metrics, that were used for the benchmarking and comparison of the different methods. The performance of the various components are described in the next chapter.

43

CHAPTER 5 Results

This chapter summarizes the results and findings of the methods proposed in the previous chapters. All the results presented in this chapter are based on the local validation set, while the results on the official BirdCLEF2017 test set are summarized in Chapter 6. Firstly, the effect of the different sound representations is evaluated in Section 5.1, including different STFT window sizes, using Mel Spectrogram (see Section 2.4.2) using CQT (Section 2.4.3) and training on the pure FFT-Spectrogram without rescaling. After that, the effect of adding metadata is evaluated in Section 5.2, followed by the performances of the various data augmentation techniques in Section 5.3. After that the performance of the two network architectures are compared. In Section 5.5 the performance of different activation functions and their effects on the training times are evaluated. Finally, Section 5.6 shortly discusses the results and findings. A training batch size of 16, and a learning rate of 0.001, with Nesterov momentum of 0.9 proved to be useful for all of the sound representations and architectures, however, no extensive hyper-parameter tuning was made regarding these parameters. The experiments were carried out on a server provided by the Technical University of Vienna, consisting of two Intel Xeon X5680 CPUs, 96GB RAM and a Titan X GPU (Maxwell architecture). The sound files were stored on an SSD, while the noise samples were cached on a memory disk.

5.1 Sound representation

Five different sound representation methods were evaluated for the problem. CQT with 80 bands and 256 frames as described in [LS16], Mel-scaled spectrogram with STFT window sizes of 256 and 512 and pure STFT representations with window size 256 and 512. For the evaluation, 5 different subsets of 100 classes were selected from the BirdCLEF2017 dataset and were used for the training of the methods with a 90% training and 10%

45 5. Results

test set split. No data augmentation techniques and metadata were used for these tests besides randomly selecting a part of the sounds. The results of the sound representation evaluations is summarized in Table 5.1. Mel-512 slightly outperformed the CQT based solution and both of them has clearly outperformed the Mel-256 based representation. The pure STFT performed significantly worse than the others. While CQT’s performance almost matched the performance of Mel-512, because of its compute intensity the training times were up to 3 times higher than with the Mel based representations. A goal of this thesis is the finding of methods that perform at least as well as the baseline method of [SJKH16] but with less computing resources, so the Mel-256/512 based solutions were used for the further evaluations.

MAP AUROC Top-1 Accuracy Mel-512 0.708 0.933 62.65% CQT 0.706 0.950 61.09% Mel-256 0.690 0.953 61.48% STFT-256 0.667 0.947 56.81% STFT-512 0.665 0.951 56.81%

Table 5.1: The performance of the different sound representation methods on 5 randomly selected subsets of 100 classes from the BirdCLEF2017 dataset. The results are ordered by the MAP score.

5.2 Metadata

Because of the long training times associated with the whole dataset, the effects of the different metadata combinations were tested on 5 randomly selected subset of 100 birds of the BirdCLEF2017 dataset. The subsets were split in a 90% training and a 10% validation set. The scores were averaged from the 5 runs and Table 5.2 gives an overview about them. The tests were carried out with the Mel-256 representation. Adding any type of metadata results in a better performance than the baseline, containing no metadata. Part of day had the least impact on the results, improving only by 0.016 the MAP score. A possible cause for this is that most of the recordings were concentrated in the morning hours. The elevation had an impact of extra 0.030 score compared to the baseline, and +0.014 compared to the part of day. Surprisingly, combining elevation with the part of day resulted in slightly worse scores than elevation on its own. Using coordinates resulted in a highly better performance than the elevation, adding 0.061 to the baseline and 0.021 to the elevation on its own. Combining the coordinates with the part of the day resulted in slightly better results. The combined spatial information resulted in 0.072 improvement in the MAP score, while using the all of the three metadata resulted in a total of 0.078 improvement compared to the baseline.

46 5.3. Data augmentation

MAP AUROC Top-1 Accuracy All metadata 0.907 0.991 85.99% Coord + Elevation 0.901 0.994 85.60% Coord + PoD 0.892 0.992 84.88% Coordinates 0.890 0.995 84.19% Elevation 0.859 0.990 79.77% Elevation + PoD 0.853 0.990 79.38% Part of day (PoD) 0.845 0.984 78.04% Baseline 0.829 0.987 75.88%

Table 5.2: The effects of adding metadata on the training performance in different combinations. Baseline represents the result when no metadata was added. The results are ordered by the MAP score.

5.3 Data augmentation

In order to evaluate the effect of the different data augmentation methods presented in Section 4.2.3, the same 5 randomly selected subsets of the BirdCLEF2017 dataset were used as in Section 5.2. The networks were trained for 100 epochs and the best performing models regarding the MAP of the foreground species were selected for the evaluation. The scores of the 5 selected models were averaged and are presented in Table 5.3. The baseline method used no data augmentation techniques besides randomly selecting chunks of the recordings. The metadata was available in the runs. Adding random noise to the recordings is by far the most important data augmentation step as it has added 0.151 to the MAP score. Adding birds from the same class had also a significant positive effect on the classification performance with increasing the MAP by 0.056. Changing the pitch, adding birds from the neighborhood and changing the volume also increased the performance, resulting in 0.031, 0.024 and 0.007 increase in the score respectively. The effect of adding neighboring birds is increased on the whole dataset as in that case the data augmentation process has more birds to choose from. The positive effect of random cutting the recordings in two parts and concatenating them in reversed order as suggested in [SJKH16] could not be reproduced, as it yielded slightly worse results than the baseline.

47 5. Results

MAP AUROC Top-1 Accuracy Noise overlay 0.884 0.945 83.52% Same class 0.789 0.939 71.26% Pitch shift 0.769 0.931 67.57% Neighboring birds 0.757 0.928 66.40% Volume shift 0.740 0.921 65.23% Baseline 0.733 0.918 64.45% Random cut 0.726 0.917 63.24%

Table 5.3: The effects of the data augmentation processes presented in Section 4.2.3. The results are ordered by the MAP score.

5.4 Network architecture

For the final evaluation of the network architectures, three different stratified splits were created of the whole BirdCLEF2016 and BirdCLEF2017 datasets with 90% training set and 10% validation set. The networks were trained for 100 epochs and the models with the best MAP score on the validation set were selected for the evaluation. Five different architectures were evaluated:

Mel-256/512: The network using only one Mel-spectrogram as input (See Sec- tion 4.3.1). It was evaluated with STFT window sizes 256 and 512. Parallel: This is the network presented in Section 4.3.3, using both a spectro- gram created with STFT window sizes 256 and 512. NoMetadata: The same network as FFT256 but without using any metadata. Baseline: This is the state-of-the-art network architecture by [SJKH16].

The results presented in Table 5.4 are the averaged scores of the three runs. Surprisingly, the simpler network Mel-256/512 outperformed the more complicated par- allel architecture. Mel-512 performed slightly better than Mel-256 on the BirdCLEF2016 on the newer dataset the contrary was true. Both the simpler network architecture and the parallel architecture performed better than the Baseline method by [SJKH16]. The Mel-256 method performed only slightly better than the baseline method, indicating that most of the improvement comes from the inclusion of metadata.

48 5.5. Activation functions and training times

MAP AUROC Top-1 Accuracy Mel-512-999 0.693 0.963 61.70% Mel-256-999 0.691 0.960 60.69% Parallel-999 0.655 0.949 56.30% Baseline-999 0.629 0.936 54.77%

(a)

MAP AUROC Top-1 Accuracy Mel-256-1500 0.684 0.958 59.56% Mel-512-1500 0.681 0.961 59.32% Parallel-1500 0.643 0.941 55.73% Baseline-1500 0.626 0.932 54.31%

(b)

Table 5.4: Comparison of the network architectures presented in Section 4.3 with the baseline architecture of [SJKH16] with different STFT window sizes, on the Bird- CLEF2016 (a) and BirdCLEF2017 (b) datasets. The results are ordered by the MAP score.

5.5 Activation functions and training times

Two activation functions were evaluated in the tests, ReLUs and ELUs. As discussed in Section 2.3.5, ReLU is usually used with the combination of batch normalization (BN) as it leads to a faster convergence and better classification performance. However, using ELUs may lead to similar performance without the need of the costly normalization steps. The performance tests were carried out with the same randomly selected training/test set as described in Section 5.4. Figure 5.5 summarizes the classification performance and training times of the different activation functions using a Mel-scaled spectrogram with 256 STFT window, with a Mel-scaled spectrogram with 512 STFT window, the training time required for the Parallel-1500 architecture, the CQT based solution and for comparison the baseline method of [SJKH16] on the full BirdCLEF2017 dataset. The total training time was calculated from the number of epochs required to reach the point with no more decrease in the training-loss for 5 consecutive epochs. All of the variants outperformed the baseline method in terms of classification performance and training time. Among the methods discussed in this thesis, just as suggested in

49 5. Results

Table 5.4, the Parallel-1500 architecture was the slowest, closely followed by the ReLU-only method. This method stopped converging after 16 epochs. For the sake of completeness, the CQT based representation was also included in this comparison, yielding the second- worse training time requirement. Mel-256 with ReLU and BN required 28h more for the convergence than the ELU based method and it resulted in slightly worse performance. While Mel-512 resulted in similar classification performance as Mel-256, it took 23h more to reach the convergence. The results show, that using ELU may lead to a much faster convergence with increased training performance compared to the ReLU only and ReLU + BN methods.

MAP Time(s) / epoch Total training time Mel-256-1500 ELU 0.684 2 850 1d 09h 15m Mel-512-1500 ELU 0.681 4 100 2d 10h 05m Mel-256-1500 ReLU w BN 0.676 4 300 2d 13h 21m CQT-1500 ELU 0.653 6 300 4d 05h 30m Mel-256-1500 ReLU w/o BN 0.647 2 800 1d 12h 26m Parallel-1500 ELU 0.643 13 000 10d 09h 10m Baseline-1500 ReLU w BN 0.626 15 500 10d 22h 38m

Table 5.5: The effects of the data augmentation processes presented in Section 4.2.3. The average training times / epoch were rounded to the next 0 or 50.

5.6 Discussion

The relatively high AUROC scores suggest that the classification system is in most cases able to make clear distinction between the two classes and the false classifications occur between a class and a few other classes. This follows from the definition of AUROC for multiclass-problems by Hand and Till. As (I) in Figure 5.1 shows, about 1/3 of the classes is always correctly classified by the Mel-256-1500 system, while about 1/5 is always wrongly. However, more than a half of the recording were in the Top-5 birds returned by the classification and only 1/10 of the classes were never in the list of the most probable 5 birds. Several possible reasons have been identified for the bad performance for classes never classified correctly:

• Noisy / bad quality recordings

• Too silent recordings

• Permanent bird singing in the background that was longer than the foreground bird

50 5.6. Discussion

• Bird singing is similar to another species with more samples

• Generally very few samples

As suggested by the (II) plot in Figure 5.1, there is no strong correlation between the number of samples per class and the accuracy, having more samples for a class only slightly increases the classification performance.

1 Top-1 1 0.9 0.9 Top-2 0.8 0.8 0.7 Top-5 0.7 0.6 0.6 0.5 0.5 0.4 0.4 Accuracy 0.3 Accuracy 0.3 0.2 0.2 0.1 0.1 0 0 1 250 500 750 1000 1250 1500 4 16 26 36 46 56 66 86 160 Instances Class size

(I) (II)

Figure 5.1: An overview of the accuracies of the classes. (I) displays the accuracy of each class ordered by the accuracy decreasing. About 1/3 of the classes are always correctly classified, while about 1/5 is always falsely. (II) depicts the effect of the class size on the accuracy.

Figure 5.2 gives an overview about the classification accuracy on bird taxonomy level with the Mel-256-1500 configuration. The taxonomy orders are ordered by their size (as visualized in Figure 4.1). The most noticeable feature is that many many classes belonging to different orders are classified as one belonging to Passeriformes. This is probably caused by the overwhelming presence of birds of this category in the dataset. Note that Cathartiformes and Opisthocomiformes are the smallest taxonomy orders in the dataset and the validation contained only 1 samples for each. Using majority vote instead of averaging the results of the chunks in the sound file resulted consequently significantly worse scores, therefore they are not extra included.

51 5. Results

Passeriformes 1 Piciformes Psittaciformes Apodiformes Accipitriformes 0.8 Galliformes Strigiformes Gruiformes Tinamiformes Columbiformes 0.6 Charadriiformes Caprimulgiformes Falconiformes Trogoniformes 0.4 Cuculiformes Pelecaniformes Coraciiformes Anseriformes Podicipediformes 0.2 Cariamiformes Eurypygiformes Opisthocomiformes Cathartiformes 0

Piciformes GalliformesGruiformes Apodiformes Strigiformes Passeriformes Tinamiformes FalconiformesCuculiformesCoraciiformesAnseriformes Psittaciformes Columbiformes TrogoniformesPelecaniformes CariamiformesCathartiformes Accipitriformes Charadriiformes Eurypygiformes Caprimulgiformes Podicipediformes Opisthocomiformes

Figure 5.2: Confusion matrix of the bird orders. The birds are ordered decreasing by the number of classes in each order.

52 CHAPTER 6 BirdCLEF 2017

As a final evaluation of the above methods 4 runs were submitted to the BirdCLEF2017 competition in a team with Alexander Schindler1,2 and Thomas Lidy1 under the name ’Cynapse’ [FSLR17]. This chapter gives an overview of the competition. The submissions based on the work of this thesis are presented in Section 6.1. After that, the methods used by other participants is covered in Section 6.2, finally the results and conslusions of the competition is presented in Section 6.3. A detailed description of the competition can be found in [JGG+17].

6.1 Submitted runs

This section summarizes the runs submitted by the Cynapse team to the BirdCLEF 2017:

Cynapse Run 1: The first run was a proof-of-concept submission, and a 256 STFT window was used and the network was trained with 90% of the training set, the rest being kept as a validation set. The training was running for 70 epochs and it reached 64.64% MAP on the 10% validation set. Cynapse Run 2: In the second run the network architecture was kept the same as in the first run, however, the STFT window size was raised to 512. The network has been trained on 90% of the training set (randomly selected, with a different seed as in the first run) until 70 epochs, and 72.06% MAP was reached on the validation set. Finally it was kept training for 20 epochs on the whole training set.

1Vienna University of Technology 2Austrian Institute of Technology

53 6. BirdCLEF 2017

Cynapse Run 3: In this run the model of the first run was trained for an additional 20 epochs with the whole training set. Cynapse Run 4: This was an ensemble of the second and the first run: for each instance of the test set the network outputs were averaged class- P2(c)+P3(c) wise, i.e. P4(c) = 2 , where Pn(c) is the prediction relative probability for class c in run n.

6.2 Other participating teams

78 teams registered for the BirdCLEF 2017 competition and 5, including Cynapse, have submitted run files for the final evaluation [JGG+17]. All of the four teams submitting working notes have used convolutional neural networks. The fifth team, WUT, did not published their solution, thus it is not covered here. In the following, the most important details are summarized of the other three methods.

DYNI UTLN [SG17] The team of the University of Toulon adapted the state-of- the-art image recognition model Inception-v4 [SIVA17] for audio- classification. The network has been first pre-trained on ImageNet [DDS+09] and then they have generated three log-spectrograms with STFT window sizes of 128, 512 and 2048, that were then fed in as the RGB channels. The team has made no signal/noise separation during the pre-processing but they used time-frequency attention mechanisms [XBK+15] in order to let the network focus on bird activities. In the data-augmentation set they have made random image crops in time and frequency domains, random variations in hue, contrast, brightness and saturation as well as adding random noise. For the transfer learning they first trained only the last layer and then fine-tuned all layers of the deep network. The team made two submissions which used the same network architecture, but differed in some undisclosed parameters. FHDO BCSG [FKF17] The FHDO BCSG team of the University of Applied Sciences and Arts Dortmund adapted an image recognition solution for audio classification, similarly to DYNI UTLN. However, instead of using the state-of-the-art Inception-v4 they decided to use the previous version of it, Inception-v3 [SVI+16], for performance reasons, as this network can be trained considerably faster than its newer version. They used a modified version of the signal-noise separation method of [SJKH16], but in addition they also applied a bandpass filtering by ignoring the frequencies below 1kHz and above 12kHz. For the training set they used an STFT window of size 512 to generate the spectrograms which were handled after that as images. During the data augmentation process they used random time shifting, time

54 6.3. Results and analysis

stretching and pitch shifting/stretching. If the resulting images were smaller than the input size of Inception-v3, 299 × 299, the images were scaled with bilinear interpolation. The same image was fed in the network in the R/G/B channels. The team submitted 4 runs; three with different pre-processing methods and the fourth as an ensemble model of these three. TUCMI [KWSH+17] The team of the University of Chemnitz based their method also on the work of Elias et al.[SJKH16], the winning team of BirdCLEF 2016. They used a modified variant of the method described in [SJKH16] for the signal-noise separation: they first calculated the spectrograms for five-second chunks of the audio signals with a STFT window 512 and then they removed those with improper signal to noise ratio with the statistical methods described in [SJKH16]. They selected 869 spectrograms with heavy background noise and that formed the base of the noise overlaying in the data augmentation process. Besides of pitch-shift and the noise samples they also applied Gaussian Noise on the spectrograms. They used a network similar to that in [SJKH16], however with larger receptive fields in the first convolutional layers. They used ELUs as activation function in addition to batch normalization before the convolutional and dense layers. They made considerable effort in order to reduce the training time. They started with a learning rate of 0.01 which was iteratively reduced over 55 to 0.0001. Instead of starting the training with the background species involved, they first pre-trained the network with single-label outputs and then later used these models as a starting point for the multi-label classification. Thus, they were able to skip 20-30 epochs of training time. The team submitted 4 runs; TUCMI Run 1 was trained on the whole dataset, TUCMI Run 2 was based on the model from the first run, but trained with the background- species as well for multi-label classification. TUCMI Run 3 was an ensemble of the first two runs, a model with STFT window size 256 and four models trained on 300, 500, 1000, 2000 training samples. Their last run, TUCMI Run 4 used a reduced training set of 100 species, selected based on the probability of occurrence.

6.3 Results and analysis

The final results of the different categories are summarized in Figure 6.1 and in Table 6.2. The DYNI UTLN team performed the best on the competition, with the DYNI UTLN Run 1 being the best performing solution in all of the categories except of the Soundscapes without time-codes (same queries 2016) category, in which the DYNI UTLN Run 2 performed the best. The Cynapse team was the second-best performing team in the soundscape category and the third-best in the traditional records category. Although

55 6. BirdCLEF 2017

the orange horizontal dashed line shows the performance of the best performing team of previous year it is not an appropriate comparison, since there were 50% more classes in 2017 than in 2016.

0.8 MAP Soundscapes with time-codes MAP Soundscapes without time-codes (same queries 2016) 0.7 MAP Traditional records (only main species) MAP Traditional records (with background species) 0.6

0.5

0.4

0.3

0.2

0.1

0

Figure 6.1: The final results of the BirdCLEF2017 competition. The horizontal dashed orange line represents the MAP of the best performing team of BirdCLEF 2016. The diagram was based on the result diagram in [JGG+17].

The TUCMI team almost reached the performance (MAP 0.616 vs 0.605) of the DYNI UTLN Run 1 when considering the traditional records, but with a much simpler architec- ture and with considerably less training time. This shows that simpler architectures with ensemble models still might have a chance against large pre-trained architectures. The method for the learning transfer plays also a crucial role in the performance, as the FHDO BCSG models were outperformed by the Cynapse and TUCMI submissions, despite the much more advanced network architecture of the former and similar preprocessing and data augmentation. Most of the submissions reached a substantially better result on the new 2017 soundscape recordings than on the 2016 ones, probably because of the better quality of the recordings. However, the precision is still considerably lower than with the traditional recordings, although the sound-scape recordings stay the closest to the most important real-life applications. There is still a lot of place for improvement here, which is partially hindered by the lack of training data. As it can be seen in Table 6.1, among the Cynapse submissions Cynapse Run 4 performed

56 6.3. Results and analysis the best when considering the traditional recordings, while Run 3 performed better with the Soundscapes with timecodes, and Run 2 with soundscapes without timecodes. With the traditional recordings, the second run with STFT window 512 performed better than Run 1 and Run 3 with an 256 window, at the cost of a considerably longer training time. The results show that as in the case of the FHDO BCSG team, using a more complicated network on its own is not enough to get better performance on the task compared to a much simpler network using more data augmentation techniques and metadata. However, with sophisticated learning transfer from image classification used by DYNI UTLN, even without any signal/noise separation a highly better performing system can be obtained, resulting in a new state-of-the-art performance.

Run MAP w TC MAP wo TC MAP o MS MAP w BS 1 0.165 0.008 0.486 0.432 2 0.069 0.012 0.562 0.493 3 0.168 0.008 0.514 0.456 4 0.142 0.010 0.579 0.511

Table 6.1: Results of the Cynapse Runs on the four BirdCLEF evaluation tasks: MAP ’16 Soundscapes with time-codes (MAP w TC), MAP Soundscapes without time-codes (MAP w/o TC), MAP Traditional records only main species (MAP o MS), MAP Traditional records with background species (MAP w BS) [JGG+17]

57 6. BirdCLEF 2017

MAP Soundscapes MAP traditional records Team with time-codes without time-codes (2016) main species only with background species DYNI UTLN Run 1 0.288 0.072 0.714 0.616 DYNI UTLN Run 2 0.223 0.099 0.594 0.516 Cynapse Run 3 0.168 0.008 0.514 0.456 Cynapse Run 1 0.165 0.008 0.486 0.432 TUCMI Run 3 0.162 0.079 0.687 0.605 Cynapse Run 4 0.142 0.010 0.579 0.511 TUCMI Run 2 0.131 0.064 0.625 0.547 TUCMI Run 1 0.111 0.063 0.644 0.564 FHDO BCSG Run 3 0.097 0.039 0.567 0.496 TUCMI Run 4 0.091 0.062 0.068 0.064 FHDO BCSG Run 4 0.083 0.023 0.504 0.438 Cynapse Run 2 0.069 0.012 0.562 0.493 FHDO BCSG Run 2 0.069 0.048 0.491 0.431 FHDO BCSG Run 1 0.056 0.041 0.492 0.427 WUT Run 1 0.002 0.000 0.176 0.151 WUT Run 4 0.001 0.010 0.162 0.139 WUT Run 2 0.000 0.014 0.207 0.177 WUT Run 3 0.000 0.010 0.156 0.132

Table 6.2: The overall results of the BirdCLEF2017 competition.

58 CHAPTER 7 Conclusion and future work

In this thesis the set goal was to improve the performance of the state-of-the-art automa- tized bird classification method presented by Sprengel et al.[SJKH16]. The second goal was to increase the scalability of the training of bird classifiers as there between 9 800 [Cle07] and 10 050 [GW06] known bird species and it is suggested that the real number can be as high as 18 000 [BCKZ16]. The presented approach uses a simple convolutional neural network architecture based on AlexNet [KSH12], uses ELU in order to avoid the need of the costly batch normalization and includes metadata about the recordings in the learning and classification phase. The best performing approach was able to reach 0.058 improvement in the MAP score and 5.25% in the accuracy compared to the baseline method, while requiring 1/8th the time to train. The methods was officially evaluated in the BirdCLEF2017 contest [JGG+17] where it was the second best performing team regarding the time-coded soundscapes, which are long recordings containing significant sequences without any relevant sound. The methods proved to be the third best when only traditional recordings were considered, with one bird being in the focus with some other species present in the background noise. The method was clearly outperformed by a from image recognition adapted solution in both the soundscape and traditional recordings presenting new ways of knowledge transfers between seemingly disparate classification tasks.

7.1 Future work

The sound-noise separation is currently built on hand crafted features which can be instead based on the recent results of detection of rare sound events [LPH17][CV17][PKBGM17]. This could especially improve the performance of the bird identification in the soundscape recordings.

59 7. Conclusion and future work

Instead of treating each species as an isolated class, research will focus on including or- nithological and taxonomical relationships between the species with the aim to incorporate as many information about the birds as possible. Another possible way of improvement is the evaluation of more complex network ar- chitectures, the incorporation of other neural network architectures such as recurrent neural networks as well as the use of generative adversarial networks [GPM+14] for data augmentation.

60 List of Figures

2.1 MNIST Dataset samples ...... 6 2.2 Overview of the Machine Learning taxonomy with the most common settings 7 2.3 Polynomial fitting with different orders ...... 9 2.4 The Rosenblatt model of a perceptron ...... 10 2.5 Heaviside step function, sigmoid function, hyperbolic tangent ...... 11 2.6 The XOR Problem with perceptrons ...... 11 2.7 Multilayer Perceptron ...... 12 2.8 Convolutional operator ...... 15 2.9 Convolutional layer ...... 16 2.10 A sample for Max-Pooling ...... 16 2.11 Drouput ...... 17 2.12 Plots of Rectified Linear Unit and Exponential Linear Unit ...... 19

4.1 Bird taxonomy overview ...... 31 4.2 Metadata distributions ...... 32 4.3 Original spectrogram ...... 33 4.4 Selected pixels of the spectrogram ...... 33 4.5 Selected pixels after erosion and dilation ...... 33 4.6 Signal/noise separation steps ...... 33 4.7 The resulting wavefiles after sound-noise splitting ...... 34 4.8 Parts of the day depending on the position of the sun ...... 36

5.1 The accuracies of classes ...... 51 5.2 Confusion matrix of the bird orders ...... 52

6.1 The final results of the BirdCLEF2017 competition ...... 56

61

List of Tables

3.1 The best-performing network architecture of BirdCLEF 2016 ...... 27

4.1 A parallel architecture for different DFT window sizes...... 40 4.2 A parallel architecture for different DFT window sizes...... 40

5.1 Sound representation results ...... 46 5.2 Effects of the metadata ...... 47 5.3 Effects of the data augmentation ...... 48 5.4 Effects of the network architecture ...... 49 5.5 Effects of the data augmentation ...... 50

6.1 Results of the Cynapse Runs in BirdCLEF 2017 ...... 57 6.2 The overall results of the BirdCLEF2017 competition...... 58

63

Acronyms

ANN Artificial Neural Networks. 3, 12, 22

AP Average Precision. 41

ATH above the horizon. 35

AUC Area Under the Curve. 24

AUROC area under the receiver operating characteristic curve. 42, 43, 46–50

BN batch normalization. 49, 50

BTH below the horizon. 35

CLEF Cross Language Evaluation Forum. 24

CNN Convolutional Neural Network. 2, 15, 20, 21, 26

CQT Constant-Q transform. 20, 38, 45

DFT Discrete Fourier Transformation. 19, 38, 40, 63

ELU Exponential Linear Unit. 18, 19, 39, 49, 50, 55, 59

FPR false positive rate. 42

GMM Gaussian Mixture Model. 21

HMM Hidden Markov Model. 21

ICML International Conference on Machine Learning. 22, 23

LReLU Leaky Rectified Linear Unit. 18, 26

MAP Mean Average Precision. 24–26, 41, 46–50, 57, 59

65 MLP Multilayer Perceptrons. 12

MLSP Machine Learning for Signal Processing. 22, 23

MSE Mean Squared Error. 13

NIPS Neural Information Processing Systems. 23

PReLU Parametric Rectified Linear Unit. 18, 26

ReLU Rectified Linear Unit. 18, 19, 26, 27, 39, 49, 50

RNN Recurrent Neural Networks. 21

ROC Receiver Operating Characteristic. 42

SED/C Sound Event Detection and Classification. 21

SGD Stochastic Gradient Descent. 17

STFT Short-Time Fourier Transform. 3, 19, 20, 32, 35, 38, 39, 45, 46, 48, 49, 53–55, 57

SVM Support Vector Machine. 21, 25

TPR true positive rate. 42

XOR exclusive OR. 11

66 Bibliography

[AMJP12] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional neural networks concepts to hybrid NN- HMM model for speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pages 4277–4280, 2012.

[BB02] Gregory Ball and Jacques Balthazart. Neuroendocrine Mechanisms Regulat- ing Reproductive Cycles and Reproductive Behavior in Birds. In Hormones, Brain and Behavior, volume 2, pages 649–798. Academic Press, December 2002.

[BCKZ16] George F. Barrowclough, Joel Cracraft, John Klicka, and Robert M. Zink. How many kinds of birds are there and why does it matter? PloS one, 11(11), 2016.

[BHR+13] F. Briggs, Y. Huang, R. Raich, K. Eftaxias, Z. Lei, W. Cukierski, S. F. Hadley, A. Hadley, M. Betts, X. Z. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, H. W. Ng, T. N. T. Nguyen, H. Huttunen, P. Ru- usuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, and M. Milakov. The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. In 2013 IEEE Interna- tional Workshop on Machine Learning for Signal Processing (MLSP), pages 1–8, September 2013.

[Bis13] Christopher M. Bishop. Pattern recognition and machine learning. Springer, 8th ed. 2009 edition, October 2013.

[BLN+12] Forrest Briggs, Balaji Lakshminarayanan, Lawrence Neal, Xiaoli Z. Fern, Raviv Raich, Sarah J. K. Hadley, Adam S. Hadley, and Matthew G. Betts. Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach. The Journal of the Acoustical Society of America, 131(6):4640–4650, 2012.

[Bre01] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

67 [BVHB99] George E. Bentley, Thomas J. Van’t Hof, and Gregory F. Ball. Seasonal neuroplasticity in the songbird telencephalon: A role for melatonin. In Proceedings of the National Academy of Sciences, volume 96 (8), pages 4674–4679. National Academy of Sciences, April 1999. [CEP+07] Jinhai Cai, Dominic Ee, Binh Pham, Paul Roe, and Jinglan Zhang. Sensor network for the monitoring of ecosystem: Bird species recognition. In Intelligent Sensors, Sensor Networks and Information, 2007. ISSNIP 2007. 3rd International Conference on, pages 293–298. IEEE, 2007. [Cha04] Antonin Chambolle. An Algorithm for Total Variation Minimization and Applications. Journal of Mathematical Imaging and Vision, 20(1-2):89–97, 2004.

[Cle07] James F. Clements. Clements checklist of birds of the world. Comstock Pub. Associates/Cornell University Press, 2007.

[CV95] Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine Learning, 20(3):273–297, 1995. [CV17] Emre Cakir and Tuomas Virtanen. Convolutional Recurrent Neural Net- works for Rare Sound Event Detection. Technical report, DCASE2017 Challenge, September 2017. [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Ima- genet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255, 2009. [Faw06] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, 2006. [FKF17] Andreas Fritzler, Sven Koitka, and Christoph M. Friedrich. Recognizing Bird Species in Audio Files Using Transfer Learning. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017. [Fod13] G. Fodor. The Ninth Annual MLSP Competition: First place. In 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–2, September 2013. [FP79] T. C. Van Flandern and K. F. Pulkkinen. Low-precision formulae for planetary positions. Astrophys. J. Supp., 41:391–411, November 1979. [FSLR17] Botond Fazekas, Alexander Schindler, Thomas Lidy, and Andreas Rauber. A Multi-modal Deep Neural Network approach to Bird-song Identication. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017.

68 [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [GD17] Frank B. Gill and David B. Donsker. IOC World Bird List 7.3. http://www.worldbirdnames.org, July 2017. [Online; accessed: 2017-08-20]. [GGV+14] Hervé Goëau, Hervé Glotin, Willem-Pier Vellinga, Robert Planqué, Andreas Rauber, and Alexis Joly. LifeCLEF Bird Identification Task 2014. In CLEF: Conference and Labs of the Evaluation Forum, Information Access Evaluation meets Multilinguality, Multimodality, and Interaction, Sheffield, United Kingdom, September 2014. [GGV+15] Hervé Goëau, Hervé Glotin, Willem-Pier Vellinga, Robert Planqué, Andreas Rauber, and Alexis Joly. LifeCLEF Bird Identification Task 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., 2015. [GGV+16] Hervé Goëau, Hervé Glotin, Willem-Pier Vellinga, Robert Planqué, and Alexis Joly. LifeCLEF Bird Identification Task 2016: The arrival of Deep learning. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016., pages 440–449, 2016. [GL83] Daniel W. Griffin and Jae S. Lim. Signal estimation from modified short- time Fourier transform. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983, pages 804–807, 1983. [GLF+93] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993. [GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645–6649, 2013. [GNF+03] Richard D. Gregory, D. Noble, R. Field, J. Marchant, M. Raven, and D. W. Gibbons. Using birds as indicators of biodiversity. Ornis hungarica, 12(13):11–24, 2003. [GO04] Kevin J. Gaston and Mark A. O’Neill. Automated species identification: why not? Philosophical transactions of the Royal society B: Biological sciences, 359(1444):655–667, 2004. [GPM+14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Gen- erative Adversarial Networks. CoRR, abs/1406.2661, 2014.

69 [GW06] Frank B. Gill and Minturn T. Wright. Birds of the world: recommended English names. Princeton University Press Princeton, New Jersey, 2006. [HDY12] Geoffrey Hinton, Li Deng, and Dong Yu. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag., 29(6):82–97, 2012. [HP05] Arndt Hampe and Rémy J. Petit. Conserving biodiversity under climate change: the rear edge matters. Ecology letters, 8(5):461–467, 2005. [HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classifi- cation. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1026–1034, 2015. [IS15] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015. [ISS13a] ICML int. Conf. Proc. 1st workshop on Machine Learning for Bioacoustics - ICML4B, USA, 2013. http://sabiod.univ-tln.fr. [ISS13b] NIPS Int. Conf. Proc. Neural Information Processing Scaled for Bioacoustics, from Neurons to Big Data, USA, 2013. http://sabiod.org/nips4b. [JGG+17] Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Jean-Christophe Lombardo, Robert Planqué, Simone Palazzo, and Henning Müller. LifeCLEF 2017 Lab Overview: multimedia species identification challenges. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 255–274. Springer, 2017. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classi- fication with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [KWSH+17] Stefan Kahl, Thomas Wilhelm-Stein, Hussein Hussein, Holger Klinck, Danny Kowerko, Marc Ritter, and Maximilian Eibl. Large-Scale Bird Sound Classification using Convolutional Neural Networks. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017. [Las14] Mario Lasseck. Large-scale Identification of Birds in Audio Recordings. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014., pages 643–653, 2014.

70 [Las15] Mario Lasseck. Improved Automatic Bird Identification through Decision Tree based Feature Selection and Bagging. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., 2015.

[LBD+89] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, 1989.

[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[LC98] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist/.

[LPH17] Hyungui Lim, Jeongsoo Park, and Yoonchang Han. Rare sound event detection using 1D convolutional recurrent neural networks. Technical report, DCASE2017 Challenge, September 2017.

[LS16] Thomas Lidy and Alexander Schindler. CQT-based convolutional neural networks for audio scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), volume 90, pages 1032–1048. DCASE2016 Challenge, 2016.

[MDH12] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Acoustic Modeling Using Deep Belief Networks. IEEE Trans. Audio, Speech & Language Processing, 20(1):14–22, 2012.

[MHD+17] Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. DCASE 2017 Challenge setup: Tasks, datasets and baseline system. In DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, November 2017.

[MHN13] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlin- earities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.

[MMN+17] Brian McFee, Matt McVicar, Oriol Nieto, Stefan Balke, Carl Thome, Dawen Liang, Eric Battenberg, Josh Moore, Rachel Bittner, Ryuichi Yamamoto, Dan Ellis, Fabian-Robert Stoter, Douglas Repetto, Simon Waloschek, C. J. Carr, Seth Kranzler, Keunwoo Choi, Petr Viktorin, Joao F. Santos, Adrian Holovaty, Waldir Pimenta, and Hojin Lee. librosa 0.5.0, February 2017.

[MP43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.

71 [MP69] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, MA, 1969.

[OM98] Genevieve B. Orr and Klaus-Robert Mueller, editors. Neural Networks : Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science. Springer, 1998.

[OSB99] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-time Signal Processing (2Nd Ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.

[Pic16] Karol J. Piczak. Recognizing Bird Species in Audio Recordings using Deep Convolutional Neural Networks. In Working Notes of CLEF 2016 - Con- ference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016., pages 534–543, 2016.

[PKBGM17] Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, and Alfred Mertins. DNN and CNN with weighted and multi-task loss functions for audio event detection. Technical report, DCASE2017 Challenge, September 2017.

[SG17] Antoine Sevilla and Hervé Glotin. Audio Bird Classification with Inception- v4 extended with Time and Time-Frequency Attention Mechanisms. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017.

[SHK+14] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[SHM+16] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Master- ing the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.

[SIVA17] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 4278–4284, 2017.

[SJKH16] Elias Sprengel, Martin Jaggi, Yannic Kilcher, and Thomas Hofmann. Audio Based Bird Species Identification using Deep Learning Techniques. In CLEF (Working Notes), pages 547–559, 2016.

72 [SK10] Christian Schörkhuber and Anssi Klapuri. Constant-Q transform toolbox for music processing. In 7th Sound and Music Computing Conference, Barcelona, Spain, pages 3–64, 2010. [SKSS16] Anish Shah, Eashan Kadam, Hena Shah, and Sameer Shinde. Deep Residual Networks with Exponential Linear Unit. CoRR, abs/1604.04112, 2016. [Sla98] Malcolm Slaney. Auditory toolbox, 1998.

[SMP12] Osvaldo E. Sala, Laura A. Meyerson, and Camille Parmesan. Biodiversity change and human health: from ecosystem services to spread of disease, volume 69. Island Press, 2012. [SPKL16] Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages 4955–4959, 2016. [SVI+16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826, 2016. [SVN37] S. S. Stevens, J. Volkmann, and E. B. Newman. A Scale for the Measurement of the Psychological Magnitude Pitch. The Journal of the Acoustical Society of America, 8(3):185–190, 1937. [TC16] Bálint P. Tóth and Bálint Czeba. Convolutional Neural Networks for Large- Scale Bird Song Classification in Noisy Environment. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016., pages 560–568, 2016. [TPN+12] Michael Towsey, Birgit Planitz, Alfredo Nantes, Jason Wimmer, and Paul Roe. A toolbox for call recognition. Bioacoustics, 21(2):107–125, 2012. [TS06] Andrew Turpin and Falk Scholer. User Performance Versus Precision Measures for Simple Search Tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pages 11–18, New York, NY, USA, 2006. ACM.

[War70] W. Dixon Ward. Foundations of Modern Auditory Theory, volume 1. Academic Press, 1970.

[Wer74] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.

73 [XBK+15] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2048–2057, 2015. [XCL16] Lie Xu, Chiu-sing Choy, and Yi-Wen Li. Deep sparse rectifier neural networks for speech denoising. In IEEE International Workshop on Acoustic Signal Enhancement, IWAENC 2016, Xi’an, China, September 13-16, 2016, pages 1–5, 2016.

[You03] Anthony Young. Global Environmental Outlook 3 (GEO-3): Past, Present and Future Perspectives, 2003.

[Zhu04] Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2:30, 2004. [ZPB+17] Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, César Laurent, Yoshua Bengio, and Aaron C. Courville. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. CoRR, abs/1701.02720, 2017.

74