n o i t s a e z DOCTORAL DOCTORAL DISSERTATIONS i i l c a e c p o S

V f

o d

r d n i n o B i

a

t n a n o

c o

i s d fi n t u i e l r c i e s g e d a s F t

a u o e l p t p e S D C S Departm en t of Sign al Processin g an d Acou stics

Seppo Fagerlund Studies on Vocalization Detection and Classification of Species Aalto Un iversity

Aa lto-DD 166/2014

S E

N R Y

O U R G

I +

T

L E T O +

Y

S C V A A L +

E S E T M R O O T N E C R O S O I

N E N N G + T S N H I I H

S E C O I S O S T C C S R O I E U C R R C E D D C T S A D A E

B

s c i t s u o c A

d n a

g ) g n d i n e i s

t r ) s f n i e e d r e c p p n ( (

o i

) r g 8 1 d P - - n

e

l t 2 1 E ) n a f 2 2

i l n r 9 9 d a

5 5 p p g c i - - ( ( 4 i

0 0 3 r S y 4 2 t

6 6 9 t f - - 3 4 c i 4 o s 2 2 9 9 e -

l i r t 5 5 4 4 9 f E - - . e n 9 9 9

v - - 9 9 f o e i 7 t 8 8 9 9 l o 1 n

m

7 7 7 7 a l t U 1 1 9 9 L r a o

-

. a o o N N N N N t w p l h S S S B B w e a c S S S S S w D S A I I I I I

9HSTFMG*afjcbb+

Aalto University publication series DOCTORAL DISSERTATIONS 166/2014

Studies on Detection and Classification of Species

Seppo Fagerlund

A doctoral dissertation completed for the degree of Doctor of Science (Technology) to be defended, with the permission of the Aalto University School of Electrical Engineering, at a public examination held at the lecture hall S1 of the school on 17 November 2014 at 12.

Aalto University School of Electrical Engineering Department of Signal Processing and Acoustics Supervising professor Prof. Unto K. Laine

Thesis advisor Prof. Unto K. Laine

Preliminary examiners Dr. Rolf Bardeli, Fraunhofer Institute Intelligent Analysis and Information Systems IAIS, Germany Dr. Peter Jančovič, University of Birmingham, U.K.

Opponent Prof. Gernot Kubin, Graz University of Technology, Austria

Aalto University publication series DOCTORAL DISSERTATIONS 166/2014

© Seppo Fagerlund

ISBN 978-952-60-5921-1 (printed) ISBN 978-952-60-5922-8 (pdf) ISSN-L 1799-4934 ISSN 1799-4934 (printed) ISSN 1799-4942 (pdf) http://urn.fi/URN:ISBN:978-952-60-5922-8

Unigrafia Oy Helsinki 2014

Finland

Publication orders (printed book): [email protected] Abstract Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi

Author Seppo Fagerlund Name of the doctoral dissertation Studies on Bird Vocalization Detection and Classification of Species Publisher School of Electrical Engineering Unit Department of Signal Processing and Acoustics Series Aalto University publication series DOCTORAL DISSERTATIONS 166/2014 Field of research Acoustics and Audio Signal Processing Manuscript submitted 12 June 2014 Date of the defence 17 November 2014 Permission to publish granted (date) 21 October 2014 English Monograph Article dissertation (summary + original articles)

Abstract The topic of this thesis is automatic identification of bird vocalization and bird species based on the they produce. Two main approaches for recordings bird sounds are presented: active recording and passive continuous recording. The aim of the active recording method is to capture sounds of a particular bird species or an individual. On the other hand, passive continuous recordings – which can be captured without human presence – are used in acoustical monitoring and are intended to include all sounds in the local environment.

The automatic identification system begins by segmenting distinct events from the recordings. The purpose of segmentation is to detect syllables in bird sounds. Active recordings with one bird individual typically have a high signal-to-noise ratio that helps in the task. Segmentation of passive continuous recordings is more demanding due to the possibility of many simultaneous sound sources and a varying signal level of sound events. Audio events in recordings, comprising sounds from many sources, are also often overlapping which adds complexity to the segmentation phase.

After the audio events have been segmented, feature extraction and classification are performed. Within feature extraction the audio signals are represented with a low number of attributes (compared to the original data) that characterize particular sound events. Feature extraction performs dimension reduction by removing redundant information from the original data. Suitable features depend on the data and should be selected so that they discriminate sounds from different sources. The classification phase decides on which class each sound event belongs to based on the feature representation.

The main focus of this thesis is to develop and examine feature representations for different types of bird sounds suitable for automatic classification. Special attention has been given to that produce inharmonic and noisy sounds due to the diverse structure of their vocalizations. A method based on short time-domain structures was found to be efficient for many different types of sounds. It also exhibited efficiency for sound event detection in continuous recordings.continuous recordings.

Keywords bird song, , sound event detection, feature extraction, pattern recognition, automated recognition ISBN (printed) 978-952-60-5921-1 ISBN (pdf) 978-952-60-5922-8 ISSN-L 1799-4934 ISSN (printed) 1799-4934 ISSN (pdf) 1799-4942 Location of publisher Helsinki Location of printing Helsinki Year 2014 Pages 123 urn http://urn.fi/URN:ISBN:978-952-60-5922-8

Tiivistelmä Aalto-yliopisto, PL 11000, 00076 Aalto www.aalto.fi

Tekijä Seppo Fagerlund Väitöskirjan nimi Lintujen äänien havaitseminen ja lintulajien tunnistaminen. Julkaisija Sähkötekniikan korkeakoulu Yksikkö Signaalinkäsittelyn ja akustiikan laitos Sarja Aalto University publication series DOCTORAL DISSERTATIONS 166/2014 Tutkimusala Akustiikka ja äänenkäsittelytekniikka Käsikirjoituksen pvm 12.06.2014 Väitöspäivä 17.11.2014 Julkaisuluvan myöntämispäivä 21.10.2014 Kieli Englanti Monografia Yhdistelmäväitöskirja (yhteenveto-osa + erillisartikkelit)

Tiivistelmä Tämän väitöstyön aiheena on lintujen äänien automaattinen havaitseminen sekä lintulajien tunnistaminen lintujen tuottamien äänien perusteella. Työ esittelee pääasialliset tavat tehdä lintuäänityksiä, joita ovat aktiivinen sekä passiivinen äänitys. Aktiivisessa äänityksessä tarkoituksena on tallentaan tietyn lintulajin tai yksilön ääniä. Passiivista äänitystä, joka voidaan suorittaa myös ilman ihmisen läsnäoloa, käytetään akustiseen seurantaan ja tavoitteena on tallentaa kaikki alueella esiintyvät äänet.

Ensimmäinen vaihe automaattisessa tunnistusjärjestelmässä on erillisten äänitapahtumien segmentointi. Tämän työn segmentoinnin tavoitteena on erottaa lintujen äänien yksittäiset tavut. Aktiivisissa äänityksissä signaali-kohinasuhde on tyypillisesti korkea, joka helpottaa tehtävää. Jatkuva-aikaisten passiivisten äänitysten segmentointi on haastavampaa, koska ne voivat sisältää ääniä useista äänilähteistä ja signaali-kohinasuhde vaihtelee äänitapahtumien välillä. Useita äänilähteitä sisältävissä tallenteissa erilliset äänitapahtumat ovat usein päällekkäisiä mikä vaikeuttaa segmentointia.

Äänitapahtumien segmentoinnin jälkeen seuraavana vaiheena on piirteiden laskenta sekä luokittelu. Piirreirrotuksessa äänitapahtumat esitetään pienellä määrällä (alkuperäiseen datamäärään verrattuna) ääniä kuvaavilla tunnusluvuilla. Piirreirrotus vähentää alkuperäistä datamäärää poistamalla redundanttia tietoa. Kuhunkin tilanteeseen soveltuvat piirteet riippuvat äänien tyypistä ja tulisi valita niin että piirteet erottelevat eri lähteistä tulevat äänet. Luokitteluvaiheessa äänitapahtumat piirre-esityksellä kuvatut äänet luokitellaan eri luokkiin.

Tämän väitöskirjatyön pääasiallisena tarkoituksena on kehittää ja tutkia erityyppisille lintujen äänille sopivia piirre-esityksiä niiden automaattiseksi luokittelemiseksi. Erityisessä asemassa on ollut epäharmonisia ja kohinanomaisia ääniä tuottavat linnut niiden äänien monipuolisen rakenteen vuoksi. Äänien lyhyttä aikarakennetta kuvaava menetelmä on osoittautunut tehokkaaksi esitykseksi monentyyppisille äänille. Tämä menetelmä on osoittautunut tehokkaaksi myös yksittäisten äänitapahtumien tunnistamiseen jatkuva-aikaisista äänitteistä.

Avainsanat linnun laulu, bioakustiikka, äänitapahtuman havaitseminen, piirreirrotus, hahmontunnistus, automaattinen tunnistaminen ISBN (painettu) 978-952-60-5921-1 ISBN (pdf) 978-952-60-5922-8 ISSN-L 1799-4934 ISSN (painettu) 1799-4934 ISSN (pdf) 1799-4942 Julkaisupaikka Helsinki Painopaikka Helsinki Vuosi 2014 Sivumäärä 123 urn http://urn.fi/URN:ISBN:978-952-60-5922-8

Preface

This work has been carried out at the Department of Signal Processing and Acoustics at the Aalto University School of Electrical Engineering in Espoo, Finland. The work was funded by the Academy of Finland, the EU IST Sixth Framework Programme, and Aalto university. The work was also supported by the Finnish Cultural Foundation, the Nokia Founda- tion, and the Research Foundation of the Helsinki University of Technol- ogy. I wish to express my gratitude to my supervisor and instructor profes- sor Unto K. Laine for making this thesis possible. A countless number of discussions with Unto have been inspiring and his enthusiasm has en- couraged me in my work. Additionally Unto have provided ideas for per- mutation transformation and smoothing of the permutation statistic. I also want to thank Unto for the patience and confidence he has shown in me during this project, especially during the times when research work did not progress as expected. My gratitude belongs also to Dr. Aki Härmä, Prof. Juha Tanttu and Dr. Jari Turunen who introduced me to this topic in the first place and also to Dr. Panu Somervuo for his contribution dur- ing the early stages of this work. I wish to thank the pre-examiners of this thesis, Dr. Rolf Bardeli and Dr. Peter Jancoviˇ c,ˇ for their valuable comments regarding the manuscript. I am indebted to Dr. Toomas Altosaar for proofreading my manuscripts and his considerable support in many different aspects during the past years. I am also sincerely glad for our numerous discussion sessions cov- ering many different topics, some of them very far from the research and science. I also wish to thank Dr. Okko Räsänen for reading and com- menting on my manuscripts. Other former and current members of our research team - Petri, Jouni, Heikki, Sofoklis, Juho and Shreyas - deserve my gratitude for the positive working environment.

1 Preface

I am proud of being able to work in a fantastic environment within the community of the acoustics lab. I want to thank Lea, Mara, Hynde, Mirja, Heidi, Markku, Ulla and Paula for taking care of many administrative and practical issues. Special thanks to Tomppa, Mairas, Carlo (In Memo- riam 1980-2008), Hannu, Jouni, Marko T. and Sami; in addition to being great colleagues I have had nice discussions and experiences with you in sports or outdoor activities. Also, my thanks goes to Cumhur, Henkka, Heidi-Maria, Jussi, Antti, Mikko-Ville, Olli, Tapani, Ilkka, Tuomo, Marko H., Julia, and many other former and current labsters for the great atmo- sphere and unforgettable moments in various events. I also wish to thank my parents Aino and Reino, and my brother Pekka with his family, for their support and understanding. I am also grateful to all my friends, especially for all the outdoor activities we have experienced together. Finally, and most importantly, I want to express my gratitude to my lovely wife Anu for all the love and support she has given me over the years. I am also grateful to our daughter Meri and son Justus for always being there. You have ensured that life has never been monotonous or empty. There are no words to express how much I love you all.

Espoo, October 28, 2014,

Seppo Fagerlund

2 Contents

Preface 1

Contents 3

List of Publications 5

Author’s Contribution 7

List of Symbols 9

List of Abbreviations 11

1. Introduction 13 1.1 Related work ...... 14 1.2 Scope and outline of the thesis ...... 17

2. Acoustical Monitoring of Bird Sounds 19 2.1 Natural audio scenes ...... 19 2.2 Organization and structure of bird sounds ...... 21 2.3 Outdoor sound recording ...... 23

3. Signal Processing Tools 27 3.1 Segmentation ...... 28 3.1.1 Frame based segmentation ...... 28 3.2 Parametric representations ...... 29 3.2.1 Spectral features ...... 30 3.2.2 Sinusoidal features ...... 30 3.2.3 Low-level signal features ...... 31 3.2.4 Time domain structures ...... 32 3.3 Classification ...... 33 3.3.1 k-nearest neighbour classification ...... 34

3 Contents

3.3.2 Support vector machines ...... 35

4. Summary and Main Results of the Publications 37 4.1 Publication I: Parametrization of Inharmonic Bird Sounds for Automatic Recognition ...... 37 4.2 Publication II: Parametric Representations of Bird Sounds for Automatic Species Recognition ...... 38 4.3 Publication III: Bird Species Recognition Using Support Vec- tor Machines...... 39 4.4 Publication IV: Monitoring of Capercaillie Courting Display 39 4.5 Publication V: Classification of Audio Events Using Permu- tation Transformation ...... 40 4.6 Publication VI: New Parametric Representations of Bird Sounds for Automatic Classification ...... 42

5. Conclusions and Future Directions 43

Appendix 47

Errata 51

References 53

Publications 61

4 List of Publications

This thesis consists of an overview and of the following publications which are referred to in the text by their Roman numerals.

I Seppo Fagerlund, Aki Härmä. Parametrization of Inharmonic Bird Sounds for Automatic Recognition. In 13th European Signal Processing Confer- ence (EUSIPCO), Antalya, , Sep 2005.

II Panu Somervuo, Aki Härmä and Seppo Fagerlund. Parametric repre- sentations of bird sounds for automatic species recognition. IEEE Trans. Audio, and Language Processing, Vol. 14, no. 6, pp. 2252 - 2263, 2006.

III Seppo Fagerlund. Bird Species Recognition Using Support Vector Ma- chines. EURASIP Journal on Advances in Signal Processing, Article ID 38637, 2007.

IV Seppo Fagerlund. Monitoring of Capercaillie Courting Display. In 19th International Congress on Sound and Vibration (ICSV19), Vilinus, Lithuania, Jul 2012.

V Seppo Fagerlund, Unto K. Laine. Classification of audio events using permutation transformation. Applied Acoustics, Vol. 83, pp. 57-63, 2014.

VI Seppo Fagerlund, Unto K. Laine. New parametric representations of bird sounds for automatic classification. In 39th International Confer-

5 List of Publications

ence on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014.

6 Author’s Contribution

Publication I: “Parametrization of Inharmonic Bird Sounds for Automatic Recognition”

The present author is responsible for the experiments in this work. The author wrote the main parts of the article and edited it based on com- ments of the second author.

Publication II: “Parametric representations of bird sounds for automatic species recognition”

The present author implemented the segmentation algorithm in collab- oration with the second author and compiled the database used in this work. The author is also responsible for the Mel-cepstrum model and de- scriptive parameters. The author wrote the corresponding parts of the arcticle.

Publication III: “Bird Species Recognition Using Support Vector Machines”

The present author is the sole author of this publication.

Publication IV: “Monitoring of Capercaillie Courting Display”

The present author is the sole author of this publication.

7 Author’s Contribution

Publication V: “Classification of audio events using permutation transformation”

The present author planned this work in collaboration with the second au- thor. The original idea of the algorithmic distance dK came from the sec- ond author. The author was responsible of implementing the algorithms and performing the experiments. The author wrote the article in collabo- ration with the second author

Publication VI: “New parametric representations of bird sounds for automatic classification”

The present author developed the method described in chapter 3 in col- laboration with the second author. The present author conducted the re- search in this work. The author wrote the article and edited it based on comments by second author.

8 List of Symbols

Am Arithmetic mean of X(n) b Bias

dE Euclidean distance measure

dKL Kullback-Leibler divergence distance measure

dM Mahalanobis distance measure f(x) Decision boundary of the classifier

Gm Geometric mean of X(n)

I(πi,πj) Mutual information

K(xi, x) Kernel function K Number of filter bank channels M Number of frequency bins

MFCCi Mel-frequency cepstral coefficients

nsv Number of support vectors

p(πi,πj) Joint probability distribution of πi and πj

p(πi),p(πj) Probability distributions of πi and πj Th Threshold for spectral rolloff frequency measure X(n) Power spectrum of the signal x(n)

Xk Logarithmic energy of kth channel on mel-scale x, y Feature vectors

yi Class labels

αi Weights of support vectors

πi,πj Temporal patterns Σ Covariance matrix

9 List of Symbols

10 List of Abbreviations

ANN Artificial Neural Network ARU Autonomous Recording Unit ASR Automatic Speech Recognition BW Signal Bandwidth CASA Computational Auditory Scene Analysis CASR Computational Auditory Scene Recognition EN Signal Energy GMM Gaussian Mixture Model HMM Hidden Markov Model kNN k-Nearest Neighbour Classifier LDA Linear Discriminant Analysis MFCC Mel-frequency Cepstral Coefficients MI Mutual Information Measure PPF Permutation Pair Frequency matrix SC Spectral Centroid SF Spectral Flux SFM SOM Self Organizing Maps SRF Spectral Rolloff Frequency SVM Support Vector Machine ZCR Zero Crossing Rate

11 List of Abbreviations

12 1. Introduction

Observing our surrounding environment has always interested humans. Obtaining correct observations from the environment are not essential anymore for our everyday living, however, this has not decreased our in- terest in monitoring wildlife. Especially during the spring time many people are enthusiastic to follow the development of the season that in- cludes the arrival of migratory birds. Nature has also been an important source of inspiration, e.g., for many composers and poets. Nature also provides a place for relaxing and recreation. Recently, environmental ob- servations have improved the awareness of our surroundings and have increased our motivation to acquire even more information regarding our ecosystem’s conditions. Acoustical monitoring has especially gained much attention during the past few years. The technical development of portable devices such as modern mobile phones, provide an easy way to make observations about our surround- ings. Mobile phones are equipped with sufficiently good quality micro- phones and adequate storage capacity to make short recordings outdoors. Phones and other portable devices have even ample computational power to perform some on-site analyses of the recordings. Moreover, advance- ments in other portable recording equipment, especially in storage capac- ity, have made it possible to perform also long-term recordings without human presence at the recording site. In fact, the most significant chal- lenges for long-term monitoring scenarios are how to supply electrical power to the recording equipment and how equipment can be protected from changing weather conditions. Acoustical monitoring of nature offers many possibilities for scientific research. Birds are a good indicator of the state of our surrounding envi- ronment; since they are widely distributed they react quickly to changes in environmental conditions such as climate change [1]. For nature con-

13 Introduction servation purposes it can be sufficient to monitor only one or a few target species. Acoustical monitoring is also widely used for tracking bird mi- gration and for estimating populations of bird species. This work is tradi- tionally performed by a large number of ornithologists, biologists as well as many enthusiastic and skilled volunteer birdwatchers. This is an area where automated monitoring could offer huge advantages. Generally speaking, monitoring is a very laborious, time consuming and expensive task when performed by human listeners and even more so when monitoring large areas since more experienced observers are needed. Even more observers are needed if monitoring needs to be per- formed round-the-clock. Also, observations made by different people may not be entirely comparable due to differences in their expertise causing a bias in the results [2]. Especially in areas that are difficult to ac- cess human activity can influence the behavior of monitored animals; these are also areas of particular interest for making observations. Auto- mated acoustical monitoring is possible in multiple locations simultane- ously. Automatic or semi-automatic monitoring can be a valuable tool for making observations especially in large and elusive areas. However, they cannot always replace manual scanning of the recordings. An experienced listener may still make more accurate observations than an automated analysis system [3]. This becomes important especially within birds with low vocalization rate. Motivations for bird species identification are numerous and it has sev- eral possible application in addition to surveillance of birds. First, many humans are interested in bird watching and it would be helpful to have technology that can identify bird species by their sounds. Often birds can be heard but they are difficult to see especially during the night [4]. In these cases recognition by sounds would be a powerful tool for identifying birds. Secondly, automatic identification of bird species near airports could also prevent collisions with aircraft if birds are detected early enough [5, 6, 7]. Finally, identification could also prevent unwanted human-animal encounters for example animals causing harm to the agri- culture [8].

1.1 Related work

Detection and classification of bird sounds is a typical audio classifica- tion problem with many similarities to, e.g., automatic speech recogni-

14 Introduction tion (ASR) and computational audio scene analysis (CASA) and recogni- tion (CASR). In general, ASR operates on speech samples such as words or sentences. CASR typically omits the sound event segmentation phase and scene recognition is performed continuously or at regular time inter- vals [9]. The extracted events are then represented with a certain para- metric representation and compared to predefined event models. Finally, the sound event is associated to the class that most closely resembles the models of the target sound. Hence many parametric representations and classification algorithms are applied in bird species recognition that are common to ASR and CASR. Automatic monitoring of environmental audio events and identification of animal species is a relatively small area of research compared to ASR and CASR. However, it has gathered increasingly more interest during recent years and is currently a very active area of research. The largest concentrated interest for recognizing animals by sounds has focused on birds. Traditionally bird vocalization identification has been performed by visual inspection of sound ; the first attempts to perform automated analysis of bird sounds were based on template matching [10, 11]. Some recent studies have exploited spectrogram im- ages for animal sound identification [12] and retrieval of similar sound events from sound databases [13]. Segmentation of distinct sound events is a crucial part of the automatic identification system. Segmentation of bird sounds is a relatively sparsely covered area in the literature and in most studies segmentation is per- formed by hand based on the energy content of the recordings. Energy based segmentation degrades when many birds sing simultaneously and their sounds become overlapped in time. This is a very common situation found in continuous unsupervised field recordings. However, temporally overlapping sounds in different frequency bands can be separated from spectrogram images of the recordings [14, 15]. Segmentation methods based on image processing of spectrograms also starts by dividing sounds into overlapping frames. Segments of bird vocalization are detected from spectrogram images by searching for similar areas in predefined time- frequency masks of bird syllables. While the masks of the syllables are needed for each syllable type, the algorithm also performs bird versus non-bird classification. Another study focusing on tonal bird sounds aims to detect sinusoidal components from the frames and combine those into frequency tracks [16, 17].

15 Introduction

Several different parametric representations of bird sounds have been proposed for classification of bird species or even individuals. The major- ity of the features operate in the frequency or time-frequency domain and include Mel-frequency cepstral coefficients (MFCC) [2, 18, 19, 20, 21, 22, 23], linear predictive coefficients (LPC) [24, 22, 25] or wavelets [26]. Even though MFCC is adapted to the central properties of the human auditory system it has proven to be a sufficient representation for many different types of bird sounds, especially those with wideband characteristics. Re- cently, cepstral features using a filter bank whose frequency resolution was designed for bird hearing rather than human hearing have also been proposed [27, 8]. Modeling of sinusoidal components has been efficiently used to repre- sent tonal and harmonic bird sounds [28, 29, 6, 30, 16]. Frequency tracts of sinusoidal components of sounds can be used for segmentation and also used as features of sounds for classification purposes [30, 16]. Es- pecially in noisy conditions classification of tonal bird sounds performs better when compared to MFCC features [30]. This is probably due to the fact that MFCC covers a wide spectrum while sinusoidal models operate in a narrow frequency band. Low-level descriptive acoustical signal parameters, such as spectral cen- troid (SC) and signal bandwidth (BW), provide suitable representations for different types of bird sounds [15]. Also, combinations of different fea- ture sets have been used for identification of bird species, e.g., Lopes et al. [7] uses MFCC together with low-level acoustical features. Lee et al. [31] uses different representations of bird sounds that exploit the spectral and temporal characteristics of signals using fixed duration birdsong segments. Each segment is transformed into an image based on predefined image basis functions. A study by E. D. Chesmore [32] uses pre-defined time dependent waveshape patterns to transform bird sounds into a series of codes. Histograms of these code-pairs were used to classify different species. Several different classification algorithms have been applied in the con- text of bird vocalization recognition including Gaussian Mixture Models (GMM) [19, 31, 17], Hidden Markov Models (HMM) [2, 18, 33, 22, 34], dif- ferent types of Artificial Neural Networks (ANN) [35, 26, 36, 37, 24] and Self Organizing Maps (SOM) [38, 26]. In addition to birds, species recognition by sounds has been a subject of study also within other animals including at least frogs [39, 12], alarm

16 Introduction calls of prairie dogs [40], grasshoppers [41] and elephants [42]. A general- ized LPC model to extract perceptually relevant features for a particular species and applied to elephants and whales has also been developed [42]. One large area of study is the sound analysis and identification of marine sounds [43].

1.2 Scope and outline of the thesis

This thesis presents methods for automated identification of specific audio events from recordings. The main focus is identification of bird vocaliza- tion for species classification. First, the characteristics of bird sounds and typical circumstances encountered while performing recordings in nature are discussed. Sound recording equipment for outdoor use in nature de- pends much on what purpose the recording needs to fulfill. For example, long-term acoustical monitoring in a specific location differs significantly from the spontaneous recording of a nearby bird. Secondly, the signal pro- cessing tools and algorithms required to generate interpretations from the recordings are presented. The main focus of this thesis is to investigate different parametric representations of bird sounds. This thesis consists of an introductory part followed by six articles that have been published in scientific journals and conference proceedings. The introductory part is organized as follows. Chapter 2 focuses on the characteristics of bird vocalization and considerations related to natural audio scenes and recording outdoors. Chapter 3 presents the necessary signal processing procedures for automatic classification. Chapter 4 sum- marizes the main results in the publications while Chapter 5 concludes the introductory part.

17 Introduction

18 2. Acoustical Monitoring of Bird Sounds

This chapter describes the considerations required to perform recordings of bird sounds in their natural environment. Besides bird vocalizations, audio scenes can contain sounds from multiple other sources. In order to identify bird sounds within the recordings it is necessary to understand the specific characteristics of bird vocalizations as well as natural audio scenes. Considerations related to recording equipment are also covered in this chapter.

2.1 Natural audio scenes

Natural audio scenes can be extremely multiform depending on the ge- ographical location, time of day and weather conditions. For automatic identification of sound events, the simplest desirable natural audio scene would include only one sound source without any background noise. How- ever, this is rarely the case when making recordings in nature. Typically multiple sound sources are present simultaneously and their sounds can overlap in time and frequency [44]. Figure 2.1 shows an example of the sounds from Capercaillie (Tetrao Urogallus, TETURO) during the court- ing season from two different audio environments. The first recording is clean and only the sounds of Capercaillie are present while the second recording includes sounds from other bird species as well. In the second recording, the sounds from different birds are overlapping but for a hu- man listener it is easy to concentrate on a particular sound event even with single channel recordings. Furthermore, for human listeners it is also possible to identify sound events from different sources even when they are overlapping (in time or frequency) or emanating from the same direction. In addition to multiple sound sources, recordings are influenced by the

19 Acoustical Monitoring of Bird Sounds

Carpercaillie sounds

Carpercaillie sounds

Figure 2.1. Example of Capercaillie sounds in two different audio scenes. environmental conditions of the recording site. The topography of the land, trees and other vegetation influences sound propagation [45]. Veg- etation and topography causes sound echoes and reverberations but they also attenuate acoustical wave propagation. This may become relevant especially when birds are at different distances from the recording device. When sound travels through air high frequency components are attenu- ated more than low frequency components . Therefore, sounds originating from longer distances have less high frequency information than those ar- riving from shorter distances. Furthermore, weather conditions and human actions also influence the recordings. The two main natural noise sources are wind and rain and af- fect recordings in two different ways; first, by adding noise to the record- ings, but also by changes in animal behavior due to weather conditions. In addition to natural noise sources, also human induced sounds, e.g., sounds from airplanes and other vehicles, factories, etc., may decrease the signal to noise ratio. Some birds also change their repertoire and vocalization according to the surrounding audio scene [46, 47]. Birds, when closer to urban areas, tend to sing at higher frequencies and have a higher minimum frequency limit to their songs. Urban birds also sing simpler songs than those in the forest [48]. Birds can also change their vocalization according to the situ-

20 Acoustical Monitoring of Bird Sounds ation they are in, e.g., during conflict situations when other birds intrude into a bird’s . The vocalization complexity during aggressive en- counters may also vary.

2.2 Organization and structure of bird sounds

Bird sounds can be categorized into songs and call-sounds [49]. Singing is limited to , but they cover only about half of the total number of bird species. Non-songbirds also utilize sounds to communicate and for them sounds are just as important as for songbirds [50]. Generally speak- ing, the sounds for songbirds are more complex. They also have a larger repertoire than non-songbirds since they have a more sophisticated con- trol of their sound production mechanism [51]. Some species have even two independent vibrating membranes in their vocal chords which en- ables the production of two independent sounds simultaneously [52]. Usually songs are longer and include more complex vocalizations than call sounds. Songs are spontaneous vocalizations while calls are associ- ated with some meaning. In most species singing is limited to males, but for a few species females also sing and some species even participate in duets [53]. Female songs tend to be simpler than the songs produced by males. Most species sing in a certain time of the year but birds also have a particular daily schedule for singing. The best time for observing bird singing is during the spring breeding season. Especially during this time the male bird song has two main functions; i) to attract females and ii) to repel rival males entering the territory. In section 2.1 it was noted that birds vocalize at a higher frequency in ur- ban areas. In addition to this, birds have regional variations in song reper- toire and syllable structure [49, 54]. A single individual of the species can have a larger repertoire than just one type of song and in some species the repertoire increases with the age of the bird. Larger repertoires play an important role in mate selection as female birds prefer males with larger repertoires. Larger repertoires also form a territorial signal to other individuals. The hierarchical levels in bird sounds are phrases, syllables and ele- ments [53]. Figure 2.2 illustrates the hierarchical levels of a (Luscinia Luscinia, LUSLUS) song. A phrase is a series of syllables that occur in a particular order in the song. Typically, syllables in the phrase are similar but can also be different. Syllables are con-

21 Acoustical Monitoring of Bird Sounds

14000 Phrases Syllables 12000 Elements

10000

8000

Frequency (Hz) 6000

4000

2000

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (s)

Figure 2.2. A song from the Thrush Nightingale. The song is composed of four phrases each containing a different number of syllables. The song also illustrates the wide spectrum of different types of sounds birds can produce.

structed from elements; an element is defined as the smallest separable unit in the spectrogram. Elements can overlap in time and frequency and thus the separation of elements can be difficult and ambiguous. Therefore, the syllable is generally considered as the basic building block for auto- matic identification. Thus bird sound recognition is typically performed based on the classification of individual syllables rather than elements. A series of syllables that occur together in a particular pattern is called a phrase. Syllables in a phrase are typically similar to each other but they can also be different as can be seen in the first and the last phrase of the song in figure 2.2. The number of syllables in the phrase may also vary. In addition to the sounds produced by the vocalization mechanism, birds also produce sounds from physical activity, e.g., sounds from wings dur- ing flight. Sounds from physical activity can have special roles in the bird communication [55], but they do not always carry a particular meaning for birds. Nevertheless, nonvocal sounds can be used for bird identifica- tion in some cases. Birds can produce a large and diverse set of different sounds. The char- acteristics of a simple voiced sound are its fundamental frequency and possibly, but not always, its harmonics. A typical song may contain com- ponents which are purely sinusoidal, harmonic, non-harmonic, broadband and noisy in structure [52]. Sound is often modulated in amplitude or frequency or even both together (coupled modulation) [56]. The frequency range of bird sounds typically lies between 100 Hz and 8 kHz but the vari- ation between different species is large. Figure 2.2 of the thrush nightin- gale song also illustrates the large spectrum of different sounds that even

22 Acoustical Monitoring of Bird Sounds a single bird species can produce. The song includes syllables that can be regarded at least as being harmonic and broadband in nature.

2.3 Outdoor sound recording

The recording equipment depends on the purpose of the recording, envi- ronmental conditions at the recording site and the intended length of the recording. In order to make recordings in nature the equipment must in- clude at least one microphone or a set of microphones and a recorder. Mi- crophones for recorders may consist of internal or external microphones, the latter being the case when they must be of high quality or in case they need a separate pre-amplifier. Also, for the case of external microphones, a separate energy source for the pre-amplifier might be required. Cur- rently, the recorder is in practice always a digital recorder and it should capture the waveform in an uncompressed format. Typically, most of the microphones that are designed for audio record- ings have a sufficient frequency response for recording birds in nature. More relevant for outdoor recordings are characteristics such as a mi- crophone’s self-noise level and its directivity pattern. Directional mi- crophones are preferred in active or supervised nature recordings since sound direction can be localized. Directional microphones are useful for increasing the signal-to-noise level of interesting sounds; they help to ob- tain good quality recordings from the target when the sound source is in the direction of the microphone. Parabolic reflectors can be used for increasing the directional sensitivity even further but they can alter the original sound wave. Since parabolas amplify the sound pressure level the size of the reflector should be selected according to the wavelength of the sound to be recorded. Longer wavelengths (lower frequency) require larger reflectors. Omnidirectional microphones capture sounds coming from all directions and thus there is no need to point them in any particular direction. This is a desirable property for passive recordings where the direction of the interesting sound is unknown, or sought after sounds are arriving from multiple directions. Also, omnidirectional microphones are useful if the sound source is likely to move which is often the case with birds. Portable recording devices with internal or external microphones can be used for long-term (one or two days maximum) recordings without human presence at the recording site [22]. An advantage of such de-

23 Acoustical Monitoring of Bird Sounds

Figure 2.3. Recording site and recording device for Capercaillie courting display. vices is that they are lightweight and easy to move from one location to another; they can still provide much more useful audio data than hu- man observation can. Additionally, passive recordings are not affected by human presence at the recording site. In figure 2.3 an example of the setup for recording Capercaillie courting sounds is shown. For long-term monitoring, the energy requirements of the recording setup as well as storage capacity become critical issues, especially in remote areas. Au- tonomous recording units (ARU) have been used to monitor in the Aleutian Archipelago [57] and [3] in Florida. ARUs en- able even longer-term recordings than conventional portable recorders and also studies on changes in seasonal behavior are possible, which are used in e.g. marine mammal studies [58]. One challenge when recording in nature occurs when multiple audio source signals exist simultaneously. One opportunity for separation of in- dividual source signals is to use a set of microphones [59]. Microphones can be placed in an array which enables the separation of signals coming from different directions based on time-of-arrival to each individual mi- crophone. The configuration of separate microphones may vary depend- ing on the purpose of the recording. They can be placed, e.g., in a line with constant spacing [60] or in a cross configuration [61] which enables better support for all directions. In all configurations of arrays, individual microphones exist relatively close to each other. Another possibility for source separation in nature recordings is a spa- tially dispersed group of microphones or microphone arrays that enable improved spatial localization of the sound sources at the recording site. The spatial location of sound sources can be evaluated from the time de- lays and intensity differences of the signals arriving to each microphone.

24 Acoustical Monitoring of Bird Sounds

Spatially dispersed microphone setups can be especially useful to evalu- ate populations and interaction between individual birds. To add to the challenge of outdoor recordings, environmental conditions must also be taken into account. Directly influencing the recordings are sources of noise such as wind, rain and sounds from human activity. Wind effects can be avoided up to some level by using wind shields. The sound of rain, however, has a wide spectrum making it harder to remove. Espe- cially specific to ARUs, changes in environmental conditions may cause problems with recording devices. These changes include temperature and humidity fluctuations and therefore equipment needs to be protected against these elements. Undesired influences, such as animals and hu- mans becoming interested in the recording equipment must be taken into account. Disturbances from these sources needs to be planned for by pro- tecting the recording equipment when seen necessary.

25 Acoustical Monitoring of Bird Sounds

26 3. Signal Processing Tools

This section describes the different phases of the sound event classifica- tion system. Figure 3.1 presents a block diagram of the main compo- nents of the classification system. Preprocessing can include, e.g., noise removal, filtering certain frequency bands from the signal and empha- sizing and attenuating certain frequency bands. The segmentation pro- cess extracts individual audio events or a series of events from contin- uous recordings. Hence the segmentation also performs data reduction by omitting periods of time in continuous recordings when audio events are not present. Depending on the application, the ordering of prepro- cessing and segmentation may also be interchanged. Feature extraction then transforms audio signal segments into parametric representations. The objective in feature extraction is to reduce the information rate of the input signal into a compact set of features that can be used to separate audio events belonging to different classes. Finally, classification assigns each parametric representation to one of the target classes, or else gives a probability for the audio event belonging to one of the target classes.

Figure 3.1. A block diagram of the species classifcation system.

27 Signal Processing Tools

3.1 Segmentation

The aim of audio segmentation is to identify relevant boundaries for the distinct audio events existing within the input signal; it is often an essen- tial processing phase and can be found in several audio processing appli- cations. Accurate segmentation is also a crucial step in the entire clas- sification system since inaccurate segmentation can increase the noise in features and leads to misclassifications. Segmentation of bird sounds in natural audio recordings can encompass two phases depending upon the input. The first phase is in use in long-term continuous recordings that may include long periods that are void of bird sounds. The goal in this phase is to find regions where birds are present and vocalizing. The goal in the second phase is to extract individual syllables in songs or a series of calls (see section 2.2 for the structure of bird sounds). Syllables are generally considered to be the basic unit used in classification. Segmentation is always based on some change in the characteristic of the audio signal and it can be carried out in the temporal or time-frequency domain. Segmentation is typically performed after the preprocessing stage and, depending on the application, it can be carried out by focusing only on a certain frequency band [13].

3.1.1 Frame based segmentation

Syllables in songs of single bird individuals are often temporally distinct; therefore, segmentation by temporal characteristics of the signal is jus- tified. Segmentation methods that are based on the frame energy are popular methods in audio event detection. It is a straightforward, com- putationally light and easy to implement method. Segmentation typically starts by dividing the signal into temporally overlapping frames. The en- ergy level is calculated for each frame and when the energy surpasses a threshold level the frame is considered to be part of the sound event. In the next phase adjacent frames are connected together while too short candidates are removed. Since the background noise level changes in audio recordings it is common to use an adaptive threshold for syllable detection. However, energy based segmentation methods are often sensitive to the noise level even when adaptive noise level estimation is used. Another problem that surfaces with energy based methods is that it is difficult to distinguish elements with very different energy levels and signal-to-noise

28 Signal Processing Tools ratios. To avoid the deficiencies of energy based segmentation methods it is possible to calculate some other characteristic measure for the frames, e.g., their spectral characteristics [44]. In Publication IV, a mutual information (MI) measure of time domain structures was used for segmentation. Mutual information of the frame measure is calculated as

p(πi,πj) I(πi,πj)=p(πi,πj)log (3.1) p(πi)p(πj) where p(∗) are distributions of different rankings of five signal values ob- tained by a moving analysis window inside the frame under study in the frame (this is further discussed in 3.2.4). Mutual information measures the structural richness in the signal and has a higher value for a signal with some structure than only for the background noise.

3.2 Parametric representations

This part introduces the different parametric representations of the audio signals. Features are divided into spectral features, sinusoidal features, low-level signal parameters and time-domain structural features. Dif- ferent representations have advantages and disadvantages depending on the sound type they are trying to represent. Even though the production mechanism of specific sounds across species is quite well known, most of the features do not directly rely on the production mechanism of the bird sounds but rather try to characterize the signals so that discrimination between classes is maximized. The number of possible different features that could be extracted from a signal at a specific time is enormous and hence it is practical to select only the most efficient features when possi- ble. However, if the underlying sound type and suitable features are not known beforehand it is common to calculate many features and reduce their number later on by discarding some of them according to their dis- criminative power or by transforming the features to a lower dimension. Feature dimension reduction by linear discriminant analysis [62, p. 155] was studied in Publication I. Many features are calculated on a frame basis where the feature vec- tor is calculated for each frame separately. The feature extraction phase thus produces a sequence of feature vectors. The processing of feature vectors in classifiers varies. The sequence of feature vectors can be used as the input of the classifier, such as in Publication V or the classifier in-

29 Signal Processing Tools put may also be the average of the feature vector, as in the Publication I, Publication III and Publication VI. When features are obtained for the entire extracted segment, a single feature vector is used as the input of the classifier.

3.2.1 Spectral features

Frequency or time-frequency domain features begins with the Fourier transformation of the signal. They measure the frequency content of the signals. The Fourier transform can be calculated from the entire signal sample or on a frame to frame basis providing time varying characteris- tics of the input signal. Mel-frequency cepstral coefficients (MFCC) [63] are a widely used fea- ture representation in different audio recognition systems and it is popu- lar in the context of bird identification as well. The construction of MFCC parameters begins with the calculation of the power spectrum which is then transformed into the logarithmic mel-frequency spectrum using a filter-bank of triangular filters. The dynamic range of the spectrum is then reduced by taking the logarithm of each channel. Then the ith MFC- coefficient is calculated by     K 1 π MFCC = X cos i k − (3.2) i k 2 K k=1 where Xk is the logarithmic energy of the kth mel-spectrum and K is the total number of bands. The discrete cosine transform (DCT) decor- relates the features and reduces the dimensionality of the feature vector. In addition to static MFCC features it is common to incorporate the time derivatives and second order derivatives to the feature representation.

3.2.2 Sinusoidal features

A large number of bird sounds are tonal or composed of a small number of sinusoidal components [64]. Syllables of tonal and harmonic bird sounds can be efficiently modeled by a single or a small number of time varying amplitude and frequency trajectories consisting of sinusoidal components [28]. Sinusoidal modeling starts by finding trajectories of the sinusoidal components in the short time spectrum of the signal. These components are found by searching for peaks in the power spectrum of the frames. Furthermore, the dynamic structure of tonal bird sounds can be incor- porated to the model by evaluating the delta features of the trajectories

30 Signal Processing Tools

[30]. Originally, sinusoidal modeling [65] was developed for the analysis and synthesis of speech;. In sinusoidal modeling the underlying signal is rep- resented with a small number of sinusoidal components. For classification of sound events it is adequate to represent sounds with one sinusoidal component or for harmonic sounds with one sinusoidal component per harmonic component as in [28] and Publication II.

3.2.3 Low-level signal features

In many applications in the field of audio signal processing, the specific signal model is unknown and the spectral characteristics may be varying. This is typical especially in the field of animal and natural sounds. In these applications it is common to use descriptive measures of the signals to parameterize sounds [15]. Low-level signal parameters include tem- poral and spectral domain features and features that are calculated on a frame basis but also over the entire signal segment. Equations for low-level feature calculation are presented in the table 3.1. Spectral features include spectral centroid, signal bandwidth, spec- tral rolloff frequency, spectral flux, spectral flatness and the frequency range of the signal. Spectral centroid (SC) and signal bandwidth (BW) describe the center point of the spectrum and width of the frequency band around the SC. Spectral rolloff frequency (SRF) describes the slope of the skewness of the spectral shape whereas spectral flux (SF) measures spec- tral differences between adjacent frames. Finally, spectral flatness (SFM) characterizes the distribution of the energy spread over the frequency band of the signal. Temporal domain features include zero-crossing rate (ZCR), signal en- ergy and duration of the underlying signal. ZCR rate measures the rate of signal sign changes. It correlates with the SC although ZCR is calculated in the temporal domain. Signal energy (EN) measures the log-energy con- tent of the signal. The temporal duration of the signal, together with the frequency range, defines the temporal and spectral boundaries of the un- derlying signal. Features derived from the modulation spectrum was also used as a low- level signal descriptor. The modulation spectrum describes the modula- tion frequency and index in the signal. First, the envelope of the signal is calculated as

31 Signal Processing Tools

Feature name Equation

 M n|X(n)|2 n=0 Spectral centroid SC = M 2 n=0 |X(n)|   M (n−SC)2|X(n)| n=0 Signal Bandwidth BW = M 2 n=0 |X(n)|     SRF 2 M 2 Spectral rolloff frequency SRF =max SRF| n=0 |X(n)|

M Spectral flux SF = n=0 |Xi(n)|−|Xi+1(n)|  1/M M i=0 |Xi| Spectral flatness SFM =10log10 M 1/M i=0 |Xi|

M−1 Zero-crossing rate ZCR = n=0 |sgn(x(n)) − sgn(x(n +1))|

Short time energy EN =20log10 |x(n)|

Table 3.1. Formulas for the calculation of low-level features. In the equations, x(n) and X(n) denote the time signal and power spectrum of the signal, respectively, while M denotes the maximum frequency bin. Th is the threshold value be- tween 0 and 1. Commonly used value is Th =0.95.

EH(n)=|x(n)+ıH(x(n))| (3.3) where x(n)+ıH(x(n)) is the analytic signal and H denotes the Hilbert transformed signal [66]. The modulation spectrum was obtained by calcu- lating the Fourier transform from the envelope signal. The frequency and magnitude of the maximum peak of the Modulation spectrum were used as the final features. These correspond to the modulation frequency and modulation index of the underlying signal. Low-level descriptive features have been utilized in Publication I, Publication II and Publication III.

3.2.4 Time domain structures

Time domain feature representations have not gained much interest since most of the feature representations operate in the frequency or time- frequency domain. Recently however, more attention has focused on sig- nal analysis and pattern recognition in time-series based on temporal structures. Most of the applications utilizing temporal patterns concern medical signals that are generally characterized by a non-stationary struc- ture [67, 68, 69, 70]. These types of signals are often difficult to identify, e.g., when using spectral methods. The main advantage of using temporal features is a potentially shorter analysis window when compared to spectral methods. The frequency res- olution in spectral methods depends on the length of the analysis window. The underlying signal is assumed to be sufficiently stationary within the

32 Signal Processing Tools analysis window, however, this assumption does not hold for many envi- ronmental signals [71]. Also, some bird sounds have a highly dynamic structure and local temporal dynamics has been found important for the identification of these sounds [34, 16]. Time-domain structural features describe the waveshapes in the under- lying signal [32]. The number of waveforms in the signals is enormous and therefore model waveshapes are stored in a codebook. The signal waveform is mapped to each of the wave shapes in the codebook while each waveshape corresponds to a unique codeword. The drawback of this method is that the codebook must be developed separately for each appli- cation. A different approach to avoid codebook creation is to use the rankings of the signal amplitude values as a description of the waveform. In an anal- ysis window consisting of N samples the amplitude values can be ordered in N-factorial (N!) different ways. The analysis window can therefore be coded with one of the N! codes. In this approach the number of possible codes grows enormously as the size of the analysis window N increases. However, in real world signals only a small number from all possible rank- ings exist or are required for accurate analysis. Also, very short time win- dows have proven to be sufficient for representing the statistical structure of signals. Another desirable property, when using this approach, is the robustness to some types of noise. Also, the method offers an invariance to the scaling level of the underlying signal because it is based on the mu- tual relations between the samples in the analysis window [72]. As long as noise does not alter the ranking of the signal values, the correspond- ing code does not change. Furthermore, since feature representations are based on statistics over the entire signal, the effect of a single analysis window to the statistic is low. Time domain structural features have been used for the feature representation of sound events in Publication V and Publication VI.

3.3 Classification

The role of a classifier is to perform decision making as to which class a test pattern belongs to when compared to a training set. This is accom- plished by comparing the best match between a test sample and a class model or target pattern. In classification tasks, the feature vector space is divided into regions that correspond to different classes. In classification

33 Signal Processing Tools of bird syllables each class (representing a bird species) may be composed of different types of feature representations, each corresponding to a dif- ferent type of syllable for each species. In this case some models may have the same class label.

3.3.1 k-nearest neighbour classification

The k-nearest neighbor (kNN) algorithm is a nonlinear classifier that de- cides on a class label by using the k-nearest neighbors of a test vector compared to the vectors in the training data set. In the nearest neighbor classification (k =1), the class is chosen that exhibits the minimum dis- tance to the test vector. In the kNN method, the test vector is assigned to the class which is most often represented in the k-nearest neighborhood [62, p. 45]. The kNN method does not need any tuning of classifier pa- rameters since the feature vectors in the training data set comprise the classifier. However, the method may exhibit a bias when different classes have various number of samples or they are unequally dispersed in the feature space. Different metrics can be used to measure the distance between a test sample and feature vectors in the training data. The Euclidean distance measure is a commonly used metric and it is calculated for two feature vectors x and y as  T dE(x, y)=x − y = (x − y) (x − y) (3.4)

The major drawback with the Euclidean distance appears with corre- lated features and features with different dynamic ranges; both typically reduce recognition accuracy. This problem may be resolved by using the Mahalanobis distance measure when calculating the distance between feature vectors. The Mahalanobis distance between vectors is calculated as  T −1 dM (x, y)= (x − y) Σ (x − y) (3.5) where Σ is the covariance matrix of the training data. When the feature vector is comprised of a distribution of some aspect of the signal, the Kulback-Leiber divergence distance is a suitable mea- sure for evaluating the distance between two feature vectors. The K-L divergence is calculated as  p(x) d (x, y)= p(x)log (3.6) KL p(y) The K-L divergence was used in Publication V.

34 Signal Processing Tools

3.3.2 Support vector machines

Support vector machine (SVM) is a classification technique that first trans- forms nonlinearly the separable feature representations into a higher di- mensional feature space where the classification problem can be solved by a linear hyperplane. SVM utilizes the feature vectors near the decision boundary in model construction making the classifier locally and globally optimal [73, p. 325]. Mapping to the higher dimensional feature space is conducted implicitly by kernel functions and thus calculations are not performed directly in the higher dimensional space. The SVM classifier is optimal in the sense that it maximizes the margin between classes. The SVM solution for a classification problem is denoted by nsv f(x)= αiyiK(xi, x)+b (3.7) i=1 where αi and yi are weights and class labels of the support vectors (actual observations in the training set) and xi is the corresponding feature vector and b is a constant bias term, x being the test feature vector. The term nsv is the number of the support vectors in the classifier and K(xi, x) is the kernel function which enables the calculations in the higher dimensional space. Since SVM classifiers are constructed from feature vectors, prior knowl- edge of the structure of the classifier is not required since this is formed during the training of the model. SVMs have good generalization prop- erties both locally and globally [74, p. 52]. SVMs maximize the margin between classes and are suitable for variously shaped classes as well as classes constructed from many clusters in the feature space. The SVM classifier also solves the problem of having different sized datasets in classes since only training samples close to the decision boundary are used to form the classifier and weight αi of other samples is set to zero. Fundamentally SVMs are binary classifiers, where y ∈{−1, +1} in equa- tion (3.7); this is often inadequate since real classification problems are mostly multiclass in nature. The two main approaches to extend SVMs to multiclass cases are methods that use one decision function and methods that combine multiple binary classifiers [75]. The latter approach uses one-against-rest strategy where each class is trained against the rest of the classes. In this case the test vector is assigned to the class that has largest margin to the decision boundary. Another strategy constructs one- against-one classifiers for all combinations of classes leading to a decision

35 Signal Processing Tools tree structure for the entire classifier. A one-against-one classification strategy was used in Publication III.

36 4. Summary and Main Results of the Publications

This section summarizes the main results in the thesis. Identification of bird species syllables was studied in publications I, II, III and VI. Table 4.1 summarizes the number of species and used methods in these papers as well as the average classification score of bird species classification by syllables. Only the results obtained in Publication I and Publication III species set 1 are fully comparable due to identical data sets.

Publication I Publication II Publication III Publication VI Number of 6 14 set1: 6 set1: 6 Species set2: 8 set2: 10 Features *Low-level *Low-level *Low-level *PPF *MFCC *MFCC *MFCC *MFCC *Sinusoidal Classifiers *kNN *GMM *SVM *kNN

*kNN + dM (x, y) *HMM *DTW Definitive 74% (MFCC) 52% set1: 91% set1: 65% classification (MFCC, Δ, ΔΔ) (MFCC, low-level) (PPF) score set2: 98% set2: 63% (MFCC, low-level) (PPF)

Table 4.1. Summary of the methods and results. Defective classification score indicates best overall classification result for species set in question and also the feature set that was used to obtain the result.

4.1 Publication I: Parametrization of Inharmonic Bird Sounds for Automatic Recognition

Publication I focuses on the classification of bird species that produce mostly inharmonic sounds. Such bird species are for example ( corax, CORRAX) and Hooded Crow (Corvus corone cornix, CORNIC), whose vocalization is considered inharmonic for more than 95% of the time. Two different parametric representations were studied in this

37 Summary and Main Results of the Publications work. A total of 19 low-level descriptive parameters provide the descrip- tive characteristics of the bird vocalization while a set of MFCC-features provides a robust representation of the spectrum of the sounds. Linear discriminant analysis (LDA) was applied for evaluation of the importance of single descriptive features and optionally used to reduce the length of the feature vector. Recognition results using the kNN-classifier exhibits a better species identification accuracy with MFCC-features, especially when the Euclidean distance measure was used. The difference in accuracy was much smaller with the Mahalanobis distance measure which is presumably due to the fact that the Mahalanobis distance measure decorrelates the features, which increases the accuracy with descriptive features. LDA-analysis shows that single features have different discriminative powers and it can be used for dimension reduction of the feature vector. However, fea- tures with low discriminative power did not impair classification accuracy when they were included in feature vectors.

4.2 Publication II: Parametric Representations of Bird Sounds for Automatic Species Recognition

In Publication II bird species classification was studied for 14 Passerine birds that produce mostly voiced (tonal and harmonic) sounds. In addition to the features in Publication I, classification was also tested with sinu- soidal models of the syllables. Four different classes of syllables were used characterized by: i) the pure tonal sounds, ii) the harmonic sounds whose strongest sinusoidal component was the fundamental frequency, iii) the first harmonic of the fundamental frequency, and iv) the second harmonic of the fundamental frequency. In addition to MFCC-features in publica- tion I, also the first and second order delta parameters were used. GMM and HMM were used for classification of syllables and songs. Additionally, dynamic time warping (DTW) was tested in syllable based recognition in order to achieve a better fit between syllables with different lengths. Recognition results show that the best average accuracy was obtained when MFCC features were used together with low-level descriptive fea- tures. DTW improves recognition accuracy when compared to the case where features were obtained on a frame basis. However, there are species- specific variations in the results since different features capture different aspects of the syllables. For example, the sinusoidal model produced the

38 Summary and Main Results of the Publications best classification accuracy for species that generate tonal sounds. Song level results exhibit considerably better performance than classification based on single syllables. This suggests that the dynamics of the syllables in the song are also important for identification of bird species.

4.3 Publication III: Bird Species Recognition Using Support Vector Machines.

Publication III applies a support vector machine classifier (SVM) for iden- tification of bird syllables for two different sets of bird species. The first set (six species) included the same material that was used in Publication I and the second set (eight species) contained the same data as in [26]. Bird syllables were represented using MFCC features including delta, delta-delta parameters and low-level descriptive features. In this work a decision tree topology of binary SVM classifiers were used to achieve a multiclass classifier. Recognition results show improved recognition accuracy for both sets of bird species when compared to the reference methods. Especially the results of the first set exhibit a considerably better accuracy for syllable classification. In addition for both sets of species, the best accuracy was achieved when all available features were used for classification. This result suggests that the training of an SVM-classifier scales the features so that even features with a low discriminative power can improve the recognition accuracy.

4.4 Publication IV: Monitoring of Capercaillie Courting Display

Publication IV presents the methods for Capercaillie breeding sound de- tection and classification from continuous wildlife recordings. Three dif- ferent types of vocalized sounds that Capercaillies produce during the courting season are presented in 4.1. The recording devices were set up at the courting ground before courting started and collected after courting had ended; the recordings were therefore performed without human dis- turbance at the recording site. Due to the size of the courting site, birds were located at different distances from the recorder and the signal-to- noise level varied considerably. Segmentation of the sound events was performed based on the mutual information measure of the distribution of structural patterns in the short time windows. Figure 4.2 depicts an

39 Summary and Main Results of the Publications example of the segmentation in a very low signal-to-noise ratio situation.

Figure 4.1. Three different types of Capercaillie sounds during courting season

Figure 4.2. Segmentation of a Capercaillie sound event in the case of a low signal-to noise ratio. The duration of the sound event is approximately 20 ms. The detected sounds are similar to those in middle panel of the figure 4.1.

The mutual information measure also captures other sound events than Capercaillie breeding sounds. The segmented sound events were finally detected by utilizing MFCC features achieving a 93% accuracy for cor- rectly detected sound events and a 9% false acceptance rate.

4.5 Publication V: Classification of Audio Events Using Permutation Transformation

Publication V introduces novel time-domain features for representing au- dio events. The method is based on the statistics of short-term ordinal

40 Summary and Main Results of the Publications structures existing in the waveform. The method first transforms sample values in a short-time window into rank numbers according to the ampli- tude values. Each permutation of rank numbers is then represented with a unique code. Finally, the frequency of code pair occurrences (codes of adjacent windows) are accumulated to construct a permutation pair fre- quency (PPF) matrix that acts as a feature representation of the sound segment. Figure 4.3 shows an example of a temporal pattern of an arbi- trary signal.

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

−1 0 2 4 6 8 10 12 14 16

Figure 4.3. Exaple of temporal pattern (continuous black line) of five sample values. Cor- responding rank numbers for these amplitude values are (42351).

PPF-matrices were tested with two different classification tasks. The first task was to separate four different randomly generated noise se- quences that have equal spectra but exhibit differences in the temporal domain due to the manner in which they were created. The second task was to identify stop consonants existing in human speech on the basis of the burst part of the consonant. The first task has been proven to be dif- ficult when using conventional frequency domain methods due to almost equivalent spectra. Likewise, stop consonant bursts are difficult to clas- sify with conventional methods due to their short duration and diverse structure. The recognition results for both assignments showed that the proposed method is an efficient feature representation for non-stationary and noisy sounds.

41 Summary and Main Results of the Publications

4.6 Publication VI: New Parametric Representations of Bird Sounds for Automatic Classification

In Publication VI the same feature extraction method that was used in Publication V was applied to the classification of bird species using their sounds instead. Classification was tested with two groups of bird species; the first group consisted of bird species that produce mostly noisy and in- harmonic sounds. The second group included, in addition to the species existing in the first group, four passerine birds whose sounds are charac- terized by tonal components. Classification was tested for single syllables and for entire songs; they were also compared to the results using MFCC- features as a baseline method. The kNN classifier using the Euclidean distance measure was applied to the classification of the bird syllables. Recognition results showed significantly better classification rates with the proposed method when compared to those obtained with MFCC fea- tures.

42 5. Conclusions and Future Directions

This thesis presents systems and methods for acoustical monitoring in natural environments and identification of specific sound events. The main focus is on bird vocalization classification, which encompasses sev- eral challenges. First, bird vocalizations are retrieved from recordings that may contain vocalization of one individual bird or potentially sev- eral birds. Especially continuous recordings can contain several different noise sources and unwanted sounds as well as many bird species and in- dividuals vocalizing simultaneously. Furthermore, extracted multiform bird sounds are classified based on the parametric representations of the sounds. This introductory part of the thesis covers the necessary background information required for a bird vocalization identification system. The main attention in this thesis is on different parametric representations of bird sounds and their performance for classification of different types of bird sounds. Traditionally, sounds with a wide spectrum and noise- like structure have been paid less attention to when compared to voiced sounds and hence these sounds have had a significant role in this thesis. While technical development of recording equipment and computers have provided improved resources for longer recordings, more computational power for the analysis of the sounds and training the models, the main challenge in developing an autonomous classification system is still to find enough annotated sound examples from birds. This is a problem es- pecially with rare species as well as with birds that vocalize infrequently or have a very rich spectrum of different vocalizations. This is a prob- lem especially with rare species as well as with birds that vocalize infre- quently. This fact becomes significant especially when the system should identify species from a very large number of possible candidates, which would be the case, e.g., in commercial applications. Recently, more bird

43 Conclusions and Future Directions sound databases have become available for public use and now provide a sufficient amount of data to become useful corpora. Also efforts on dig- italization of large analogue animal sound archives have increased the amount of the data available [13]. Even though the automatic identification of birds and other animals has been an active area of research during at least the last decade, the bioa- coustic community is still missing a common well-established framework for bioacoustical signal classification [59] covering all the stages of the classification system. A wireless sensor network with the potential to sup- port a different number of sensor nodes [76, 77] would be ideal for many recording scenarios of bioacoustis signals. Most available research has fo- cused on some predefined problem setting and the data has been collected specifically for that purpose and is generally not available for others to use. Even now there is no well-established public database that would be available for general purposes. As long as such a database does not exist, comparing different methods will remain difficult. Alternatively, future bioacoustic challenges could provide a suitable platform for comparing different systems [78]. A remarkable potential application for automated acoustical monitoring is in the surveillance of certain target species simultaneously in several different locations. Automated monitoring may not substitute monitoring by human listeners entirely, but it enables monitoring in more locations. Automated monitoring combined with other methods would probably pro- duce a more accurate surveillance of the migrating birds. Also, surveil- lance of bird species populations and monitoring of acoustical scenes in certain locations would be suitable for monitoring environmental condi- tions in those locations. Automated monitoring would reveal immediately changes in acoustical scenes or bird species populations and potentially enable a fast response to the cause of the change in acoustical scenes. The publications in this thesis and other studies in the field have uti- lized several different parametric representations for different types of sound events. The future goal would be to integrate different feature representation together. This would potentially enable a more universal system for classification of different types of sounds. Also robust segmen- tation of the bird vocalization is still an open question and requires more attention in future. Main challenge in segmentation of the syllables is still large spectrum of different vocalizations especially in terms of temporal and spectral coverage.

44 Conclusions and Future Directions

Features based on time domain structures utilized in Publication IV, Publication V and Publication VI have already produced interesting re- sults and confirmed the potential of temporal domain feature representa- tions. A potential area for future research is the continued development of temporal structural features. More investigation is required to deter- mine how this method behaves especially when multiple sound sources occur simultaneously; a more systematic evaluation of the different types of signals and noise conditions is also required.

45 Conclusions and Future Directions

46 Appendix

Additional notes and clarifications for the included publications

The MFCC features were utilized in classification of bird species in many included publications throughout the thesis. In all classification experiments the average MFCC feature vector was calculated over the syllable duration. In Publication IV, Publication V and Publication VI the notation of a single permutation was changed after introduction of how a single permu- π ≡ πτ t tation was formed. The correspondence between notations is i,j n( ) satisfying i, j = t. The notation was changed in order to make sequential equations more explicit.

Publication I

The presented paper handles bird sounds that cannot be regarded as tonal or harmonic. In a preceding study [28] bird sounds were divided into four classes based on their harmonic structure. However, a large number of bird sounds do not fit to these classes. In this paper F0, F1 and F2 refer to the harmonic components, e.g., F0 is fundamental frequency and F1 and F2 are the first and the second harmonic components, respectively.

Publication IV

This publication concerns the detection of the Capercaillie courting sounds in long-term outdoors recordings. Capercaillie courting sounds have a

47 Appendix wide spectrum and in outdoor recordings the signal-to-noise ratio may vary considerably. Thus, mutual information of short time structures was used for signal segmentation for the sound events. Mutual information measures structural information of a signal rather than just energy con- tent. This segmentation method identifies those frames as part of sound events whose mutual information measure exceeds twice the manually assigned threshold for the mutual information of the background noise. After the segmentation phase the MFCC features, as explained in Pub- lication IV, were calculated for those frames that were detected as part of the sound event. The Euclidean distance measure between MFCC fea- tures of the frames to be classified and the models of the Capercaillie sounds were divided by the mutual information (MI) measure before the final classification. Since in the studied case the largest MI-values oc- curred for the instances of Capercaillie, this method naturally improved the classification of the Capercaillie sounds and reduced the misclassifi- cations of other sound events. However, one has to remember that this combination of MI-values with distance measures works well only for the cases where the sound to be classified has a stronger structure (in MI terms) than all the competing sound events. For experimental evaluation a sound sample of 10 minutes of a typical Capercaillie courting display was used. This section included, in addition to Capercaillie sounds, sounds also from other sources, mostly notably from other birds. However, the amount of other sounds varied during the different parts of the recording. When measured manually, 40% of the sound events on average were not produced by Capercaillies. The background noise level rate remained relatively stable during the whole signal selected for this evaluation. The classification accuracy indicates the rate of correctly detected sounds (hit rate) and the misclassification rate indicates the rate of the sounds that were detected but found not to be the sounds of a Capercaillie (false acceptance).

Publication V

In synthetic noise sequence classification experiments four different noise sequences were created from random pattern of length 20 samples and its three variants. These four patterns were different in the temporal do- main, but their power spectrum was equal. This explains why Fourier analysis cannot classify these signals. A set of signals (200 of each four

48 Appendix types) of length 4000 samples were created based on four patterns by adding them in random positions. Half of the signals were assigned to the training set and the other half to the test set. The classification results showed low recognition accuracy with PPF- matrices when the number of the patterns was low. The reason for this is due to the fact that when the number of added patterns is low the signals include a large number of zeros and due to handling of the equal values in the permutation window the permutation code sequence is mostly com- posed of identical codes. In natural signals this phenomenon is extremely unlikely. When the number of the patterns grows sufficiently large ran- domness in the signals increase and the temporal structure decreases be- cause patterns are added into random positions. In stop consonant classification tests the TIMIT labels were used to in- dicate the starting and ending points of the burst. Recognition tests were performed iteratively for each consonant in the test set. The classification begins with a window length of 19 ms and it was increased by 5 ms for each iteration round until the end of the burst.

Publication VI

Average of the MFCC feature vector were used as feature representation for syllables. It may not be entirely comparable to the permutation based features because permutation features are not calculated on frame ba- sis. MFCC features were chosen because it is widely used and utilized in preceding studies as well. However, in the identification of the songs the results obtained from the syllable recognition were utilized. Average MFCC-feature vector was not calculated for entire song segment, but the results from syllable recognition were integrated by assigning the song into class which was most often represented in syllables of particular song.

49 Appendix

50 Errata

Publication V

Equation (9) should be written

120 p(P ) d (P, Q)= p(P )log KL p(Q) i,j=1

Heading of chapter 4 should be Stop consonant classification

51 Errata

52 References

[1] R. Virkkala and A. Lehikoinen, “Patterns of climate-induced density shifts of species: poleward shifts faster in northern boreal birds than in southern birds,” Global Change Biology, vol. 20, no. 10, pp. 2995–3003, 2014.

[2] J. A. Kogan and D. Margoliash, “Automated recognition of bird song ele- ments from continuous recordings using dynamic time warping and hidden Markov models: A comparative study,” The Journal of the Acoustical Society of America., vol. 103, no. 4, pp. 2185–2196, 1998.

[3] K. A. Swiston and D. J. Mennill, “Comparison of manual and automated methods for identifying target sounds in audio recordings of pileated, pale- billed, and putative ivory-billed woodpeckers,” Journal of Field , vol. 80, no. 1, pp. 42–50, 2009.

[4] A. Farnsworth and R. W. Russell, “Monitoring flight calls of migrating birds from an oil platform in the northern guld of mexico,” Journal of Field Or- nithology, vol. 78, no. 3, pp. 279–289, 2007.

[5] C. Kwan, G. Mei, X. Zhao, Z. Ren, R. Xu, V. Stanford, C. Rochet, J. Aube, and K.C. Ho, “Bird classification algorithms: theory and experimental re- sults,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP 2004), Montreal, , May 2004.

[6] J. R. Heller and J. D. Pinezich, “Automatic recognition of harmonic bird sounds using a frequency tract extraction algorithm,” The Journal of the Acoustical Society of America, vol. 124, no. 3, pp. 1830–1837, 2008.

[7] M. T. Lopes, L. L. Gioppo, T. T. Higushi, C. A. A. Kaestner, C. N. Silla, and A. L. Koerich, “Automatic bird species identification for large number of species,” in IEEE International Symposium on Multimedia (ISM), Dec 2011, pp. 117–122.

[8] K. A. Steen, O. R. Therkildsen, H. Karstoft, and O. Green, “A vocal-based analytical method for goose behaviour recognition,” Sensors, vol. 12, no. 3, pp. 3773–3788, 2012.

[9] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2006.

53 References

[10] S. E. Anderson, A. S. Dave, and D. Margoliash, “Template-based automatic recognition of birdsong syllables from continuous recordings,” The Journal of the Acoustical Society of America, vol. 100, no. 2, pp. 1209–1219, 1996.

[11] K. Ito, K. Mori, and S. Iwasaki, “Application of dynamic programming matching to classification of budgerigar contact calls,” The Journal of the Acoustical Society of America, vol. 100, no. 6, pp. 3947–3956, 1996.

[12] T. S. Brandes, P. Naskrecki, and H. K. Figueroa, “Using image processing to detect and classify narrow-band cricket and frog calls,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 2950–2957, 2006.

[13] R. Bardeli, “Similarity search in animal sound databases,” IEEE Transac- tions on Multimedia, vol. 11, no. 1, pp. 68–76, 2009.

[14] L. Neal, F. Briggs, R. Raich, and X.Z. Fern, “Time-frequency segmentation of bird song in noisy acoustic environments,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 2012– 2015.

[15] F. Briggs, B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich, S. J. K. Hadley, A. S. Hadley, and M. G. Betts, “Acoustic classification of multi- ple simultaneous bird species: A multi-instance multi-label approach,” The Journal of the Acoustical Society of America, vol. 131, no. 6, pp. 4640–4650, 2012.

[16] P. Jancoviˇ c,ˇ M. Köküer, and M. Russell, “Bird species recognition from field recordings using hmm-based modelling of frequency tracks,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP 2014), Sep 2014.

[17] P. Jancoviˇ c,ˇ M. Köküer, M. Zakeri, and M. Russell, “Unsupervised discovery of acoustic patterns in bird vocalisations employing dtw and clustering,” in 21st European Signal Processing Conference (EUSIPCO), Sep 2013, pp. 1–5.

[18] M. B. Trawicki, M. T. Johnson, and T. S. Osiejuk, “Automatic song-type classification and speaker identification of norwegian ortolan bunting (em- beriza hortulana) vocalizations,” in IEEE Workshop on Machine Learning for Signal Processing, Sep 2005, pp. 277–282.

[19] C. Kwan, K. C. Ho, G. Mei, Y. Li, Z. Ren, R. Xu, Y. Zhang, D. Lao, M. Steven- son, and V. Stanford, “An automated acoustic system to monitor and classify birds,” EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Arti- cle ID 96706, 2006.

[20] C. Lee, C. Han, and C. Chuang, “Automatic classification of bird species from their sounds using two-dimensional cepstral coefficients,” IEEE Trans- actions on Audio, Speech and Language Processing, vol. 16, no. 8, pp. 1541– 1550, 2008.

[21] M. Marcarini, G.A. Williamson, and L. de Sisternes Garcia, “Comparison of methods for automated recognition of avian nocturnal flight calls,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP), Mar 2008, pp. 2029–2032.

[22] V. M. Trifa, A. N. G. Kirschel, C. E. Taylor, and E. E. Vallejo, “Automated species recognition of in a mexican rainforest using hidden markov

54 References

models,” The Journal of the Acoustical Society of America, vol. 123, no. 4, pp. 2424–2431, 2008.

[23] N. Harte, S. Murphy, D. J. Kelly, and N. M. Marples, “Identifying new bird species from differences in birdsong,” in 14th Annual Conference of the International Speech Communication Association (INTERSPEECH), Lyon, France, Aug 2013.

[24] C.-F. Juang and T.-M. Chen, “Birdsong recognition using prediction-based recurrent neural fuzzy networks,” Neurocomputing, vol. 71, no. 1-3, pp. 121 – 130, 2007.

[25] W. Chu and D. T. Blumstein, “Noise robust bird song detection using syllable pattern-based hidden markov models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 345–348.

[26] A. Selin, J. Turunen, and J. T. Tanttu, “Wavelets in automatic recognition of bird sounds,” EURASIP Journal on Signal Processing Special Issue on Multirate Systems and Applications, vol. 2007, no. 1, pp. Article ID 51806, 2007.

[27] W. Chu and A. Abeer, “Fbem: A filter bank em algorithm for the joint op- timization of features and acoustic model parameters in bird call classifi- cation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar 2012, pp. 1993–1996.

[28] A. Härmä and P. Somervuo, “Classification of the harmonic structure in bird vocalization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2004.

[29] Z. Chen and R. C. Maher, “Semi-automatic classification of bird vocaliza- tions using spectral peak tracks,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 2974–2984, 2006.

[30] P. Jancoviˇ cˇ and M. Köküer, “Automatic detection and recognition of tonal bird sounds in noisy environments,” EURASIP Journal on Advances in Signal Processing, vol. 2011, no. 1, pp. Article ID 982936, 2011.

[31] C. Lee, S. Hsu, J. Shih, and C.F Chou, “Continuous birdsong recognition using gaussian mixture modeling of image shape features,” IEEE Transac- tions on Multimedia, vol. 15, no. 2, pp. 454 – 464, 2013.

[32] E. D. Chesmore, “Application of time domain signal coding and artificial neural networks to passive acoustical identification of animals,” Applied Acoustics, vol. 62, no. 12, pp. 1359 – 1374, 2001.

[33] H. Tyagi, R. M. Hegde, H. A. Murthy, and A. Prabhakar, “Automatic iden- tification of bird calls using spectral ensemble average voiceprints,” in 13th European Signal Processing Conference (EUSIPCO), Sep 2006.

[34] T. S. Brandes, “Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp. 1173–1180, 2008.

55 References

[35] A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropaga- tion and multivariate statistics,” IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2740–2748, 1997.

[36] S. A. Selouani, M. Kardouchi, E. Hervet, and D. Roy, “Automatic bird- song recognition based on autoregressive time-delay neural networks,” in Congress on Computational Intelligence Methods and Applications (ICSC), Dec 2005.

[37] M. R. W. Dawson, I. Charrier, and C. B. Sturdy, “Using an artificial neural network to classify black-capped chickadee (poecile atricapillus) call note types,” The Journal of the Acoustical Society of America, vol. 119, no. 5, pp. 3161–3172, 2006.

[38] P. Somervuo and A. Härmä, “Analyzing bird song syllables on the self- organizing map,” in Workshop on Self-Organizing Maps (WSOM03), 2003.

[39] G. Grigg, A. Taylor, H. Mc Callum, and G. Watson, “Monitoring frog commu- nities: an application of machine learning,” in 8th Innovative Applications of Artificial Intelligence Conference, Sep 1996, pp. 1564–1569.

[40] J. Placer, C. N. Slobodchikoff, J. Burns, J. Placer, and R. Middleton, “Using self-organizing maps to recognize acoustic units associated with informa- tion content in animal vocalizations,” The Journal of the Acoustical Society of America, vol. 119, no. 5, pp. 3140–3146, 2006.

[41] E. D. Chesmore and E. Ohya, “Automated identification of field-recorded songs of four british grasshoppers using bioacoustic signal recognition,” Bulletin of entomological research, vol. 94, no. 4, pp. 319–330, 2004.

[42] P. J. Clemins and M. T. Johnson, “Generalized perceptual linear prediction features for animal vocalization analysis,” The Journal of the Acoustical Society of America, vol. 120, no. 1, pp. 527–534, 2006.

[43] D. K. Mellinger and J. W. Bradbury, “Acoustic measurement of marine mam- mal sounds in noisy environments,” in International Conference on Under- water Acoustical Measurements: Technologies and Results, Jun 2007, pp. 273–280.

[44] R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, and K.-H. Fromolt, “Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring,” Pattern Recognition Letters, vol. 31, pp. 1524 – 1534, 2010.

[45] T. F. W. Embleton, “Tutorial on sound propagation outdoors,” The Journal of the Acoustical Society of America, vol. 100, no. 1, pp. 31–48, 1996.

[46] D. Luther and L. Baptista, “Urban noise and the cultural of bird songs,” Proceedings of the Royal Society B, vol. 277, no. 1680, pp. 469–473, 2010.

[47] E. Bermudez-Cuamatzin, A. A. Rios-Chelen, D. Gil, and C. M. Garcia, “Ex- perimental evidence for real-time song frequency shift in response to urban noise in a passerine bird,” Biology Letters, vol. 7, no. 1, pp. 36–38, 2011.

[48] H. Slabbekoorn and A. den Boer-Visser, “Cities change the songs of birds,” Current Biology, vol. 16, no. 23, pp. 2326–2331, 2006.

56 References

[49] J. R. Krebs and D. E. Kroodsma, “Repertoires and geographical variation in bird song,” Advances in the Study of Behaviour, vol. 11, pp. 143–177, 1980.

[50] G. J. L. Beckers, R. A. Suthers, and C. ten Cate, “Mechanisms of frequency and amplitude modulation in ring dove song,” The Journal of Experimental Biology, vol. 206, no. 11, pp. 1833–1843, 2003.

[51] A. S. Gaunt, “A hypothesis concerning the relationship of syringeal struc- ture to vocal abilities,” Auk, vol. 100, no. 4, pp. 853–862, 1983.

[52] S. Nowicki, “Bird acoustics,” in Encyclopedia of Acoustics, M. J. Crocker, Ed., chapter 150, pp. 1813–1817. John Wiley & Sons, 1997.

[53] C. K. Catchpole and P. J. B. Slater, Bird Song: Biological Themes and Vari- ations, Cambridge University Press, Cambridge, UK, 1995.

[54] T. S. Osiejuk, K. Ratynska, J. P. Cygan, and D. Svein, “Song structure and repertoire variation in ortolan bunting (Emberiza hortulana l.) from iso- lated norwegian population,” Annales Zoologici Fennici, vol. 40, pp. 3–16, 2003.

[55] M. Hingee and R. D. Magrath, “Flights of fear: a mechanical wing whistle sounds the alarm in a flocking bird,” Proceedings of the Royal Society B: Biological Sciences, vol. 276, no. 1676, pp. 4173–4179, 2009.

[56] J. H. Brackenbury, “Functions of the and the control of sound pro- duction,” in Form and Function in Bird, A. S. King and J. McLelland, Eds., chapter 4, pp. 193–220. Academic Press, 1989.

[57] R. T. Buxton and I. L. Jones, “Measuring nocturnal activity and status using acoustic recording devices: applications for island restoration,” Journal of Field Ornithology, vol. 83, no. 1, pp. 47–60, 2012.

[58] R. S. Sousa-Lima, T. F. Norris, J. N. Oswald, and D. P. Fernandes, “A review and inventory of fixed autonomous recorders for passive acoustic monitoring of marine ,” Aquatic Mammals, vol. 39, no. 1, 2013.

[59] D. T. Blumstein, D. J. Mennill, P. Clemins, L. Girod, K. Yao, G. Patricelli, J. L. Deppe, A. H. Krakauer, C. Clark, K. A. Cortopassi, S. F. Hanser, B. Mc- Cowan, A. M. Ali, and A. N. G. Kirschel, “Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological consid- erations and prospectus,” Journal of Applied Ecology, vol. 48, no. 3, pp. 758–767, 2011.

[60] D. L. Jones and R. Ratnam, “Blind location and separation of callers in a natural chorus using a microphone array,” The Journal of the Acoustical Society of America, vol. 126, no. 2, pp. 895–910, 2009.

[61] K.-H. Frommolt, K.-H. Tauchert, and M. Koch, “Advantages and disad- vantages of acoustic monitoring of birds–realistic scenarios for automated bioacoustic monitoring in a densely populated region,” in Computational Bioacoustics for Assessing Biodiversity. Proc. of the Internat. Expert Meet- ing on IT-based Detection of Bioacoustical Patterns. BfN-Skripten. Citeseer, 2008, vol. 234, pp. 83–92.

[62] S. Theodoridis and L. Koutroumbas, Pattern Recognition, Academic Press, 1999.

57 References

[63] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

[64] N. H. Fletcher, “A class of chaotic bird calls,” The Journal of the Acoustical Society of America, vol. 108, no. 2, pp. 821–826, 2000.

[65] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[66] W. M. Hartmann, Signals, Sound, and Sensation, Springer, Woodbury, New York, 1 edition, 1997.

[67] K. Keller, H. Lauffer, and M. Sinn, “Ordinal analysis of eeg time series,” in Advanced Methods of Electrophysiological Signal Analysis and Symbol Grounding, J. Kurths C. Allefeld, P. beim Graben, Ed., chapter 7, pp. 109– 119. Nova Science Publishers, 2008.

[68] G. Ouyang, Z. Ju, and H. Liu, “Mutual information analysis with ordinal pattern for emg based hand motion recognition,” Intelligent Robotics and Applications, vol. 7506, pp. 499–506, 2012.

[69] Y. Cao, W-W. Tung, J. B. Gao, V. A. Protopopescu, and L. M. Hively, “De- tecting dynamical changes in time series using the permutation entropy,” Physical Review E, vol. 70, pp. 046217, 2004.

[70] D. Arroyo, P. Chamorro, J. M. Amigó, F. B. Rodríguez, and P Varona, “Event detection, multimodality and non-stationarity: Ordinal patterns, a tool to rule them all?,” The European Physical Journal Special Topics, vol. 222, no. 2, pp. 457 – 472, 2013.

[71] B. Ghoraani and S. Krishnan, “Time-frequency matrix feature extraction and classification of environmental audio signals,” IEEE Transactions on Speech and Audio Processing, vol. 19, no. 7, pp. 2197–2209, 2011.

[72] K Keller, M Sinn, and J Emonds, “Time series from the ordinal viewpoint,” Stochastics and Dynamics, vol. 7, no. 2, pp. 247–272, 2007.

[73] C. M. Bishop, Pattern recognition and machine learning, vol. 1, Springer New York, 2006.

[74] N. Cristianini and J. Shawe-Taylor, An introduction to support vector ma- chines and other kernel-based learning methods, Cambridge University Press, 2000.

[75] F. Schwenker, “Hierarchical support vector machines for multi-class pat- recognition,” in 4th International Conference on Knowledge-Based In- telligent Engineering Systems and Allied Technologies, 2000, pp. 561–565.

[76] G. Vaca-Castano, D. Rodriguez, J. Castillo, K. Lu, A. Rios, and F. Bird, “A framework for bioacoustical species classification in a versatile service- oriented wireless mesh network,” in 18th European Signal Processing Con- ference (EUSIPCO), Aug 2010.

58 References

[77] T. C. Collier, A. N. G. Kirschel, and C. E. Taylor, “Acoustic localization of antbirds in a mexican rainforest using a wireless sensor network,” The Journal of the Acoustical Society of America, vol. 128, no. 1, pp. 182–189, 2010.

[78] I. Potamitis, “Automatic classification of a taxon-rich community recorded in the wild,” PLoS ONE, vol. 9, no. 5, pp. e96936, 2014.

59 References

60 n o i t s a e z DOCTORAL DOCTORAL DISSERTATIONS i i l c a e c p o S

V f

o d

r d n i n o B i

a

t n a n o

c o

i d s fi n t u i e l r c i e s g e d a s F t

a u o e l p t p e D C Departm en t of Sign al Processin g an d AcouS stics S

Seppo Fagerlund Studies on Bird Vocalization Detection and Classification of Species Aalto Un iversity

Aa lto-DD 166/2014

S E

N R Y

O U R G

I +

T

L E T O +

Y

S C V A A L +

E S E T M R O O T N E C R O S O I

N E N N G + T S N H I I H

S E C O I S O S T C C S O R I E U C R R C E D D C T S A D A E

B

s c i t s u o c A

d n a

g ) g n d i n e i s

t r ) s f n i e e d r e c p p n ( (

o i

) r g 8 1 d P - - n e

l t 2 1 E ) n a f 2 2

i l n 9 r 9 d a

5 5 p p g c i - - ( ( 4 i

0 0 3 r S y 4 2 t

6 6 9 t f - - 3 4 c i 4 o s 2 2 9 9 e -

l i r t 5 5 4 4 9 f E - - . e n 9 9 9

- v - 9 9 f o e i 7 t 8 8 9 9 l o 1 n

m

7 7 7 7 a l t U 1 1 9 9 L r a o

-

. a o o N N N N N t w p l h S S S B B w e a c S S S S S w D S I I I A I I

9HSTFMG*afjcbb+