THE UNIVERSITYOF MANCHESTER

BSC (HONS)COMPUTER SCIENCEAND MATHEMATICS WITH INDUSTRIAL EXPERIENCE

FINAL YEAR PROJECT

Bird Song Recognition

Author: Supervisor: Jake PALMER Dr. Andrea SCHALK

April 2016 Abstract

This project investigates the viability and potential approaches for automatic song recognition by testing various methods on a handful of British . How to gather and process the data is detailed, and we discuss whether or not doing so is important. A novel approach to classification using archetypal bird songs is developed, with an augmented similarity measure using the dynamic time warping algorithm taking into account the number of components in a song. The results, benefits, and drawbacks of the archetypes approach is compared to a more typical machine learning approach using a decision tree. Contents

List of Figures 3

List of Tables 4

1 Introduction 5 1.1 Motivation ...... 5 1.2 Project Goals ...... 6 1.3 Representing Bird Song ...... 7 1.3.1 The Raw Waveform ...... 7 1.3.2 The Spectrum ...... 9 1.3.3 Human Intuition and the Spectrogram ...... 11 Our Human Ability To Recognise Patterns ...... 11 The Spectrogram ...... 11 1.4 Existing literature ...... 14

2 Gathering and Processing Data 15 2.1 xeno-canto ...... 15 2.2 A Variety of Songs ...... 15 2.3 Preprocessing Gathered Data ...... 17 2.3.1 The Procedure ...... 17 2.4 Data to Be Classified ...... 18

3 Approaches 21 3.1 Desirable Properties ...... 22 3.2 Archetypes ...... 22 3.2.1 Finding the Archetype ...... 23 The Method ...... 23 Set Up Time ...... 24

1 3.2.2 Measuring Similarity ...... 25 Best Representation ...... 25 Throwing Away Data ...... 26 Unexpected Similarity Results (and What To Do About It) 28 3.2.3 Results ...... 29 3.3 Machine Learning ...... 29 3.3.1 Results ...... 30

4 Conclusion 32

5 References 33

A Appendix 35 A.1 Dynamic Time Warping ...... 35 A.2 The (Fast) Fourier Transform ...... 37

2 List of Figures

1.1 A recording of me saying ”the cat sat” ...... 7 1.2 Processed recording of the song of a Common Chiffchaff . . . . . 8 1.3 Section of a raw Blackbird recording ...... 8 1.4 A plot of the spectrum of a Woodpigeon’s song ...... 10 1.5 A plot of the spectrum of a Chiffchaff’s song ...... 10 1.6 A plot of the spectrum of a recording of me saying ”the cat sat” . 11 1.7 The spectrogram of a Woodpigeon’s song ...... 12 1.8 The spectrogram of a Chiffchaff’s song ...... 13 1.9 The spectrogram of a section of a raw recording of a Blackbird’s song ...... 13

2.1 Spectrogram before starting procedure ...... 20 2.2 Waveform before starting procedure ...... 20 2.3 Spectrogram after applying a low-pass filter ...... 20 2.4 Waveform after applying a low-pass filter ...... 20 2.5 Spectrogram after removing noise ...... 20 2.6 Waveform after removing noise ...... 20

3.1 Waveform with all values kept ...... 26 3.2 Waveform with every 100th value kept ...... 26 3.3 A plot comparing the time per DTW measure for various amounts of data being kept ...... 27 3.4 Only the top half (values above 0) of a Woodpigeon song’s wave- form ...... 31 3.5 Smoothed top half of a Woodpigeon song’s waveform ...... 31 3.6 Marked components of the smoothed top half of a Woodpigeon’s waveform ...... 31

3 A.1 A fabricated example of the result of attempting to find the best match between two signals using DTW ...... 37 A.2 A plot of sin(x) for x from 0 to 2π, along with its spectrum . . . . 38 A.3 A plot of sin(x) for x from 0 to 2π + 1, along with its spectrum . . 39

List of Tables

2.1 A comparison of the pitch and structure similarity of the birds being used ...... 16 2.2 The number of both processed and unprocessed songs for each bird ...... 17

3.1 The classification accuracies for the archetypes method ...... 29

4 Introduction

Bird vocalisation comes in two flavours: songs, and calls. Bird song is what birds use for courtship and mating and tends to be relatively complex (depending on the bird), whereas calls (also known as signals[2]) are for signalling alarm or for members of a flock to communicate information, which may simply be their locations relative to each other. Due to the greater complexity of bird song, in terms of its structure, it is much easier to tell one bird’s song (belonging to one species) apart from another bird’s song (belonging to another species). Calls tend to be quite similar both in terms of structure and pitch. I am fairly sure the same techniques I have used for songs could be applied to calls too, but I have chosen to focus my attention on songs in particular for the reasons mentioned. For all birds within a given species it is usually the case that their songs (of a particular kind, say for courting) sound reasonably similar to a human listener. This isn’t always the case, but to a first approximation this is more or less true. Of course there are some birds of different species that have quite similar songs, and so can be difficult to tell apart, and there are some birds of the same species that will have a number of songs they can choose from, each of which may serve a different purpose for them, such as for the different stages of a courting ritual.

1.1 Motivation

I think this sort of software would see a lot of use recreationally, for people to identify what bird they are hearing if they don’t have the knowledge necessary to be able to recognise it themselves, or because they want a second opinion on their guess. But apart from recreational use, I can imagine this sort of thing being very use- ful for mapping out and monitoring bird populations in a way that requires less manual intervention. For example, placing some recorders in an area relatively free from noise like cars and people and have it try and recognise sounds day and night and store them on a local disk. A person could then come back some time later and retrieve the data for analysis. Once data has been gathered over several years some precise statements about the local bird population over time could be made. This could be important for spotting a decline in population due to disease

5 or an increase in predators. Additionally, you might be able to say some interest- ing things about their daily habits, and which of their particular vocalisations are more prevalent at different times of the year, maybe to identify which songs are for what purpose. Alternatively it could be used to spot that migration times have changed over the years and use that as secondary data to provide further evidence for some primary data such as a change in climate. For monitoring bird popula- tions this would require much less manual labour and is thus much more scalable than some more traditional methods[13].

1.2 Project Goals

The primary goal of this project was to assess the viability of recognising bird song to some reasonable degree of accuracy. More specifically - limiting the scope of the project slightly to one that was feasible in the time frame - to recognise some modestly sized subset of British birds, again with a reasonable degree of accuracy (what a ”reasonably degree of accuracy” means will of course depend on the desired application, but I am taking it to mean several times better than random guessing, at the very least). Upon achieving that goal, I wanted to be able to compare and contrast dif- ferent approaches to the problem if possible in the available time period, or if I only managed to look deeply into one then discuss some of the pros and cons of that method in isolation. Fortunately I have been able to do some research into two relatively different methods (beyond just machine learning algorithm A and machine learning algorithm B, which would not be especially insightful). I took quite a different first approach, which I will discuss later. I have found that it is absolutely viable, and with some more investigation it seems to me that you should be able to get good accuracy as long as you have the data to back up the algorithms. This project doesn’t concern itself with how to present this information to a user or similar things in that vein, as it has not been about developing an applica- tion. It was instead about investigating and testing potential methods. Note also that I will be ignoring things such as the differences between the song of an adult and juvenile bird, and will only gather data and test against adult bird song. I have been very pleased with the results of the project, and I think there are numerous interesting avenues for investigation in the future.

6 1.3 Representing Bird Song

Of course if we are to have any hope of automating the process of recognition, we first need to find suitable forms and data structures for storing our bird song. So, what exactly is a bird song, when it is in digital format?

1.3.1 The Raw Waveform The simplest imaginable representation of a sound, at least to my intuition, is a digitised form of how it appears in nature. A collective summing of simple sine waves at particular frequencies gives us a waveform. Most uncompressed audio file formats will store the information using the pulse-code modulation method, which I won’t go into here, but the WAV file format is an instance of that, and that’s the file format my program reads in. Reading one of these files in will give us our waveform (as an array of amplitude values, the range of which can vary depending on the encoding and file format). Figure 1.1 shows the waveform displayed in Audacity[1] for me saying the words ”the cat sat”.

Figure 1.1: A recording of me saying ”the cat sat”

This is an 8000Hz recording, which means that there are 8000 samples per second. This figure of 8000Hz is called the sample rate. Hertz is a measure of the resolution of an audio recording. If I have some audio data of N samples then to figure out the length of the recording I also need to know the sampe rate, call it R. I can then get the length of the recording (in seconds) by: (N samples) / (R samples per second) = N/R seconds. The recording above is 1.036 seconds long, so the array we read in will have Rt = 8000 ∗ 1.036 = 8288 samples.

7 Displayed in figure 1.2 is a recording of a Common Chiffchaff (Phylloscopus collybita)[12, 8].

Figure 1.2: Processed recording of the song of a Common Chiffchaff

This is what I consider a single instance of a Chiffchaff song, which I have extracted from a larger recording and done some processing on. I will discuss the processing stage in more detail later. Clearly this is a very regular song. What tends to change about Chiffchaff song is the number of the different ”parts”, which you can see as the different spikes in amplitude in this Chiffchaff recording. Here there are 11, but there can be anywhere from 5 to just under 20, in my experience. This can cause problems with certain approaches, and we need to be careful to make an algorithm robust enough to deal with that. I will discuss that in more detail later as well. However, a lot of songs are more complicated than each instance of song just differing by how often it repeats the same part; take the (Tur- dus merula)[4, 3] song shown in figure 1.3.

Figure 1.3: Section of a raw Blackbird recording

This is a more raw recording where I haven’t extracted a single song or applied any processing. You can see that is true in the large baseline amplitude throughout

8 the recording; this is the background noise. I would say there are between 4 and 6 different instances of blackbird song here, depending on how you draw the boundaries (and I don’t think it really matters too much where you do, as long as you’re consistent). But none of them seem to resemble each other very much. Yet to a human ear, each of them are easy to identify as Blackbird song, at least to someone familiar with it. So what are we even hearing here to be able to identify it as such? As I mentioned in the introduction, in this case it is more about pitch and ”character” than it is about the structure itself, though some regularity can be gleamed from the structure too (it is just too complex to state in simple terms). I do make extensive use of the waveform in this project, even though initially I had dismissed it and thought it would be all about the spectrum. However, my experiments have revealed, at least for my methods, it is worth using.

1.3.2 The Spectrum See the appendix for details on how to obtain the spectrum from the waveform using the Fourier transform, in particular an implementation known as the Fast Fourier Transform (commonly abbreviated as FFT). The spectrum is usually more useful than the comparatively messy waveform, at least in the world of human speech processing, but they both have their benefits. Two birds that have a similar song structure would be difficult to tell apart by the waveform alone (and as discussed the waveform is in general a more difficult form to work with), but if they had consistently quite dissimilar frequencies in their songs then we can easily tell them apart by the spectrum. Knowing that Woodpigeons have low-frequency songs and Chiffchaffs have relatively high-frequency songs, it is easy to see which is which in figures 1.4 and 1.5. The Woodpigeon’s dominant frequency lies somewhere in the range of 300 to 800 hertz, and the Chiffchaff in the range 4000 to 6000 hertz, So clearly there is something of use in the spectrums, if we as humans can tell so easily which is which in this case. What is difficult, of course, are the cases where it is not so easy to see, even for a human (there is a third way of displaying the data which is plays very nicely into human intuition, and an expert may be able to identify a large number of birds just by looking at it; I will come onto that shortly). Additionally, simply to illustrate that the concept of a dominant frequency in a given recording of sound is much more relevant for bird song than it is for other fields such as speech recognition, see figure 1.6 for the spectrum of the recording of me saying ”the cat sat”. Nothing stands out particularly as a dominant frequency, at least not to the same degree as you will see often in bird song.

9 Figure 1.4: A plot of the spectrum of a Woodpigeon’s song

Figure 1.5: A plot of the spectrum of a Chiffchaff’s song

Sadly, I have not managed to make as much use of the spectrum as I wanted to. It is clearly useful, but for the experiments I have ran in every case using only the spectrum as opposed to only the waveform results in worse accuracy and slower classification. However, I do believe an optimal approach will use both.

10 Figure 1.6: A plot of the spectrum of a recording of me saying ”the cat sat”

1.3.3 Human Intuition and the Spectrogram

Our Human Ability To Recognise Patterns A person need only pay attention to the sounds that tend to come from particular species of birds to be able to tell, without seeing the bird, what species the bird belongs to. We tend to recognise birds more naturally than just seeing that their song follows some very rigid pattern and comparing it against some sort of mental database. In fact, there are many birds whose song structure varies a lot, such as the common blackbird (Turdus merula), but the ”character” of its song remains, in both its pitch and its particular intonations, among other qualities. Clearly, as is evident in the previous sentence’s impreciseness, it can be difficult to express just what exactly we as humans are using to be able to recognise a given bird. As well as being able to tell one bird from another by listening, with some practice and a little studying we are also able to tell one bird from another, in many cases, simply by looking at a 3D (or 2D with colour) representation of their songs - the spectrogram. A person with more experience might even be able to tell you what bird it is, just by looking at this representation.

The Spectrogram The spectrogram, or sonogram/sonograph as it tends to be known in the bird vo- calisation community, is a way of displaying and/or storing the data that contains all the information from both the waveform and spectrum. Along its x-axis is

11 time, the y-axis is frequency, and the intensity of the colour (or the blackness, if it is a monochrome spectrogram) is the amplitude or presence of that particular frequency at that time. Using shades of grey or colour intensity to represent am- plitude avoids the need for plotting in a third dimension, and is easier to interpret. Any information we might like to query about a signal will be contained in the spectrogram. Well, almost; there is some phase information in the spectrogram, but it is not complete. Apart from this limited phase information (which is not important, we don’t care much about phase), and ignoring the fact that we will never have a perfect picture of the real signal (because we are digitally sampling an analogue signal in the first place), the spectrogram contains all the information we could possibly need. Indeed, viewing birdsong through spectrograms is an incredibly enlightening thing to do, and when William Homan Thorpe considered applying spectrograms to bird song in the 1950s (accounted in his book on the subject[17]), it revolu- tionised the field of bird song analysis. See figures 1.7, 1.8, and 1.9 for the spectrograms of the songs of a Wood- pigeon, Chiffchaff, and Blackbird respectively. The brighter parts are the song. These are the same songs and recordings that I have displayed waveforms and spectrums for previously. It is clear from these figures which I have done some processing on to remove noise and interfering sounds, that is the Woodpigeon (low frequency), and the Chiffchaff. The processed Chiffchaff song is a particu- larly nice example, revealing more structure to the song than simply the waveform and spectrum did on their own. Every Woodpigeon song looks almost exactly like the one displayed; 5 individual parts, one short, two longer, then another two short.

Figure 1.7: The spectrogram of a Woodpigeon’s song

What I am interested in is if we can use this intuition with spectrograms to identify an automatic way recognising bird song. A first thought that I had was

12 Figure 1.8: The spectrogram of a Chiffchaff’s song

Figure 1.9: The spectrogram of a section of a raw recording of a Blackbird’s song to literally represent the bird song as an image, the spectrogram, and use image similarity algorithms to compare it to various different bird songs. The method would have been backed by a spectrogram database to compare new recordings to. I abandoned this idea early on in favour of following two simpler methods that I thought would be more likely to give good results in the time available. The worry was that the idea of comparing spectrograms may be similar to having an image database of objects in simple (unclutterd) scenes, and then recognising objects in a new simple scene using the image database. This is well-known to be a very hard problem. In the end, much of the information about structure can be derived from the waveform, and frequency separately from the spectrum. Nevertheless, one of my regrets in this project is not having enough time to look at combining the two, and perhaps directly using the spectrogram, to do some analysis on the ”clusters” you often see in spectrograms to potentially feed into some machine learning al- gorithm. It seems to me that the situation is not as dire as I made it sound with my

13 comparison to recognising objects in simple scenes, and I think using the spec- trograms to recognise bird song is actually much more analogous to handwriting recognition (though I do not have experimental data to back this up). Bird song tends to be a bird’s audio-based signature, after all, and the spectrogram merely represents that as an image.

1.4 Existing literature

As may be evident by now, there is very little existing literature on the subject of automatic bird song recognition, somewhat surprisingly. There is plenty of research on the biological aspects, such as comparing the development of human speech and bird song from birth to adulthood, on how birds themselves may recog- nise each other’s calls and songs, and a view of bird song from the perspective of a bird’s . Much of the research into this area (including some on or tan- gentially related to automatic bird song recognition) seemed to peak in the late 90s and trailed off significantly after that. A few papers on automatic bird song recognition do exist, such as Automated recognition of bird song elements from continuous recordings[10]. Though this paper notes that the recordings were done under laboratory conditions, using only two kinds of bird, achieving satisfactory results and requiring expert knowledge under non-ideal conditions. So there has been some research into it, albeit with only modest success in ideal scenarios. Unfortunately it is behind a prohibitively expensive paywall so I have not been able to have an in-depth look at it. There is also some research in related areas, such as the recognition of indi- vdual dolphin whistles[9]. I have used some of this to direct my experiments (I would never have looked into Parsons encoding, mentioned later in the section on approaches, without seeing the success of it in the world of dolphin whistle recognition), though as I expected it was not directly applicable and some of it, assuming my approach was sound, did not work at all. I think the fact I am doing my experiments with a reasonably small number of songs for each bird (recorded in the wild by contributors to xeno-canto) then testing my methods on both preprocessed and raw examples, on a handful of both diverse and similar birds, is much more reflective of how well this may work when used in an average real-world situation.

14 Gathering and Processing Data

Clearly we are not going to get anywhere without first having enough data to work with, and that data has to be good data (for a particular sense of the word ”good”, which I will discuss). I will explore how much of it we need and how much the quality of the recording matters, and what we can do to improve it, in this chapter.

2.1 xeno-canto

The only resource I have used for gathering data is xeno-canto[19], a website where any member of the public may upload bird vocalisation recordings for oth- ers to use how they please. As of the time of writing, there are 302,751 recordings consisting of about 9,548 different species[18]. The available recordings for some birds are limited, but for the common birds of Britain (which is what I have chosen to focus on) there are plenty of recordings of reasonable to good quality. The recordings can simply be downloaded as an MP3, and then converted to WAV for use with my program (could easily be done on the fly with a system call/script using lame --decode song.mp3). Anyone hoping to get this to a point where it truly covers a very large number of birds would probably have to do some field work, so to speak, as many birds still have very few to no recordings on xeno-canto.

2.2 A Variety of Songs

As previously stated, this project is about investigating and testing potential meth- ods, and because of that the data not only needs to be good in the sense of quality and quantity, but also good in the sense that it should challenge as many aspects of the algorithms being applied as possible. That is, we should have at least two birds that sound quite similar to each other in pitch and song structure so that we can see how it deals with that, two birds that have very different pitches, and two birds that have similar pitches but different song structure. The only combination of pitch and song structure I haven’t really got is very different pitches but similar structure, but I think both of those things

15 Bird Similar Similar Dissimilar Dissimilar Pitch Structure Pitch Structure Blackbird Blackcap, Blackcap Woodpigeon Chiffchaff, Chiffchaff Woodpigeon Blackcap Blackbird, Blackbird Woodpigeon Chiffchaff, Chiffchaff Woodpigeon Chiffchaff Blackbird, N/A Woodpigeon All Blackcap Woodpigeon N/A N/A All All

Table 2.1: A comparison of the pitch and structure similarity of the birds being used are individually tested by two of the other cases. This allows us to see which aspects of each algorithm could do with improvement, and where each excels. The birds I settled on are: the Common Blackbird, Woodpigeon, Common Chiffchaff, and the Eurasian Blackcap. See table 2.1. Additionally, not men- tioned in the table, Blackbird and Blackcap song is quite self-dissimilar from one instance to the next, whereas Woodpigeon and Chiffchaff are very self-similar. The reason for not just throwing a huge amount of data all at once at the various different approaches is because gathering the data itself takes time, and I had to limit myself in order to focus on more important aspects of the project. Downloading and converting the files and naming them appropriately (with their xeno-canto IDs, so anyone in the future wanting to see where the data comes from can simply type the ID into xeno-canto) doesn’t take a huge amount of time. What really does add up is processing each of the files afterwards and extracting each of the individual songs. It does not seem possible to me to automate this without an already incredibly sophisticated piece of software (the creating of which would probably be even harder than this project), and even if that existed using it would likely introduce further error into the classification. For this reason I have focused on only several birds required to test the various aspects of the algorithms, and the rest of my time has been spent on investigating and developing these solutions. The amount of data I could collect, as mentioned depended on the amount of time I had spare on the project. As well as time, I was also limited on how long the actual algorithms take to set up and classify to facilitate fairly quick testing, which I estimated before beginning to gather the data with some crude but reasonably accurate calculations based on some times I found by running the various parts of

16 Bird Processed Songs Unprocessed (Raw) Songs Total Blackbird 26 11 37 Blackcap 21 6 27 Chiffchaff 24 10 34 Woodpigeon 19 6 25

Table 2.2: The number of both processed and unprocessed songs for each bird it on only a few songs. I initially only collected around 15 songs per bird, but later increased that to what I have now. The number of songs I have for each bird can be seen in table 2.2. The reason processed songs outnumber raw is because I was initially testing on only processed songs to test my methods in more ideal conditions; only later did I add a number of raw songs (and along with them some more processed ones). By this point I was not only limited by what was on xeno-canto, the time required to run the algorithms, and my own time, but also the file system quota on the school computers.

2.3 Preprocessing Gathered Data

One thing that clearly needs to be done is to go through a large recording of multiple instances of song and extract each one. But apart from that, is there anything else we can do to improve the quality of the extracted songs? There are several things in fact, and I will go through the general procedure for cleaning up recordings.

2.3.1 The Procedure The general procedure is very simple. I will go through it step by step (I used Audacity for all of this). To see the spectrograms and waveforms at each step see the figures starting with figure 2.1 on page 20.

Step 1: Find a reasonable quality recording of the bird. I will use xeno-canto 183441, a Woodpigeon recording, for this example. See figures 2.1 and 2.2 Step 2: If the recording is stereo, make it mono.

17 Step 3: (Situational) This particular recording also has a Robin and a House Mar- tin in the background. If we wanted this recording for the Robin we would be out of luck, but as we want it for the Woodpigeon we can use the fact it has such a low frequency song to apply a low-pass filter and eliminate the unwanted birds from the recording. Here I applied a filter of 1200Hz with a roll-off of 36dB twice. The roll-off for a low-pass filter can be seen as the filter’s eagerness to remove sound above the specified frequency. It can be anywhere in the range 6 to 48 decibels in Audacity. See figures 2.3 and 2.4.

Step 4: We still have some remaining background noise. The previous step can rarely be applied in general, but removing noise is something I do for every processed recording, though it is only possible if there is some part in the recording that is pure noise (to get the noise profile). See figures 2.5 and 2.6.

Step 5: Finally, extract each of the individual songs from the recording. It is not obvious at first but in this recording only four out of five of what may be songs are full songs. Its full song has five parts, so out of this recording I would get four processed Woodpigeon songs. There is some freedom in this step in that it is not always clear where a song starts and ends.

Whether or not preprocessing the songs first is worthwhile will be investigated in chapter 3.

2.4 Data to Be Classified

Apart from the data we are using to train our algorithms, we also have to deal with data coming in to be classified. The recording quality should be reasonably high in the first place, and I would recommend the usual standard of 44,100Hz or higher. This is particularly important because in one of my approaches to the problem I throw away quite a lot of data in order to speed up the process of training and recognition at the expense of a small amount of accuracy. All of the data I use for training (and testing) is either 44,100Hz or 48,000Hz. There are two interesting problems with incoming data that are outside of the scope of this project. Firstly, a recording could be given in which the bird song is not in isolation, by which I mean we are being given a recording of a bird in which its song occurs

18 several times, or among the songs of other birds, with some significant amount of pure background noise. Second, several birds could be singing at the same time. I can imagine a solu- tion to this in which you could place several recording devices in an environment for recognising bird song and use some sort of source separation algorithm (see the cocktail party problem[7]) to analyse them independently, which may be nec- essary if bird song recognition were to be used in the wild to track bird populations or for some other purpose. I don’t think a pure software solution would be very satisfactory even if it were to some degree possible. I will be assuming none of these complications are present in the recordings I am using, indeed as I am choosing the recordings to test the algorithms myself I do not have to worry. Constraints like this do not really matter because this project is not concerned with being user-facing. These problems are nevertheless interesting to consider.

19 Figure 2.1: Spectrogram before starting procedure

Figure 2.2: Waveform before starting procedure

Figure 2.3: Spectrogram after applying a low-pass filter

Figure 2.4: Waveform after applying a low-pass filter

Figure 2.5: Spectrogram after removing noise

20 Figure 2.6: Waveform after removing noise Approaches

In this chapter I will present a novel approach to bird song recognition by essen- tially using a database of automatically chosen archetypal bird songs to compare a new recording to. Additionally, I compare the results of this approach to a more typical one using machine learning. I will compare not only the accuracy of the approaches but also the time it takes to set up the algorithms, and the time it takes to classify new songs. There is a certain amount of compromise to be had with respect to these attributes. If I had an extra year I would also have liked to investigate Hidden Markov Models, as this is something that came up during the Natural Language Systems module, and I have seen it mentioned in relation to bird song a few times. The directory structure and how the data is stored is the same for both ap- proaches, and is as follows:

/path/to/data/ BarnSwallow/ Songs/ 1.wav 2.wav ... Calls/ CommonBlackbird/ Songs/ 1.wav ... Calls/ 1.wav ...

Both algorithms work if there is no data in any of these folders, and will simply skip the bird if that is the case. Indeed, none of my Calls folders have data because I chose to focus on songs. The names of the bird folders are used in the program to label the data for the machine learning part, so the folders need to be named using title case else

21 the birds will not be given their correct name. The program is also already built to handle several different kinds of songs and calls from one bird, I just haven’t had the time to gather enough data for that. For example, a Pink Robin may have in its Calls folder both an Alarm and Communication folder, and perhaps in them some sub-folders too, such as Alarm/Type1 and Alarm/Type2; the program would handle this. It expects recordings to be in the deepest directories in each branch of the directory tree. The PinkRobin/Calls/Alarm/Type1 recordings would then all be labelled Pink Robin (Call, Alarm, Type1) by the machine learning algorithm. It is possible that there are birds who have an uncountably or prohibitively large number of songs to split into a sub-folder arrangement like this, though I know of no such bird as of yet. That is, apart from birds like the Superb Lyrebird because of their ability to mimic most sounds, and of course there is absolutely no way dealing with something as incredible as that. There are some additional files produced by both methods that I will describe shortly.

3.1 Desirable Properties

It is not complicated to measure whether one algorithm is better than another in terms of accuracy, as we simply take the number of correctly classified recordings and divide it by the total number of recordings (slightly more complicated in some cases to be robust but it is essentially this). However, there are a number of other considerations, such as the time it takes to set up the algorithms, the time it takes to classify, and lastly how much storage space on the file system a given approach takes. For example, we may have a K-nearest neighbour algorithm that has good accuracy, but KNN tends to be quite slow on large datasets and requires the storage of all of the feature vectors for every training example. As I am only working with a small set of birds any claims I make about the scalability of the algorithms will be extrapolations.

3.2 Archetypes

Any kind of method that is going to work for recognising bird song clearly needs to be backed by data. In its most simplest form, that would be comparing new

22 recordings to archetypal examples of each bird’s different songs and calls, and seeing which one it is most similar to. If it is quite similar to several, then we simply report it as such, and say that it may be one of a potential few. That is exactly what my first attempt at solving the problem has been. There are two major questions that need to be answered to implement this method. Firstly, how to determine what the archetypal song is in a given batch of songs. Second, what exactly it means for one song to be similar to another. I recognise that what this method immediately does is ignore most of the data available in favour of focusing on one instance of song, but I find the simplicity of it to be very appealing and as such I wanted to see if it could be a viable approach.

3.2.1 Finding the Archetype The answer to the first question is simple. The archetypal song is the song that is most similar to most of the other songs. Clearly using this method a particularly strange or malformed instance of a song will never be chosen as an archetype. I should say here that in my similarity measures the closer the returned value is to 0 the more similar the songs are. The larger the value it is, the more dissimilar they are.

The Method So we must calculate the similarity of each song to every other song, and the song with the lowest mean similarity is our archetypal song. This splits the second question into two questions. The first is the same as before, and the second is whether or not we should use the same similarity measure for both finding the archetype and comparing new recordings to the archetypes. A first response may be ”well why not?”; I will discuss this in detail later on. For now we will focus on what it means to find the archetype in more concrete terms. Given some similarity measure d that is lower when two songs are more similar, and around N songs for each bird, consider the following matrix:

 d(1.wav,1.wav) d(2.wav,1.wav) ... d(N.wav,1.wav)   d(1.wav,2.wav) d(2.wav,2.wav) ... d(N.wav,2.wav)     ......  d(1.wav,N.wav) d(2.wav,N.wav) ... d(N.wav,N.wav)

23 Label each column by i.wav - the song in the first part of the column’s d(i.wav, j.wav) for j ∈ [1,N]. So the first column is labelled with 1.wav, the second with 2.wav, and so on. The archetypal song is the (label) i.wav such that the mean of the measures in its column is the lowest for all the columns in the matrix. So in the following matrix:

 0 3 12   3 0 9  12 9 0 The column means are [7.5,6,10.5], so the lowest mean is the second column which will be labelled 2.wav, so this is our archetypal song.

Set Up Time Assume the similarity measure takes T seconds to complete on average, and we have B birds in total. That means finding the archetypes for all birds in one run will take around BTN2 seconds. If we had 50 birds in our database each backed by around 30 songs then this would take 45000T seconds. If I want this to take less than an hour I need T to be less than 0.08s. Thankfully, as is made clear in the example, the diagonal of this matrix will always be all 0s if our measure is sensible, and additionally d(i.wav, j.wav) = d( j.wav,i.wav) should be true. So actually we only need to compute the bottom left (equally the top right) of the matrix, which is 0.5N(N − 1) measures. This changes our setting up time to 0.5BTN(N − 1) and for the values we used before 21750T. Now our measure can take approximately twice as long. We can do even better in the long term because of how I store the information about which archetypes were chosen. Take the example file structure from before (see 3). An archetype file will be created at each deepest directory containing a batch of WAV files. A (real) example of the contents of one of these archetpye files:

CommonBlackbird\Songs\XC128838-7.wav CommonBlackbird\Songs\XC128838-1.wav CommonBlackbird\Songs\XC128838-10.wav ... CommonBlackbird\Songs\XC187043_7.wav CommonBlackbird\Songs\XC187043_8.wav CommonBlackbird\Songs\XC187043_9.wav

24 I have cut out most of the lines as they are unnecessary to explain the format. The first line of the file is the archetypal song. Every line after that are the songs that took part in finding the archetype. So the next time I run the archetype finding algorithm I first check if there is an existing file. If there is, I check if there has been any new songs added to the folder and if there has I check if they may be the new archetype. If no new recordings have since been added I do not need to do anything and can move onto the next batch. This drastically reduces the time required to find archetypes for all but the very first run of the algorithm, so this problem of set up time is not really an issue as it can be done incrementally as described.

3.2.2 Measuring Similarity We still have not tackled the problem of measuring the similarity between two songs. It is by far the trickiest part of this method, and any improvements to this method will likely have to be improvements to this aspect of it. Easily the most common algorithm to measure the similarity between two au- dio samples that I have seen for both human speech and other vocalisation is the dynamic time warping algorithm[6, 9, 10]. See the appendix for an overview of DTW.

Best Representation So we have our similarity measure. Whether or not DTW is the best measure I do not know; I have focused most of my time on what to apply it to and augmenting it to achieve better results. As is common when developing a solution iteratively like this, each question you answer throws up several new questions. We are going to have to stop some- where, but for now we continue and next answer what the best thing is to apply DTW to. The question is simply whether we should be comparing the waveforms or the spectrum. In any case it is necessary to normalise both the waveforms and spectrums before doing any comparisons, changing the minimum and maximum y-values from [a,b] to [0,1]. Otherwise, for example, two waveforms that are exactly the same but with different amplitudes would be reported as unreasonably dissimilar. To begin with I simply applied it to the raw waveforms, but there is a problem with this, namely that the measure takes a long time to compute.

25 Throwing Away Data Before I did anything else I did some experiments to check exactly how long it might take in general. To do this I took 15 Blackbird and 18 Woodpigeon songs and ran the DTW algorithm on the first of each set against the rest in each set, so 31 DTW measures. As well as checking it against the full waveform I also tested it when keeping only every 10th value in the waveform, and also every 100th. You can see the overall structure is just about maintained when keeping only every 100th in figures 3.1 and 3.2. The results are presented in figure 3.3.

Figure 3.1: Waveform with all values kept

Figure 3.2: Waveform with every 100th value kept

These values are 21.45s, 1.87s, and 0.13s per DTW. Remember that before we calculated 0.16s to be about what we needed when we have around 50 birds with about 30 songs each to take less than an hour, so we achieve slightly better than that by throwing away 99 out of 100 values from the waveform. Just to reiterate what I went over at the start of this chapter, assuming the same number of birds and songs on average as before, 21.45s per DTW would take a few days.

26 Figure 3.3: A plot comparing the time per DTW measure for various amounts of data being kept

I did the same tests throwing away the same amount for the spectrums, and the results were 9.65s, 0.97s, and 0.16s per DTW. Interestingly, this does actually preserve the ordering of similarities between songs. All that it does, as long as the overall structure is maintained (again as shown in figure 3.2), is reduce the similarity measures approximately by a factor of 100. This is significant, because if this was not the case and we could not throw away data without destroying the similarities then everything about this method would fall apart because it would simply take far too long to measure. There is a worry that certain bird songs will be malformed by this throwing away of data, seeing as by doing so I am essentially taking a 44,100Hz recording and reducing it to an approximately 441Hz recording. This seems to be fine for the birds I have tested it on, but if there is any bird whose song has important features that are only visible over less time than 1/441 = 0.002 seconds then this will be a problem. I do not seeing this being a problem in general though considering this is such a short time-scale. For spectrums we can not be so ruthless, and can go as far as keeping only every 25th value else the structure is destroyed. I tried averaging the values instead of just throwing them away but this resulted in the waveform being far too flattened out and smeared, and lost much of the structure present in the original waveform.

27 Another thing I tried doing was using Parsons encoding, that is reducing the song to a sequence of 0s, 1s, and -1s depending on whether the frequency stayed the same (approximately), increased, or decreased from the last sample. This completely ruined the similarity measures and thus the accuracy as well.

Unexpected Similarity Results (and What To Do About It) After deciding on using the waveform I decided to check if the similarity measure agreed with my human intuition on the similarity measures for each pair of songs. I used the set of pigeon songs I had, and also added in two new pigeon songs that sounded very different to the rest just as a sanity check, as if they were reported similar to the rest then something was wrong. Unfortunately, it did report those two songs as similar to the rest. What it did report as being very dissimilar was a song that I seemed to have malformed in the processing stage that had a somewhat ghostly sound to it. After a lot of experimentation and trying to work out how I as a human was hearing such different songs when the algorithm could not I solved this discrep- ancy by augmenting the similarity measure with a multiplicative factor. This multiplicative factor took into account the number of ”components” the songs being compared to each other had. So if one song had 2 components and the other had 3, the measure would be multiplied by a factor of three halves (in- creasing the measure and thus increasing the dissimilarity). The way I calculated the number of components in a waveform was somewhat crude but effective. I simply take the top half, then apply a moving average with a window of 20 to smooth out the values, renormalise the values to be in the range [0,1], then go through the values checking when they go above and below 0.1 to mark out the components. See the figures starting with figure 3.4 on page 31 for an example on a Woodpigeon song that correctly counts five components. The green vertical lines denote the start of a component and the red vertical lines denote the end. This is where the measure for finding the archetype differs from the one used to classify. If we have a Chiffchaff song and it only has, say 6 parts, and our Chif- fchaff archetype has 10, then we’re going to multiply whatever similarity measure we get by 5/3, and as a result may miss out and pick the incorrect archetype. But this makes sense when finding the most typical and does not cause problems, in fact as I have shown it is necessary. With this change all the similarities agree with my intuition (I have nothing else to base the correctness on besides that) and all that remains is to evaluate the

28 Bird Noisy Classified Correct (%) Clean Classified Correct (%) Blackbird 55 77 Blackcap 95 100 Chiffchaff 30 58 Woodpigeon 17 42 Overall 48 69

Table 3.1: The classification accuracies for the archetypes method classification accuracy of this method.

3.2.3 Results The final results are as follows: Which makes for an overall classification accuracy, including noisy and clear recordings (processed to remove noise and interfering sounds) of 63%. Interestingly, the birds whose songs are most self-similar - the Woodpigeon and Chiffchaff - faired worst with this method. The two birds I expected it might confuse, the Blackbird and Blackcap, as they have both similar pitch and incon- sistent structure, it did very well with. The whole process of evaluating this, including all the overhead, took around 74 seconds, so we can safely assume each classification takes around half a sec- ond, which is more than satisfactory.

3.3 Machine Learning

I considered three different algorithms for testing a machine learning approach: a Support Vector Machine (using some kernel to apply to a higher dimensional feature space), a K-Nearest Neighbour Classifier, and a Decision Tree Classifier. I chose not to use K-Nearest Neighbour because of storage concerns. Consid- ering that you need to store the feature vectors for every training example you use this would not allow the program to scale very well when adding a large number of birds (and songs for each bird). The time required to classify would grow with the size of the data we have as well. For decision trees the model size remains small even with a large increase in data, and thus so does the classification time.

29 I didn’t use Support Vector Machines because I wanted to easily obtain the probabilities for each potential bird as well as a simple classification without much overhead. Not that I made much use of that here, but I think that’s something that would be quite useful in the future. For decision trees this is easy, and is simply a matter of comparing the number of examples down one branch compared to the total number. Of course if we have limited data this will result in a lot of 0 probabilities; we must simply be careful in the parameters we choose for our decision tree such as its maximum depth. To store the model from an implementation point of view, for the decision tree (for which I am using scikit-learn[16]), we simply take the trained decision tree object and serialise it with pickle[14], and de-serialise that when the program is loaded anew. As decision trees are fairly lightweight this takes almost no time at all. The feature vector I used was a normalised lower-resolution spectrum (rela- tively large frequency bins), along with the song length. It is more sensible to use the spectrum rather than the waveform for the feature vector as the spectrum is more consistent, especially when you make the frequency bins large enough. If I had more time I would have liked to also test it with the cepstrum[5], some- thing which is often used in the world of human speech recognition. The cepstrum puts more weight on the differences between lower frequencies than high, which reflects how we as humans perceive sound. I wonder if some knowledge of the ways birds perceive sound could be used to develop a similar thing.

3.3.1 Results On the small dataset I have now the decision tree takes a few minutes to train. I am not sure how this would scale with the number of songs I have to feed it as I do not know the precise implementation details as I did with my own archetypes method, so I cannot really say. Though as I have mentioned previously it does not matter too much how long the training phase takes. The time it takes for the decision tree to classify a new recording is negligible, on the order of a hundredth or thousandth of a second (depending on the computer it is being run on). The overall classification accuracy for noisy and clean recordings using a de- cision tree classifier, tested using 5-fold cross validation, is 77% (plus or minus 0.13% of uncertainty).

30 Figure 3.4: Only the top half (values above 0) of a Woodpigeon song’s waveform

Figure 3.5: Smoothed top half of a Woodpigeon song’s waveform

Figure 3.6: Marked components of the smoothed top half of a Woodpigeon’s waveform

31 Conclusion

I am very happy with the results of this project, particularly the reasonably good accuracy the archetypes method was able to achieve as I put a lot of effort into getting that to work, despite its relatively simplistic appearance in the end. In retrospect though I think I would have liked to have spent less time on the archetypes method and put some more work into pursuing the machine learning approach, and other completely different avenues like a Hidden Markov Model. Additionally, more research over the summer might have allowed me to do what I have done as well as further pursue some of the above mentioned things. Nevertheless I have enjoyed my time with it and learned a lot.

32 References

[1] Audacity. Audacity website. http://www.audacityteam.org/.

[2] Clive K. Catchpole and Peter J. B. Slater. Bird Song: Biological themes and variations, page 6. Cambridge University Press, 1995.

[3] Juan Emilio. Common blackbird image (overlaid on the record- ing). https://commons.wikimedia.org/wiki/File:Turdus_merula_ -Gran_Canaria,_Canary_Islands,_Spain-8_(1).jpg, 2011.

[4] Stuart Fisher. Common blackbird recording. http://www.xeno-canto. org/72861, 2011.

[5] Ethnicity group. Speech recognition. https://www.clear.rice.edu/ elec532/PROJECTS98/speech/cepstrum/cepstrum.html.

[6] Dr. John G. Harris. Isolated word, speech recognition using dynamic time warping towards smart appliances. http://www.cnel.ufl.edu/˜kkale/ dtw.html.

[7] Simon Haykin and Zhe Chen. The cocktail party problem. Neural Compu- tation, 17(9):1875–1902, 2005.

[8] Munish Jauhar. Common chiffchaff image (overlaid on the recording). https://commons.wikimedia.org/wiki/File:Common_Chiffchaff. jpg, 2013.

[9] Arik Kershenbaum, Laela S. Sayigh, and Vincent M. Janik. The encoding of individual identity in dolphin signature whistles: How much information is needed? National Institute for Mathematical and Biological Synthesis, 2013.

[10] Joseph A. Kogan1 and Daniel Margoliash. Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden markov models: A comparative study. The Journal of the Acoustical Society of America, 1998.

33 [11] Richard Lyons. Windowing functions improve fft results, part 1. Electrical Design News, September, 1998.

[12] David M. Common chiffchaff recording. http://www.xeno-canto.org/ 121965, 2012.

[13] The Institute For Bird Populations. Jobs. http://www.birdpop.org/ pages/jobs.php.

[14] Python. pickle. https://docs.python.org/2/library/pickle.html.

[15] Kazuaki Tanida. fastdtw. https://pypi.python.org/pypi/fastdtw.

[16] SKL team. scikit-learn. http://scikit-learn.org/stable/.

[17] William H. Thorpe. Bird-Song. The biology of vocal communication and expression in birds, chapter 3. Cambridge University Press, 1961.

[18] xeno-canto team. xeno-canto collection graphs. http://www.xeno-canto. org/collection/stats/graphs.

[19] xeno-canto team. xeno-canto website. http://www.xeno-canto.org/.

34 Appendix

A.1 Dynamic Time Warping

Dynamic time warping, commonly abbreviated as DTW, is a method for measur- ing the similarity between two sequences. The elements of the sequence need not be numbers; all that is needed is some concept of distance between any given two elements. Of course in practice it does tend to be using numbers, more specifically mea- suring the similarity between two sequences that represent some kind of signal. I will provide an intuitive way of looking at what the algorithm does in this case after explaining the basic concept. Given two sequences x = (x1,x2,...,xn) and y = (y1,y2,...,yn), we create a simple 2-dimensional distance matrix, like so:   d(x1,y1) ... d(xn,y1)  ......  d(x1,yn) .. d(xn,yn) The start of each of the sequences is in the top and left, and the ends in the bottom and right. If our distance function simply measures the distance between two elements (does not take into account previously calculated distances), then we next need to find the shortest path through the matrix (using some path finding algorithm), where each step either moves one right or one down. The sum of the distances on this path is our similarity measure. Of course we may take the square root of this measure, or similar, and retain the same similarity ordering. So if p = ((xi1 ,y j1 ),(xi2 ,y j2 ),...,(xim ,y jm )), for some ik, jk ∈ [1,n] is the sequence of indexes this path takes, then our similarity measure might be:

m 2 ∑ (xik − y jk ) k=1 and m will be at least as large as n, as the shortest possible path through the matrix would be to go straight diagonally, which would be of length n. Our distance function may take into account previously calculated distances. Indeed, this is what tends to be done when using the algorithm for finding string similarity, in which case you do not need to find a shortest path and you may just

35 read off the bottom right element as the similarity. This same sort of method may be taken for matching sequences, and other similar applications. I do not make use of that kind of distance function in this project. In our case we are measuring the similarity between two signals over time (the raw waveform), or the presence of certain frequencies over others (the spectrum). In the former case, the intuition is that we are ”warping” the signals over the time domain until they match as much as we can get them to. You can see this in figure A.1, where you can imagine stretching the signals horizontally until those lines (that correspond to the elements of the sequences that share a point on the shortest path through the distance matrix) are vertical. If our distance function was simply d(xi,y j) = |xi − y j|, where | · | is absolute value, and the connecting lines above were drawn vertical all the way along, this is equivalent to taking the diagonal path in the matrix and from this we recover the simple Manhattan distance for sequences. This accounts for signals that may be out of phase by some amount, which in our case could simply be due to a bird song being extracted with, say, a longer period silence on the left (entirely possible as I am doing this manually), or a bird ”skipping” part of its song, in which case the signal will be warped or shifted along until the parts that match align, and we can measure the similarity of that. For these reasons, I believe this to be a much better choice for distance metric than a more naive one such as the Manhattan, Euclidean, or any Minkowski distance metric or similar. In fact, you can see DTW as a more general case of these metrics, as you recover them by simply taking the path along the diagonal of the distance matrix (although DTW lacks some mathematical properties, such as the triangle inequality being true). There are additional aspects to DTW such as weighting the measure returned by the algorithm by how much it deviates from the main diagonal, but I will refrain from going into too much depth here. I have used the library fastdtw[15] in my implementation.

36 Figure A.1: A fabricated example of the result of attempting to find the best match between two signals using DTW

A.2 The (Fast) Fourier Transform

The spectrum is not something we will immediately have access to upon reading in an audio file. We need to do some work to be able to find it, which we can do by using a clever technique called a Fourier transform1, discovered by Joseph Fourier in 1822 (though the concept had been considered much earlier). The idea is simple, it is the decomposition of a waveform into a sum of sine functions of particular frequencies. We are taking our signal from the time- amplitude domain to the frequency-amplitude domain. Then we can look at the dominant frequencies, which is something that is in general more consistent from song to song. However, in the mathematical formulation and if we want to be perfectly precise, we have to use an infinite sum. It is possible to break this in- finite sum down into a finite sum in some case when doing a Fourier transform analytically, but in practice we tend not to bother and as such do not end up with a result that perfectly captures our original waveform, but it will be more than good enough. This can be done for any waveform; there are no special, absolute require- ments for this to work. However, the Fourier transform does assume what you are feeding it is a repeating signal, and works best when the audio signal contains an

1A nice, deeper explanation of the Fourier transform than I will give in this report can be found at: http://www.med.harvard.edu/jpnm/physics/didactics/improc/intro/fourier2.html

37 integer number of cycles of the repeating signal. Of course in practice we would be very lucky to meet this soft requirement, and so we have to deal with the issue of spectral leakage. Spectral leakage also occurs as a result of the fact we only have samples of the signal at discrete intervals. Spectral leakage is the name given to frequencies ”bleeding” over into neigh- bouring frequencies in the result of the Fourier transform. It can be reduced by ap- plying a windowing function to the data before applying the Fourier transform[11], though there are trade-offs to be made; applying a windowing function will reduce the frequency resolution of the data. It depends on the application if you care about spectral leakage or not. I have come to the conclusion it does not matter for this application, given that we only care about the vague sorts of frequencies that one species of bird uses compared to another and not so much about the precise pro- files (the leakage in one Fourier transform compared to another on a similar signal should be similar anyway), so I have not looked into it with great depth. That is, I have not run extensive tests to determine whether it would be advantageous or not to apply a windowing function, and chose to focus my efforts instead on making more general progress on the problem. I could be wrong about this, and I think further research into this is necessary to be able to make the claims I have just made with more confidence. As a simple illustration of the phenomenon refer to figures A.2 and A.3.

Figure A.2: A plot of sin(x) for x from 0 to 2π, along with its spectrum

In the first plot you can see the single peak at 1, which is what we would expect. Doing the same thing for sin(x + 1), the sine function shifted along by 1, gives us the same result. But notice in figure A.3 what happens if we view this

38 function in the range 0 to 2π + 1, which is not an integer number of periods. You can very clearly see the spectral leakage in the spectrum for this function. Again, whether that is a problem or not depends on the application and problem domain.

Figure A.3: A plot of sin(x) for x from 0 to 2π + 1, along with its spectrum

There is a particular algorithm called the fast Fourier Transform (often ab- breviated as FFT), that I use in my program for doing this conversion. Also note that the original waveform can always be recovered from the resulting frequencies using the inverse fast Fourier transform, abbreviated IFFT (though phase informa- tion may be lost in the process).

39