DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Automatic audio sample finder for creation Melodic audio segmentation using DSP and machine learning

DAVID PITUK

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Automatic audio sample finder for music creation

DAVID PITUK

Master in ICT Innovation Date: October 25, 2019 Supervisor: Saikat Chatterjee Examiner: Saikat Chatterjee School of Electrical Engineering and Computer Science Host company: Teenage Engineering Swedish title: Automatisk ljudprovfinnare för musikskapande

iii

Abstract

In the field of audio signal processing, there have always been attempts to cre- ate tools which help musicians by automating processes for music creation or analysis, and the electronic music industry is still playing an important role in the combination of engineering and music. In the age of sample based synthesizers and sequencers, creating and using high quality and unique audio sample packages is a crucial part for composing songs.

Nowadays, there are hundreds of audio applications and editors that provide the sufficient tools for songwriters and DJs to find and edit audio samples and create their own signature packages for their performances. However, these applications do not offer automated solutions to extract melodic loop or drum samples. Therefore, the whole procedure of extracting euphonious and unique samples can be quite time consuming. To decide which part of the song is good enough to be used as a separate loop or drum sound is highly subjective, so to fully automate this mechanism is really challenging. However, having a good balance between fully automated processes and freedom for additional editing can result to a useful tool, which can still save a lot of time for the users.

In this paper, I present the research and implementation of a cross-platform (Windows, macOS, ) desktop application which automatically extracts melodic motifs and percussion sections from songs for loop and drum sam- ples. Furthermore, the app also provides classification of the extracted drum samples in five categories (kick, snare, clap, open hi-hat and closed hi-hat) and allows the user to do additional editing on the samples. The software is devel- oped as a part of an internship at a Swedish audio company, called Teenage Engineering, therefore the application converts the final sample into a single file which is supported by the company’s OP-Z sequencer and synthe- sizer device. iv

Sammanfattning

Inom ämnet av ljudprocessering har det alltid gjorts försök att framställa verk- tyg som hjälper musiker genom att automatisera processer för att skapa el- ler analysera musik, och den elektroniska musikindustrin spelar en viktig roll då de kombinerar programvarutekniken och musik. I dagens musikindustri är samplingsbaserade syntar, sequencers, användning av högkvalitativa och uni- ka ljudpaket en avgörande del för att komponera låtar.

Numera finns det hundratals ljudapplikationer och redigeringsprogram som ger tillräckligt med verktyg till låtskrivare och DJ:s för att hitta och redigera ljud och skapa sina egna signature packages för sina konserter. Dessa appli- kationer erbjuder emellertid inte automatiserade lösningar för framställningen av melodiska slingor eller trum samples. Därför kan hela förfarandet i proces- sen att skapa unika samples vara ganska tidskrävande. Att avgöra vilken del av låten som är tillräckligt bra för att använda som en separat slinga eller åter- kommande trumljud är mycket subjektivt, så att helt automatisera musikpro- duktionen är väldigt utmanande. Att ha en bra balans mellan de helautomatiska processerna och möjlighet för ytterligare redigering kan emellertid resultera i ett användbart verktyg som kan spara mycket tid för användarna.

I denna avhandling presenterar jag forskning och implementering av en korsplatt- forms (Windows, macOS, Linux) applikation som automatiskt framställer me- lodier och slagverkssektioner för låtar. Dessutom tillhandahåller appen också klassificering av de skapade trum samples i fem kategorier (kick, snare, klapp, öppen hi-hat och sluten hi-hat). Detta tillåter användaren att göra ytterligare redigering i sampelsen. Mjukvaran utvecklas som en del av ett projekt hos det svenska ljudföretaget Teenage Engineering. Applikationen konverterar det slutliga sample package till en enda fil som stöds av företagets OP-Z sequencer och synthesizer-enhet. Contents

1 Introduction 1 1.1 Background ...... 1 1.1.1 Teenage Engineering and the OP-Z ...... 1 1.1.2 Sample kits for the OP-Z ...... 2 1.2 Motivation ...... 3 1.3 Goal ...... 3 1.4 Method ...... 4 1.4.1 Melodic samples ...... 4 1.4.2 Drum samples and classification ...... 4 1.5 Sustainability goals ...... 6 1.5.1 Good Health and Well-being ...... 6 1.5.2 Industry, Innovation and Infrastructure ...... 6 1.6 Outline ...... 6

2 Theoretical Background 7 2.1 The sound ...... 7 2.1.1 Time domain ...... 8 2.1.2 Frequency domain ...... 9 2.1.3 Spectrogram ...... 11 2.2 Melodic sample extraction ...... 12 2.2.1 Similarity measure ...... 12 2.2.2 Self similarity matrix ...... 13 2.3 Audio classification ...... 15 2.3.1 Convolutional neural network ...... 16 2.4 The OP-Z kit file ...... 19 2.4.1 The AIFF file format ...... 20 2.4.2 The OP-Z JSON Object ...... 20

v vi CONTENTS

3 Implementation 22 3.1 Loop sample extraction ...... 23 3.1.1 Preprocessing the spectrogram ...... 23 3.1.2 Self similarity matrix and thresholding ...... 25 3.1.3 Sample extraction from the matrix ...... 27 3.1.4 Algorithm overview ...... 28 3.2 Drum sample extraction and classification ...... 29 3.2.1 Beat slicing ...... 29 3.2.2 Generating the CNN model ...... 32 3.2.3 Algorithm overview ...... 36 3.3 The application structure and user interface ...... 36 3.3.1 The Electron framework ...... 38 3.3.2 The ZeroRPC module ...... 39 3.3.3 The ...... 39

4 Conclusion 43 4.1 Results ...... 43 4.2 Future work ...... 44 4.3 Final thought ...... 44

Bibliography 46 Chapter 1

Introduction

1.1 Background

Music sequencers (or audio sequencer or simply sequencer) are very impor- tant part of electronic music creation. They are devices or application software that can , edit, or play back music, by handling note and performance information in several forms. Sequencers can be categorized by handling data types such as MIDI, CV/Gate or audio data sequencers. Another way of cat- egorization is the storage and playback mechanism of the device, for example real time, analog or step sequencers [1].

1.1.1 Teenage Engineering and the OP-Z This thesis project is a part of an internship program at Teenage Engineering. TE is a Swedish consumer electronics company and manufacturer founded in 2005 based in Stockholm. Their products include electronics and synthesiz- ers. TE’s OP-Z is an audio sample based step sequencer with some additional fea- tures.1

A step sequencers breaks down beats into ‘steps’. For example, if the user breaks up a loop with 4 bars that’s in standard 4/4 time, it will have 16 steps (also known as beats). With a sequencer, the user can edit each step to cus- tomize the beats or the song. Tweak, add/remove, edit drum hits such as kicks, snares or hats, add sample hits or effects. Then the user can set the desired

1OP-Z has many more features than a regular step sequencer, such as sample based syn- thesizer mode or the visual and audio effect unit.

1 2 CHAPTER 1. INTRODUCTION

number of steps in each beat, change velocity, reverb and other effects.

OP-Z has eight audio tracks which are divided into two groups, the drum group and the synth group. The drum group consists of four drum tracks. These are kick, snare, perc and sample. Each track in this group has a two note polyphony per step. They are all sample based and consist of 24 different sounds across the musical keyboard [2]. This is called a kit and this thesis paper is about how to generate whole kits from different songs for OP-Z automatically.

Figure 1.1: Teenage Engineering’s OP-Z sequncers

1.1.2 Sample kits for the OP-Z OP-Z has several built-in sample kits, however users can also load their own kits to the device storage. In order for the device to be able to use the uploaded sounds, the drum sample kits has to meet some requirements: • The kit has to be a single file containing the sounds for each keys.

• The file’s length must be 12 seconds or shorter.

• The file has to be in AIF (Audio Interchange File) format with the sample rate of 44.1 kHz.

• Header meta data must be part of the AIFF file’s chunk section. This header contains the information about the assignment between the kit CHAPTER 1. INTRODUCTION 3

samples and the OP-Z keyboard and other settings, such as pitch, LFO etc. I provide more details about this header section in the chapter.

1.2 Motivation

There are many sound editors and beat slicers on the software market which helps users to create their custom single kit files, however each of these tools lack of certain features that should be implemented in order to achieve an easy way of OP-Z kit construction.

Figure 1.2: , a versatile and open-source audio editor

For example editors like the open-source Audacity allows users to cut and edit parts of songs, but usually do not have the option to create AIFF files with the specific OP-Z compatible header data. Furthermore, the market lacks of software which are able to find melodic motifs and unique drum samples in an audio file automatically.

1.3 Goal

The main goal of this thesis project is to implement a cross-platform desktop application which has the following features:

• Finding melodic motifs in different songs by one click

• Extracting unique drum samples from songs or drum recordings auto- matically 4 CHAPTER 1. INTRODUCTION

• Classifying the drum samples in five categories (kick, snare, clap, open hihat, closed hihat) with good ( 80%) accuracy

• Offering additional editing for the extracted samples (cutting, key re- assigning, etc.)

• Being able to accept WAV, MP3 and AIFF files as input

• Saving the final kit into OP-Z compatible format

1.4 Method

The main focus of this paper is on the topic of melodic motif and unique drum sample extraction, and drum sample classification using deep learning tech- niques.

1.4.1 Melodic samples The thesis project started with a literature review and background study about melodic sample detection in song files.

After the research, I found techniques based on the spectrograms of whole music tracks combined with self-similarity matrix calculation using the spec- trogram as the input of the matrix calculation. After implementing test scripts, this technique seemed efficient but still lightweight and fast enough to be in- tegrated in a JavaScript-Python based desktop application. The next step was running experiments to prepare the matrix to be easier to perform pattern recognition in the this two-dimensional matrix. The outcome of these experiments were applying thresholding on both the spectrogram and the self-similarity matrix. After the pre- and post-thresholding, I implemented an algorithm for find- ing patterns which represent signature melodic motifs in the corresponding song. Lastly, I created the script which was responsible to filter out similar sam- ples and handle time overlaps between them.

1.4.2 Drum samples and classification This part of the project has two sections. The first one is to find an efficient deep learning technique for audio classification and to use it for creating the CHAPTER 1. INTRODUCTION 5

AI model which performs the prediction. The second section is to implement some kind of beat slicing algorithm which outputs unique drum samples that can be targets for the classification.

Classification

After the necessary literature research and background study, I have found articles which successfully used convolutional neural networks (CNN) to per- form classification on different audio sources. Fortunately, CNN has a very good support in Python which is fairly easy to use. After collecting the training sample data for the five drum classes (600 different kick, snare, clap, closed and open hihat samples), the next step was to decide what kind of input can be used for the CNN training. Since con- volutional neural network based tools are mostly used for image recognition projects, I had to choose the right image representation of the sound samples. After I created the setup for the training layers, I started to run several training cycles for three different audio training data representations: wave- form data, spectral data and spectrogram.

Beat slicing

The implemented beat slicing algorithm uses a classic signal energy based solution. Therefore, the first step was developing the script which generates the signal energy array of each audio samples with a given time frame. After that, the script has to filter the peeks out of the energy array. Then an adaptive baseline is calculated which represent the minimum energy level that the certain peek has to be on. The baseline is adaptive because its value is not constant over the time of the sample but the whole signal energy array is divided into time frames where each time frame has its own baseline. The next step is to run some experiments to find the correct time frame constants for the energy array calculation and for the dynamic baseline calcu- lation as well. Finally, an algorithm checks the similarity between the detected beat slices using distance function on each slice’s spectrum, and selects the slices which are unique enough. 6 CHAPTER 1. INTRODUCTION

1.5 Sustainability goals

Sustainability was an important factor during the development process. There- fore, this thesis project targets few of the Sustainable Development Goals (SDG) which have been defined by the United Nations.

1.5.1 Good Health and Well-being The project mainly focuses on audio signal processing (e.g. signal similarity measures, audio classification etc.). Since the researched and implemented al- gorithms are based on general signal processing and neural network solutions, they can target different types of signals as well. In the future, the imple- mented components can be easily adjusted to handle medical signals, such as electroencephalogram (EEG), electrocardiogram (ECG) or electromyogram (EMG).

1.5.2 Industry, Innovation and Infrastructure One of the biggest trending topic in today’s IT world is artificial intelligence (AI). Nowadays, neural network based AI solutions are widely used in differ- ent areas of our lives. This thesis work will be published as an open-source project, so all the machine learning solutions of the software will be available for developer communities. Hopefully, these solutions will be beneficial for new projects in the future.

1.6 Outline

In this section I give an overall summary about the content of the chapters. Chapter 1 introduces the topic of the thesis, declares the goal and briefly the methods that will be employed. Chapter 2 discusses the base theory about sound signals, data analysis, con- volutional neural network and shares technical information about the OP-Z kit file’s meta data. Chapter 3 describes the whole process of the implementation and test results. Chapter 4 summarizes the results, provides conclusion and presents the poten- tial future works. Chapter 2

Theoretical Background

2.1 The sound

In physics, sound is a vibration that typically propagates as an audible wave of pressure, through a transmission medium such as a gas, liquid or solid. Since in case of musical sounds the change of air pressure has a quasi-periodic behavior, the musical sounds have the two fundamental properties of periodic signals: amplitude and frequency. A typical sample of musical sound can consist of a wide spectrum of frequencies, however the generally accepted standard range of audible frequencies for humans is 20 to 20,000 Hz. If the sound is only one single sine wave, the pitch of the sound equals the frequency of the sine wave:

y = A cos(2πft + ϕ) (2.1) In reality however, single sine wave sounds are relatively rare. It is much more common that the sound wave is a combination of different frequencies with different amplitudes. For example, one note played on a guitar contains several other harmonics, and these harmonics determine the overall tone of the note.[3]

Therefore, if our goal is to analyze or classify music songs, we have to use different (visual) representations of the audio data from a music track. In the following subsections, I will give a short summary of each audio representa- tion that has been used in the project.

7 8 CHAPTER 2. THEORETICAL BACKGROUND

2.1.1 Time domain One of the most obvious ways to analyze signals is the time domain analysis. In the time domain, the signal or function’s value is known for all real num- bers, for the case of continuous time, or at various separate instants in the case of discrete time.

In this project, the sources of audio data are audio files which are digital repre- sentations of the sound. Therefore, signals stored in audio files are discretized version of the analog signal with an important property, called sampling fre- quency or sampling rate which is the average number of samples obtained in one second (samples per second)[4]. This project is focusing on audio signals with the sampling rate of 44.1 kHz.

The simplest way to visualize a signal in the time domain is the waveform that is the shape of its graph as a function of time, independent of its time and magnitude scales and of any displacement in time. The figure 2.1 shows the waveform of the song "Changes" by French electronic duo Faul & Wad Ad.

Figure 2.1: The waveform of a tropical house track

In this project, I used another time domain representation of signals, the signal energy graph. In case of an N index long finite discrete signals, the total energy of the signal can be calculated as following [5]:

N−1 X 2 E = |xn| (2.2) n=0 The signal energy graph is showing the change of the energy over time. However, to have a more suitable format for beat detection, the signal is split up to time frames, where each time frame’s energy is calculated according to CHAPTER 2. THEORETICAL BACKGROUND 9

the 2.3 formula, where L represents the length of the time frame in indexes and N is the length of the signal.

n+(L/2) X 2 En = |xi| , n ∈ Z :(L/2) < n < (N − 1 − (L/2)) (2.3) i=n−(L/2)

Now let us take a look at the energy graph of the song showed in figure 2.1 with a time frame length of 0.05 seconds without frame overlaps.

Figure 2.2: The energy graph of a tropical house track

Even in this image it is quite visible that this approach can proved a good basis for processing to find beats in drum oriented music tracks.

2.1.2 Frequency domain In electronics, control systems engineering, and statistics, the frequency do- main refers to the analysis of mathematical functions or signals with respect to frequency, rather than time [6], therefore, in the field of signal processing, we often have to use some kind of frequency domain based solution, because comparing to techniques in time domain, it is more suitable to perform analysis on more complex signals. For digital devices, to decompose a finite (N index long) signal into its constituent frequencies, we use discrete Fourier transform (DFT):

N−1 X −i2πkn/N Xk = xne , k = 0, ..., N − 1 (2.4) n=0 10 CHAPTER 2. THEORETICAL BACKGROUND

The Fourier transform of a function of time is a complex-valued function of frequency, whose magnitude represents the amount of that frequency present in the original function, and whose argument is the phase offset of the basic sinusoid in that frequency. A more usual interpretation of these complex num- bers is that they give both the amplitude of the wave present in the function and the phase (or the initial angle) of the wave.

In audio signal processing, it is common to only focus on the amplitude of each frequency of the audio signal. The amplitude as a function of frequency is known as the spectrum of the signal.

An important property of the DFT is the resolution of the frequency band (∆f), which depends on the length of the time signal (T):

∆f = 1/T, (2.5) According to the Nyquist–Shannon sampling theorem, for a given sample rate fs, perfect reconstruction is guaranteed possible for a bandlimit B < fs/2 [7]. Which means, in case of audio data the sample rate of 44.1 kHz determines the maximum frequency range of the spectrum, specifically 0 to 22050 Hz (figure 2.3).

Figure 2.3: Spectrum of a short sample of a kick drum sound

The picture above shows the frequency in a linear scale. However, even this picture foreshadows the fact that most of the frequency components of musical and songs are located in the range of 50 to 5000 Hz. Therefore, spectrum representations in this project use a logarithmic scale of frequency which provide a more suitable form for further processing. Figure 2.4 shows the same drum sample spectrum with using logarithmic frequency scaling. CHAPTER 2. THEORETICAL BACKGROUND 11

Figure 2.4: Same spectrum with logarithmic frequency scaling

2.1.3 Spectrogram Another illustration of complex periodic signals is the spectrogram, which is a visual representation of the spectrum of frequencies of a signal as it varies with time. We can also look at the spectrogram as a limited combination of time and frequency domain approaches. A spectrogram is usually depicted as a heat map, for example in the figure 2.5 where the intensity shown by varying the colour in the picture.

Figure 2.5: Spectrogram of the first 37 seconds of the song Bohemian Rhap- sody

One common way to generate a spectrogram is using fast Fourier transform (FFT). The digitally sampled audio signal, is split up into chunks, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. A chunk corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time (the midpoint of the chunk). 12 CHAPTER 2. THEORETICAL BACKGROUND

The spectrogram has two important parameters which have to be consid- ered to achieve the proper format for further analysis:

• One is the length of the chunk, which determines the frequency resolu- tion on the y-axis of the spectrogram.

• The other parameter is the level of overlap between chunks. These two parameters together determine the time resolution on the x-axis.

2.2 Melodic sample extraction

During the the literature study about different ways of extracting returning melodic motifs from music tracks, I have found studies about using the self similarity matrix (SSM) representation of the song [8][9][10]. This method turns the audio signal processing related problem into a hybrid problem using both sound and image processing techniques where the main task is to find pattern in the SSM.

The main idea of this approach is to calculate the spectrogram of the song, then execute a similarity measure function between each column vector (the spectrum of each chunk) of the spectrogram. In the following subsections, I introduce the potential similarity measures and give a brief summary about how to calculate and interpret a self similarity matrix.

2.2.1 Similarity measure The similarity measure is the measure of how much alike two data objects are. Similarity measure in the context of this paper is a distance with dimensions representing the amplitude values of difference frequencies. If the distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

For self similarity matrix calculation, there are several options to measure the similarity between the DFT vectors [11].

One option is to use the Euclidean distance function. Euclidean distance is the most common way to calculate distance. Euclidean distance is also known as simply distance. In this case, the distance between two points is the length of the path connecting them. The Pythagorean theorem gives this distance between two points (p = (p1, p2, ..., pn) and q = (q1, q2, ..., qn)): CHAPTER 2. THEORETICAL BACKGROUND 13

v u n uX 2 d(p, q) = t (qi − pi) (2.6) i=1

Another common type of similarity functions is the cosine similarity. Co- sine similarity metric finds the normalized dot product of the two attributes. The cosine of 0◦ is 1, and it is less than 1 for any other angle:

A · B sim(A, B) = cos(θ) = (2.7) ||A||||B|| Therefore the result is only about the orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90◦ have a similarity of 0, and two vectors diametrically opposed have a sim- ilarity of -1, independent of their magnitude.

The last type of distance functions which I present here is the Manhattan distance. Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates:

n X d(p, q) = |pi − qi| (2.8) i=1

2.2.2 Self similarity matrix In data analysis, the self-similarity matrix is a graphical representation of sim- ilar sequences in a data series. The process to construct a SSM can be sum- marized in two main steps:

1. The audio signal x is transformed into a series of feature vectors C = (c1, ..., cN ) that divide x into N frames and capture specific frame- level characteristics of the given signal. This project - as I stated at the beginning of the section - only focuses on harmonic features, more specifically on spectrogram.

2. C is used in order to obtain a self-similarity matrix (SSM) S, a symmetric matrix such that S(n, m) = d(cn, cm), ∀n, m ∈ [1 : N], where d is a distance function (e.g. Euclidean, cosine, Manhattan). 14 CHAPTER 2. THEORETICAL BACKGROUND

An example of a self similarity matrix is shown in figure 2.6. The chunk size of the spectrogram is 0.25 seconds with an overlap value of 0.03125 sec- onds and the applied similarity measure is the euclidean distance function. The values in the matrix are normalized into the range of 0 to 1.

Figure 2.6: Self similarity matrix of a piano sample

The source of this matrix is a 1 minute and 17 seconds long random piano sequence but I copy the same melody to the position of 00:12, 00:27, 00:50 and 01:03. Since the matrix is illustrated as a heat-map where the darkest points are the values close to 0 and the lightest points are the values close to 1, these repeating melodic parts are easily visible as darker diagonal line patterns, so called paths.

However, besides these four identical parts, I copied an additional one which contains the same melody, but with different background accords. Its position is at 00:37 and this part is barely visible in the matrix. CHAPTER 2. THEORETICAL BACKGROUND 15

The red colored paths in figure 2.7 shows where we should see the addi- tional part if the matrix would be well prepared for the pattern recognition.

Figure 2.7: Correction for the missing parts

The conclusion from this example is that some kind of preparation is needed if we want to recognize motifs which are not 100% identical. In the next chap- ter, I will present the techniques which I used in order to improve the format of the SSM and also give a detailed description about the algorithms which are responsible for the path recognition in the matrix.

2.3 Audio classification

In this section, I will give a short introduction about the technique, which has been used for the audio classification, specifically drum category prediction. Nowadays, when data scientists have to implement some kind of classifi- cation on a set of data, they often use some kind of machine learning (ML) solution or a very popular subset of ML, so called deep learning. There are many ML based solution to perform audio classification, so the question was which would fit the best for this project. When I was doing the literature study, I took three aspects into account when I was choosing the right tool: 16 CHAPTER 2. THEORETICAL BACKGROUND

• The most important aspect is the accuracy. The goal is to train a model which is able to predict the category with at least 75% of accuracy.

• Another factor is the technical support for the certain ML class. In this case, support for Python or JavaScript was essential.

• The last aspect is the learning curve of the tool. Since I have no prior knowledge about data science, I had to choose a technique that is rela- tively easy to use and has a good documentation.

After I was done with the research, I have found many articles which de- scribed successful experiments in the field of audio classification with an al- gorithm called convolutional neural network (CNN) [12][13][14]. CNN has all the features that I stated above: according to the papers, its accuracy is suf- ficient, has a very good support in Python via the Tensorflow library (Keras API) and the process of input data preparation is much easier than classic ML approaches. In the upcoming subsections, I provide some basic information about CNN and how it can be used for audio related tasks.

2.3.1 Convolutional neural network A convolutional neural network is a deep learning algorithm which can take in an input image, assign importance (learnable weights and biases) to var- ious aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a CNN is much lower as compared to other classification algorithms. While in primitive methods filters are hand- engineered, with enough training, CNN have the ability to learn these fil- ters/characteristics.

Figure 2.8: The workflow of the CNN algorithm CHAPTER 2. THEORETICAL BACKGROUND 17

A CNN can capture the spatial and temporal dependencies in an image by applying multiple filters (layers) on it. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights. Therefore, a convolutional network can be trained to understand the features of the image better. The role of the convolutional neural network is to reduce the images into a form which is easier to process, without losing information which are critical for getting a good prediction.

To achieve this reduced form, a CNN architecture is formed by a stack of dis- tinct layers that transform the input volume. A few distinct and commonly used types of layers are listed below:

• Convolutional layer The convolutional layer is the core element of a CNN. The layer’s pa- rameters consist of a set of kernels, which have a small receptive field, but extend through the full depth of the input volume. During the pro- cess, each filter is convolved across the width and height of the input image, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.

Figure 2.9: Convoluting a 7x7x1 image with a 3x3x1 kernel to get a 5x5x1 convolved feature

• Pooling layer The pooling layer, similar to the convolutional layer, is responsible for reducing the size of the convolved feature. This is to decrease the com- putational power required to process the data and it is useful for extract- ing dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model. 18 CHAPTER 2. THEORETICAL BACKGROUND

There are two types of pooling: max pooling and average pooling. Max pooling returns the highest value from the chunk of the image covered by the kernel, while average pooling returns the average of all the values from the covered portion of the image.

Figure 2.10: 2x2 Average and max pooling

• ReLU layer ReLU stands for rectified linear unit which is the most commonly used activation function in CNNs. Mathematically, it is defined as y = max(0, x). It effectively removes negative values from an activation map by setting them to zero. It increases the nonlinear properties of the decision func- tion and of the overall network without affecting the receptive fields of the convolution layer. Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent, and the sigmoid function.

Figure 2.11: Commonly used activation functions CHAPTER 2. THEORETICAL BACKGROUND 19

• Fully connected layer After the algorithm applied several convolutional and pooling layers on the input image, the high-level reasoning in the neural network is done via fully connected layers (or FC layers). To use the FC layer, we have to flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs1, the model is able to distinguish between dom- inating and certain low-level features in images and classify them using the softmax2 classification technique.

The last important question is, how to use a mainly image oriented neu- ral network algorithm like CNN for audio classification. The answer is very intuitive: using some kind of image representation of the sound as the input image for the CNN. As I described in the first section of this chapter, there are several potential visual representations (e.g. spectrum or simple waveform) of periodic signals which can be used for the training. The results of the train- ing using different types of input images will be discussed in the next chapter where I will give a detailed description of the process of the implementation.

2.4 The OP-Z kit file

In the first chapter I give a short introduction about the kit file format of the OP-Z device. The kit has to be a maximum 12 seconds long AIFF file con- taining all the samples for each key (maximum 24 samples for the 24 keys) of the sequencer.

The figure 2.12 shows the waveform of one of the OP-Z’s built-in drum kits with the key layout. In this case, the order of the samples in the audio file is identical to the order of the keys. The starting and ending time of each samples are illustrated as dotted vertical lines. All the information about the positions of these time lines and other settings are described in a JSON object string stored in the Meta data section of the AIFF file. The following subsections will proved some information about this JSON object and the AIFF format.

1One epoch is one forward pass and one backward pass of all the training examples. 2Softmax is a function that takes as input a vector of K real , and normalizes it into a probability distribution consisting of K probabilities. 20 CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.12: Kit waveform and the OP-Z key layout

2.4.1 The AIFF file format AIFF is short for Audio Interchange File Format, which is an audio format initially created by Apple Computer for storing and transmitting high-quality sampled audio data. It supports a variety of bit resolutions, sample rates, and channels of audio. This format is quite popular upon Apple platforms, and is commonly adopted in professional programs that handle digital audio wave- forms. AIFF files are uncompressed, making the files quite large compared to the ubiquitous MP3 format. AIFF files are comparable to Microsoft’s wave files, because they are high quality they are excellent for burning to CD. The file structure of a regular AIFF file with the OP-Z JSON is shown in the figure 2.13. The important thing is that the JSON string has to be placed right before the Marker chunk, otherwise the OP sequencer will not recognize the file as a valid kit file. The file manipulation in the application is done by vanilla JavaScript.

2.4.2 The OP-Z JSON Object To have a AIFF kit file which can interpreted by the OP-Z, the JSON object (placed before the Marker chunk) has to have to a specific list of attributes. These attributes are describing different features and settings of the kit, such as parameters by each sample (start times, end times, pitch, volume, play mode and reverse) and kit wise settings (applied effect, LFO settings). In our case, the two important attributes are the start and end attributes. CHAPTER 2. THEORETICAL BACKGROUND 21

They determine the position of each sample in the kit audio file. Both are ar- rays of integers with the maximum length of 24. The OP-Z is using a 31-bit format to set the time values, so each start and end time value has to be con- verted from seconds to the compatible 31-bit integer format by the following formula, where tsec is the time in seconds and tOP is the 31 bit format: t t = sec (231 − 1) (2.9) OP 12

Figure 2.13: Structure of the kit (.aiff) file Chapter 3

Implementation

In this chapter, I provide a detailed description of the process of development and also present some results of different test experiments. All the important theoretical and technical aspects of the implementation phase has been dis- cussed in the previous chapter.

This chapter has three main sections:

• Loop sample extraction In this section, I explain how spectrograms can be prepared to achieve a better result in the self similarity matrix and to cut down the computation time. Then I introduce a method to find patterns (path) in the SSM and an algorithm to reduce the number of paths by handling time overlaps between them.

• Drum sample extraction and classification This part is about the algorithm to find potential drum samples by beat slicing using the energy graph of the audio data. After that I introduce the layer structure of the CNN module for the training and show the results of using different types of input data (waveform, spectrum and spectrogram).

• The application structure and user interface In this part, I show the overall structure of the application and give some details about the python back-end layer of the app and the communica- tion between the GUI and the back-end. Furthermore, I presents the main specification of the user interface and provides some basic infor- mation about the tools and libraries which have been used to create the final GUI.

22 CHAPTER 3. IMPLEMENTATION 23

3.1 Loop sample extraction

In Chapter 2, I introduced a possible method for detecting returning motifs in a music track by converting the raw audio data into a spectrogram format, then using the spectrogram as input for the self similarity calculation. In the follow- ing subsections, I will show what steps has been done in the implementation process to this method for getting loop samples from music tracks.

3.1.1 Preprocessing the spectrogram The example of the test piano audio file from Chapter 2 showed that the recog- nition of identical parts is easy to achieve, however, finding the path which represented part that was not 100% identical to the rest seemed challenging. Therefore, I had to find a solutions to improve the visibility of these kind of paths in the SSM. From now on, I will use the same piano sequence to illus- trate the improvements.

If we take a look at figure 3.1, we can see that the identical parts (blue arrow) and the similar part (red arrow) look quite different and in the higher frequency range, the signal is very weak.

Figure 3.1: The spectrogram of the piano sequence

The first step to improve the visibility is to get rid of the frequency range which is barely represented in musical audio signals. Figure 3.2 shows the result of a frequency reduced spectrogram. I chose to keep the range 0 Hz to 4000 Hz which still includes 89% of the range of audible musical (C0 - 1 B7). This step not just improves the visibility but speeds up the computation

1 The frequencies of C0 and B7 are 16.35 Hz and 3951.07 Hz in standard 440 Hz tuning. 24 CHAPTER 3. IMPLEMENTATION

of the SSM calculation.

Figure 3.2: The spectrogram of the piano sequence (0 - 4000 Hz)

Now we can recognize some kind of similarity between the blue and red parts, however, there is one more step to make it better for the SSM calculation. Using some level of thresholding can eliminate noise and audio components with lower amplitude, therefore, the less significant harmonics will not present. In the figure 3.3, we can see the result of a 80% threshold filter.

Figure 3.3: The spectrogram after thresholding

The red area still seems different by the human eye, but it significantly lost many noise components and weak harmonics so in the next subsection I will show how this modified input works for SSM calculation. CHAPTER 3. IMPLEMENTATION 25

3.1.2 Self similarity matrix and thresholding In the previous chapter I showed the SSM of the piano sequence without mod- ifying the spectrogram. The identical parts were clearly visible but the other similar part was invisible:

Figure 3.4: SSM of the piano sequence using raw spectrogram (red paths in- dicate the missing path)

The question, what kind of SSM output will we get if we modify the spec- trogram by dropping higher frequencies (4000 Hz <) and using thresholding (80%). In figure 3.5, we can see such result. The image shows that paths of the identical parts are still recognizable and the SSM started to show darker areas where the similar part’s paths should be. 26 CHAPTER 3. IMPLEMENTATION

Figure 3.5: SSM of the piano sequence using preprocessed spectrogram

The next step is to execute another thresholding layer, but now applying it on the self similarity matrix. The result can be seen on the figure 3.6. The left side image is the raw outcome, while on the right side image, I draw red lines, indicating the paths for the similar parts. Now it is clear that using this double threshold method improves the recognition of patterns which are not perfectly identical.

Figure 3.6: Self similarity matrices after thresholding CHAPTER 3. IMPLEMENTATION 27

3.1.3 Sample extraction from the matrix After the two threshold is applied, the matrix is ready to be processed to find the path which represent the melodic patterns that will be used as loop samples. This procedure has two main steps:

• Collecting the paths This algorithm looks over the matrix2 searching for path (chain of pixels that are diagonal to the main diameter). If the path is longer than N elements3 the algorithm puts it into the stack.

• Filtering the collected paths After the algorithm has collected all the paths, the next move is to filter out every pattern which is duplicated (because of the nature of the SSM) or has some level of overlap with another pattern. Filtering the duplicated patterns is quite straightforward. Every path has four coordinates: [x1, y1, x2, y2]. These coordinates define two melodic patters, one is [x1, x2] and the other one is [y1, y2]. Therefore, the algo- rithm just picks one of these two patterns. Handling time overlaps between patterns can be more flexible. In the case of this project, if a pattern is completely included in another path, it will be dropped. If a pattern overlaps with another one and the percent- age of the overlap is larger than 50% of the longer pattern, the shorter pattern will be dropped (figure 3.7).

The patterns are saved as two-element arrays, where the first element is the starting time and the second element is the ending time (in seconds). After all the patterns are collected in an array, the final collection looks something like this: [[12.05, 17.1], [27.2, 32.2], [50, 55.2], [63.4, 68.5], ...]. This array is passed to the front-end layer, which can easily process this information and show the user the extracted samples in an audio waveform view.

2Only one half of the matrix has to be processed, since similarity matrices are perfectly symmetrical to the main diameter. 3The value of N determines the minimum length for a pattern. E.g.: the time window for the spectrogram is 0.2 seconds, without overlaps. Then N=3 means that the minimum length of the path has to be at least 0.6 seconds. 28 CHAPTER 3. IMPLEMENTATION

Figure 3.7: Filtering pattern from the SSM

3.1.4 Algorithm overview In this subsection, I show the overview of the algorithm that receives an audio file as an input and return an array of time values representing the starting and ending times of each sample:

1. Opening the audio file (mono/stereo wav, aiff or mp3).

2. Converting the audio file (using Node.js packages) to a mono WAV (44100 Hz) file to be compatible with the Python algorithm.

3. Generating the spectrogram.

4. Applying the first thresholding on the spectrogram.

5. Calculating the self similarity matrix.

6. Applying the second thresholding on the matrix.

7. Collecting the paths from the SSM.

8. Filtering the patterns.

9. Passing the collected patterns to the front-end. CHAPTER 3. IMPLEMENTATION 29

3.2 Drum sample extraction and classifica- tion

In Chapter 2, I introduced CNN, a neural network based technique which can be used for audio classification and also gave an introduction about signal en- ergy graph calculation. In this chapter, I explain how the algorithm performs beat slicing in order to extract drum samples from an audio file, and how these samples can be classified using a CNN model.

3.2.1 Beat slicing When it comes to extract drum samples from an audio signal, the approach is totally different from finding melodic samples. Drum detection or beat slicing does not need melody related information such as the frequency of notes. Beat slicing is usually based on signal energy calculation.

In this project, I used the energy graph of each audio signal to find peaks which represent the beat in the song. However, instead of using only the en- ergy graph, I calculated the tangent (gradient or slope) using two neighbouring energy value and the time delta (length of the time frame) between them (L):

y [i] − y [i − 1] y [i] = energy energy (3.1) gradient L After generating the tangent or gradient graph the peaks of the energy graph appear as two neighbouring value in the gradient graph where the first value is positive and the second value is negative4. Also the gradient values are representing the angle of the steepness of the energy line.

The next step is to calculate a baseline based on the values of the gradient graph. In this project, instead of using a constant baseline, I implemented an adaptive baseline, which is changing over time (with a predefined time con- stant), based on the gradient values of the peaks in a certain time frame. Figure 3.8 illustrate how this method works.

4If the energy line is increasing the slope is positive, while a decreasing line’s slope is negative. 30 CHAPTER 3. IMPLEMENTATION

Figure 3.8: Waveform, energy graph and gradient graph

On the figure above, there are three graphs: the waveform, the energy graph and the gradient graph. The gradient graph shows the following lines:

• The blue line represents the gradient values calculated from the energy graph. It is clearly visible that when there is a peak in the energy graph, the gradient graph has a positive to negative transition. The algorithm recognize this transition as a peak, however not all the peaks qualifies.

• To filter out insignificant peaks, the algorithm calculates a local baseline within a certain time frame5. The orange line represents this adaptive baseline. The value of the baseline is calculated from the local maxi- mum of the positive-negative transitions’ positive values. In this case, the baseline is the 30% of the local maximum, therefore, every peak that is higher than the baseline qualifies.

The reason of using this adaptive baseline technique is that some songs have significant volume changes during the playtime, so a static baseline would cancel out important beats in the low volume regions.

5In this case, the time frame for the adaptive baseline is set to one second. CHAPTER 3. IMPLEMENTATION 31

• The dotted black line shows the position of the significant beats. In this example, the recognized beats are matching the song’s actual tempo, which is 90 BPM.

The next issue to face with is the number of beats. For instance, in the current example, the detected beats are more than 300. Therefore, the next step is to find only the unique beats, since in electronic or pop music tracks the same type of drum beat is repeated many times over the song. To apply a filter, I decided to use the same technique that I used for calculation self similarity matrices, which is distance calculation.

For Euclidean distance calculation, the detected beat samples have to be con- verted to frequency domain, then their spectrum (array of amplitude values) can be used as the input for the distance function. The only thing which has to be decided is the distance limit, which determines the unique beats. Af- ter running different test scenarios, I chose the value of 0.4. Meaning that if the Euclidean distance is higher than 0.4 between two beat samples, they are considered as different samples.

Figure 3.9: Unique drum samples using Euclidean distance function 32 CHAPTER 3. IMPLEMENTATION

Using this distance calculation technique for the drum samples delivers a 92-96% reduction for detected beats. The red dotted lines on figure 3.9 shows the detected unique beats based on the distance calculations.

Now the application has a collection of unique drum samples, so the next move is to integrate the convolutional neural network (Chapter 2.3) module to perform drum class prediction for each sample. In the next subsection, I write about the procedure to create the CNN model for this purpose.

3.2.2 Generating the CNN model The Keras API

In Chapter 2, I have mentioned that convolutional neural networks have several implementations in different languages, but Keras is one of the most popular among them. It is a high-level API to build and train deep learning models. It’s used for fast prototyping, advanced research, and production, with three key advantages:

• User friendly Keras has a simple, consistent interface optimized for common use cases. It provides clear and actionable feedback for user errors.

• Modular and composable Keras models are made by connecting configurable building blocks to- gether, with few restrictions.

• Easy to extend Write custom building blocks to express new ideas for research. Create new layers, loss functions, and develop state-of-the-art models.[15]

Using the Keras API seemed the best option for CNN based classification, since it is part of the Tensorflow’s6 core modules which has a great documen- tation and community.

6Currently, Tensorflow is the number one Python library for deep learning research and application. CHAPTER 3. IMPLEMENTATION 33

CNN layers

One of the most important part of training convolutional neural networks is the structure of the layers. Choosing the right layers with the right parameters can effect the overall accuracy, training time or model size. After experiment- ing with several configurations, I chose the following layer structure for the project:

1. Convolutional + ReLU layer Kernel size: 3x3

2. Pooling layer

• Type: max • Kernel size: 3x3

3. Convolutional + ReLU layer Kernel size: 3x3

4. Pooling layer

• Type: max • Kernel size: 3x3

5. Dropout layer7 Dropout rate: 0.5

6. Flatten layer8

7. Dense layer Dense output space dimensionality: 64

8. Dense + FC layer

• Dense output space dimensionality: 5 • FC type: softmax

For the batch size, I have used 64 and all of the training ran for at least 30 epochs.

7Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. 8Flattens the input. Does not affect the batch size. 34 CHAPTER 3. IMPLEMENTATION

Figure 3.10: Layer description by the Keras API

Training dataset

As I stated in the previous chapter, CNN uses images as its training and vali- dation datasets. Therefore, converting audio signals to images is the only way to use this method.

To collect drum samples in five different categories (kick, snare, clap, open hi-hat and closed hi-hat), I used different online sources which provide free sound packages or standalone samples. During the collecting phase, I have managed to get 600 different samples in the five categories (120 samples each). For training, I set up a 10% validation split, which means that 540 samples were randomly selected for training and the rest was used for validation.

Before starting the training, these samples had to be converted into some of the visual representations and saved as images with relatively low resolution. I have done trainings with three different input types:

• Waveform

– Image resolution: 240x150 – Processing as: Grayscale images – Best validation accuracy: 72% CHAPTER 3. IMPLEMENTATION 35

• Spectrum

– Image resolution: 240x150 – Processing as: Grayscale images – Best validation accuracy: 82%

• Spectrogram

– Image resolution: 240x150 – Processing as: Colored images – Best validation accuracy: 79%

Figure 3.11: Waveform, spectrum and spectrogram as training images (kick, snare, clap and open hi-hat)

Surprisingly, over several training sessions, using the spectrum provided the best validation accuracy of 82%. Therefore, during the classification in the app, the algorithm has to transform each detected beat sample into the fre- quency domain, than save them as images with the correct resolution (240x150). This part can be easily done with a few lines of Python code. 36 CHAPTER 3. IMPLEMENTATION

3.2.3 Algorithm overview Here I present an overview of the algorithm that receives an audio file as an input and return an array of time values and drum categories representing each drum sample:

1. Opening the audio file (mono/stereo wav, aiff or mp3).

2. Converting the audio file (using Node.js packages) to a mono WAV (44100 Hz) file to be compatible with the Python algorithm.

3. Generating the gradient graph and calculating the adaptive baseline.

4. Slicing the gradient graph based on the baseline values.

5. Generating the images for each beat (low resolution spectrum).

6. Executing the classification algorithm on each image.

7. Passing the collected time values and drum categories to the front-end.

3.3 The application structure and user inter- face

The application consists of two layers. On the one hand, the back-end layer contains all the Python scripts for loop and drum sample extraction, drum clas- sification and the server side code of the RPC9 communication using a library, called ZeroRPC. On the other hand, the front-end layer contains the client side RPC script and the web technology based user interface implementation using the ElectronJS framework.

In the previous sections, I covered most of the Python algorithm which responsible for the audio processing, therefore this section will mainly focus on the front-end and the communication between the two layers. Figure 3.12 shows the block diagram of the application, indicating the two layers and their components.

9Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer on a network without having to understand the network’s details. A procedure call is also sometimes known as a function call or a subroutine call.[16] CHAPTER 3. IMPLEMENTATION 37

Figure 3.12: Software diagram of the TE Sample Finder app 38 CHAPTER 3. IMPLEMENTATION

At the beginning of the project, I had to find out what kind of languages can be used for a cross-platform desktop application that deals with signal pro- cessing, deep learning prediction, has a good looking user interface and easy to maintain. The problem was that the number of single tools and languages which are good for all these purposes are very limited, therefore I decided to split the project to a back-end and a front-and levels. In this two-layered struc- ture it is very common to use different technologies for each layer. Choosing Python for the back-end was a practical and reasonable decision, since Python has a large collection of DSP and machine learning libraries (such as NumPy, SciPy, OpenCV or Tensorflow), and during my master studies I al- ready gained some skills in Python programming.

3.3.1 The Electron framework For the front-end, I chose Electron. It is an open-source framework that allows for the development of cross-platform desktop GUI applications using web technologies. It combines the Chromium10 rendering engine and the Node.js runtime. Electron is developed and maintained by GitHub, has a very good support and large community, and being used by companies like Microsoft, Atlassian, Discord or Slack.[18]

There are three main advantages about using Electron:

• Developers can use most of the JavaScript and HTML/CSS libraries and packages which are originally used for building web applications. With these libraries, it is fast and easy to create good looking and well struc- tured graphical interfaces.

• Thanks to the Node.js API integration, Electron applications are not as limited as web applications. For example, developers can access the local file-system of the device.

• Electron also includes the Child Process Node.js module, that provides the opportunity to spawn child processes. Therefore, an Electron app is

10Chromium is a free and open-source web browser developed by Google. Its rendering engine, Blink is used in the Google Chrome browser and many other projects. It is developed as part of the Chromium project with contributions from Google, Facebook, Microsoft, IBM, and others. [17] CHAPTER 3. IMPLEMENTATION 39

able to execute Python scripts as a child process. This functionality is essential to connect to the back-end, which is fully written in Python.

3.3.2 The ZeroRPC module To have a constant connection between the UI and the Python scripts, I have implemented an RPC based API, using the ZeroRPC library. ZeroRPC has two main branches, one is a Python library, which can be used to create a server on the local machine. The other one is a Node.js package, which is used for implementing the client side of the RPC.

After the application started, it spawns a child process that executes the api.py script. This API imports all the three Python scripts (drum, melodic, classifi- cation) for the signal processing tasks. When the child process is created, the app also starts the ZeroRPC client. During the lifetime of the app, methods from the UI uses this client to send requests to the API, which executes the corresponding Python function from the imported scripts. For example, when the user presses the Loop samples button, the UI calls a method which uses the client to send out the getDrumSamples, then the API identify the request type and calls the find_drum_samples() function from the drum_processing.py Python file.

3.3.3 The graphical user interface In this subsection, I will introduce the main elements and behavior of the graphical user interface (GUI). The main requirements for the GUI were:

• File browser or drag & drop file opening

• Displaying the position and number of extracted samples

• Showing the predicted class for each drum sample

• Edit window for additional sample editing

• Displaying the final kit’s waveform

• Ability to modify the key assignment for each sample

• Showing error to the user 40 CHAPTER 3. IMPLEMENTATION

Figure 3.13: The user interface

Figure 3.13 shows the GUI of the application, processing a popular elec- tronic music track. On the top left side, it is the file opener element. The user can click on it to use the file browser or just drag and drop an audio file to the app window to open it. If the application successfully opens the file, the file name will be displayed in the element.

Under the file opener, they are the three waveform views. The first is the primary view which shows the full song and also displays the extracted sam- ples. The user can use the right and left arrow keys to navigate between the samples and press the space key to start listening to them. Using the up and down arrow keys, the user can switch between the views. The middle one is the edit view. If the user presses the E key, the selected sample will be displayed in the edit view with a high zoom level. In this view, the user can use the right and left arrow keys with the Shift and Ctrl keys to adjust the duration and position of the sample. The last one is the final view, which shows the current content of the final kit file. Using the arrow keys, the user can switch between the samples, playing them with the space key or delete the selected sample by pressing the delete key. CHAPTER 3. IMPLEMENTATION 41

For these three view elements, I used a JavaScript library, called wavesurfer.js, which is a customizable audio waveform visualization, built on top of Web Au- dio API and HTML5 Canvas. [19]

Under the final view, there is the OP-Z keyboard layout. When the user selects a sample in the final view, the assigned key is shown as green in the keyboard layout, while the other already taken keys are shown as red. Fur- thermore, the user is able to reassign the key for each sample, by clicking on a different key on the layout.

On the right side of the GUI, there are the buttons for finding loop or drum samples. By pressing them, the app loads the detected samples into the the primary view, marking them as red sections on the green waveform. There is also a selector for the loop finder, where the user can set how strict the sample finder should be when it check runs the distance function on the self similarity matrix11. The other buttons are the Add sample button for adding the the edited sample from the edit view to the final view, and the Save button which generates the final kit file after the user is finished with collecting the samples.

Besides the buttons, there are three windows for each views to provide ba- sic information. The top window shows the number of extracted samples, the state of the app (loading, error etc.) and the type of the selected sample (loop or drum class with accuracy percentage). The middle window shows the length of the edited sample, while the bottom window displays the length of the final kit audio file, which has to be shorter than 12 seconds, because of the OP-Z file specification.

11The percentage value in the selector correlates with the threshold level applied on the SSM (Chapter 3.1.2). 42 CHAPTER 3. IMPLEMENTATION

Figure 3.14: The loading screen while searching for loop samples

The application also has a full screen loading screen with the actual loading message (figure 3.14) to inform the user about the current running process and to prevent the user to interact with the app while it is busy. Chapter 4

Conclusion

Finally, as a conclusion, I will write about the current results and the potential future improvements.

4.1 Results

The application is fulfilling most of the requirements which I presented in Chapter 1. It can open multiple different audio file format (WAV, MP3 and AIFF) and process them for sample detection. Extracting melodic motifs and drum samples also works fine, alongside with manual editing option for the user. The application is also able to generate OP-Z compatible kit files with the sufficient meta information in them.

However, the drum classification is not working as accurately as expected. When the input is a clean drum track, the application offers a high accuracy, but using complex songs with multiple instruments playing at the same time confuses the prediction. Implementing source separation to clean the drum samples from other instrument and vocal layers would highly increase the ac- curacy of the prediction. The other minor issue is the performance. Overall, the application works at relatively comfortable speed, but generating self similarity matrices for songs can take 20-40 seconds depending on the user’s hardware and the length of the songs.

The quality of the detected drum samples is not always the best. While the quality of the loop samples are usually the same for every genres, drum sam- ples, drum samples from electronic music tracks and pop songs are generally

43 44 CHAPTER 4. CONCLUSION

better than drum samples from classical music or vocal oriented songs.

4.2 Future work

As I stated in the previous section, the application still has to be improved in some fields, and there are good ideas for additional features as well. I list some of these below:

• Improving the drum class prediction by using music source separation. Source separation is a relatively complex task but it would benefit the classification and the final sample quality as well.

• Getting rid off the Python part and rewrite the whole back-end in Node.js. This could make the app more responsive and easier to maintain. How- ever, this requires to re-implement major parts of already existing Python libraries.

• Converting the application to a web app. This could be done after the whole app is written in node.js / JavaScript, or the Python back-end is moved to an online server by using one of the cload services.

• Integrating Bluetooth remote control and file transfer with the OP-Z de- vice. With a Bluetooth communication API in the app, the user could use the OP-Z device as a remote control for the app or save the created kit files immediately on the device.

• Extending the drum categories for prediction. After the accuracy is im- proved, the could have a larger collection for drum class prediction. However, this would require much more samples for the training and maybe a more sophisticated training configuration.

• Update the application to be able to open and handle more than one input file at the same time. This feature would help the user to collect samples from different tracks more easily.

4.3 Final thought

This project showed that automatizing melodic motif and beat sample detec- tion is possible for music creation purposes. Moreover, using convolutional neural network to combine image classification and audio signal processing CHAPTER 4. CONCLUSION 45

provided surprisingly good results for pure drum audio samples, meaning that this solution has lots of potential.

Personally, I really liked working on this project because it helped me to deepen my knowledge about audio signals and gave me the opportunity to learn the basics about neural networks and user interface design. During the develop- ment, I could experience how it is like to work with professionals in the audio hardware industry. In this project, I could also use the skills which a acquired during my master studies about signal processing, Python programming and research methodology. Bibliography

[1] Andrea Pejrolo. Creative Sequencing Techniques for Music Production. Taylor & Francis, 2012. [2] OP-Z guide. https://teenage.engineering/guides/op- z/. Accessed: 2019-07. [3] David Pituk. Tuner implementation on Android OS. Budapest Univer- sity of Technology and Economics. BSc thesis. July 2016. [4] Martin Weik. Communications Standard . Springer US, 1995. [5] Winser Alexander & Cranos Williams. Digital Signal Processing. 2017. [6] Kurt Bryan S. Allen Broughton. Discrete Fourier Analysis and Wavelets: Applications to Signal and Image Processing. John Wiley & Sons, 2018. [7] Robert J. Marks II. Introduction to Shannon Sampling and Interpolation Theory. Springer-Verlag, 1991. [8] Oriol Nieto and Morwaread M. Farbood. Identifying Polyphonic Pat- terns From Audio Recordings Using Music Segmentation Techniques. Music and Audio Research Lab, New York University. [9] Diego F. Silva. Fast Similarity Matrix Profile for Music Analysis and Exploration. IEEE Transactions on Multimedia, VOL. 14, NO. 8. Aug. 2015. [10] Jonathan Foote. Visualizing music and audio using self-similarity. MUL- TIMEDIA ’99 Proceedings of the seventh ACM international confer- ence on Multimedia. Nov. 1999. [11] Distance and Similarity Measures. Wolfram Language & System: Doc- umentation Center. Accessed: 2019-08. [12] Shawn Hershey. CNN Architectures for Large-scale Audio Classifica- tion. Google, Inc. Jan. 2017.

46 BIBLIOGRAPHY 47

[13] Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End- to-End Environmental Sound Classification using a 1D Convolutional Neural Network. Apr. 2019. [14] Jongpil Lee et al. Raw Waveform-based Audio Classification Using Sample- level CNN Architectures. 31st Conference on Neural Information Pro- cessing Systems. Dec. 2017. [15] Antonio Gulli and Sujit Pal. Deep Learning with Keras. Packt Publish- ing Ltd, 2017. [16] Ronald A. Olsson and Aaron W. Keen. “Remote Procedure Call”. In: The JR Programming Language 365 (2004), pp. 99–105. [17] The Chromium Projects. The official website of the Chromium open- source project: https://dev.chromium.org/chromium- projects. Accessed: 2019-08. [18] Felix Rieseberg. Introducing Electron. O’Reilly Media, Inc., 2017. [19] Wavesurfer.js documentation. https://wavesurfer-js.org/ docs/. Accessed: 2019-08.

TRITA-EECS-EX-2019:773 www.kth.se