DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Content-based music recommendation system: A comparison of supervised Machine Learning models and music features

MARINE CHEMEQUE-RABEL

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Content-based music recommendation system:

A comparison of supervised Machine Learning models and music features

Marine Chemeque-Rabel

[email protected]

Master in Computer Science

School of Electrical Engineering and Computer Science

Supervisor: Bob Sturm

Examiner: Joakim Gustafsson

Tutor: Didier Giot, Aubay

Swedish title: Inneh˚allsbaseratmusikrekommendationssystem

Date: August 18, 2020 Abstract

As streaming platforms have become more and more popular in recent years and music consumption has increased, music recommendation has become an increasingly relevant issue. Music applications are attempting to improve their recommendation systems in order to offer their users the best possible listening experience and keep them on their platform. For this purpose, two main models have emerged, collaborative filtering and content-based model. In the former, recommendations are based on similarity computations between users and their musical tastes. The main issue with this method is called cold start, it describes the fact that the system will not perform well on new items, whether music or users. In the latter, it is a matter of extracting information from the music itself in order to recommend a similar one. It is the second method that has been implemented in this thesis. The state of the art of content-based methods reveals that the features that can be ex- tracted are numerous. Indeed, there are low level features that can be temporal (zero crossing rate), spectral (spectral decrease), or even perceptual (loudness) that require knowledge of physics and signal processing. There are middle level features that can be understood by musical experts (rhythm, pitch, ...). Finally, there are higher level features, understandable by all (mood, danceability, ...). It should be underlined that the models identified during the paper readings step are also abundant. Using the two datasets GTZAN and FMA, we will aim to first find the best model by focusing only on supervised models as well as their hyperparameters to achieve a relevant recommendation. On the other hand it is also necessary to determine the best subset of features to characterise the music while avoiding redundant and parasitic information. One of the main challenges is to find a way to assess the performance of our system. Sammanfattning

Med anledning till att streamingplattformar har blivit mer och mer popul¨ara under de senaste ˚aren,och musikf¨orbrukningen har ¨okat, har musikrekommen- dationen blivit en allt viktigare fr˚aga.Musikapplikationer f¨ors¨oker f¨orb¨attrasina rekommendationssystem genom att erbjuda sina anv¨andareden b¨astam¨ojliga lyssningsupplevelsen och h˚alladem p˚asin plattform. F¨ordetta ¨andam˚alhar tv˚ahuvudmodeller framkommit, samarbetsfiltrering och inneh˚allsbaseradmod- ell. I den f¨orsta¨arrekommendationer baserade p˚alikhetsber¨akningarmellan anv¨andareoch deras smak. Huvudfr˚aganmed denna metod kallas kallstart, den beskriver det faktum att systemet inte kommer att fungera bra p˚anya objekt, vare sig f¨ormusik eller anv¨andare. I den senare modellen handlar det om att extrahera information fr˚ansj¨alva musiken f¨oratt rekommendera en annan. Det ¨arden andra modellen som har implementerats i denna avhandling. Det senaste inom inneh˚allsbaserademetoder avsl¨ojaratt de funktioner som kan extraheras ¨arm˚anga.Det finns faktiskt l˚agniv˚afunktionersom kan vara tem- por¨ara(noll¨overg˚angshastighet),spektral (spektral minskning) eller till och med perceptuell (perceptuell h¨oghet) som kr¨aver kunskap om fysik och signalbehan- dling. Det finns funktioner p˚amedelniv˚asom kan f¨orst˚asav musikaliska experter (rytm, tonh¨ojd...). Slutligen finns det funktioner p˚ah¨ogre niv˚a,f¨orst˚aligaf¨or alla (hum¨or,dansbarhet ...). Det b¨orbetonas att de modeller som identifierats under pappersavl¨asningsstegetocks˚a¨arrikliga. Med hj¨alpav de tv˚adatam¨angderGTZAN och FMA ¨arm˚aletf¨ordet f¨orsta att hitta den b¨astamodellen genom att endast fokusera p˚a¨overvakade modeller, liksom dess hyperparametrar f¨oratt uppn˚aen relevant rekommendation. A˚ andra sidan ¨ardet ocks˚an¨odv¨andigtatt best¨ammaden b¨asta delm¨angdenav funktioner f¨oratt karakterisera musiken samtidigt som man undviker redundant och parasitisk information. En av utmaningarna ¨aratt hitta ett s¨attatt bed¨oma prestandan i v˚artsystem. Contents

1 Introduction 1 1.1 Context ...... 1 1.2 Purpose and specifications ...... 1 1.3 Research question ...... 2 1.4 Overview ...... 2

2 Background 3 2.1 Recommendation overview ...... 3 2.1.1 Recommendation definition ...... 3 2.1.2 Music is different ...... 3 2.1.3 What is a good recommendation? ...... 3 2.1.4 Available data ...... 4 2.2 Types of recommendation systems ...... 5 2.2.1 Collaborative approach ...... 5 2.2.2 Content-based approach ...... 6 2.2.3 Context-based approach ...... 7 2.2.4 Hybrid approach ...... 7 2.3 Models for content-based recommendation ...... 7 2.3.1 Logistic Regression ...... 8 2.3.2 Decision Trees ...... 10 2.3.3 Bagging: Random Forest ...... 10 2.3.4 Boosting: Adaboost ...... 11 2.3.5 k-Nearest Neighbours ...... 12 2.3.6 Support Vector Machine ...... 13 2.3.7 Naive Bayes ...... 14 2.3.8 Linear Discriminant Analysis ...... 14 2.3.9 Neural Networks ...... 14 2.4 Features for content-based recommendation ...... 16 2.4.1 Low-level features ...... 17 2.4.2 Middle-level features ...... 22 2.4.3 High-level features ...... 23 2.5 Features selection algorithms ...... 24 2.5.1 Filter model ...... 24 2.5.2 Wrapper model ...... 25 2.5.3 Embedded model ...... 26

3 Methods 27 3.1 Choosen approach ...... 27 3.2 Datasets ...... 27 3.2.1 GTZAN ...... 28 3.2.2 Free Music Archive ...... 28 3.2.3 Data-augmentation ...... 30 3.3 Features extraction ...... 31 3.3.1 Preprocessing ...... 31 3.3.2 Chosen features ...... 31 3.3.3 Wrapper model for features selection ...... 31 3.4 Models ...... 32 3.4.1 Hyperparameter tuning ...... 32 3.5 Evaluation ...... 33 3.5.1 Evaluation of the classification using labels ...... 33 3.5.2 Evaluation of the prediction using confusion matrices . . 34 3.5.3 Evaluation of the prediction based on human opinion . . . 34

4 Results 35 4.1 Preliminary results ...... 35 4.1.1 Tests on FMA ...... 35 4.1.2 Tests on GTZAN ...... 39 4.2 Dataset creation ...... 41 4.3 Hyperparameters tuning ...... 42 4.3.1 Logistic regression optimization ...... 43 4.3.2 Decision tree and random forest optimization ...... 43 4.3.3 Adaboost optimization ...... 44 4.3.4 K-nearest-neighbours optimization ...... 45 4.3.5 Support vector machine optimization ...... 46 4.3.6 Linear Discriminant Analysis ...... 47 4.3.7 Feed-Forward Neural Network ...... 47 4.3.8 Global results after tuning ...... 47 4.4 Feature selection ...... 47 4.4.1 Most important features ...... 52 4.5 Data augmentation ...... 54 4.6 Final examples of recommendations ...... 54

5 Conclusions and discussions 57 5.1 Discussion of the results ...... 57 5.1.1 Quantitative results ...... 57 5.1.2 Qualitative results ...... 57 5.2 Conclusion ...... 58 5.2.1 Research question ...... 58 5.2.2 Known limitations ...... 58 5.3 Future work ...... 59 5.3.1 Improvement suggestions ...... 59 5.3.2 Application development ...... 59 1 INTRODUCTION –

1 Introduction 1.1 Context In 1979, the beginning of a recommendation system was born. Elaine Rich described her Grundy library system [1]: it is used to recommend books to users following a short interview in which the user is initially asked to fill in his first and last name and then, in order to identify the user’s preferences and classify them ”stereotype”, Grundy asks them to describe themselves in a few key words. Once the information has been recorded, Grundy makes an initial suggestion by displaying a summary of the book. If the suggestion does not please the user, Grundy asks questions to understand on which aspect of the book it has made a mistake and suggests a new one. However, its use remains limited and Rich faces problems of generalisation. The recommendation systems that really emerged in the 1990s have de- veloped strongly in recent years, especially with the introduction of Machine Learning and networks. Indeed, on the one hand, the growing use of the cur- rent digital environment, characterised by an overabundance of information has allowed us to obtain large user databases. On the other hand, the increase in computing power made it possible to process these data especially thanks to Machine Learning when human capacities were no longer able to carry out an exhaustive analysis of so much information. Unlike search engines that receive requests containing precise information from the user about what they want, a recommendation system does not receive a direct request from the user, but must offer them new possibilities by learning their preferences from their past behaviour. E-commerce sites that aim to sell a maximum of items or services (travel, books, ...) to customers must therefore recommend suitable goods quickly. As for sites that offer streaming music and movies, their goal is to keep their users on their platform as long as possible. The common point is that it is necessary to make adequate recommendations. Recent progress in this field is considerable and these recommendations are as beneficial for companies that maximise their profits as they are for customers who are no longer overwhelmed by the number of possibilities. Decision-making is made easier and a good recommendation is therefore a significant time saver. In 2006, Netflix, which was an online DVD rental service, launched the Net- flix Challenge with $1 million to be won. The goal of the contest was to build a recommendation algorithm that could surpass the current one by 10% in tests. The contest generated a lot of interest, both in the research community and among movie lovers. The prize was won 3 years later and highlighted several methods and research directions to solve this kind of problem. A recommenda- tion system will be defined according to Burke’s definition : [2]: it is a system capable of providing personalised recommendations or guiding the user to in- teresting or useful resources (called items) within a large data space.

1.2 Purpose and specifications The project, entitled Aubay Musical Playlist, was carried out in Aubay’s ”Innov” division. It is a brand new Research and Development project, its goal is to achieve a complete state of the art of the available methods in order to offer

1 1 INTRODUCTION – 1.3 Research question a functional and efficient music recommendation system. This project does not have a direct client, so the training dataset is not provided, and thus needs to be determined. In the long-term, the goal is not only to recommend existing songs but also to generate songs adapted to the musical taste of the user. During this master thesis I focused on the recommendation part while exchanging with a colleague in charge of the generation part. The future of the project will consist in gathering these two parts in order to have a fully functional recommendation system. The aim of this thesis is to explore the different recommendation approaches, the available datasets, the ways to take into account the user’s preferences and the machine learning methods in order to build a suitable recommendation system. One important part was only dedicated to determine how to evaluate this recommendation system. This project will be introduced to the members of the company and will take the form of an application. The user will be asked to upload a music (mp3, wav format) and the application will recommend some musics to be listened to afterwards.

1.3 Research question This master thesis focuses on two aspects: determining the listener’s prefer- ences and evaluating our recommendations. The main research question is the following:

How can a music listener’s tastes be taken into consideration in order to automatically recommend music? How can one measure the tastes of a music listener?

Several points will, therefore, have to be addressed: - how to classify the music’s style? - how to take tastes into account? - how can the performance of such a system be measured?

1.4 Overview This report will be structured as follows: the technical background required for this project will first be described in detail. The different approaches that can be used to implement recommendation systems will be presented and the machine learning methods that will be experimented in this thesis will be de- scribed. The ways to evaluate our results will also be presented. The Method section will describe the work performed. I will first introduce the chosen dataset and the reasons why it has been chosen. I will then dig deeper and detail the experiments that were carried out. Then the different experiments carried out will be explained in detail. Finally, the Results section will highlight and give a visualisation of the main results obtained. Quantitative and qualitative interpretations of these results will allow us to reach a final unique model and to answer the research sub-questions. In the “Conclusion and Discussion” section I will discuss the future of this work.

2 2 BACKGROUND –

2 Background 2.1 Recommendation overview 2.1.1 Recommendation definition In this thesis the focus will be on recommendation systems. A recommen- dation system is a set of techniques and services whose purpose is to propose to users articles that are likely to interest them... They are presently implemented on multimedia content distribution platforms (Netflix, Deezer, Spotify, ...), on- line sales platforms (Amazon, Ebay, ...), social networks (Facebook, Twitter, ...), etc... Recommendation systems are particularly useful when the number of users and articles becomes very large. That is because users are unlikely to know all the richness of the catalogue offered by the service, and it can be ar- gued that it is almost impossible to make a personalised human prescription for all the users of a service. The purpose of the recommendation system is to lead users through the vast amount of data available, particularly in e-commerce platforms, filtering this data to automatically propose to each consumer the items that are likely to be of interest to them.

2.1.2 Music is different The recommendation systems are more and more used in many fields: ho- tels, travels, products. But the musical field has some particularities to take into account. [3] The first factor to consider is the duration of a music track. As a track is short, it is less critical to make a bad recommendation than it is for a movie or a book, for example. The user can also quickly browse through the music to quickly see if it suits their taste or not. A second specificity is the number of tracks available, indeed the choice is very wide, it is estimated that at least tens of millions of songs are accessible on the Internet. It is common for repeated recommendations of the same music to be appreciated. While for trips or movies the user is looking for diversity, the user may like to listen to the same music over and over again. Moreover, it is possible that at the first listening the user was not attentive since listening to music is often done in parallel with another activity (sport, work...). Attentive listening requires quality hardware, the proper mood, and exclusive attention time. Moreover, it is quite easy to extract a set of features from one piece of music. In- deed information can be extracted through signal processing, thanks to musical knowledge, thanks to lyrics, or just using user feedback. Old music is as rel- evant as new music: recent music, as well as music from a few decades ago or classical music can be as enjoyable. It is a matter of correctly understanding the user’s tastes. It must also be taken into account that music listening is often passive: the listener doesn’t necessarily listen attentively to it: in shops, bars, while working... The last point that distinguishes music from other items that may be recommended is that music is often played in sequence. Indeed, as they are short, they are often chained together in the form of a playlist.

2.1.3 What is a good recommendation? Taking these peculiarities into consideration it is now necessary to make an adequate recommendation. Naturally, the main objective is to achieve a good

3 2 BACKGROUND – 2.1 Recommendation overview level of accuracy which means predicting music that the user will like and listen to. The more the user has confidence in the recommendation system and knows how it works, the more effective it will be. A successful recommendation involves a trade-off between exploitation and exploration: [4] On the first hand, exploitation consists in playing safe music, music that the recommender knows the user likes. It’s called lean-back experience and it brings short-term rewards. On the other hand, exploration is about playing new music, making new discoveries. It’s called lean-in experience and it brings long-term rewards. If it’s properly gauged, a little serendipity may please [5]. This implies that we need to find the appropriate balance between: novelty and familiarity, diversity and similarity as well as popularity and personalization. A relevant recommendation must also reflect the listener context. It can- not be only based on music and listener properties, it is needed to take into account the mood, the activity... For example someone who is working does not want to listen to the same music as when they are running. Finally, transparency with users is a crucial point. It has been proved that explaining how the algorithm works to the user improves their confidence and therefore the time they will spend on the platform, firstly to perfect their profile and secondly because they will get better recommendations. [4]

2.1.4 Available data The main goal of Music Information Retrieval is to extract the most relevant information from various representations of a music (audio, lyrics, web, meta- data, ...)

The features extracted can be split into four categories. The first one is the music content [6] and groups together three types of features. Signal pro- cessing techniques give us access to low-level features, which means machine- interpretable features. They can be temporal (zero crossing rate) or spectral (spectral flux, spectral decrease, ...) features. Musical knowledge is required to extract middle-level features such as the beat, tonality, pitch, ... Finally, if the previous ones are only understandable by the machine or the experts, the high-level features are accessible to everyone, such as danceability, or liveness for example. The second category is the music context. The goal is to retrieve as much information (country, related artists, genres) as possible based on metadata thanks to web pages, blogs, lyrics, tags, ... [6] The usable data does not only come from the music itself, but can also be focused on the user. First of all, there are the listener properties. Everyone has tastes and preferences, these can be retrieved implicitly (plays, playlists) or explicitly (thumbs, stars). [6] Finally, the last category is the listener context. Data can be retrieved directly from the sensors of the device the listener is using. They are useful because according to the mood of the listener, their activity (sport, work, ...) the desired music varies strongly. [6] There are different methods to get information about the user’s context: [4] They can be retrieved explicitly, which means directly by asking the user (thanks to forms, ratings polls, ...) Some information can also be deduced implicitly from

4 2 BACKGROUND – 2.2 Types of recommendation systems the sensors of our devices (heart rate, light intensity, accelerometer, position, weather, and so on) Another way is to infer them. Indeed, machine Learning and statistical techniques can be used to infer conclusions: for example from position and movement speed the activity can be inferred.

Figure 1: Factors in Music Information Retrieval [6]

2.2 Types of recommendation systems There are three main recommendation systems which provide the ability to create music playlists adapted to a user: collaborative filtering, content- based information retrieval techniques, and context-based recommendation. A combination of the previous techniques is possible and is called hybrid. [7]

2.2.1 Collaborative approach This recommendation method is based on the analysis of both the behaviour of the listeners and the behaviour of all others users of the platform. The fun- damental assumption here is that the opinions of other users can be used to provide a reasonable prediction of another user’s preferences for an item that they have not yet rated: a user is given recommendations based on users with whom they share the same tastes with. Indeed, during years, in order to choose music, restaurants, movies, etc.. We have been asking our friends, family, and colleagues to recommend something they liked. And it is this mechanism that is attempted to be reproduced here. Netflix was a pioneer of this method (based on stars given by other users) but it is now widely used, including for Spotify’s Discover Weekly. [8]

The first family of collaborative filtering methods is called memory-based approach. The principle is to store all data in a Users/Songs matrix. This can be done thanks to implicit or explicit feedback. In the former, if the item has been listened at least once the value is 1, 0 otherwise. In the latter the value is the number of stars if available, 0 otherwise. We end up with a large matrix. To reduce it, Spotify tries to approximate this matrix by an inner product of two others smaller matrices. [9]

5 2 BACKGROUND – 2.2 Types of recommendation systems

Thanks to matrix factorization, we have now two types of vectors, one user vector X and one song-vector Y for each listener.

0 0 1 0 1 0 1 1 0 0 0 0     . 1 0 0 0 1 0 .      0 0 1 1 0 1 = X · ··· Y ··· (1)     0 1 0 0 0 0 .   . 1 0 1 1 0 0 0 0 0 1 1 1

The last step is to find similarity between vectors to be able to recommend musics to listeners, to do so there are two methods: [10] • User-user similarity: comparing the listener vector with others user’s vectors to find those who have similar tastes. • Item-item similarity: comparing tracks vectors to find which one is the closest to the actual listened music. There is a second approach called model-based: the goal is to predict the user’s rating for missing items using machine learning models.

The key advantage of the collaborative approach is that we do not need to analyze and extract features from the raw files, so there is no need to have the audio files, nor to have an in-depth knowledge of music or physics. Moreover, it brings serendipity, it is the effect of surprise that the user can receive by being given a relevant recommendation that they would not have found alone.

There are three major drawbacks, the first one is called cold start, it des- ignates two issues: new user problem and new item problem. [11] The former reflects the lack of user data to make a relevant recommendation while the latter reflects the fact that we do not know who to recommend new items to. The next issue is the scalability, indeed a large number of users and items requires high computing resources. The last one is sparsity, because the amount of items is large, one user can only rate a small subset of them. [11]

2.2.2 Content-based approach The content-based recommendation consists in the analysis of the content of the items candidates for recommendation. This approach aims to infer the user’s preferences in order to recommend items that are similar in content to items they have previously liked. This method does not need any feedback of the listener, it is only based on sound similarity which is deduced from the features extracted from the previous listened songs. [8] This method is based on the similarities between the different items. To estimate similarities, it is a matter of extracting features to best describe the music. The Machine Learning algorithms then recommends the closest item to those that the user already likes. It is, therefore, necessary to create items profiles based on features extracted from items. Moreover, this method requires user profiles based on both their preferences and their history on the platform. These profiles will be in the

6 2 BACKGROUND – 2.3 Models for content-based recommendation following form: a list of weights (which reveals the importance) corresponding to each feature we have selected. The main advantage of this approach is that an unknown music is just as likely to be recommended as a currently popular one, or even a timeless one. This allows new artists with a few ”views” to be brought up as well. Moreover, the problem of the cold start and in particular of the new items is thus avoided: when new items are introduced into the system, they can be recommended directly, without requiring integration time as is the case for recommendation systems based on a collaborative filtering approach. The negative point is that this method limits the diversity in the recom- mendation, it tends to over-specialise. Moreover, the integration of a new user cannot be instantaneous, they have to listen and evaluate a certain amount of songs before being able to receive recommendations, this is the user cold start.

2.2.3 Context-based approach Studies [12] have shown that the mood, activity, or even the location of the person influences the music they want to listen to. We listen to music in a given moment, in a predefined emotional state, and established circumstances (party, work, ...). And these predispositions will play a decisive role in the way we feel about the music. Although there are many applications [12] of this type of recommendations such as tourist guide applications with adaptive ambient songs, there are not many concrete applications on this subject. Many barriers still block the research on this field. Indeed, the nature of the data to be taken into account is highly varied and depends on the environ- ment (time, place, weather, culture, ...) or the user themselves (motion speed, emotions, heart rate, device luminosity, ...). An even more significant issue is the lack of data available for research purposes. In the real world it is not easy to retrieve them either, as users do not always want to transmit as much information from their mobile phone sensors.

2.2.4 Hybrid approach It is also possible to combine the previous complementary methods to create a recommendation system called hybrid. It can also be based on all other lesser known methods such as location-based recommendation. This method can alleviate the problems of cold start and sparsity. Several implementations can be set up, first of all the recommendation systems can be mixed into one. It is also possible to keep several systems separated and assign them weights, or the ability to switch between systems at will. Finally, it is possible to extract results from one system, to then be used as an input for the next one.

2.3 Models for content-based recommendation During the state of the art phase, the reading of numerous research papers has shown that a variety of models can be used for recommendation. These models will be tested in this thesis and therefore presented in this section. The models chosen are supervised machine learning algorithms. Machine learning is a type of artificial intelligence where an algorithm automatically modifies its behaviour in order to improve its performance on a task based on a

7 2 BACKGROUND – 2.3 Models for content-based recommendation set of data. This process is called learning since the algorithm is optimised from a set of observations and tries to extract the statistical regularities from them. In supervised learning, the objective of the algorithm is to predict an explicit and known t-target from the training data. The two most common types of targets are either continuous values where t ∈ < (regression problem) or discrete classes where t ∈ 1, ..., NC for a problem with NC classes (classification problem).

2.3.1 Logistic Regression Logistic regression has proven its effectiveness in the field of music classi- fication, although not the most efficient method [13], it has the advantage of being fast. Logistic regression is often used for multi-class classification. It is a question of finding the optimal decision boundary in order to separate the different classes. [14] The easiest case is when there are only two classes (0 and 1), in that case, as LR is a linear model, the score function can be written as follows:

(i) (X ) = θ0x0 + θ1x1 + ... + θnxn (2) with: (i) X : an observation (from the training or test set) as a vector x1, x2, ..., xn xi: one of the valuable features of the predictive model θ0: a constant called bias θi: the weights (associated with the features) that have to be computed It can be written more compactly by noting Θ the vector containing the components θ0, θ1, ..., θn and X the vector containing x1, x2, ..., xn:

S(X) = ΘX (3)

Then the goal is to find coefficients θ0, θ1, ..., θn such as: • S(X(i)) > 0 if the sample is in the positive class (label 1) • S(X(i)) < 0 if the sample is in the negative class (label 0)

1 The sigmoid (figure 2) function (sigmoid(x) = e 1+ex ) is then applied to the score function, which allows us to obtain values between 0 and 1. The overall hypothesis function for logistic regression is therefore: 1 H(x) = Sigmoid(S(X)) = (4) 1 + eΘX

8 2 BACKGROUND – 2.3 Models for content-based recommendation

Figure 2: Sigmoid function [15]

Classifying musics according to their genres is a case of a multi-class classi- fication, and the algorithm commonly used by this model is One-Versus-All: it consists in splitting the problem into several sub-problems of binary classifi- cation. First class 1 will be separated from all the others, then class 2 from all the others and so on.

Figure 3: Logistic Regression for multi-class classification [16]

To prevent overfitting, l1 and l2 regularisation methods can be used to adjust the value of the weights ωi: • Lasso (Least Absolute Shrinkage and Selection Operator) Regression (l1): it consists of adding a regularisation term to the loss function: n n X 2 X L(x, y) = (yi − f(xi)) + λ |ωi| (5) i=1 i=1 Lasso has multiple solutions and tends to shrink the less important features to zero, so it’s particularly effective when you have a large number of features and you want to select the most important ones. • Ridge Regression (l2): n n X 2 X 2 L(x, y) = (yi − f(xi)) + λ ωi (6) i=1 i=1

9 2 BACKGROUND – 2.3 Models for content-based recommendation

Ridge regression has only one solution, it doesn’t reduce the number of features but rather the impact of each features.

2.3.2 Decision Trees This paper [17] shows that in the classification of Latin music, decision trees can be efficient. They will therefore be studied in this thesis as well as ensemble methods in order to potentially improve our models. In a decision tree [18], one can classify items by continuously separating them parameter by parameter. A decision tree is composed of three main types of elements: the nodes are the tests performed on the attributes, the edges are the results of the tests, they connect one level of nodes to the next and leaf nodes: these are the last nodes of the tree, they represent the final classes. There are two types of decision trees, regression trees and classification trees, the former can take targets which are continuous values for example to predict the price of a house while the latter is composed of Yes/No questions and its targets are discrete/categorical. This is an iterative process that consists of dividing the data into partitions and then distributing it to each of the branches.

The algorithm used to train the decision trees for classification is Divide- and-Conquer, it aims to split the dataset into subsets at each node. The principle is to select a test for the first node, called root node, which allows to split the set into two sub-parts by maximising the Information Gain or Gini Im- purity. Then this action need to be repeated recursively until there is a branch where all instances are of the same class. To avoid overfitting, a depth limit can be set. In order to define the Gain of Information and Gini Impurity, the concept of entropy must be specified. Entropy is the measures of impurity, disorder, or uncertainty in a set of exam- ples. For a dataset with C classes and with pi being the proportion of element of class i in the dataset:

C X E = − pi log2 pi (7) i Information Gain (IG) measures how much ”information” a feature gives us about the class, it can be computed as follows:

IG = Entropy(parent) − [weighted average] ∗ Entropy[children] (8)

Gini Impurity is the probability of incorrectly classifying a randomly cho- sen element in the dataset if it were randomly labelled according to the class distribution in the dataset:

C X G = pi(1 − pi) (9) i

2.3.3 Bagging: Random Forest Random Forest is a method of ensemble learning. According to James Surowiecki, ”Crowd wiser than any individual”, which means that thanks to

10 2 BACKGROUND – 2.3 Models for content-based recommendation their diversity, independence, decentralization and aggregation, a combination multiple classifiers give a better one. For optimal results, classifiers with high variance and low bias should be grouped together in order to reduce the overall variance while maintaining a low bias. In decision trees, high variance means boundaries which are highly dependant on the training set, low bias means boundaries that are close in average to the true boundary. Random Forest is a special case of Bagging (bootstrap aggregating), the aim of such methods is to reduce the variance introduced by a single tree and thus reduce the forecasting error. To predict the result, the Random Forest algorithm averages the forecasts of several independent models (in the context of a classification it predicts the most frequent category). To build these models, several bootstrap replicates of the training set are created by sampling with replacement, thus all our models can be trained in a parallel way. This algorithm not only uses Bagging but also a randomization of the selected features at each node.

2.3.4 Boosting: Adaboost Adaboost is one of the most used algorithms in boosting which also optimises decision trees. Multiple binary classifiers are taken, each based on a feature, this will give us a set of weak classifiers. The Adaboost principle is based on the assumption that a set of weak classifiers can give a strong one (figure 4). The principle is to loop on the classifiers with weighted samples and when a sample is incorrectly classified its weight is increased. Specifically, the steps are: 1. Initialise weights: at the beginning they are uniform 2. Train one decision tree

3. Compute the weighted error rate: compute how many items (taking into account their weight) are missclassified 4. Compute the decision tree’s weight depending on its error rate 1 − e W tree = learning rate ∗ log ( ) (10) e

5. Update weights of missclassified items:

W item new = W item old ∗ exp (W tree) (11)

6. Repeat from step 2 to 5 for each tree 7. Final decision: X P red F inal = W t ∗ P red(t) (12) t∈trees

This means that each model is trained in a sequential way and learns from mistakes made by the previous models. While Random Forest aims to decrease variance and not bias, Adaboost aims to decrease bias but not variance.

11 2 BACKGROUND – 2.3 Models for content-based recommendation

Figure 4: Adaboost classifier [19]

2.3.5 k-Nearest Neighbours This method will make a prediction based on the entire data set. When we want to predict a new value, the algorithm will look for the K instances of the set closest to it. Then, it will use the output values of the closest K neighbours to compute the value of the variable that need to predicted. [20] Parameter K is to be determined, a sufficiently high value is needed to avoid underfitting, however if the value is too high there is a risk of overfitting and so a poor generalisation on unseen data. A compromise has to be found. For each new item i, the first step is to calculate its distance d to all the other values of the dataset and retain the K items for which the distance is minimal. [21] Then the optimal K is used to make the prediction: in the case of a regression, the next step is to calculate the mean (or median) of the output values of the selected K neighbours. In the case of a classification, it is to retrieve the most represented class among the K neighbours. For distance determination various formulas are available, as long as they satisfy the criteria of non-negativity, identity, symmetry and triangle inequality, those commonly used are: • Hamming distance: for two equal-length strings, this is the number of positions for which the characters are different.

• Manhattan distance:

n X dm(x, y) = |xi − yi| (13) i=1

• Euclidean distance: v u n uX 2 de(x, y) = t (xi − yi) (14) i=1

12 2 BACKGROUND – 2.3 Models for content-based recommendation

• Minkowski distance: n X p 1 dn(x, y) = ( |xi − yi| ) p , p ≥ 1 (15) i=1 • Tchebychev distance: n dt(x, y) = (max |xi − yi|) (16) i=1 It is a method that has the advantage of being simple, transparent and intuitive while giving reliable results, but it is sensitive to redundant or useless features. [20] k-NN seems to give good results (an accuracy of 91%) when it comes to music classification, even when using MFCC features (defined in the subsection features, 2.4) [22]

2.3.6 Support Vector Machine Support Vector Machine are widely used for music classification and gives good results but the features used are often restricted to MFCC and some rhythmic features. This method uses several successive SVMs to improve the results. [23] The basic principle is to separate two groups of data while max- imising the margin around the border (so the distance between two classes). It is based on the idea that almost everything becomes linearly separable when represented in high-dimensional spaces. So the two steps are: transforming the input into a suitable high-dimensional space, and then finding the hyperplane that separate data while maximising margins. In practice, kernel functions are used to reap the benefits of a high-dimensional space without actually repre- senting anything, indeed the only operation done in high-dimensional space is the computation of the scalar products between pair of items. The common used Kernels are: • Linear: K(~x,~y) = ~x · ~y (17)

• Polynomial: K(~x,~y) = (~x t~y + 1)p (18)

• Radial based: − 1 ||~x−~y||2 K(~x,~y) = e 2ρ2 (19)

In some cases, it is possible to accept some outliers that will be in the margin in order to be able to separate the data. Although they were created to deal with binary problems, there are two ways to adapt it to multi-class problems: • One-versus-all: it consists in transforming C classes classification problem using a unique separator into a C binary classification problem. The ranking is given by the classifier which fits the best.

k(k−1) • One-versus-one: 2 binary classifiers are taken this time, the idea is that for each class Ci it will be compared to each other class Cj 6= Ci. The final ranking will be given by majority vote.

13 2 BACKGROUND – 2.3 Models for content-based recommendation

2.3.7 Naive Bayes Bayes classifiers are based on a probabilistic approach employing Bayes’ theorem, it gives the probability for an item to be in the class Ci knowing that the item has a set of features named x = (x1, ..., xF ).

P (x|Ci)P (Ci) P (x|Ci)P (Ci) P (Ci|x) = = P (20) P (x) j P (x|Cj)P (Cj)

P (Ci) is called the prior, PP (Ci|x)) is called the posterior, P (x|Ci) is called the likelihood, P (x) is called the evidence. The result to be computed is often based on several variables. Since the com- putation is complex, one type of classifier often used is the Naive Bayes Clas- sifier [24]: it is assumed that these variables are independent. This is a strong assumption, which is why the word ”naive” is used.

2.3.8 Linear Discriminant Analysis This method is used to predict which predefined class the item belongs to, based on its characteristics measured using predictive variables. It was first introduced by Fisher in [25]. It achieved 71% on GTZAN [26]. Linear Discrimi- nant Analysis is a dimensional reduction technique which means that it aims to reduce the number of dimensions (i.e. features) in the dataset while keeping as much relevant information as possible. It uses information from every feature in order to create a new axis and projects the data on this axis while maximising the distance between classes. For that purpose, the initial step is to compute the between class variance which is the level of separability between classes (i.e the distance between the mean of a class and its elements). Then, the distance between the mean and samples of each class must be computed. This metric is called within class variance. Finally, the last stage is to construct the lower dimensional space which maximises the between class variance while minimising the within class variance.

2.3.9 Neural Networks Neural networks of all types are also widely used, including feed-forward neu- ral networks. Results based only on high-level features give an overall accuracy of 85%. [27] A neural network (see figure 5) is a system whose architecture is inspired by the functioning of biological neurons, however it is nowadays getting closer and closer to mathematical and statistical methods.

14 2 BACKGROUND – 2.3 Models for content-based recommendation

Figure 5: Illustration of an artificial neural network [28]

A formal neuron is the elementary unit of an artificial neural network. When receiving signals from other neurons in the network, a formal neuron responds by producing an output signal which is transmitted to other neurons in the network. The signal received is a weighted sum of signals from different neurons. The final output signal is a function of this weighted sum:

N ! X yj = f wi,jxi (21) i=1 yj is the output of the formal neuron j. xi for i ∈ {1, ..., N} are the signals received by the neuron j from neurons i. wi,j are the weights of the interconnections between neurons i and j. f, called the activation function, gives the output value. Usually we use the identity, sigmoid, or hyperbolic tangent functions. For multi-classes classification purposes, the simplest neural network is the Multi-Layer Perceptron (MLP) [29]. It is a network that contains several layers (and on each one several units) all fully connected. The training method is called gradient backpropagation, it is used to find the weight values for each neuron that are most relevant for further classification. There are three types of neurons, the input cells are associated with the data (one for each input fea- ture), the output neurons each associated with a class, and the hidden neurons which are in the intermediate layers. For too deep neural networks several problems appear, the first one concerns time and computing power which becomes very quickly overwhelming. The training algorithm also has trouble working correctly, indeed, it is often facing exploding or vanishing gradient issues. There are different ways, called regularisation methods, to deal with overfit- ting:

• Dropout: it consists in deactivating a percentage of units for a particular layer during training, more precisely for each stage of training, neurons are either kept with probability p, or dropped out with probability 1-p. This improves generalisation since it forces the layer to learn the same concept but with different neurons. This method is commonly applied on fully connected layers.

15 2 BACKGROUND – 2.4 Features for content-based recommendation

Figure 6: Without/with dropout [30]

• Early stopping: the idea is to stop the training when the system starts to overfit, i.e. when the test accuracy starts to decrease (figure 7). In order to achieve this, a validation set must be created, it allows to test the model at each epoch and thus stop the training as soon as the validation accuracy decreases and thus the overfitting appears.

Figure 7: Overfitting [31]

2.4 Features for content-based recommendation The features mentioned in the papers read to carry out this thesis that can be extracted from the music are numerous and can be classified according to their level (low, middle, high). There are several representations of music. From a physical point of view, the sound is a wave, i.e. an oscillation of pressure, which is generally transmitted through the ambient air. Sound is therefore a superpo- sition of sound waves of different frequencies with different characteristics such as amplitude and phase. It is mainly from this representation that the features can be extracted. While some features use the signal in the time domain, others focus on its frequency shape. In fact, the discrete Fourier transform (DFT) can be used to decompose a digital time signal into its sinusoidal components, and thus pass it into the frequency domain.

16 2 BACKGROUND – 2.4 Features for content-based recommendation

2.4.1 Low-level features Low-level features are those that can be computed immediately from the raw audio file using statistical, signal processing and mathematical methods. Low- level features can be grouped according to their nature: temporal, spectral, energetic, or perceptual. An audio signal is constantly changing, that’s why the first step is to split the signal into short frames. It enables us to make the hypothesis that the signal is statistically stationary on this frame. Usually the signal is framed into 20-40ms spans, if shorter we would not have a sufficiently long segment to give a reliable result, and if longer, the signal changes too much. Initially, the focus is on temporal features: • Zero-crossing rate: It corresponds to the number of crossings between the signal and the zero axis for a given time frame (which is a short period of time) on the signal. A high value is characteristic of a noisy sound while a low value indicates a periodic signal. [32] It is mainly used in music information retrieval to catch noise and percussive sound. [7] As this value tends to be higher for percussive sounds, it helps us to differentiate for example rock from metal. It can be computed for the frame t, K being the frame size, as follows: [7]

(t+1)·K−1 1 X CR = | sign(s(k)) − sign(s(k + 1)) | (22) t 2 k=t·K

where:  1 if s(k) ≥ 0 sign(s(k)) = −1 if s(k) < 0

• Amplitude envelope: It computes the maximum amplitude among all samples for a frame t: [7]

(t+1)·K−1 AEt = max s(k) (23) k=t·K

• Root-mean-square energy: This feature is correlated to the perception of sound intensity, so it can be used to evaluate loudness. A low energy is particularly representative of classical music. [7] On one frame t: v u (t+1)·K−1 u 1 X RMS = t · s(k)2 (24) t K k=t·K

Spectral features are defined as follows: • Spectral centroid: It indicates the location of the centre of mass / barycentre of the spectrum, it represents the band where most of the energy is. [33] It is calculated as the weighted mean of the frequencies in the sound. Low Spectral Centroid usually corresponds to classical music, especially those with only piano. Other music tends to have Spectral Centroids that vary much more. [7]

17 2 BACKGROUND – 2.4 Features for content-based recommendation

In order to compute it, the spectrum is considered as a distribution: the values are the frequencies and the probabilities to observe them are nor- malised in amplitude. [32]

Z µ = x · p(x)dx (25)

where x are the observed data: x = freq s(x) and p(x) the probability to observe x: ampl s(x) p(x) = P x ampl s(x) It is also possible to compute it in the following way: PN m (n) · n SC = n=1 t t PN n=1 mt(n) • Spectral spread - Bandwidth: The spread (= bandwidth) can be defined as the variance of the distri- bution, it indicates how spread is the spectrum around its mean value. [32]

Z σ2 = (x − µ)2 · p(x)∂x (26)

It is also possible to compute it in the following way:

PN m (n)· | n − SC | SS = n=1 t t t PN n=1 mt(n) • Spectral skewness: It shows how asymmetric a distribution is around its mean value. [32] A value of 0 characterises a symmetric distribution, a higher value denotes a concentration of energy on the left while a lower value denotes more energy on the right.

m γ = 3 (27) 1 σ3 where Z 3 m3 = (x − µ) · p(x)dx

• Spectral Kurtosis: It reveals how flat the distribution is around its mean value. [32] The value for a normal distribution is 3, higher means peaker, lower means flatter. m γ = 4 (28) 2 σ4 where Z 4 m4 = (x − µ) · p(x)dx

18 2 BACKGROUND – 2.4 Features for content-based recommendation

• Spectral roll-off frequency: It corresponds to the frequency value fc such that a percentage (e.g. 95%) of the signal energy is contained below this value. [32] By noting sr/2 the Nyquist frequency:

fc sr/2 X X a2(f) = 0.95 a2(f) (29) 0 0

• Band energy Ratio: This indicator measures the extent to which low frequencies dominate high frequencies. It is calculated by selecting a limit value called the split frequency band F . [7]

PF −1 2 n=1 mt(n) BERt = (30) PN 2 n=F mt(n)

• Spectral Flux: It depicts the power change between two consecutive frames.

N X 2 Ft = (Dt(n) − Dt−1(n)) (31) n=1

• Spectral Slope: This indicator quantifies the amplitude decay of the spectrum, it is calcu- lated by linear regression, it is thus of the following form: [32]

aˆ = slope ∗ f + const (32)

• Spectral Decrease: It also quantifies the amplitude decay of the spectrum but the method of calculation is more based on the perceptual part: [32]

K 1 X s(k) − s(1) decrease = (33) PK k − 1 k=2 s(k) k=2

• Spectral Flatness: The flatness reveals how close a sound is to white noise. A flat power spectrum (high flatness value) correspond to white noise. It is expressed as the geometric mean to arithmetic mean ratio of a power spectrum: [32]

1 Q  K n∈num band mt(n) flatness = 1 P (34) K n∈num band mt(n)

The signal energy can also be taken into account: • Global energy: It gives the estimate signal power at a given time. [32]

19 2 BACKGROUND – 2.4 Features for content-based recommendation

• Harmonic energy: It gives the estimate the power of the harmonic part at a given time. [32] • Noise energy: It gives the estimate the power of the noise part at a given time. [32] This last part defines the psycho-acoustic features. The purpose of these features is to characterise and model the hearing system. They allow both to evaluate the values as perceived by humans and to predict the discomfort or annoyance that may be caused by certain sounds.

• MFCC: The Mel Frequency Cepstral Coefficients have been introduced for speech and speaker recognition and found out to be powerful for extracting the power spectrum of the audio signal. Man-made sounds are filtered by the shape of the vocal apparatus (mouth, tongue, teeth, ...). It is therefore a question of determining the shape of the sound with precision, this should give us an exact representation of the phenomenon produced and therefore the way it is perceived. Different works proved that MFCCs can also be useful in the field of music similarity. [34]

Figure 8: Process to compute MFCC [35]

We then apply Hamming windowing to each frame in order to reduce the edge effects. [36]

The next step is simply to convert the signal into the frequency domain. For this we use the Fast Fourier Transform method to obtain the desired periodogram. This is motivated by the functioning of the human cochlea, which has the particularity of vibrating according to the frequency of the sound heard. More precisely, depending on the exact location of the vi- brating cochlea (detected by small hairs), the nerves transmit to the brain which frequencies are present.

As the cochlea cannot discern differences between close frequencies, seg- ments are summed to determine the amount of energy present in each frequency region. For this we use Mel filterbank, the first filter is very thin and indicates the concentrated energy close to 0 Hz. The more fre- quencies increase, the wider the filters become since the variations matter less.

The final step consists in computing the discrete cosine transform (DCT)

20 2 BACKGROUND – 2.4 Features for content-based recommendation

of filters. The filters are all overlapping, so the goal of this step is to decorrelate the energies of each other.

Finally, as humans can discern small variations in pitch more easily in low frequencies than in high frequencies, the Mel scale is more suitable. Usually 13 coefficients are finally kept for each frame.

Frequencies to Mel: [7]

M(f) = 1125 ln(1 + f/700) (35)

• Loudness: This is the first of the three sound quality descriptors, these are percep- tual and subjective metrics that are often used to assess the noise nuisance caused by products / worksites. [37] Its value is non-linear and represents the sound volume as perceived by the human ear, it is an intensity sensa- tion. [33] Generally, the human ear focuses on frequencies between 2000 and 5000 Hz but it varies with age, population, culture, ... This is why, even songs that would have the same sound pressure, physically measured in decibels (dB), but with frequencies that are not in this range are perceived as softer to the human ear. [6] Loudness N computation is achieved thanks to the Zwicker and Stevens model: [37]

Z 24Bark N = N 0(z)dz (36) 0Bark N 0 is the specific loudness, it is the loudness density according to the critical band rate, and is measured in sone/Bark. So N 0(z) is the loudness in the zth Bark band. N, the loudness, is a value in sone and corresponds to the sound volume. One sone represents the perception of a sound volume equivalent to that of a pure 1 kHz sound at a pressure level of 40 dB. Thus two sones correspond to a sound twice as intense as one sone for the average listener. • Perceptual Sharpness: It is an indicator of the perception of a noise as high-pitched. [33] It is an equivalence of the centroid spectral at the perceptual level, it is therefore calculated from the loudness. [32] Low acuity corresponds to ”dull sounds” while high acuity corresponds to ”screeching sounds”. Generally, listeners prefer deaf sounds, but an extremely low value can also be annoying. One of the possible model is called Aures and is computed as follows: [37]

Z 24Bark 0 N (z)gs(z) S = c · N+20 dz (37) 0Bark ln( 20 ) c is the correction factor, gs(z) is the weighting function for sharpness, S is measured in acum.

21 2 BACKGROUND – 2.4 Features for content-based recommendation

• Perceptual Spread: It is a measure of the distance between the largest specific loudness and the total loudness: [32]

N − max N 0(z)2 Sd = z (38) N

• Perceptual Roughness: It evaluates the perception of time envelope modulations for frequencies between 20 and 150 Hz, maximum at 70 Hz (low and middle frequency variations). It allows to quantify the rapid variations that can be perceived as dissonant for the listener. [33] As for the loudness: [38]

Z 24Bark R = R0(z)dz (39) 0Bark where R0 is the specific roughness.

2.4.2 Middle-level features Middle-level features focus on aspects that are meaningful musically and are understandable by a music expert. The first ones are focused on the harmony and melody of the music. Harmony is defined as the combined use of different pitch values and chords in music, it is called the vertical part of music. Melody is the horizontal part, it describes a sequence of pitched events that are perceived as a whole. [39] Different features allow you to extract information on harmony and melody: • Pitch: The pitch is related to the fundamental frequency, i.e. the frequency whose integer multiples best fit to the spectral content of a signal. [40] It is used to qualify sounds as ”high” or ”low” in the sense associated with musical melodies. To estimate the pitch we often estimate the so-called tuning system. This defines the tones (choice of number and spacing of frequency values) used in the music.

• Tonality / Modality: It outlines the relationship between simultaneous and consecutive tones. [40] It indicates whether the mode of the track is major or minor. The following focus on the temporal and rhythmic properties of music:

• Duration of the track: The duration of a given music is a simple element to extract that can help us to classify music. • Onset events: Onset detection is about finding the temporal position of all sonic events in a piece of music. • Metrical levels: Metrical levels correspond to the different levels of embedded impulses

22 2 BACKGROUND – 2.4 Features for content-based recommendation

present in a piece of music, generally higher metrical levels are multiple of lower. [40] The lowest level, called tatum, corresponds to the shortest durational values. The one that the listener would describe as ”most important” is called tactus, it corresponds to the foot tapping or what is commonly called beat. The tactus enables us to define the tempo which is the rate of the tactus pulse. [41] • Beat: The beat is the fundamental unit of time. Usually it is between 40 and 200 beats per minute. [40] • Rhythm: The rhythm also describes a pattern repetition in time, over longer periods of time than that of the beats. [40]

2.4.3 High-level features High-level features are the one that can be understood by any listener, they describe music as it is perceived by humans. As they require interpretation, they sometimes seem intuitive but they are complex to extract reliably. Care must be taken with these features that are not always relevant. Moreover, most of them are classified as ”trade secrets” and are held by The Echo Nest (owned by Spotify), among others.

• Danceability: This parameter estimates the ability of music to make people dance. It usually takes values between 0 and 3, the higher the value, the more danceable the music is. [42] One way to calculate it could be based on the velocity v for each sample time t and tempo of the music: [43] X D = tempo ∗ v(t) (40) t

• Liveness: It consists in determining whether or not an audience is present while recording. • Speechiness: The predominance of voices in a music makes it possible to differentiate for example slam / rap which will have very high values from jazz / classical where the values will be very low. [43] • Instrumentalness: This feature contrasts with the previous one, a strong instrumentalness value corresponds to a strong domination of the instruments. [43] • Instruments and Singer: Knowing the instruments present, and knowing if there is a singer as well as if it is a man or a woman can help to recommend the best music.

23 2 BACKGROUND – 2.5 Features selection algorithms

• Valence: The valence characterises the mood of a music, a high value corresponds to a joyful, lively music while a low value is more likely to be sad, low energy, or even depressing music... • Lyrics: The mood of the music can also be determined through the lyrics. Natural Language Processing (NLP) is used to extract information from the lyrics. This machine learning method is used to analyse texts and extract rele- vant information. The first step, after retrieving the lyrics (from websites like http://www.lyrics.com) in text form, is a preprocessing step in which punctuation and stopwords (’now’, ’how’, ‘I’, ‘they’, ...) are removed. We then vectorize the words to extract redundant topics for each genre.

2.5 Features selection algorithms Many features can be extracted from audio files. The task is to eliminate those that are irrelevant or less significant and could increase the complexity of the model as well as the computation time while making less reliable predictions. The selection of features is usually defined as a process of investigation in order to find a ”relevant” subset of features. The selection algorithms used to evaluate a subset of features can be classified into three main categories: filter, wrapper and embedded.

2.5.1 Filter model The aim is to assess the relevance of a feature based on measures that rely on the properties of the learning data. It’s a preprocessing step that filters the features before performing the actual classification.

Let X = {xk|xk = (xk,1, xk,2, ..., xk,n), k = 1, 2, ..., m} be a set of m training values. Let Y = {yk, k = 1, 2, ..., m} be the labels of training values. To determine the relevance of a feature, there are several evaluation criteria.

• Correlation criteria: it is used in the case of a binary classification, µi and µy represent respectively the mean values of the feature i and its labels: [44]

Pm (x − µ )(y − µ ) C(i) = k=1 k,i i k y (41) pPm 2 Pm 2 k=1(xk,i − µi) k=1(yk − µy)

• Fisher criteria: measures the degree of separability of the classes using a i i given feature. nc, µc and σc represent respectively the number of samples, the average and the standard deviation of the ith feature within class C. µi is the overall average of the first feature. [45]

PC n (µi − µi)2 F (i) = c=1 c c (42) PC i 2 c=1 nc(σc)

24 2 BACKGROUND – 2.5 Features selection algorithms

• Mutual Information: measures the dependence between the distribu- tions of two populations:   X X P (X = xi,Y = y) I(i) = P (X = xi,Y = y) log (43) P (X = xi)P (Y = y) xi y

• Signal-to-Noise Ratio coefficient: similar to the Fisher criterion, it is a score that measures the discriminatory power of a feature between two classes: 2 · |µ − µ | SNR(i) = C i1 C i2 (44) σC i1 − σC i2

This filtering method is efficient and robust against overfitting. However, since it does not take into account interactions between features, it tends to select features with redundancy rather than complementary information. Moreover, this method does not take into account the performance of the classification methods that once the selection is made.

2.5.2 Wrapper model The wrapper method was introduced by Kohavi and John [46]. In this case, the evaluation is done using a classifier that estimates the relevance of a given subset of features. That is why the subset of features selected by this method matches the classification algorithm used, but the subset is not necessarily valid if the classifier is changed. Most common implementations of wrapper are: • Forward selection: start with no features and add the most relevant one at each step: 1. Choose the significance level (e.g. 0.05) 2. Select the feature that fits the model with the lower p-value 3. If p value < α, add the feature to the feature set and go back to step 2, else stop the process. α is the significance level. • Backward elimination: start with every feature, and remove the in- significant one at each step: 1. Choose the significance level (e.g. 0.05) 2. Fit the model with all features in the feature set 3. Consider the feature with highest p-value 4. If p value > α, remove the feature from the feature set and go back to step 2, else stop the process. • Stepwise Selection / Bidirectional elimination: Similar to forward selection, a feature is added to each iteration, but it also verifies the signif- icance of features already added and can remove a feature through back- ward elimination if needed. 1. Choose the significance level (e.g. 0.05)

25 2 BACKGROUND – 2.5 Features selection algorithms

2. Perform steps 2 and 3 of forward selection 3. Perform steps 2, 3 and 4 of backward elimination 4. Repeat 2 and 3 until finding the optimal feature set This method is considered the best. Indeed, it selects a small subsets of features that are efficient with the classifier used, however there are two main drawbacks that limit these methods: firstly, the computation time is much longer than for the previous method, and the cross-validation often used to reduce the risks of overfitting increases this problem. The second challenge is to apply this selection mechanism for each classifier to test.

2.5.3 Embedded model In contrast to the two previous methods, this one incorporates the selection of features during the learning process, it can be incorporated in an algorithm such as SVM, adaBoost, or in a decision tree. Some examples are LASSO, Ridge Regression, or Elastic Net. As the selection is done during the training, the overfitting risk is high.

26 3 METHODS –

3 Methods 3.1 Choosen approach The project being carried out for Research and Development purposes, the company could not provide data. Sufficient open source data on users being difficult to find, it was chosen in agreement with the company to develop a content-based recommendation system. Literature states various works done with different models and features without consensus on which one is ”the best”. Indeed, several papers describe their experiences on a single model and with a sometimes very limited subset of features, which results in a lack of standardi- sation and comparison. My goal is therefore to determine both the model (and its optimal parameters) and the subset of features to obtain the best possible recommendations.

3.2 Datasets There are various datasets used in Music Information Retrieval. However, some characteristics are necessary to get a dataset adapted to this thesis. The first thing needed is a large amount of data. We need a lot of music, but we also need a variety of genres and artists. This will allow us to minimise the risk of overfitting. The second major requirement is that the audio files are available so we can extract as many features as possible. This will allow the company not to limit itself only to the music of the dataset, indeed it will be possible to add music and compute features for new musics the same way it has been done on the dataset. During the state of the art phase several datasets caught my attention. The Million Song Dataset (MSD) is one of the first large dataset to be released. It was developed by researchers from LabRosa and The Echo Nest and presented at the ISMIR 2011 1. It combines data from playme.com 2, TheEchoNest 3, 7digital 4 and musicbrainz.org 5, that is why it gives access to numerous meta- data and features already computed for more than a million musics. [47] The MSD can be used in parallel with the MMTD (Million Music Tweets Datasets), proposed at the ISMIR 2013 6, which provides user data that is ex- tracted from microblogs and social media. [48] However, while a large amount of features are available, raw audio is not provided and so it is not possible to compute the desired features. Also, the communities around this set are not very active anymore and many links to access information have expired. [49] The dataset LFM-1b was accepted by the ICMR 2016 7, it is composed of one billion music listening events generated by around 120 thousands users from Last.fm 8 on 32 million tracks. It gives us access to both user (country, age, number of times a piece of music is played, ...) and item metadata, so the main advantage is that it can be used to experiment collaborative filtering approaches. [50]

1http://www.ismir.net/conferences/ismir2011.html 2http://www.playme.com/ww/web/radio/ 3http://the.echonest.com/ 4https://fr.7digital.com/ 5https://musicbrainz.org/ 6http://www.ismir.net/conferences/ismir2013.html 7http://www.icmr2016.org/ 8https://www.last.fm/

27 3 METHODS – 3.2 Datasets

However, once again, this dataset does not give access to raw files and therefore feature extraction is not possible. The MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization- Dataset) dataset was created for the International Piano-e-Competition. [51] It is composed of 1200 tracks of virtuosic piano and the audio tracks that can be downloaded. Some metadata as well as the MIDI files are available. These can be useful in particular for generating music thanks to artificial intel- ligence. Despite the fact that raw audio are available, the dataset remains very small and the musics are of the same genre, so it will be complicated to offer varied content that can please all tastes.

3.2.1 GTZAN GTZAN is a minimalist dataset containing one thousand music samples of 30 seconds from 10 different genres. [52] Raw audio singly labeled are provided. The music chosen for each genre is particularly typical and their resemblance is strong. This dataset is widely used in the field of music recommendation. It contains music that is typically representative of their genre and tends to give good results. The high performance of the previous work also comes from the fact that this dataset contains a lot of redundancies. Some of these repetitions are exact, i.e. the fingerprints of the music are identical, but there are also different versions (studio, live) of the same music. Finally, the artists are often represented by several musics. [52] Mislabelings and distortions can also be observed in this dataset.

3.2.2 Free Music Archive The FMA (Free Music Archive) is a large scale dataset created for music analysis that is composed of more than 100,000 thousands tracks from 161 gen- res. [53] It provides various track-level (including low-level features), artist-level and -level metadata in csv files. For 13,000 tracks, Echo nest features are provided. The main advantage is that the dataset contains the raw audio in high quality and full length. The music is labelled according to the main genre (among 16 genres) but also has several sub-genres. Moreover, it is possible to download smaller subsets. [54] The FMA dataset seems to be the most suitable since it is quite large and provides the audio files allowing us to analyze them and extract the desired features. The downside of this dataset is that it contains data exclusively about the music (as well as its artist and album). As we don’t have any user data, we will have to rely on a content-based recommendation only. The available subsets are: • Small: 8,000 tracks - 8 genres - 30 seconds 1000 musics per genre, all balanced (figure 1).

28 3 METHODS – 3.2 Datasets

Genres Number of tracks Electronic 8000 Experimental 8000 Folk 8000 Hip-Hop 8000 Instrumental 8000 International 8000 Pop 8000 Rock 8000

Table 1: Number of samples per class in the small dataset

• Medium: 25,000 tracks - 16 genres - 30 seconds In the medium and biggest datasets, the number of sample in each class is not the same, so we will have to deal with class imbalance issues (table 2).

Genres Number of tracks Blues 74 Classical 619 Country 178 Easy Listening 21 Electronic 6311 Experimental 2249 Folk 1516 Hip-Hop 2197 Instrumental 1349 International 1018 Jazz 384 Old-Time / Historic 510 Pop 1186 Rock 7097 Soul-RnB 154 Spoken 118

Table 2: Number of samples per class in the medium dataset

• Large: 103,000 tracks - 161 genres - 30 seconds (table 3)

• Full: 106,000 tracks - 161 genres - full-length

29 3 METHODS – 3.2 Datasets

Genres Number of tracks Blues 110 Classical 1230 Country 194 Easy Listening 24 Electronic 9372 Experimental 10608 Folk 2803 Hip-Hop 3552 Instrumental 2079 International 1389 Jazz 571 Old-Time / Historic 554 Pop 2332 Rock 14182 Soul-RnB 175 Spoken 423 NaN 56976

Table 3: Number of samples per class in the large dataset

Initially the tests were carried out on the small and medium datasets. When it was possible to get more disk space the large dataset was used.

3.2.3 Data-augmentation One of the drawbacks of the FMA dataset that can be seen on the previous tables is a problem of class imbalance: some genres are over-represented com- pared to others. To tackle this issue one can use under- and over-sampling methods. The former consists in reducing the number of samples from over- represented classes while the latter consists in duplicating data from under- represented classes to obtain more data. In order to prevent overfitting on our new dataset and to increase the robustness of the system, these methods can be combined with data-augmentation. Moreover this can be useful for datasets such as GTZAN containing very little music. Data-augmentation is widely used for images, but some techniques also exist for audio files. This makes it possible to add slightly different data by applying minor variations. I have chosen to re-implement the procedures introduced in [55]. One possible method is to add noise (randomly generated) to the signal. The intensity of the noise can be varied, but care must be taken to ensure that the melody remains. It is also possible to ”shift the music in time”, i.e. the augmented music begin at the second s and the original beginning of the music is attached at the end. Another method is to change the speed of the music and thus stretch it or make it shorter. Finally, the pitch can also be changed randomly to put more focus on the bass or on the high-pitched sounds. It is often wise and efficient to to mix several of these variations. The research paper [55] specifies that even if the results are encouraging, they are preliminary results and it is necessary to be attentive to the methods used according to the purpose of the project. The set of features will be slightly adapted according to the methods used. Indeed, when using the method that changes the pitch, features that depend on the

30 3 METHODS – 3.3 Features extraction pitch of a sound will not be taken into account and the same goes for the tempo when changing the speed.

3.3 Features extraction 3.3.1 Preprocessing The musics are in mp3 format for the FMA dataset and in wav format for GTZAN. The first step is to extract the spectral envelopes. The datasets were then split into a training set containing 90% of the data and a test set of 10%. For later hyperparameter optimisation, a third set called validation set will be introduced. From the spectral envelopes of each music, an attempt is made to extract as many useful features as possible in order to best describe all the characteristics of the music. Depending on the features, there are three different extraction modes: either at a given time t, over the whole music, or (in most cases) by frame.

3.3.2 Chosen features Once the preprocessing was done, and the spectral envelopes extracted, the features were computed. For this purpose, three libraries were used. The main one is librosa (library for Recognition and Organization of Speech and Au- dio) [56], it is an open-source Python package used for audio and music signal processing. Essentia [57] was used to extract the danceability. This C++ coded library, also open source, is intended for audio-based music information retrieval projects. Finally, to extract perceptual features, the Yaafe [58] library (Yet An- other Audio Feature Extractor) was employed. The features extracted to perform the tests are: the loudness, the perceptual sharpness and roughness, the danceability, the spectral decrease, the spectral roll-off, the spectral flux, the spectral centroid, the spectral bandwidth, the spectral flatness, the spectral slope, the tunning, the root-mean square energy, the zero crossing rate, the onset events and the 20 MFCCs coefficients. Among the extracted features there are two types: global and instantaneous. The global type has a unique value for each music, which is computed on the whole signal. The Instantaneous type features are computed on a short seg- ment of time called frame (around 20msec). In this second case, two methods have been studied to enter the input values in our models: the first one is to only keep statistical descriptors (mean, variance, skewness, kurtosis, ...) of the values obtained at the different frames. The second is to keep the its temporal dimension and to use an adapted neural network.

3.3.3 Wrapper model for features selection To avoid information redundancy and limit parasitic features, the wrapper model, defined in the Background part, will be applied to keep only a subset of relevant features.

31 3 METHODS – 3.4 Models

3.4 Models All models defined in the state of the art have been tested in experiments: logistic regression, decision tree, random forest, adaboost, k-nearest-neighbors, support vector machine, naive bayes and feed-forward neural networks. The parameters of these have been varied in order to obtain the optimal settings for each model. These settings are called hyperparameters, they are parameters that are set in advance and will not be learned from the data. To evaluate the model in an unbiased manner, the best hyperparameters will be selected by calculating the error on the validation set and the generalisation performance of the model will be determined by calculating the error on the test set.

3.4.1 Hyperparameter tuning For a specific model, distinction is made between the parameters and the hyperparameters. The former is internal to the model, it will be estimated during the training phase and will be used later to make predictions. The hyperparameters are external parameters determined prior to the training. In order to determine the optimal hyperparameters, grid search was used. The principle is to specify for each hyperparameter to be optimised the range of values to be tested. In order to avoid overfitting as much as possible, a cross- validation strategy is adopted. To obtain relevant results while maintaining reasonable computation times 5-fold cross validation was used. For the logistic regression the main parameter to be varied is the solver. The first one adapted to a multiclass classification is Nonlinear conjugate gra- dient method (newton-cg): based on Hessian matrices, this method tends to be slow for large datasets since the second partial derivative has to be calculated. The second one is Limited-memory Broyden-Fletcher-Goldfarb-Shanno (lbfgs): it also requires to compute the second derivative resulting in slowness, however memory usage is more optimised as it only stores some updates. Stochastic Average Gradient descent (sag and saga) is faster for large datasets but expen- sive in memory, it is a matter of estimating the gradient by computing it on a randomly chosen subset. The penalty parameter can also be varied between l1, l2, and elasticnet depending on the solver. To find the optimal parameters of a decision tree, it is necessary to determine both the decision criteria (gini, entropy) and the maximum depth of the tree in order to obtain a good performance while avoiding overfitting. In the same way for the random forest the criterion and the maximum depth are varied, but also the number of trees (called the number of estimators). For boosting, the number of estimators can also be tweaked. Point weights for the k-Nearest-Neighbors method can be computed either uniformly (i.e. all points have the same weight) or using weights that are in- versely proportional to the distance (i.e. the closest neighbors have a stronger influence than the farthest neighbors). The Support Vector Machine method supports the variation of the kernel used (linear, polynomial, sigmoid, or rbf). The regularisation parameter named C is used to vary the penalty. Two major parameters can affect the results when using Linear Discriminant Analysis: the solver and the shrinking intensity. The main solvers are Least squares solution (lsqr), Eigenvalue decomposition (eigen), and Singular value

32 3 METHODS – 3.5 Evaluation decomposition (svd). The shrinkage can vary between 0 and 1, to automate this process, one can use the Ledoit-Wolf lemma. Neural network hyperparameters are numerous, which is why finding the optimal parameters appears to be slow. First of all, the number of epochs (which is the number of samples that pass through the model during the forward and backward pass) and the batch size used for each epoch must be determined. Then, the optimizer that offers the best results is selected. It is also possible to use the famous Stochastic Gradient Descent algorithm while optimising the learning rate (to control how much weights will be updated), and momentum (to control the influence of past steps). The purpose will be to identify which of the two previous approaches seems to be the best. The way the weights are initialised before the first forward pass also influences the results. The main approaches are either to initialise all weights to the same value (usually 0 or 1) or to take them randomly (often from a uniform distribution). To avoid exploding or vanishing gradient issues, heuristics (whose formula depends on the number of layers) are used to find the weights. To ensure the convergence of the network and to control training speed, various activation functions exist. Finally, to prevent overfitting, dropout is introduced: a percentage of neurons that will be randomly dropped (temporarily disabled) at each epoch will be determined.

3.5 Evaluation The evaluation of the recommendation system is carried out in two stages, firstly through a quantitative evaluation based on mathematical and statistical techniques. During the second stage, it is required that a few people listen to the recommendations made.

3.5.1 Evaluation of the classification using labels Each music is labelled and therefore has a genre considered as the ground truth, the model will then aim to predict a genre for each music of the test set. Multiple methods [7] are used to assess the classification of songs: The first one is the precision, it indicates the number of retrieved items that are indeed relevant. [7] Let C be the number of classes, Relc be the set of relevant items for the class c and Retc be the set of retrieved items for the class c. The precision of the class c and the average precision are:

|Relc ∩ Retc| 1 X Pc = P = Pc |Retc| |C| c∈C For recommendation systems the goal is to ensure that the majority of the predicted elements are relevant. Indeed, it is desirable for platforms to keep their customers, which is why high accuracy is desired. The second one is the recall, it reveals how many of the relevant items are retrieved. [7] The recall of the class c and the average recall are:

|Relc ∩ Retc| 1 X Rc = R = Rc |Relc| |C| c∈C

33 3 METHODS – 3.5 Evaluation

This metric is mainly used in information retrieval (such as in medicine where the objective is to detect all sick patients, some mistakes can be made in classifying healthy patients as sick, but the opposite should be avoided). Finally, the F-measure can also be used as a measure of the test accuracy. The average F-measure is defined as follow:

(1 + β) · P · R F = β β · P + R It can also be computed for one given class by considering the precision and the recall of the class. The F1-score, called harmonic mean, is often used but a higher beta value can be used to give more importance to the recall compared to the precision.

3.5.2 Evaluation of the prediction using confusion matrices Finally, one very useful tool is the confusion matrix. For multi-genre classification, the use of the confusion matrix is essential. It shows which music genres are best recognised and, above all, how errors are classified. Indeed, as some genres are close to each other, the impact of some errors is low. Ideally, it is hoped to see a diagonal on the matrix with high accuracy percentages and the other values as low as possible. However, the visualisation of errors will also be very important since the boundaries between the genres are not always very well defined.

3.5.3 Evaluation of the prediction based on human opinion The purpose of this project is to build a music recommendation system, so even if predicting genres is useful it cannot be the only criterion. Above all, recommendations must be relevant based on human judgment. This method has two main drawbacks, it is expensive in time and subjective. Everyone will have a different opinion according to their personal tastes and the attention they pay to rhythm, melody, lyrics, etc. Judges may be asked to give a score between 0 (very different) and 100 (very similar) or to categorise them between ”very similar”, ”somewhat similar” and ”not similar at all”. For this type of evaluation, care must be taken to obtain a representative sample of the population (especially of the target population of the application). In order to do this, as explained in [59], it is preferable to have a great diversity, whether it is demographic (age, gender, ...), geographic (country, city, ...), or personality-based (opinions, interests, lifestyle, ...). It is interesting to rank the judges according to their user listening experience. There could be four classes: those who do not listen to music, those who listen to music occasionally, those who have music as an important part of their lives, and those who have a strong knowledge of music.

34 4 RESULTS –

4 Results 4.1 Preliminary results Some preliminary tests were carried out on each of the datasets (FMA and GTZAN) to determine which one will be used.

4.1.1 Tests on FMA For each music there is a ”genre top” label which is a unique value among the 16 main genres (Blues, Classical, Country, Easy Listening, Electronic, Exper- imental, Folk, Hip-Hop, Instrumental, International, Jazz, Old-Time/Historic, Pop, Rock, Soul-RnB and Spoken). This value is missing for half of the wide dataset but always indicated for the others. Another useful label is the ”gen- res all”, it contains a set of genres (among the 161 sub-genres) for each music. In order to achieve a supervised classification, the first step of the prepro- cessing is to discard the musics of the datasets which do not have a label (neither genre top nor genres all). The Spoken genre has been discarded since we are only interested in music. Unreadable audio files were also discarded as well. Once the reduced dataset is obtained, the spectral envelope and the sampling rate of each music is extracted in order to be able to get the features. The first results, presented in the table 4, obtained using the FMA small dataset aren’t as good as expected. The maximum accuracy obtained on the test set is 55% using a neural network. Looking at the confusion matrices (figure 9), the expected diagonal is far from reached. After listening to some samples and studying this dataset, several weaknesses emerge. First of all, musics classified as ”experimental” correspond more to daily noises (destruction of objects) than real music, that is why it was chosen to remove them from our final dataset. Moreover, considering the genre ”international”, it actually includes African, Latin, Indian, French musics and so on. It is therefore understandable that our models find it difficult to recognise this genre and to find similarities among different musics.

Training set Test set Logistic Regression 58 % 42 % Decision Tree 100 % 30 % Random Forest 100 % 32 % Adaboost 32 % 28 % K-Nearest-Neighbours 68 % 53 % Support Vector Machines 70 % 51 % Naive Bayes 37 % 27 % Linear Discriminant Analysis 69 % 50 % Feed Forward Neural Network 100 % 55 %

Table 4: Training and test accuracy on FMA small

35 4 RESULTS – 4.1 Preliminary results

Figure 9: Confusion matrices on FMA small (in percentage)

36 4 RESULTS – 4.1 Preliminary results

Training set Test set Logistic Regression 55 % 28 % Decision Tree 100 % 28 % Random Forest 100 % 29 % Adaboost 24 % 23 % K-Nearest-Neighbours 42 % 29 % Support Vector Machines 43 % 29 % Naive Bayes 31 % 19 % Linear Discriminant Analysis 41 % 28 % Feed Forward Neural Network 100 % 21 %

Table 5: Training and test accuracy on FMA large

While genres are balanced in FMA’s smallest dataset, this is not the case in the larger datasets so the initial results are poor (table 5). This is why when using the dataset as it is, classifiers tend to classify all music into the most represented genres, in this case rock, experimental, and electronic. This effect is particularly noticeable on the confusion matrices (figure 10).

37 4 RESULTS – 4.1 Preliminary results

Figure 10: Confusion matrices on FMA large (in percentage)

38 4 RESULTS – 4.1 Preliminary results

4.1.2 Tests on GTZAN Regarding this dataset, the first results, both numerical (table 6) and on listening, are much more encouraging.

Training set Test set Logistic Regression 87 % 65 % Decision Tree 100 % 45% Random Forest 100 % 58 % Adaboost 32 % 29 % K-Nearest-Neighbours 80 % 64 % Support Vector Machines 86 % 65 % Naive Bayes 55 % 8 % Linear Discriminant Analysis 69 % 64 % Feed Forward Neural Network 100 % 63 %

Table 6: Training and test accuracy on GTZAN

Considering the table 6, one thing that emerges is that the results obtained with the Naive Bayes model are not exploitable. As said before, since the features are highly correlated, this algorithm gives bad results and classifies every music as Classical. A more interesting fact is that the desired darker diagonal is much more present (see figure 11) than with the FMA dataset, and this is true for all other models, in particular when using Support Vector Machine. It can also be seen that most methods tend to overfit and reach 100% train accuracy while the test accuracy is much lower. Changing the parameters and adding penalty to these models may help to reduce the importance of this issue. Finally, Random Forest reveals to be a great improvement over the classic Decisions Tree, whereas Adaboost does not give good enough results here.

39 4 RESULTS – 4.1 Preliminary results

Figure 11: Confusion matrices on GTZAN (in percentage)

However, as explained above, if these preliminary results are way more promising, it is due to the fact that the selected musics (their features) are very similar to each other, and specifically chosen to be representative of their genres. The artists aren’t very diverse either. This is a minor yet existing is- sue as we aim to develop a recommendation system that can be generalised to

40 4 RESULTS – 4.2 Dataset creation musics not present in our dataset.

4.2 Dataset creation Because of the previously stated issues, it has been decided to create a new dataset for this project. This new dataset, used in this master thesis will be a combination of the GTZAN and FMA datasets. The FMA dataset itself has too many drawbacks to be used alone. The numerous repetitions is the GTZAN dataset and its narrowness make it difficult to generalise our model to new musics. The new dataset will be composed of the 100 musics of the 10 genres present in the GTZAN dataset to which we added 50 more musics coming from the FMA dataset for each genre. In the end, each of the 10 genres will be represented by 150 musics and varied enough to limit the overfitting of the model on the dataset. These results are more exhaustive and more representative of our dataset and represent every genre while avoiding class imbalance issues.

Training set Test set Logistic Regression 87 % 71 % Decision Tree 100 % 48 % Random Forest 100 % 60 % Adaboost 33 % 31 % K-Nearest-Neighbours 81 % 65 % Support Vector Machines 90 % 67 % Naive Bayes 58 % 48 % Linear Discriminant Analysis 82 % 66 % Feed Forward Neural Network 100 % 69 %

Table 7: Training and test accuracy on the GTZAN / FMA dataset

41 4 RESULTS – 4.3 Hyperparameters tuning

Figure 12: Confusion matrices on the new dataset

4.3 Hyperparameters tuning The purpose of this part is to tweak the hyperparameters of the models in order to obtain the best achievable results. To limit the risks of overfitting and to obtain a better generalisation on external data, the k-fold cross validation

42 4 RESULTS – 4.3 Hyperparameters tuning was used (k being between 5 and 10). The hyperparameter’s selection will be based on the results obtained, both the mean accuracy and the variance, as well as on the time needed to perform the computations.

4.3.1 Logistic regression optimization On our dataset, all the solvers give similar results. The maximum iteration will be set to 100. To avoid overfitting, C will be set to a relatively low value: 1. The liblinear solver will then be used with a penalty of type l1 as it is well adapted to small datasets such as GTZAN.

Figure 13: Logistic regression hyperparameters tuning

4.3.2 Decision tree and random forest optimization For simple decision trees (see figure 14), the decision criterion (which mea- sures the split quality) based on entropy and information gain gives better re- sults: 57% accuracy compared to 54% using Gini impurity. On the other hand, for random forests, both functions give very similar results (see figure 15). A clear improvement in accuracy of 73% is observed when using random forest instead of decision tree, confirming the power of the overall methods. In the following part of the project, the entropy function and a maximum depth of 24 will be chosen in order to maximise the mean accuracy while minimising the variance.

43 4 RESULTS – 4.3 Hyperparameters tuning

Figure 14: Decision tree hyperparameters tuning

Figure 15: Random forest hyperparameters tuning

4.3.3 Adaboost optimization The optimal parameters for adaboost are 0.01 for the learning rate and 60 estimators 16. Nevertheless, the accuracy obtained remains limited with a maximum value of 41%.

44 4 RESULTS – 4.3 Hyperparameters tuning

Figure 16: Adaboost hyperparameters tuning

4.3.4 K-nearest-neighbours optimization Overall, according to the figure 17 the results are better when using Eu- clidean and Manhattan distances. Since the variances obtained when using the Euclidean distance are lower, this distance will be preferred in the future experiments. Moreover, giving an importance inversely proportional to their distance to the points studied provides better accuracy than if their weights were uniform. The number of neighbours K will be fixed at 5 to achieve 72 % of accuracy.

45 4 RESULTS – 4.3 Hyperparameters tuning

Figure 17: k-nearest-neighbours hyperparameters tuning

4.3.5 Support vector machine optimization The most appropriate kernel type seems to be rbf, which achieves 73% ac- curacy. In order to minimise overfitting, a rather low value of the penalty parameter C is taken, here 2.

Figure 18: Support vector machine hyperparameters tuning

46 4 RESULTS – 4.4 Feature selection

4.3.6 Linear Discriminant Analysis The results obtained for the solver lsqr and eigen are similar. Using the svd solver the results are not convincing. The best results are obtained with automatic shrinkage using the Ledoit-Wolf lemma. The accuracy obtained is 70%.

4.3.7 Feed-Forward Neural Network The different hyperparameters of neural networks were tested in parallel. The optimised model will therefore be composed of three hidden layers (of sizes 256, 128, and 64 respectively). Moreover, the training will be done on a batch of size 64 as well as a number of iterations of 1000 epochs. In order to avoid overfitting, dropout and early stoping methods will be used. For the other main parameters, tanh will be used as the activation function, the constant learning rate is set to 0.001, the momentum is set to 0.9, and finally, the solver used it adam.

4.3.8 Global results after tuning At the end of the hyperparameter tuning step, the results (Table 8) are as follows:

Before tuning After tuning Logistic Regression 68 % 70 % Decision Tree 48 % 54 % Random Forest 60 % 73 % Adaboost 31 % 41 % K-Nearest-Neighbors 66 % 72 % Support Vector Machines 68 % 73 % Naive Bayes 8 % 8 % Linear Discriminant Analysis 67 % 70 % Feed Forward Neural Network 70 % 74 %

Table 8: Test accuracy on the GTZAN / FMA dataset before and after hyper- paramaters tuning. Significant improvements are in blue. .

4.4 Feature selection In order to identify the subset of features most suitable for each model, all three wrapper-type methods were used. Performance evaluation is based on accuracy. The tests were carried out with cross validation, with k = 5. As can be seen in the Table 9 which shows the performances achieved: for the majority of the models, it is the bi-directional method, more expensive in time and computing resources, which offers the subset of features maximising the accuracy.

47 4 RESULTS – 4.4 Feature selection

Forward Backward Bidirectional selection elimination elimination accuracy accuracy accuracy (Feature number) (Feature number) (Feature number) Logistic 71 % 72 % 73 % Regression (45) 72 (36) Decision 58 % 60 % 61 % Tree (50) (27) (28) Random 75 % 74 % 74 % Forest (24) (25) (28) Ada 40 % 39 % 41 % Boost (9) (9) (17) K-Nearest 74 % 75 % 76 % Neighbours (36) (36) (27) Support Vector 78 % 79 % 80 % Machines (40) (40) (35) Naive 62 % 62 % 63 % Bayes (25) (24) (27) Linear Discriminant 71 % 71 % 72 % Analysis (47) (39) (44) Feed Forward 75 % 75 % 75 % Neural Network (41) (42) (38)

Table 9: Training and test accuracy on the GTZAN / FMA dataset. Significant improvements are in blue.

The first interesting observation is that the accuracy of Naive Bayes can be greatly increased by reducing the number of features, in fact 63% can be achieved taking less than half of the features. Picking all the features increases the probability of dependency between each of them and therefore decreases the performance of this algorithm. Looking at figure 19 it can be seen that the accuracy decreases as the number of features increases. Since cross-validation was used, the curve on the graphs corresponds to the mean accuracy and the blue envelope around it corresponds to the variance. However the accuracy of this method is still too low compared to the other models.

48 4 RESULTS – 4.4 Feature selection

Figure 19: Features selection - Naive Bayes

The most promising models are now Support Vector Machine (figure 20) and k-Nearest neighbours (figure 21). However, Random Forest (figure 22) and Logistic Regression (figure 24) provide results that are also relevant. The number of features is carefully chosen to maximise accuracy in order to best predict genres and at the same time limit the variance of the results. In order to keep only the most relevant ones, the Decision Tree, Adaboost and Naive Bayes models will no longer be part of the competition that takes place in this thesis.

49 4 RESULTS – 4.4 Feature selection

Figure 20: Features selection - Support Vector Machine

Figure 21: Features selection - k-Nearest Neighbours

50 4 RESULTS – 4.4 Feature selection

Figure 22: Features selection - Random Forest

Figure 23: Features selection - Logistic Regression

51 4 RESULTS – 4.4 Feature selection

Figure 24: Features selection - Linear Discriminant Analysis

4.4.1 Most important features Knowing the optimal subset of features for each model, it is possible to see which features are the most useful on average. The results are given in Figure 25. First of all, as indicated in the papers presenting the perceptual features (loudness, perceptual sharpness, perceptual spread [37], and MFCC [34]), these features are particularly powerful and efficient for music recommendation. Con- cerning MFCCs, it is noticeable that the first coefficients are the most important and that the highest coefficients are less used in the final subsets. Moreover, onset events (characterising the rhythm) and danceability are also very strong features that are always used regardless of the model. It should be noted that although the use of high-level features is very controversial, the use of danceabil- ity seems to be a relevant choice here. Spectral features as well as zero crossing rate and pitch are less used.

52 4 RESULTS – 4.4 Feature selection

Figure 25: Feature utilisation frequency

It is relevant to compare the frequency of appearance of features in the final subsets with the average accuracy obtained using the features alone presented in table 10. Taking only the MFCCs (their mean and variance), the accuracy is 52%. Tests were carried out by taking the moments of order 3 and 4 but this did not bring any improvement. Crossing the two results shows that although danceability is present in every subsets, when used alone it provides only 20% accuracy. The conclusion is that this high level feature provides information that is different from the other lower level features and not redundant. The same is true to a lesser extent for root mean square energy. Although spectral features provide higher accuracy, they are not always considered necessary since they do not provide useful details to classify musics.

53 4 RESULTS – 4.5 Data augmentation

Feature average accuracy MFCC mean & variance 52 % MFCC mean 48 % MFCC variance 35 % Spectral slope 33 % Perceptual sharpness 31 % Spectral decrease 32 % Spectral bandwidth 30 % Spectral rolloff 29 % Spectral flux 28 % Onset event 28 % Perceptual spread 27 % Spectral centroid 26 % Pitch tuning 25 % Loudness 24 % Spectral flatness 23 % Root mean square energy 21 % Danceability 20 % Zero crossing rate 18 %

Table 10: Accuracy using features alone

4.5 Data augmentation A series of tests were carried out using the different data augmentation methods (add noise, shift the music timewise, change the speed, the pitch, etc) separately and in combination. Table 11 only shows the results for Logistic Regression and K-Nearest-Neighbors, the findings being similar using the other methods. The first conclusion that can be drawn is that the method that consists in shifting the music in time does not change the results on the test set. This can be explained as the extracted features are mostly averaged over time. The other three methods are therefore focused on, separately and then together. Unfortunately the performance is not improved, on the contrary overfitting is enhanced, the results on the training set are very good while they are decreased on the test set. The choice of the final method will therefore be made without data augmentation.

shift noise pitch speed the last three train test train test train test train test train test Log Reg 92 % 73 % 98 % 72 % 99 % 68 % 98 % 65 % 100 % 65 % k-NN 80 % 76 % 91 % 72 % 93 % 69 % 92 % 67 % 93 % 68 %

Table 11: Accuracies using data augmentation

4.6 Final examples of recommendations Once the features that best represent the music have been extracted, recom- mendations can be listened to. After all these steps, it appears that the most

54 4 RESULTS – 4.6 Final examples of recommendations

efficient and robust method, both in terms of scores in the confusion matrices and in human-ear tests, is Support Vector Machine. This will therefore be the method used in this final part. The confusion matrix is presented in the figure 26. The percentages are better, and the errors seem quite forgivable. Now some validation-listening will take place to ensure that the predictions are accurate. This model went through various tests, and here are the results by with styles among Blues, Rock and Classical.

Figure 26: Final confusion matrix for SVM

For the first example, classical music was used as an input. Unsurprisingly and as expected by the matrix of confusion, the genre of music is recognised and the recommended musics (table 12) are all rather similar classical musics with mostly strings instruments.

Title Artist Genre Initial song Spring Allegro III Vivaldi Classical Spring Allegro II Vivaldi Classical Recommended songs Violin Concerto Karol Szymanowski Classical Allegro assai con spirito F.J. Haydn Classical Sonata XIII Giovanni Gabrieli Classical

Table 12: Recommended songs example 1 (classical)

55 4 RESULTS – 4.6 Final examples of recommendations

For the second example, a Blues music was taken. It appears on the table 13 that the first recommended music is from the same artist, two others are from the same genre and the last one is Reggae. Although they belong to different genres, these recommendations seem to be quite correct.

Title Artist Genre Initial song One More Night Hot Toddy Blues Rescue Me Hot Toddy Blues Recommended songs Hobo’s Son Kelly Joe Phelps Blues Could You Be Loved Bob Marley Raggae I’m Bad Like Jesse James John Lee Hooker Blues

Table 13: Recommended songs example 2 (blues)

Finally, by taking a Rock type music, results are more heterogeneous. In- deed, one of the music is disco and half of the recommendations are metal music (shown on table 14). The disco music is quite close to the initial song and metal is a sub-genre of rock. After listening to these results, they seem to be relevant too.

Title Artist Genre Initial song Like Swimming Morphine Rock Caught in the Middle DIO Metal Recommended songs I Know You Pt. III Morphine Rock High Energy Evelyn Thomas Disco Freedom Rage Against The Machine Metal

Table 14: Recommended songs example 3 (rock)

Most of the listening tests performed seem to be relevant or at least not totally inexplicable. However, an external expertise would be required for every genre in order to obtain more accurate human judgements, and thus model predictions.

56 5 CONCLUSIONS AND DISCUSSIONS –

5 Conclusions and discussions 5.1 Discussion of the results 5.1.1 Quantitative results To sum up, in terms of quantitative results, the final accuracy obtained for the classification into genres with the Support Vector Machine model is 78% on the test set. However, it can be noted that other models, in particular the neural network and random forest, show promising results. This value could be reached after an optimization of the model’s hyperparameters as well as a features selection stage. These two methods make it possible to greatly increase accuracy while restricting overfitting. It has to be taken into account that this value is only an indicator for the classification part of the project. The confusion matrix (figure 26 in the previous section) clearly gives us more information about the results. What we are looking for on the confusion matrix is a diagonal of dark color characterizing a high recovery rate of the initial genre while the other cases should be as white as possible. The final matrix is in line with what is desired. From the matrix comes the fact that some genres are easier for our system to distinguish than others. Indeed, the classical genre is always retrieved. On the pop, metal, country and blues genres, the system is also rather efficient and finds the genre four times out of five. By contrast, rock and disco are more challenging to identify. Furthermore, a good classification is not the only indicator on which a recommendation should be based. Indeed, these categories are not mutually exclusive, and some are a sub-genre of others, for example metal is a sub-genre of rock. The most useful features to quantify the similarity between musics are perceptual features (loudness, sharpness, spread and MFCC), rhythm and danceability. Indeed, perceptual features alone provide more than 60% accuracy with the SVM method. It should also be noted that danceability, a high-level feature, provides information that is totally different from the more classical features, whether physical or musical, which allows a clear improvement in performance (about 10%).

5.1.2 Qualitative results Results in terms of listening are encouraging. It should be specified that these listenings were judged by only one person (me, the author) and that av- eraging over a group of individuals would allow a better assessment. However, the progression between listening at the beginning and after optimizations is evident. While the system used to recommend erroneously blues songs for a totally different pop music, the recommendations are now more coherent and consistent. The system mainly recommends music within the same genre (some- times from the same artist) and less frequently more distinct musics. In all cases, there is a similarity in the general mood or tone of the music chosen, its rhythm, melody, or even instruments.

57 5 CONCLUSIONS AND DISCUSSIONS – 5.2 Conclusion

5.2 Conclusion 5.2.1 Research question The purpose of this section is to answer the research question. In order to take into account the tastes of a user, several approaches are possible, most of them are already explained in the Background section. The content-based method has been chosen and implemented for this thesis. The results obtained show that it is rather efficient learn a user’s tastes through the previous music they listened to in order to recommend a new ones. The principle is therefore, from a song in mp3 or wav format, to extract as much information as possible. The music track is seen as an evidence of the user’s musical preferences. The retrieved information can be musical (pitch, rhythm), physical extracted via signal processing (spectral decrease), or higher level features (danceability). It is essential to ensure that the information are all relevant and non-redundant using feature selection algorithms. From the best subset of features, the Machine Learning algorithm (here Support Vector Machine) classifies musics into genres. After this classification stage comes the recommendation one. It consists in finding four other musics that will be recommended to the user. These musics will be those which will be the closest in terms feature similarities. One of the most complicated tasks in this project is the evaluation. It is pretty complex to measure the performance of our system. The initial results were based on various classifiers studied during this project. At the same time, listening sessions took place in order to judge the recommendations and their models. Comparing the results with other research is not a simple task for two main reasons. The datasets were initially used is a combination of two other open source, which means that no other paper is (currently) based on the same one. Furthermore, it is pretty complex to rank the performance of a recommendation system in order to compare. Regarding the genre classification, the results are below the one obtained by the researchers who only used the GTZAN dataset. Indeed, where the scores could reach 91% [22] with their dataset, our classifier peaks at 78% without. This is justified by the fact that our dataset contains musics from the FMA dataset. It is therefore more suitable to generalizations of yet unencoutered musics since it better encompasses the possible variations within a same genre.

5.2.2 Known limitations First of all, it should be pointed out that the way our system is evaluated is controversial. Indeed, basing on metrics such as accuracy provides little infor- mation about the performance of our recommender. [60] But the article quoted above specifies that recall, accuracy or confusion matrices do not really provide a true assessment of the ability of our system to recognize a genre. Moreover, the interpretations made from the results are based on human feelings. We base our interpretations on the instruments we recognize, on the way the mu- sic is sung or rapped, whereas current features are not able to focus on such information. [60] The GTZAN dataset contains some errors in the labels. As the concept of genre may be ambiguous, a more in-depth analysis of the genres in FMA and GTZAN should also be done in order to determine whether music classified as ”blues” in FMA for example would also be classified as ”blues” in GTZAN.

58 5 CONCLUSIONS AND DISCUSSIONS – 5.3 Future work

This is why the interpretations made in this thesis are based on two as- sumptions. The first one is that the dataset created is coherent and consistent at the label level. The second is that the recommendation system uses cues (like number and type of instruments, music for dancing, relaxing...) similar to those that would be used by a human trying to classify music. [60]

5.3 Future work 5.3.1 Improvement suggestions First, the way used to evaluate the results could be improved. Indeed, the qualitative evaluations were conducted only by myself. Opinions should be collected from from a wider and better mixed sample of end users. Users with demographic (age, gender, ...), geographic (country) parameters but also with different opinions and personalities. Secondly, the dataset could be enhanced. In fact, the quality of the rec- ommendations could be improved by taking larger datasets, but it is necessary to ensure that the required computational resources are available. Moreover, taking the music entirely and not just the first 30 seconds should improve our results. Furthermore, basing the recommendation on a set of previously played songs instead of just one would allow to better understanding of the user’s tastes and not just a specific style of music they enjoy. Finally, combining recommendation methods can greatly increase the accuracy of the results. A first task is to retrieve the lyrics of the music using datamining. Then, they can be analyzed using Natural Language Pro- cessing techniques in order to obtain a new feature that will help us to classify the music even more precisely. This thesis relies on the content-based method, whilst it is feasible and often wise to combine different methods. Once the ap- plication is launched, it would be advisable to collect user preferences data in order to create an hybrid recommendation system combining content-based and collaborative filtering approaches.

5.3.2 Application development The application itself is not yet released, the templates have been made by a UX/UI designer and will be used later in the project life-cycle. The final application will be split in two main features. The recommendation part (layout shown in figure 27), the user will be invited to enter the URL of a music from a streaming platform. In a later version, the goal is that the application will also be able to record an extract of a music being played. From the music or extract, the application will offer the possibility to listen to a recommended songs by the algorithm developed in this master thesis.

59 5 CONCLUSIONS AND DISCUSSIONS – 5.3 Future work

Figure 27: Main menu and recommendation part of the application

The generation part was developed by another trainee (see figure 28). In this one it will be possible to generate music for a selected genre. Once the user chooses a genre, the application uses a GAN to generate a brand new music.

Figure 28: Main menu and generation part of the application

60 REFERENCES – REFERENCES

References

[1] Elaine Rich. User modeling via stereotypes*. Cognitive Science, 3(4):329– 354, 1979. [2] Robin Burke and Maryam Ramezani. Matching Recommendation Tech- nologies and Domains, pages 367–386. 01 2011. [3] Markus Schedl. Deep learning in music recommendation systems. Frontiers in Applied Mathematics and Statistics, 5:44, 08 2019. [4] Markus Schedl, Peter Knees, and Fabien Gouyon. New Paths in Music Recommender Systems Research. 2017. [5] Sean M. McNee, John Riedl, and Joseph A. Konstan. Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems, page 1097–1101. CHI EA ’06. Association for Computing Machinery, New York, NY, USA, 2006.

[6] Peter Knees and Markus Schedl. Music retrieval and recommendation – a tutorial overview. In Proceedings of the 38th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Santiago, Chile, August 2015. [7] Peter Knees and Markus Schedl. Music Similarity and Retrieval: An In- troduction to Audio- and Web-based Strategies. 2016. [8] Javier P´erez-Marcosand Vivian Batista. Recommender system based on collaborative filtering for spotify’s users. pages 214–220, 06 2018. [9] Chris Johnson. From idea to execution: Spotify’s discover weekly. Novem- ber 2015. [10] Ben Schafer, Ben J, Dan Frankowski, Dan, Herlocker, Jon, Shilad, and Shilad Sen. Collaborative Filtering Recommender Systems. 01 2007. [11] Mehdi Elahi, Francesco Ricci, and Neil Rubens. A survey of active learning in collaborative filtering recommender systems. Computer Science Review, 06 2016. [12] Marius Kaminskas and Francesco Ricci. Contextual music information re- trieval and recommendation: State of the art and challenges. Computer Science Review, 6(2):89 – 119, 2012.

[13] Zhiwei Gu Li Guo and Tianchi Liu. Music genre classification via machine learning. [14] Hyeoun-Ae Park. An introduction to logistic regression: From basic con- cepts to interpretation with particular attention to nursing domain. Journal of Korean Academy of Nursing, 43:154–164, 04 2013.

[15] Sigmoid-function-2. https://commons.wikimedia.org/wiki/File:Sigmoid- function-2.svg.

61 REFERENCES – REFERENCES

[16] Mohammed Terry-Jack. Stips and tricks for multi-class classifica- tion. https://medium.com/@b.terryjack/tips-and-tricks-for-multi-class- classification-c184ae1c8ffc.

[17] Beatriz C. F. de Azevedo Glaucia M. Bressan and Elisangela Ap. S. Lizzi. A decision tree approach for the musical genres classification. 2017. [18] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, Mar 1986.

[19] Zhuo Wang, Jintao Zhang, and Naveen Verma. Realizing low-energy clas- sification systems by implementing matrix multiplication directly within an adc. IEEE Transactions on Biomedical Circuits and Systems, 9:1–1, 12 2015. [20] Padraig Cunningham and Sarah Delany. k-nearest neighbour classifiers. Mult Classif Syst, 04 2007. [21] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Conference Proceedings of the Annual ACM Symposium on Theory of Computing, 604-613, 10 2000.

[22] R. Thiruvengatanadhan. Speech/music classification using mfcc and knn. International Journal of Computational Intelligence Research, 13(10):2449– 2452, 2017. [23] Changsheng Xu, Namunu Maddage, Xi Shao, Fang Cao, and Qi Tian. Musical genre classification using support vector machines, volume 5, pages V – 429. 05 2003. [24] Harry Zisopoulos, Savvas Karagiannidis, Georgios Demirtsoglou, and Ste- fanos Antaris. Content-based recommendation systems. 11 2008. [25] R. A. FISHER. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936.

[26] Tao Li, Mitsunori Ogihara, and Qi Li. A Comparative Study on Content- Based Music Genre Classification, pages 282–289. 01 2003. [27] Sarfaraz Masood. Genre classification of songs using neural network. 09 2014.

[28] Artificial neural network. https://en.wikipedia.org/wiki/Artificial neural network. [29] B. Mehlig. Artificial neural networks, 2019. [30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. [31] Early stopping with pytorch to restrain your model from overfit- ting. https://mc.ai/early-stopping-with-pytorch-to-restrain-your-model- from-overfitting/.

62 REFERENCES – REFERENCES

[32] Geoffroy Peeters. A large set of audio features for sound description (sim- ilarity and classification) in the cuidado project. 01 2004. [33] Martin McKinney, Jeroen Breebaart, and Prof (wy. Features for audio and music classification. 11 2003. [34] Jesper Jensen, Mads Christensen, Manohar Murthi, and Søren Jensen. Evaluation of mfcc estimation techniques for music similarity. 09 2006. [35] Lianzhang Zhu, Leiming Chen, Dehai Zhao, Jiehan Zhou, and Weishan Zhang. Emotion recognition from chinese speech for smart affective services using a combination of svm and dbn. Sensors, 17(7), 2017. [36] Xin Luo, Xuezheng Liu, Ran Tao, and Youqun Shi. Content-based retrieval of music using mel frequency cepstral coefficient ( mfcc ). 2015. [37] Sung-Hwan Shin. Comparative study of the commercial software for sound quality analysis. Acoustical Science and Technology, 29:221–228, 01 2008. [38] Jan Stepanek Ondrej Moravec. Possibility of application of objective psy- choacoustic metrics on musical signals. 2006. [39] N. Scaringella, G. Zoia, and D. Mlynek. Automatic genre classification of music content: a survey. IEEE Signal Processing Magazine, 23(2):133–141, March 2006. [40] Igor Vatolkin and Wolfgang Theimer. Introduction to methods for music classification based on audio data. 01 2020. [41] Anssi P. Klapuri, Antti J. Eronen, and Jaakko T. Astola. Analysis of the Meter of Acoustic Musical Signals, pages 342–355. 2004. [42] Essentia: an audio analysis library for music information retrieval. In Inter- national Society for Music Information Retrieval Conference (ISMIR’13), pages 493–498, Curitiba, Brazil, 04/11/2013 2013. [43] Sunil Karamchandani, Prathmesh Matodkar, Suraj Iyer, and Nirav Gori. Score Formulation and Parametric Synthesis of Musical Track as a Plat- form for Big Data in Hit Prediction, pages 363–374. 01 2018. [44] Isabelle Guyon and Andr´eElisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3(null):1157–1182, March 2003. [45] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classifica- tion. Wiley, New York, 2 edition, 2001. [46] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1):273 – 324, 1997. Relevance. [47] Thierry Bertin-Mahieux, Daniel Ellis, Brian Whitman, and Paul Lamere. The million song dataset. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), pages 591–596, 01 2011. [48] David Hauger, Andrej Kosir, Marko Tkalˇciˇc,and Markus Schedl. The million musical tweets dataset: What can we learn from microblogs. 11 2013.

63 REFERENCES – REFERENCES

[49] Anupama Aggarwal. Msd Getting the dataset. http://millionsongdataset.com/pages/getting-dataset/, 2012.

[50] Markus Schedl. The lfm-1b dataset for music retrieval and recommenda- tion. pages 103–110, 06 2016. [51] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng- Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. 2019. [52] Bob Sturm. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use. 06 2013. [53] Micha¨elDefferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bres- son. Fma: A dataset for music analysis. 12 2016.

[54] Pierre Vandergheynst Xavier Bresson Micha¨elDefferrard, Kirell Benzi. Fma A Dataset For Music Analysis. https://github.com/mdeff/fma, 2017. [55] Brian McFee, Eric J. Humphrey, and Juan Pablo Bello. A software frame- work for musical data augmentation. 2015.

[56] Brian McFee, Colin Raffel, Dawen Liang, Daniel Patrick Whittlesey Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. 2015. [57] Dmitry Bogdanov, Nicolas Wack, Emilia G´omez,Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, Jose Zapata, and Xavier Serra. Essentia: An open-source library for sound and music anal- ysis. Proceedings - 21st ACM International Conference on Multimedia, 10 2013. [58] Benoˆıt Mathieu, Slim Essid, Thomas Fillon, Jacques Prado, and Ga¨el Richard. YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software. 01 2010. [59] Yading Song, Simon Dixon, and Marcus Pearce. A survey of music recom- mendation systems and future perspectives. 06 2012. [60] Bob L. Sturm. Classification accuracy is not enough. J. Intell. Inf. Syst., 41(3):371–406, December 2013.

64 TRITA -EECS-EX-2020:847

www.kth.se