Recommend Songs with Data from Spotify Using Spectral Clustering
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2021 Recommend Songs With Data From Spotify Using Spectral Clustering DANIEL BARREIRA NAZAR MAKSYMCHUK NETTERSTRÖM KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Abstract Spotify, which is one of the worlds biggest music services, posted a data set and an open-ended challenge for music recommendation research. This study's goal is to recommend songs to playlists with the given data set from Spotify using Spectral clustering. While the given data set had 1 000 000 playlists, Spectral clustering was performed on a subset with 16 000 playlists due to the lack of computational resources. With four different weighting methods describing the connection between playlists, the study shows results of reasonable clusters where similar category of playlists were clustered together although most of the results also had a very large clusters where a lot of different sorts of playlists were clustered together. The conclusion of the results were that the data was overly connected as an effect of our weighting methods. While the results show the possibility of recommending songs to a limited number of playlists, hierarchical clustering would possibly be helpful to be able to recommend song to a larger amount of playlists, but that is left to future research to conclude. 1 Sammanfattning Spotify, som ¨aren av v¨arldensst¨orstamusiktj¨anster,publicerade data och en ¨oppen utmaning f¨orforskning inom musikrekommendation. Denna studies m˚al¨aratt rekomendera l˚atartill en spellista med den angivna data fr˚anSpotify med hj¨alpav klusteranalys. Fast¨anden publicerade datam¨angdenhade 1 000 000 spellistor, utf¨ordesklusteranalys p˚a16 000 spellistor p˚agrund av brist p˚aber¨akningskapacitet. Med fyra olika viktningar p˚agrafen med spellistor, visar studien resultat av rimliga kluster d¨arliknande kategori av spellistor var klustrade ihop. D¨aremotinneh¨oll resultatet i de flesta fallen ett v¨aldigtstort kluster med m˚angaoliaka typer av spellistor klustrades ihop. Slutsaten av detta var att den anv¨andadatan var alltf¨or sammankopplad som en effekt utav de anv¨andav¨agningarna. Aven¨ om resultaten visar att m¨ojlighetenfinns att rekommendera l˚atartill ett begr¨ansaatantal spellistor, skulle hierarkisk klustring m¨ojligenvara till hj¨alpf¨oratt kunna rekomendera l˚atartill fler antal spellistor. 2 Acknowledgement We would like express our sincere gratitude to our supervisors Emil Ringh and Parikshit Upadhyaya. Parik, without your help we would probably still have been stuck on the difference between unnormalized and normalized Laplacian. Emil, without you detecting some of our computational flaws and teaching us effective ways to find them we would probably still be doing simulations. Without the encouragement and continuous feedback from you two this project would have been a lot harder, thank you. 3 Authors Daniel Barreira, [email protected] Nazar Maksymchuk Netterstr¨om,[email protected] Degree Programme in Technology KTH Royal Institute of Technology Place for Project Stockholm, Sweden Examiner Gunnar Tibert Vehicle Technology and Solid Mechanics, KTH Royal Institute of Technology Supervisor Emil Ringh Parikshit Upadhyaya Department of Mathematics, Numerical Analysis, KTH Royal Institute of Technology Contents 1 Introduction6 1.1 The data..................................6 1.2 Clustering.................................7 1.3 Limitations................................8 2 Method9 2.1 Graph Theory...............................9 2.2 Eigenvalue and Eigenvectors...................... 11 2.3 k-means algorithm............................ 11 2.4 Spectral Clustering............................ 12 2.5 Different approaches to weighting.................... 19 2.6 Recommending songs from clustered graph.............. 21 3 Results 22 3.1 Results from weighting 1: Percentage of similar song......... 22 3.2 Results from weighting 2: Percentage of similar artists........ 25 3.3 Results from weighting 3: A constructed function........... 28 3.4 Results from weighting 4: A constructed function........... 30 3.5 Results from random samples...................... 31 3.6 Song recommendation to a playlist................... 32 4 Discussion 34 4.1 Conclusion................................. 36 4.2 Future work................................ 36 5 1 Introduction We live in an age where there is an overflow of data. All the data presents us humans with a spectrum of different problems as well as opportunities. Music is an important cultural part of our society, and it is one of the fields that can take advantage of the opportunities that arises with the data overflow. Both from the perspective as a musician and as a listener there are initiatives to be done. As a musician, you probably want your music to reach as many listeners as possible, and as a listener you want a diverse pool of music that is in your interest. All this leads up to the purpose of our project, a challenge presented by AIcrowd called the "Spotify Million Playlist Dataset Challenge"[3]. The challenge is to create an automatic playlist continuation by recommending 500 songs, ordered by relevance in decreasing order. In this paper the challenge will be solved by using Spectral clustering. Spectral clustering is a method to do clustering in the eigenspace of a graph laplacian. Clustering is a method to analyze data that is widely used in many fields such as statistics, computer science and more. The aim in this project is to see if one can find clusters of playlists and use these clusters to not only give music that directly correlates with the user but also find songs from connections in the cluster. Problem formulation From a data set given by Spotify, our study will focus on analysing how effective spectral clustering is when recommending songs. 1.1 The data The data set is sampled from over 4 billion public playlists on Spotify and consists of one million playlists. There are 2 million unique tracks present in the data by nearly 300 000 artists. The data set is collected from US Spotify users during the years 2010 and 2017. 6 Figure 1: An illustrative extraction of a part of playlist 661 from the data set. Furthermore the data has roughly 66 million tracks in total and 2; 26 million unique tracks. The given data has a few different attributes. An example is shown in Figure 1 for a playlist with 58 tracks. It is shown what is given for every playlist, and for every track. The ones mainly used in this paper are the ones marked with a red ring. 1.2 Clustering Clustering is a way of understanding information with dividing data into different groups. The point is to define connections between data points with similarity, and by proxy removing non-essential data also known as noise. By doing this one will be able to detect different patterns and thus being able to analyze the given data. The applications of clustering are numerous and it is a widely used method to start analyzing big sets of data with machine learning [8], [7]. There are a variety of clustering algorithms. To name some of them, "-neighborhood, k-means and Density-based clustering. In this paper the study revolves around Spectral clustering and how effective it is when clustering Spotify playlists. 7 The reason why there are lots of different methods of clustering is because of the variety of datasets. For example consider the different data point sets in Figures2 and3. Figure 2: data set one Figure 3: data set two As seen in Figure2 and Figure3, the structure of the data points are different. This means that some clustering methods will also perform better versus others. By using density-based clustering or spectral clustering one can get "correct" results on both graphs shown in Figures2 and3 but, k-means clustering will only be simply implemented on the graph shown in Figure2. To understand why, a detailed understanding of the different algorithms is needed. Before explaining the algorithms some theory is needed. 1.3 Limitations The given data set consists of a lot of data. Due to lack of time and lack of computational resources the entire data set could not be analyzed . Furthermore the challenge itself is not being done, and the study only shows how spectral clustering could work for the challenge and to understand the general structure of the data. 8 2 Method As mentioned before the main method in this paper is Spectral Clustering. But before diving into the algorithm itself some preliminary material is presented, starting with graph theory. 2.1 Graph Theory A graph G is defined as a collection of i nodes N = fn1; : : : ; nig and k edges E = fe1; : : : ; ekg. A node represents a data point and an edge represents the connection/relationship between two nodes [5]. We write G = (N; E): (1) Furthermore there is such a thing as a directed and an undirected graph. Undirected and directed graph A graph is undirected when an edge between two arbitrary nodes ni and nj is the same without regard of the direction. eij 2 E ) eji 2 E: (2) An undirected graph also is called a symmetric graph. For a directed graph the direction matters. The next step is to explain how nodes are conneted to each other via edges. Connectivity of a graph A graph is called connected when it is possible to walk from one node to every other node. Otherwise the graph is not connected and there are several number of sub graphs. This is defined as multiplicity M, where M ≥ 1. [4] M(G) := multiplicity = number of sub graphs (3) The connectivity of a graph can be represented by different sorts of matrices. 9 Adjacency matrix and weights The graph can be represented with a matrix, called, the unweighted adjacency matrix Auw and is defined as following: 8 <1 : if there is an edge between nodes k and j Auw(k; j) = (4) :0 : if no edge between nodes k and j Figure 4: Example of an unweighted graph G and the associated adjacency matrix A graph can either be unweighted (as seen in Figure4) or weighted, meaning that the edges values vary.