<<

Exploring last.fm community music listening habits for automatic artist recommendation

Diogo Costa

Faculdade de Engenharia da Universidade do Porto, Rua Dr. Roberto Frias, s/n 4200-465 Porto, [email protected]

Abstract. Using the listening habits of a sample of users from the last.fm’s community and a set of tags describing each artist (flokson- omy) we intend to group users with similar listening habits and inside each group discover artist associations of the kind, someone who listens to A and B also listens to C. The final goal is to produce a recommendation system of music artists solely based on the tastes or behavior of other users in the community.

Keywords: Music Brainz Recommendation System last.fm Clustering Association Rules Community Dataset

1 Introduction

The basis of a recommendation system is to make predictions using existing user profiles and the profile of a given user. Its objective is to recommend a given user with information that he may find valuable but may not have known using the existing user profiles. These profiles can be constructed in an implicit way by keeping track of a user actions, or by explicitly asking him to rate information. With the increasing information of user habits on the web, recommendation systems are ever more present in our every day on-line life. Examples of these recommending systems are Last.fm1, Amazon2, Netflix 3 etc. Building a recommendation system has several challenges, it is necessary to have an extensive dataset to make accurate predictions and we need to choose what attributes describe our users the best for our objectives. Although the system can make predictions it can only make these using statistics based on our dataset, which means that the recommendation may not be perfect and seldom is not. Using data from listening habits of 360 thousand users of last.fm and folk- sonomy made available by MusicBrainz we intend to group users with similar tastes in music, and for each group find strong association rules between artists.

1 http://www.last.fm/ 2 http://www.amazon.com/ 3 http://www.netflix.com/ 2 Music recommendation based on music listening habits

This report starts by mentioning related work, identifying the data mining algorithms used, and uses a CRISP 4like structure to describe the several DM stages. Finally the report ends with the conclusions.

2 Related Work

Much has been done in recommendation systems, and music recommendation is no exception. Music recommendation is also relying heavily on user generated data on sites like Last.fm and folksonomy is getting increasingly more impor- tant in this process by automatically assigning artists to genres[1] and creating taxonomies[1].

3 Algorithms used

3.1 z-score normalization

The z-score measures how much a value differs in standard deviations from the mean. This can also be called Standard Score. This score is very robust even with outliers because the normalization is relative to the mean of the values, and if we have few outliers around values close to the mean, their z-score is not much influenced by these outliers. We have used this to normalize our attributes before running the clustering algorithms.

3.2 K-Means clustering

The k-means algorithm groups individuals into K groups, minimizing the differ- ences between elements in the same group, while maximizing differences between different group individuals. The K parameter is critical in the result of the algorithm, because it will be the number of groups representing our individuals. If we have to few groups we may not be able to distinguish individuals who are very dissimilar, and if we have too much groups we will not be able to group individuals who are similar. Each group will be represented by a point in an N-Dimensional space, being N the number of attributes of each individual, this is called the centroid of the group. Roughly, the k-means algorithm starts by choosing random K centroids and for each individual we compute the distance to every centroid and attribute the individual to the closest centroid. We then move each centroid to the aver- age position of every individual in that group and repeat the process until the centroids converge to a given position. This process is illustrated in Figure 1 There are several proximity measures of individuals and centroids, like the Euclidean, Manhattan or cosine similarity. A few notes about the k-means algorithm, the initial position of the centroids influences the end result, we should, when possible, apply several times the

4 http://www.crisp-dm.org/ Music recommendation based on music listening habits 3

Fig. 1. K-Means process visualization algorithm and choose the one that shows the best results. Also to find the best value K we compute the algorithm several times for different values of K, and for each one we measure it’s quality, using for instance the average centroid distance. In the limit, when K is equal to the number of individuals we have an average centroid distance of zero, but we will try and choose the maximum K that significantly reduces our average centroid distance.

3.3 Rule association with FP-Growth

The Frequent Pattern Growth algorithm tries and improve the Apriori algorithm by reducing the candidate generation.

4 Business Understanding

The goal of this work is to produce a music recommendation system based on last.fm’s community listening habits. It is based on objective user behavior and music artist folksonomy, and not on audio signal processing or rules based on potentially biased opinions. User behavior records what artists did a given user listened to, and how many times. This has been implicitly collected via the last.fms scrobbling applications. Figure 2 shows the part of the listening habits of a given last.fm user. Folksonomy is system of classification based solely on the collaborative tag- ging from users in an uncontrolled environment, given enough contributions a vocabulary surfaces, and in the case of tagging artists, relevant keywords to a given artist tend to emerge. Figure 3 shows the collection of tags associated with the artist . Note however that these tags are not music genres, although they appear to be. These tags are just words used to describe artist Coldplay, being that some are indeed names of genres. 4 Music recommendation based on music listening habits

Fig. 2. User listening history

5 Data Understanding

We initially started by using a public dataset5 collected via last.fm’s API con- taining some of the listening habits of approximately 360 thousand users. We would group users based on the artists listened, but computational lim- itations arouse due to the size of the dataset. Because we have approximately 360000 users or vectors and 390000 artists or dimensions, naively assuming that each component of the vector would occupy no more then 32 bits, we would need 523GB of memory just to store all the initial data. Even assuming binary attributes we would still need 16GB. We could either reduce the number of users dramatically or discretize the listened artists somehow. Instead of using artists we would represent them by their tags. In effect we will group users not based on the artists, but based on the tags associated with those artists. Since the original dataset did not con- tain folksonomy about artists, we collected a second dataset from MusicBrainz6 database. Additionally the MusicBrainz database has extensive description of artists, labels, etc.

5.1 Last.fm data The dataset containing habits from last.fm is stored in 2 TSV 7 files. It consists of a set of users and for each user there are a set of lines representing how many

5 http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-360K. html 6 http://www.musicbrainz.org/ 7 Tab Separated Values Music recommendation based on music listening habits 5

Fig. 3. Folksonomy - Tags associated to artist Coldplay times that user as listened a given artist. One file contains information about users, and the other about each user listening habits. Table 1 shows the number of users and the total number of lines.

Table 1. Number of artists and habits

Unique users 359.347 Total habits 17.559.530

Tables 2 and 3 describe in more detail the attributes of users and their habits.

Table 2. Users attributes and scale

Attribute Scale Missing Values user-mboxsha1 Nominal No gender Nominal Yes age Ratio Yes country Nominal No signup Interval No

Regarding the gender attribute, using Figure 4, we can see that there are more male users. As for the age attribute the Mean and Median are very close and we can see that the second and third quartile (50%) are between 20 and 28, which tells us that the majority of users are young adults. The difference between the Median and Mean in the play attribute can mean two things. Either most people listen few times to some artists and listen a lot to their favorite ones, or most users listen few times to many artists and some users listen a lot to some artists. 6 Music recommendation based on music listening habits

Table 3. Habits attributes and scale

Attribute Scale Missing Values user-mboxsha1 Nominal No musicbrainz-artist-id Nominal Yes artist-name Nominal No plays Ratio Yes

Fig. 4. Characterization of several dataset attributes

5.2 MusicBrainz Data

Table 4 characterizes the data we will use from this dataset which is the artist folksonomy information.

Table 4. MusicBrainz data

Unique Artists 572867 Unique Tags 15725

6 Data Preparation

Having the two datasets mentioned in imported in MySQL, and since the original dataset had MusicBrainz artist identifiers we could connect the two datasets and build a third dataset in MySQL with the structure represented in Figure 5. During this transformation we removed invalid records like users with artists with no tags in the listening habits table. After cleaning the data we then built a CSV file for clustering in rapidminer. Since we want to group artists based on their tastes in music, we use a matrix to model this, in which lines are artists and columns are tags, the relation between them is a value that represents the importance of a given tag for a given user. This matrix is more formally described in Equation 1.

artists X Mu,t = playsi ∗ weightt,i (1) i=0 Music recommendation based on music listening habits 7

Fig. 5. Final dataset imported in MySQL

If an artist i does not contain a tag t, the weight is zero. The weight of a tag t for an artist i comes from the MusicBrainz dataset and is already normalized. We can also view the metric in Equation 1 as the component in a vector rep- resenting a user u of N dimensions, in which N is the total number of tags. These vectors will be the individuals in which we will apply the K-Means algorithm. Finally we used a python8 script to create a csv file representing this matrix and load it in RapidMiner9 for normalization and clustering (Figure 8 ). Even using approximately half of the users in the dataset we couldn’t cluster for insufficient memory on a 6GB PC. So we used only 10 thousand random users from the original 350 thousand. After clustering we updated the original dataset in Figure 5 with the cor- responding cluster for each user. We then isolated the habits of two different clusters with different dimensions in hope of seeing association rules about com- pletely different artists. The corresponding dataset was stored locally in MySQL has is shown in Figure 6

Fig. 6. Dataset updated after clustering

Finally we made another Python script that created two csv files representing a matrix of users and artists for the two selected clusters. This time we can use the artists as attributes because inside each cluster the total number of artists

8 http://www.python.org/ 9 http://rapid-i.com/ 8 Music recommendation based on music listening habits and total number of users makes a matrix that fits into todays average PC memory. This matrix will be entirely binary representing only if a user u listens to an artist a.

7 Modeling

7.1 Clustering

Having the dataset ready for clustering 10 thousand users we first try and find an appropriate K parameter for K-Means. We ran the algorithm several times for various K values and recorded the Average Centroid Distance(Table 7). Seeing

Fig. 7. Average Centroid Distance that for values of K bigger then 150 the ACD starts to decrease less, we decided to group the 10000 users into 150 groups. Having the Parameter K we clustered the users using RapidMiner (Figure 8 ).

Fig. 8. K-Means clustering in RapidMiner software Music recommendation based on music listening habits 9

7.2 Association rules

After getting the result of the clustering we went back to the Data Prepara- tion phase to prepare for association rules in each cluster. We then used Weka10 to create the association rules for each cluster. We tried to maintain the min- imum Support low to generate enough candidates and order the rules by their conviction.

8 Evaluation and Deployment

Cluster112 was the cluster with the most users (6068), and by inspecting the artists in the established association rules, we find some very plausible associa- tions.

beyonc ==> rihanna snow patrol ==> coldplay iron maiden ==> metallica britney spears ==> rihanna rihanna ==> britney spears led zeppelin ==> the beatles ==> the beatles guns n’ roses ==> metallica rihanna ==> beyonc led zeppelin ==> pink floyd ac/dc ==> metallica oasis ==> coldplay u2 ==> coldplay ==> the beatles the killers ==> coldplay We found the most interesting cluster, however to be Cluster6. Cluster6 had fewer users and was chosen in hope of detecting users with a very specific taste in music. Table 5 shows some of the tags retrieved from last.fm about the artists in the association rules

lars winnerbck ==> hkan hellstrm hkan hellstrm, billie the vision & the dancers ==> lars winnerbck the hives ==> mando diao, kings of leon ==> the kooks jens lekman ==> hkan hellstrm ==> death cab for cutie bright eyes ==> death cab for cutie kaiser chiefs ==> mando diao, the killers coldplay, lars winnerbck ==> hkan hellstrm coldplay, maxmo park ==> mia.

10 http://www.cs.waikato.ac.nz/ml/weka/ 10 Music recommendation based on music listening habits mia. ==> coldplay, maxmo park billie the vision & the dancers, lars winnerbck ==> hkan hellstrm coldplay, maxmo park ==> mando diao, mia. mando diao, coldplay, maxmo park ==> mia. mia. ==> mando diao, coldplay, maxmo park

The more relevant tags for each user according to last.fm are in Table 5. We can see that tags like swedish, indie,

Table 5. Relevant tags for artists in cluster 6

Artist Tags Billie the Vision & The Dancers swedish, indie pop, indie, pop, singer-songwriter lars winnerbck swedish, pop, indie, indie pop, singer-songwriter lars winnerbck swedish, singer-songwriter, pop, rock, svenskt Mando Diao rock, indie, , swedish, alternative The Hives rock, garage rock, indie, swedish, punk Kings of Leon indie, rock, indie rock, alternative, southern rock The Kooks indie, indie rock, british, britpop, alternative Kaiser Chiefs indie, indie rock, britpop, rock, alternative Maxmo Park indie, indie rock, british, alternative, rock Mia. german, electropop, deutsch, female vocalists, alternative

Also note that the last.fm confirms the similarity of the first association rule. This is ilustrated in Figure 9.

Fig. 9. Similarity between lars winnerbck and hkan hellstrm

As a curiosity we profiled the users in this cluster, and has expected we have many users from Sweden, Germany and the UK. This is consistent with the tags Swedish, German and British, although the United States also has a share, but this may be because a large portion of last.fm users are Americans.

+------+------+ | count(*) | country | +------+------+ | 85 | Austria | | 41 | | Music recommendation based on music listening habits 11

| 26 | Czech Republic | | 33 | Denmark | | 29 | France | | 361 | Germany | | 23 | Japan | | 32 | Philippines | | 12 | Serbia | | 342 | Sweden | | 43 | | | 162 | United States | +------+------+ 12 rows in set (0.01 sec)

9 Conclusions

The time spent in the modeling phase, transforming and preparing the dataset for the modeling phase was much more then anticipated. One reason for this was because we didn’t have enough computational resources to do our original plan, and had to use information from a different dataset to circumvent this problem. Due also to time and computational constraints, there were only made association rules to two clusters, and the number of users clustered was of only 10000. This makes the model very unstable and prone to oscillate with new users. We found that for a user in one of those two clusters we can make some relatively accurate predictions about what they might like based on their past music habits. Because we made the association rules inside each cluster, we might not be finding association between artists that are so dissimilar from each other that never appear in the same cluster, we are therefore promoting a discovery of new artists very similar to our tastes. This should be complemented by some artists from other genres using globally computed association rules. Finding computational resources was also very difficult due to the memory restrictions most computers present with these datasets.

References

1. The Quest for Musical Genres: Do the Experts and the Wisdom of Crowds Agree? Mohamed Sordo, scar Celma, Martin Blech, Enric Guaus 2.