Truncated SVD-Based Feature Engineering for Music Recommendation KKBOX’S Music Recommendation Challenge at ACM WSDM Cup 2018
Total Page:16
File Type:pdf, Size:1020Kb
Truncated SVD-based Feature Engineering for Music Recommendation KKBOX’s Music Recommendation challenge at ACM WSDM Cup 2018 Nima Shahbazi Mohamed Chahhou Jarek Gryz York University Sidi Mohamed Ben Abdellah York University [email protected] University [email protected] [email protected] ABSTRACT to suggest relevant preference to a specific user. Many web ser- 1 This year’s ACM WSDM Cup asked the ML community to build an vices such as Netflix, Spotify, YouTube, KKBOX use CF to deliver improved music recommendation system using a dataset provided personalized recommendation to their customers. by KKBOX. The task was to predict the chances a user would listen The ACM WSDM Cup challenged the ML community to improve to a song repetitively after the first observable listening event within KKBOX model and build a better music recommendation system 2 a given time frame. We cast this problem as a binary classification using their dataset . KKBOX currently uses CF based algorithm problem and addressed it by using gradient boosted decision trees. with matrix factorization but expects that other techniques could To overcome the cold start problem, which is a notorious problem in lead to better music recommendations. recommender systems, we create truncated SVD-based embedding In this task, we want to predict songs which a user will truly like. features for users, songs and artists. Using the embedding features Intuitively, if a user much enjoys a song, s/he will repetitively listen with four different statistical based features (users, songs, artists to it. We are asked to predict the chances of a user listening to a song and time), our model won the ACM challenge, ranking second. repetitively after the first observable listening event within a given There was no music domain knowledge needed for creating the time window. If there are recurring listening event(s) triggered features, and we only relied on the information gain and prediction within a month after the user’s very first observable listening event, accuracy for feature selection. its target is marked 1, and 0 otherwise. More formally, let’s assume an event E¹U ; S;T1º in which a user U listened to a song S and it 0 occurred at time T1. If we observe a subsequent event E ¹U ; S;T2º CCS CONCEPTS where T 2−T 1 < 1 month (that is, we got repetitive listening within • Information systems → Recommender systems; one month), event E will be marked as 1 otherwise it will be marked as 0. The submissions are evaluated on area under the ROC curve between the predicted probability and the observed target. KEYWORDS For this competition, KKBOX has provided a training data set Truncated SVD, Cold start, Gradient boosting that consists of information of the first observable listening event for each unique user-song pair (E) within a specific timeframe. The training and the test data are selected from users’ listening history in a given time period and have around 7 and 2.5 millions unique 1 INTRODUCTION user-song pairs respectively. Although the training and the test sets are split based on time and ordered chronologically, the timestamps Until recently, people listened to their own local music library and for train and test are not provided. It is worth mentioning that this rarely curated collections online. But the era of local music libraries structure also suffers from the cold start problem: 14:5% of the is over. Personalization algorithms and unlimited streaming ser- users and 26:6% of the songs in test do not appear in the training vices like YouTube, Spotify, etc. are emerging. People now listen data. Table 1 contains some statistics on users and songs in the to all kinds of music and algorithms still struggle in some key ar- training and test data provided by KKBOX which clearly shows eas. Without enough historical data, how can an algorithm predict the existence of the cold start problem. Hence the main task is to whether a listener will like a new song or a new artist? And how predict whether or not a new listener will like a new song or a can it recommend songs to brand new users? new artist. We used embedded features to overcome this problem The popularity of online content services and social media has (described in detail in Section 2.1). demonstrated the value of providing relevant information to users. Recommender systems have proven to be an effective tool for this purpose and are receiving increasingly more attention. Also, re- cently the amount of available training data has increased enor- mously and advances in hardware (like GPUs) have made it possible to tackle these problems in a reasonable amount of time. One common approach for building an accurate recommender 1KKBOX is an Asia’s leading music streaming service, holding the world’s most model is collaborate filtering (CF). CF, used widely across various comprehensive Asia-Pop music library with over 30 million tracks. domains [9], exploits the overall behavior or taste of other users 2https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data Table 1: users and songs distribution in train and test set Train Test 7.377.418 events 2.556.790 events 30.755 unique users 25.131 unique users 359.966 unique songs 224.753 unique songs 9.272 users are in train but not in test 3.648 users are in test but not in train; this corresponds to 14.51% new users 195.086 songs are in train but not in test 59.873 songs are in test but not in train; this corresponds to 26.64% new songs 2 OUR APPROACH The performance of any supervised learning model relies on two principal factors: predictive features and effective learning algo- rithm. The feature engineering approach exploits the domain knowl- edge to extract features from the training data set that should gen- eralize well to the unseen data in the test set. The features are in general implicit information contained in the training data set which the algorithms are not able to extract by themselves. The quality and quantity of the features have a direct impact on the overall quality of the model. Finding the relevant features requires understanding and subse- quent analysis of the problem structure. Judging the quality of a new feature is done by examining the information gain of the feature Figure 1: Evolution of repeated listening (target) in time. first and then comparing the performance of the model before and after the feature is added. Unfortunately, finding such good features is not an easy task and is also computationally expensive. Most In Figure 3, we plot the same distribution as above but for the of the time, the engineered features lead to only small improve- last 1 million observations in the train dataset and we noticed a ments in the performance of the model; to achieve a noticeable huge drop of re-listening in local_library and local_playlist. improvement requires engineering of hundreds of features. Also, the online_playlist counts have increased significantly. Table 2 summarize all features provided by KKBOX. In this work, It seems that users behaviors have changed over time by online we developed a set of hand-crafted features which can be grouped users who are more interested in what other users listened to via into the following classes: session features, user-based statistical fea- the online_playlist. As a result, they are more interested in tures, song-based statistical features, artist-based statistical features discovering new songs rather than re-listening again to what they and embedding features based on user, song, artist and meta-data already liked. Still, we can see from the figure that listening to what from the dataset. These features combined with our learning algo- other people like does not guarantee that they will re-listen to the rithm proved to be quite powerful and gave us the second-place song which explains why the target mean drops over time. ranking with the score difference of 0.00094 from the first place. 2.1.1 Session Features. The provided dataset does not include 2.1 Feature Engineering any timestamps or user session information. However, since the data is ordered chronologically, we were able to approximate user Before turning into feature engineering problem, we would like sessions using the following approach: to highlight some important aspects of our analysis. We noticed (1) merge the train and test dataset that statistical features based on the target feature did not work (2) groupby users well enough during the training stage. This is probably due to the (3) for each user shift the data by one to the right dynamics involved in the data structure and the sampling strategy (4) take the difference between the original and shifted indices made by KKBOX as shown in Figure 1. Figure 1 shows the evolution and call it diff_ind of the target mean over time. The plot is set up by aggregating (5) small diff_ind values3 correspond to observations within observations target mean in bins of size 100000. In can be seen from the same session and high values corresponds to the start of the figure that the target distribution is significantly decreasing a new session over time and consequently we should expect a target mean decline (6) consequently diff_ind value for the start of a new session in the test dataset too. is assigned for the entire session Further analysis of the given features shows a strong correlation between source_type feature and the target value. In Figure 2, we For example, suppose that user X has {1,2,3,4,20,21,23,26,100,105,111} plot the distribution of songs count per source_type categories indices in the data set.