The LFM-1B Dataset for Music Retrieval and Recommendation
Total Page:16
File Type:pdf, Size:1020Kb
The LFM-1b Dataset for Music Retrieval and Recommendation Markus Schedl Department of Computational Perception Johannes Kepler University Linz, Austria [email protected] ABSTRACT However, researchers interested in conducting experiments We present the LFM-1b dataset of more than one billion in music retrieval and recommendation on a large scale — in music listening events created by more than 120,000 users particular those working in academia — are facing the chal- of Last.fm. Each listening event is characterized by artist, lenge that they frequently have to acquire the data for their album, and track name, and further includes a timestamp. experiments themselves, which results in non-standardized collections, hindering reproducibility . While online music On the (anonymous) user level, basic demographics and a 1 2 3 selection of more elaborate user descriptors are included. platforms such as Spotify, Last.fm, or Soundcloud of- The dataset is foremost intended for benchmarking in mu- fer convenient API endpoints that provide access to their sic information retrieval and recommendation. To facili- databases, it takes a considerable amount of time to build tate experimentation in a straightforward manner, it also collections of substantial size necessary for large-scale eval- includes a precomputed user-item-playcount matrix. In ad- uation. The few publicly available datasets for these pur- dition, sample Python scripts showing how to load the data poses, the most well-known of which is probably the Million and perform efficient computations are provided. An imple- Song Dataset (MSD) [2], might represent an alternative, but mentation of a simple collaborative filtering recommender come with certain restrictions. For instance, while the MSD rounds off the code package. offers a great variety of pieces of information (among oth- We discuss in detail the LFM-1b dataset’s acquisition, ers, genre labels, tags, term weights of lyrics, song similar- availability, statistics, and content, and place it in the con- ity information, and aggregated playcount data), user- and text of existing datasets. We also showcase its usage in a listening-specific information is provided rather scarcely, on simple artist recommendation task, whose results are in- a high level, or in a summarized form only. tended to serve as baseline against which more elaborate Since we believe that listener-specific information is key techniques can be assessed. The two unique features of the to build personalized music retrieval systems, the focus and dataset in comparison to existing ones are (i) its substantial unique feature of the LFM-1b dataset presented here is de- size and (ii) a wide range of additional user descriptors that tailed information on the level of listeners and of listen- reflect their music taste and consumption behavior. ing events. For instance, the dataset provides user-specific scores about their music taste (among others, measures of mainstreaminess and inclination to listen to novel music). Keywords Next to this, the size of the dataset, which includes more Dataset, Analysis, Music Information Retrieval, Music Rec- than a billion listening events of approximately 120,000 users ommendation, Collaborative Filtering, Experimentation should be sufficient to perform experimentation on a large scale on real-world data. The paper is organized as follows. We first review re- 1. MOTIVATION lated datasets for music retrieval and recommendation (Sec- Research and development in music information retrieval tion 2). Subsequently, we outline the data acquisition pro- (MIR) and music recommender systems has seen a sharp cedure, present the dataset’s structure and content, pro- increase during the past few years, not least due to the pro- vide basic statistics and analyze them, and point to sample liferation of music streaming services [8, 5]. Having tens Python scripts that show how to access the components of of millions of music pieces available at the listeners’ finger- the dataset (Section 3). We further illustrate how to exploit tips requires novel retrieval, recommendation, and interac- the dataset for the use case of building a music recommender tion techniques for music. system that implements various recommendation algorithms (Section 4). We eventually round off the paper with a sum- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed mary and discussion of possible extensions (Section 5). for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the 2. RELATED WORK author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission The need for user-aware and multimodal approaches to and/or a fee. Request permissions from [email protected]. music retrieval and recommendation has been acknowledged ICMR'16, June 06 - 09, 2016, New York, NY, USA 1https://developer.spotify.com/web-api c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2 ISBN 978-1-4503-4359-6/16/06. $15.00 http://www.last.fm/api 3 DOI: http://dx.doi.org/10.1145/2911996.2912004 https://developers.soundcloud.com/docs/api/guide many times and is meanwhile widely accepted [9, 11, 14, 15]. users, only including artists they most frequently listened to. However, respective scientific work is still in its fledgling The other subset offers full listening data of nearly a thou- stage. One of the reasons for this is that involving users, sand users, where each listening event is annotated with a which is an obvious necessity to build user-aware approaches, timestamp, artist, and track name. Both subsets include is time-consuming and hardly feasible on a large scale — at gender, age, country, and date of registering at Last.fm, as least not in academia. As a consequence, datasets offering provided by their API. user-specific information are scarce. Other datasets that are related to LFM-1b to a smaller On the other hand, thanks to evaluation campaigns in the extent include the AotM-2011 dataset of playlists extracted fields of music information retrieval and music recommenda- from Art of the Mix9 and the MagnaTagATune10 dataset [7] tion, including the Music Information Retrieval Evaluation of user-generated tags and relative similarity judgments be- eXchange4 (MIREX) and the KDD Cup 2011 5 [4], the re- tween triples of tracks. search community has been given several datasets that can In comparison to the datasets most similar to the one be used for a wide range of MIR tasks, from tempo estima- proposed here — the MSD and Celma’s [3] — the LFM-1b tion to melody extraction to emotion classification. Most of dataset offers the following unique features: (i) substantially these datasets, however, are specific to a particular task, e.g., more listening events, i.e., over one billion, in comparison onset detection or genre classification. What is more, for to roughly 48 and 19 million, respectively, for MSD and content-based or audio-based approaches, the actual audio Celma’s [3]; (ii) exact timestamps of each listening event, can typically not be shared, because of restrictions imposed unlike MSD; (iii) demographic information about listeners by intellectual property rights. in an anonymous way, unlike MSD; and (iv) additional infor- Datasets that can be used to some extent for evaluat- mation describing the listeners’ music preferences and con- ing personalized approaches to music retrieval and recom- sumption behavior, unlike both MSD and Celma’s [3]. These mendation include the Yahoo! Music dataset [4], which cur- additional descriptors include temporal aspects of listening rently represents the largest available music recommenda- behavior as well as novelty and mainstreaminess scores as tion dataset, including more than 262 million ratings of more proposed in [12], among others. than 620 thousand music items created by more than one million users. The ratings cover a time range from 1999 to 3. THE LFM-1B DATASET 2010. However, the dataset is completely anonymized, i.e., not only users, but also items are unknown. The absence In the following, we outline the data acquisition procedure of any descriptive metadata and ignorance of music domain from Last.fm, describe in detail the dataset’s components, knowledge therefore restricts the usage of the dataset to rat- analyze basic statistical properties of the dataset, provide ing prediction and collaborative filtering [13]. download links, and refer to some sample code in Python, The Million Song Dataset 6 (MSD) [2] is perhaps one of which is also available for download. Please note that the LFM-1b dataset is considered derivative work according to the most widely used datasets in MIR research. It offers a 11 wealth of information, among others, audio content descrip- paragraph 4.1 of Last.fm’s API Terms of Service. tors such as tempo, key, or loudness estimates, editorial item metadata, user-generated tags, term vector representations 3.1 Data Acquisition We first use the overall 250 top tags12 to gather their top of lyrics, and playcount information. While the MSD pro- 13 vides a great amount of information about one million songs, artists using the Last.fm API. For these artists, we fetch it has also been criticized, foremost for its lack of audio mate- the top fans, which results in 465,000 active users. For a ran- domly chosen subset of 120,322 users, we then obtain their rial, the obscurity of the approaches used to extract content 14 descriptors, and the improvable integration of the different listening histories. For approximately 5,000 users, we cap parts of the dataset. The MSD Challenge7 [10] further in- the fetched listening histories at 20,000 listening events in creased the popularity of the dataset.