Billboard 200 - Dataset Preparation
Total Page:16
File Type:pdf, Size:1020Kb
Billboard 200 - Dataset Preparation João Miguel José Azevedo Ricardo Ferreira [email protected] [email protected] [email protected] October 29, 2020 Abstract datasets and their exploitation with free- text queries; Billboard 200 is a record chart which ranks the top 200 music albums on a weekly basis. • In the third milestone, it will be designed, This chart is published by the Billboard mag- represented and exploited an ontology for azine in the United States. The charts are the domain of the project datasets. frequently used to convey the popularity of artists. Since this chart is active for more than 30 years, crossing this information with in- In the next sections of this document, it is formation on artists, musical genres, lyrics presented: and other information will allow to build a big datastructure about music. This is the pur- pose of this work. The following document • Which datasets are used for this project; describes which datasets are being used, how the data is being prepared and enriched, the • Which are the data sources used on this sources of the data and their characterization. project; • How was the data collected; 1 Introduction • How was the data cleaned and enriched; Using the Billboard 200 [2] chart as a base, this work tries to create a data information re- • The conceptual model for the datasets; trieval system about music. The goal is to have in one place and easily accessed information on • The return documents and possible search albums, musics and artists since 1963 to 2019. tasks; The platform will provide ranks on the Bill- board 200, albums names, release dates, artists, • The data characterization of the data col- bands, biographies, song lyrics, characteriza- lected. tion by musical genre and performing queries to get and order this information. This project is splited in 3 milestones: 2 Data Sources • In the first milestone, it is supposed to choose, prepare and characterize the Three sources of information are used for this datasets to be used. The choice criterion project. One structured dataset with the Bill- was to have 2 different dataset types: one board 200 charts which comes in a sql database. unstructured, rich in textual data; and an- The other sources come from two websites, other more semantic, rich in structured and Metro Lyrics and Last.Fm and provide unstruc- annotated data; tured data. With this information new datasets will be built. • In the second milestone, it will be used The datasets used for this project are pre- an information retrieval tool on the sented next. 1 2 Billboard 200 - Dataset Preparation 2.1 Billboard 200 Datasets 3 Data Preparation The main dataset used is a database provided The main dataset used is very organized and by the components one group [1]. This database needs minimal action to extract the relevant is free for use. Two tables of this table are used, data, besides that it comes in a sql database the Albums table and the Acoustics table. which also makes it easy to query the dataset. The Albums table provides information about In the same way, the information retrieved via the albums that were nominated for the Bill- web scrapping is also very organized and easy board 200 ranking, more specifically, the date to get and process. The pipeline used to extract it was introduced on the rank, the name, the and enrich the data is outlined in the Figure1. artist, as well as the length of the album. This pipeline has two main purposes, to clean The Acoustics table provides information the data and enrich it. This two operations about the songs of the albums selected for the will be described in more detailed in the next Billboard 200, which include the name of the sections. song, the album, the artist, as well as a number of technical features about the songs, like the acousticness, danceability, duration, energy, in- 3.1 Data Cleaning strumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature and valence. The first stage of the pipeline from Figure1 But this extra information will not be used since is the data cleaning, accomplished through the we consider that it doesn’t add valuable infor- use of the tool OpenRefine [7]. This tool is used mation. to remove empty entries from the database and In total, these 2 tables cover 33011 albums, to remove entries with extra characters. This 9675 artists and 339854 songs. operation will result in two files: • Albums - A file with all the albums and 2.2 Last.fm their ranks in the Billboard 200 since 1963 to 2019; Last.fm[3] is a music website founded in the United Kingdom in 2002[4]. This website pro- • Tracks - A file with songs and artists that vides a handful of information about music, match with the albums file. which is used to complement the albums, songs and artists, that can be used for personal and non-commercial purposes. The process we used In this dataset one album can have hundreds to extract and treat this information is described of entries, because it can be featured in the Bill- in the section 3.2. The data obtained is stored board 200 for several weeks or even months or on JavaScript Object Notation (.json) files. years, this imposes a problem in the next phase of the pipeline, the scraping data. This will lead to an album to be processed several times and 2.3 MetroLyrics the pipeline becomes extremely inefficient and slow. To overcome this problem, and since we MetroLyrics[5] is a lyrics-dedicated website, cannot discard any entry (we would end up founded in December 2002[6]. It is used to ob- losing the rank information) from the previous tain the lyrics of the songs from the albums on files the pipeline generates auxiliary files with Billboard 200. The data obtained is stored in duplicates removed. This is done in the step 1 JavaScript Object Notation (.json) files and can and 2 marked in the pipeline. In these steps all be used for personal and non-commercial pur- the characters are also escaped to HTML format poses. More about the process of extraction can to be used to generate the URL addresses for be found on section 3.2. MetroLyrics and Last.Fm. Group E, October 29, 2020 Billboard 200 - Dataset Preparation 3 Figure 1: Data Pipeline. 3.2 Data Enrichment • LASTFM_URL/{artist}/{subset} : Links to pages with information of an artist.; The second stage of the pipeline is the data enrichment. To enrich the datasets already ob- • LASTFM_URL/{artist}/_/{song}/{subset} tained from Billboard 200, the websites Last.fm : Links to pages with information of a and MetroLyrics were crawled using the Scrapy song; framework[9]. • LASTFM_URL/{artist}/{album}/{subset} : The crawler uses the cleaned files stated Links to pages with information of an al- in the previous section and load these files bum; into 4 dataframes using Pandas[8], an open source data analysis and manipulation library • ML_URL/{song}-lyrics-{artist}.html : for Python: Links to pages with the lyrics of a song. • ranks: Table loaded from albums.csv with The {artist}, {song} and {album} fields can information of the albums positions on the be obtained from the dataframes to search for Billboard 200 charts on different dates. specific albums, artists and tracks. The {subset} section indicates what kind of information we • albums: Table with albums information ob- want to obtain for that album, artist or song, tained from albums.csv by separating the which can be either "tags", "wiki", or even empty albums columns and removing duplicates. if we want an overview. • tracks: Table loaded from tracks.csv. Each spider searches for a subsect of all artists, albums or tracks and exports that in- • artists: Table with artists information ob- formation to a json file. As a result, the crawler tained from tracks.csv by separating the obtains 8 json files with scraped data: artists columns and removing duplicates. • albums_overview: Number of listeners Both Last.fm and Metrolyrics have user- and release date of the albums. friendly links, than can be built following a common structure and using the information • albums_tags: List of tags for each album. from the columns of the dataframes above: • tracks_lyrics: Lyrics of the tracks. LASTFM_URL = https://www.last.fm/music • tracks_overview: Number of listeners and ML_URL = https://www.metrolyrics.com duration of the tracks. Group E, October 29, 2020 4 Billboard 200 - Dataset Preparation • tracks_tags: List of tags for each track. so it can be associated to several ranks and in different positions between them. • artists_overview: Number of listeners of the artists. • Album: Saves all the information of an album, such as name of the album, name • artists_tags: List of tags for each artist. of the artist, release date, number of tracks, • artists_wiki: The biography and number total duration and number of listeners. of listeners of all artists. If the artist is an in- • Artist: Saves general information of an dividual (Solo) it also contains its birth date Artist, such as name of the artist, num- and birth location. If the artist is a group ber of listeners and its biography (textual of individuals (Band) it also contains the information). An Artist can also be one location of the foundation, years of activity of two subclasses that hold more specific and a list of its members. information, depending on whether it is a The next step is to complement the Solo (referring to an individual) or a Band dataframes above with the scraped data that (group of individuals). was stored in the json files. Each json file has • Solo: Saves specific information of an indi- the keys to the albums, artists or tracks which vidual artist, such as birth date and birth the scraped data corresponds to, so the informa- location.