<<

- Dataset Preparation

João Miguel José Azevedo Ricardo Ferreira [email protected] [email protected] [email protected]

October 29, 2020

Abstract datasets and their exploitation with free- text queries; Billboard 200 is a which ranks the top 200 music on a weekly basis. • In the third milestone, it will be designed, This chart is published by the Billboard mag- represented and exploited an ontology for azine in the United States. The charts are the domain of the project datasets. frequently used to convey the popularity of artists. Since this chart is active for more than 30 years, crossing this information with in- In the next sections of this document, it is formation on artists, musical genres, lyrics presented: and other information will allow to build a big datastructure about music. This is the pur- pose of this work. The following document • Which datasets are used for this project; describes which datasets are being used, how the data is being prepared and enriched, the • Which are the data sources used on this sources of the data and their characterization. project;

• How the data collected; 1 Introduction • How was the data cleaned and enriched; Using the Billboard 200 [2] chart as a base, this work tries to create a data information re- • The conceptual model for the datasets; trieval system about music. The goal is to have in one place and easily accessed information on • The return documents and possible search albums, musics and artists since 1963 to 2019. tasks; The platform will provide ranks on the Bill- board 200, albums names, release dates, artists, • The data characterization of the data col- bands, biographies, song lyrics, characteriza- lected. tion by musical genre and performing queries to get and order this information. This project is splited in 3 milestones: 2 Data Sources • In the first milestone, it is supposed to choose, prepare and characterize the Three sources of information are used for this datasets to be used. The choice criterion project. One structured dataset with the Bill- was to have 2 different dataset types: one board 200 charts which comes in a sql database. unstructured, rich in textual data; and an- The other sources come from two websites, other more semantic, rich in structured and Metro Lyrics and Last.Fm and provide unstruc- annotated data; tured data. With this information new datasets will be built. • In the second milestone, it will be used The datasets used for this project are pre- an information retrieval tool on the sented next.

1 2 Billboard 200 - Dataset Preparation

2.1 Billboard 200 Datasets 3 Data Preparation

The main dataset used is a database provided The main dataset used is very organized and by the components one group [1]. This database needs minimal action to extract the relevant is free for use. Two tables of this table are used, data, besides that it comes in a sql database the Albums table and the Acoustics table. which also makes it easy to query the dataset. The Albums table provides information about In the same way, the information retrieved via the albums that were nominated for the Bill- web scrapping is also very organized and easy board 200 ranking, more specifically, the date to get and process. The pipeline used to extract it was introduced on the rank, the name, the and enrich the data is outlined in the Figure1. artist, as well as the length of the . This pipeline has two main purposes, to clean The Acoustics table provides information the data and enrich it. This two operations about the songs of the albums selected for the will be described in more detailed in the next Billboard 200, which include the name of the sections. song, the album, the artist, as well as a number of technical features about the songs, like the acousticness, danceability, duration, energy, in- 3.1 Data Cleaning strumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature and valence. The first stage of the pipeline from Figure1 But this extra information will not be used since is the data cleaning, accomplished through the we consider that it doesn’t add valuable infor- use of the tool OpenRefine [7]. This tool is used mation. to remove empty entries from the database and In total, these 2 tables cover 33011 albums, to remove entries with extra characters. This 9675 artists and 339854 songs. operation will result in two files:

• Albums - A file with all the albums and 2.2 Last.fm their ranks in the Billboard 200 since 1963 to 2019; Last.fm[3] is a music website founded in the United Kingdom in 2002[4]. This website pro- • Tracks - A file with songs and artists that vides a handful of information about music, match with the albums file. which is used to complement the albums, songs and artists, that can be used for personal and non-commercial purposes. The process we used In this dataset one album can have hundreds to extract and treat this information is described of entries, because it can be featured in the Bill- in the section 3.2. The data obtained is stored board 200 for several weeks or even months or on JavaScript Object Notation (.json) files. years, this imposes a problem in the next phase of the pipeline, the scraping data. This will lead to an album to be processed several times and 2.3 MetroLyrics the pipeline becomes extremely inefficient and slow. To overcome this problem, and since we MetroLyrics[5] is a lyrics-dedicated website, cannot discard any entry (we would end up founded in December 2002[6]. It is used to ob- losing the rank information) from the previous tain the lyrics of the songs from the albums on files the pipeline generates auxiliary files with Billboard 200. The data obtained is stored in duplicates removed. This is done in the step 1 JavaScript Object Notation (.json) files and can and 2 marked in the pipeline. In these steps all be used for personal and non-commercial pur- the characters are also escaped to HTML format poses. More about the process of extraction can to be used to generate the URL addresses for be found on section 3.2. MetroLyrics and Last.Fm.

Group E, October 29, 2020 Billboard 200 - Dataset Preparation 3

Figure 1: Data Pipeline.

3.2 Data Enrichment • LASTFM_URL/{artist}/{subset} : Links to pages with information of an artist.; The second stage of the pipeline is the data enrichment. To enrich the datasets already ob- • LASTFM_URL/{artist}/_/{song}/{subset} tained from Billboard 200, the websites Last.fm : Links to pages with information of a and MetroLyrics were crawled using the Scrapy song; framework[9]. • LASTFM_URL/{artist}/{album}/{subset} : The crawler uses the cleaned files stated Links to pages with information of an al- in the previous section and load these files bum; into 4 dataframes using Pandas[8], an open source data analysis and manipulation library • ML_URL/{song}-lyrics-{artist}.html : for Python: Links to pages with the lyrics of a song.

• ranks: Table loaded from albums.csv with The {artist}, {song} and {album} fields can information of the albums positions on the be obtained from the dataframes to search for Billboard 200 charts on different dates. specific albums, artists and tracks. The {subset} section indicates what kind of information we • albums: Table with albums information ob- want to obtain for that album, artist or song, tained from albums.csv by separating the which can be either "tags", "wiki", or even empty albums columns and removing duplicates. if we want an overview. • tracks: Table loaded from tracks.csv. Each spider searches for a subsect of all artists, albums or tracks and exports that in- • artists: Table with artists information ob- formation to a json file. As a result, the crawler tained from tracks.csv by separating the obtains 8 json files with scraped data: artists columns and removing duplicates. • albums_overview: Number of listeners Both Last.fm and Metrolyrics have user- and release date of the albums. friendly links, than can be built following a common structure and using the information • albums_tags: List of tags for each album. from the columns of the dataframes above: • tracks_lyrics: Lyrics of the tracks.

LASTFM_URL = https://www.last.fm/music • tracks_overview: Number of listeners and ML_URL = https://www.metrolyrics.com duration of the tracks.

Group E, October 29, 2020 4 Billboard 200 - Dataset Preparation

• tracks_tags: List of tags for each track. so it can be associated to several ranks and in different positions between them. • artists_overview: Number of listeners of the artists. • Album: Saves all the information of an album, such as name of the album, name • artists_tags: List of tags for each artist. of the artist, release date, number of tracks, • artists_wiki: The biography and number total duration and number of listeners. of listeners of all artists. If the artist is an in- • Artist: Saves general information of an dividual (Solo) it also contains its birth date Artist, such as name of the artist, num- and birth location. If the artist is a group ber of listeners and its biography (textual of individuals (Band) it also contains the information). An Artist can also be one location of the foundation, years of activity of two subclasses that hold more specific and a list of its members. information, depending on whether it is a The next step is to complement the Solo (referring to an individual) or a Band dataframes above with the scraped data that (group of individuals). was stored in the json files. Each json file has • Solo: Saves specific information of an indi- the keys to the albums, artists or tracks which vidual artist, such as birth date and birth the scraped data corresponds to, so the informa- location. tion can be aggregated. This aggregation would result in 4 main tables: ranks, albums, tracks • Band: Saves information of a group of and artists (solo and band). artists that perform together, such as foun- dation location, years of activity, and the 4 Data Model list of members (can be Solo artists, if their information is provided by the datasets). The Figure2 shows the final data model. • Tag: A tag contains a string that can label Which is comprised of the following elements: many albums, artists or songs, giving im- portant information about them, such as genre.

5 Information Retrieval Tasks

The information described in the previous sections will feed a platform that will allow searches and will return relevant documents based on the search parameters.

5.1 Returned Documents

Each search will return one or more docu- ments. There are 4 types of documents for this project:

Figure 2: Conceptual Data Model. • Albums: It provides album information as well as its songs and artist; • Rank: List of top albums on the Billboard 200 chart on a specific date. An album can • Songs: It provides song information as be at the top of the chart on different dates, well as its lyrics;

Group E, October 29, 2020 Billboard 200 - Dataset Preparation 5

• Artists: It provides artist information as has can be seen in SectionA - Figure3. The year well as its albums; 2019 has only data for the first month. Looking at the number of songs per year, SectionA- • Ranks: It provides an album ranking chart Figure4, we can see that the number of songs corresponding to a date. included in the albums has increased over the years, being the year 2014 the year with most 5.2 Possible Search Tasks songs in the Billboard 200. Similar conclusions can be taken relatively to the number of artists The possible search tasks are what search with albums in the rank, SectionA - Figure inputs are expected and its corresponding re- 4, this can be a of the increase turned documents. number of diffusion mediums. The possible search tasks for this project are:

• Rank by date (year, month, day): returns 7 Conclusions albums, artists and ranks; The first milestone proposed is completed. • Artists (band or solo): returns artists and The data was characterized and its usage de- albums; fined. The data chosen is very complete and • Location: returns artists; will provide lots of information about music since 1963. The database with the Billboard 200 • Album: returns album, artist, songs and proved to be an awesome starting point to ex- best rank; tract information since it gives an exhaustive list of albums, artists and musics since 1963. • Release date (year, month, day): returns Even though, the database has a huge list of albums and artists; albums it only has information about albums • Musical genre: returns albums, artists and in the Billboard 200, lacking information about songs; other less popular albums. This will render the final datasets incomplete, but on the other hand, • Songs (by name or words/sentences from since we are considering information from 1963 lyrics): returns songs. to 2019, all albums produced in these dates would represent datasets with giant propor- 6 Data Characterization tions and challenges that are not in the scope of this analysis. The sources used to enrich The dataset in analysis is very big, having this data also proved to be very satisfactory 574000 entries. Given the nature of the data, and with high quality information, leading to the same album can have hundreds of entries very complete groups of datasets. Overall, the in the database, because really popular entries process of characterization, data cleaning and will be featured in the Billboard 200 for several enrichment performed ended in good quality years, or months. Since this dataset is in a documents that will allow the usage of informa- sql database we can characterize the data with tion retrieval tools and the return of satisfactory relative ease. In this dataset the album with results. the most entries is The Dark Side of the moon with 942 entries. Another particularity which 8 References increases the complexity of the analysis is the quantity of albums named Greatest Hits from [1] Acoustic and meta features of albums and different artists, this represents 5905 entries. To songs on the Billboard 200. url: https:// overcome this, the scripts always consider the components . one / datasets / billboard - pair album and artist. The number of albums 200. per year is constant from one year to the other

Group E, October 29, 2020 6 Billboard 200 - Dataset Preparation

[2] Billboard 200. url: https : / / www . billboard.com/charts/billboard-200. [3] Last.fm. url: https://www.last.fm/. [4] Last.fm on Wikipedia. url: https : / / en . wikipedia.org/wiki/Last.fm. [5] MetroLyrics. url: https : / / www . metrolyrics.com/. [6] MetroLyrics on Wikipedia. url: https://en. wikipedia.org/wiki/MetroLyrics. [7] Open Refine Documentation. url: https:// openrefine.org. [8] Pandas, Python Data Analysis Library. url: https://pandas.pydata.org/. [9] Scrapy, A Fast and Powerful Scraping and Web Crawling Framework. url: https:// scrapy.org/.

Group E, October 29, 2020 Billboard 200 - Dataset Preparation 7

A Data Characterization

Figure 3: Albums per year in the Billboard 200.

Figure 4: Songs per year in the Billboard 200.

Figure 5: Artists per year in the Billboard 200.

Group E, October 29, 2020