<<

Bachelor Degree Project

Price prediction of vinyl records using machine learning algorithms

Author: David Johansson Supervisor: Tobias Ohlsson Semester. Spring 2020 Subject: Computer Science Abstract

Machine learning algorithms have been used for price prediction within several application areas. Examples include real estate, the stock market, tourist accommodation, electricity, art, cryptocurrencies, and fine wine. Common approaches in studies are to evaluate the accuracy of predictions and compare different algorithms, such as Linear Regression or Neural Networks. There is a thriving global second-hand market for vinyl records, but the research of price prediction within the area is very limited. The purpose of this project was to expand on existing knowledge within price prediction in general to evaluate some aspects of price prediction of vinyl records. That included investigating the possible level of accuracy and comparing the efficiency of algorithms. A dataset of 37000 samples of vinyl records was created with data from the Discogs , and multiple machine learning algorithms were utilized in a controlled experiment. Among the conclusions drawn from the results was that the Random Forest algorithm generally generated the strongest results, that results can vary substantially between different artists or genres, and that a large part of the predictions had a good accuracy level, but that a relatively small amount of large errors had a considerable effect on the general results.

Keywords: price prediction, price estimation, vinyl records, vinyl prices, regression, machine learning, machine learning algorithms, algorithm comparison, dataset, vinyl dataset, k-nearest neighbors, linear regression, neural network, random forest, discogs

ii Contents

1 Introduction 6 1.1 Background 6 1.1.1 Vinyl record industry 6 1.1.2 Price estimation with machine learning 7 1.1.3 Algorithms 8 1.1.4 Hedonic modelling 8 1.2 Related work 8 1.2.1 Real estate 9 1.2.2 Collectibles 10 1.2.3 Vinyl records 11 1.3 Problem formulation 12 1.4 Motivation 12 1.5 Objectives 13 1.6 Scope/Limitation 13 1.7 Target group 13 1.8 Outline 14 2 Method 15 2.1 Dataset 15 2.1.1 Genres and artists 15 2.1.2 Attributes 16 2.1.3 Data trimming 17 2.1.4 Variable encoding 17 2.1.5 Feature scaling 18 2.2 Controlled experiment 18 2.2.1 Algorithm selection 18 2.2.2 Measuring accuracy 18 2.2.3 Experiment configuration 19 2.3 Reliability and Validity 20 2.3.1 Reliability 20 2.3.2 Internal validity 21 2.3.3 External validity 21 2.4 Ethical Considerations 22

iii 3 Implementation 23 3.1 Dataset 23 3.1.1 Attribute selection 23 3.1.2 Data collection 25 3.1.3 Data trimming calibration 26 3.2 Controlled experiment 28 3.2.1 Hyperparameter calibration 29 3.2.2 Dataset class 29 3.2.3 Regression class 30 3.2.4 Experiment execution 32 4 Results 33 4.1 Dataset 33 4.2 Controlled experiment 33 4.2.1 Evaluating full dataset 34 4.2.2 Evaluating genre data 37 4.2.3 Evaluating artist data 39 5 Analysis 41 5.1 Comparison of algorithms 44 5.2 Additional contextual data 45 5.3 Efficiency of features 47 5.4 Accuracy of price estimation 48 6 Discussion 50 6.1 Accuracy of price estimation (Q1) 50 6.2 Additional contextual data (Q2) 52 6.3 Comparison of algorithms (Q3) 52 6.4 Efficiency of features (Q4) 53 7 Conclusion 54 7.1 Future work 54 References 56

A Appendix 1 60

A Appendix 2 72

iv B Appendix 1 79

B Appendix 2 80

B Appendix 3 82

v 1 Introduction Price estimation using machine learning can be approached in multiple ways. Whether the outcome is considered satisfactory is usually depending on several factors, but it is generally of high importance that the data collected and the algorithms implemented are well suited for the problem formulations one is to answer. In this project, state-of-the-art techniques within the field of price estimation were applied in a context where they have not previously been implemented to a large extent. Merging familiar concepts with some new ideas in a new environment, the aim was to broaden the available knowledge and shed some light on an unexplored application area for machine learning. A dataset representing the vinyl discographies of hundreds of artists as well as software utilizing several machine learning models were created for this project. Using those assets, an experiment was conducted with the intent to answer a number of problem formulations related to the subject.

1.1 Background

1.1.1 Vinyl record industry Within the almost hundred-year-old industry, we can see the usage of physical audio media fading away in favor of heavy consumption of digital audio files and streaming platforms. The popularity of compact discs has been steadily declining since the turn of the century, with major retail chains like Best Buy shutting down their selling of CDs completely [1]. The vinyl record, on the other hand, looks like it has stood the test of time. It still has a thriving community presenting retail sales numbers that have been increasing in recent years, with a total doubling of sales during 2018 [2]. Often, vinyl records are pressed as limited editions, sell out quickly, and might be difficult to get hold of. In other cases, attention can be brought to an item months or even years after it has disappeared from the retail market, making second-hand prices high, or even extremely high compared to the initial price. Other items might not be as sought after, and perhaps will not even generate its original value if one was to sell a copy. Either way, the second-hand market of vinyl records, recent as well as older releases, is highly active and prices can be very diverse depending on several factors. Many people are collecting and dealing with records, either as a hobby or as a

6 business, or somewhere in between.

1.1.2 Price estimation with machine learning Machine learning is used to generate information from data [3, pp. 1]. There are different types of machine learning which can be used for different types of tasks. Price prediction is a type of problem that belongs in the category of supervised machine learning , which is used to predict an output value based on some kind of input [3, pp. 25]. In the case of this project, the output is the price of a vinyl record and the input is an array of attribute values describing the vinyl record. A prediction is made using a model (an implementation of an algorithm) that has been trained (or fit) with input/output pairs of data, referred to as training data [3, pp. 25]. The algorithm is the part of the model that does calculations based on the training data and values of hyperparameters used during implementation. Hyperparameters are settings used to improve the generalization performance of a model [3, pp. 260]. Supervised machine learning problems can be further divided into two major categories - classification and regression. Classification is used to predict a categorical output value from a set list of options 1, while regression is used to predict a continuous number [3, pp. 25]. Thus, regression is the type of machine learning used to predict prices. Regression of prices has been studied in several different areas. Using a generic term like “price regression” in a of academic publications like Google Scholar, one is likely to find results in diverse areas. Examples include real estate, the stock market, tourist accommodation, electricity, art/paintings, cryptocurrencies, and fine wine. While the fundamental concepts of these studies are often similar (using machine learning algorithms to investigate economical aspects of the subject), the course of action, as well as the aims, are often slightly different in the respective areas of application (which will be further elaborated on in Chapter 1.2). Also, some areas are more prevalent, like real estate or the stock market, and are therefore more comprehensively studied. In more obscure areas, there may only be one or a few studies available, meaning there are possible approaches of study which are currently unexplored.

1 E.g. yes, no or unknown , or any other set of categorical values.

7 1.1.3 Algorithms While the algorithms commonly used in price prediction studies will be looked closer upon in Chapter 1.2, this chapter will give an overview of generally recognized algorithms. Models highlighted in the educational book Introduction to Machine Learning with Python - A Guide for Data Scientists [3] are k-Nearest Neighbors, Linear Regression, Decision Tree, Ensemble of Decision Trees (also known as Random Forest), Support Vector Machine and Neural Network. A Google search using the search phrase which machine learning model to use for regression was done, and a handful of blogs and online resources [4], [5], [6], [7], [8], [9], [10], [11] were examined. Although slightly varied in their suggestions, the sources were fairly consistent. The previously mentioned algorithms constituted the majority of the recommended models. Among the most common ones were Linear Regression, Neural Network, Random Forest, and Decision Tree, while the less frequent ones included k-Nearest Neighbors and Support Vector Machine.

1.1.4 Hedonic modelling Hedonic modelling, or hedonic regression, is a method of distinguishing the characteristics of an item that are components of its value. The purpose is to determine the values of the separate features, to more accurately be able to make calculations of an item’s value. The characteristics are not in themselves market goods, and can not be sold separately. It is rather the composition of characteristics that determine something's value [12]. For example, an object's age, size or condition can not be purchased separately, but may be important separate factors for the value of the object. Hedonic modelling is about focusing on distinct independent characteristics and how they affect the value.

1.2 Related work As described in Chapter 1.1.2, multiple application areas for price prediction with machine learning have been studied. While studying related work in this project, it was decided to investigate one of the most commonly researched areas with the purpose of observing common and modern approaches, even if the type of goods examined are different from those in this project. It was

8 also decided to study research more closely related to the subject of this project, even if it may not be as cutting-edge for price prediction in general as research available in a more actively studied area. Combining knowledge from both these areas was deemed to be an applicable outset of this project. Based on this practice, the areas of real estate, collectibles and vinyl records were investigated.

1.2.1 Real estate With real estate arguably being the most researched field in price estimation using machine learning, it can be assumed that the area is representative for the current state-of-the-art of regression techniques. A search was conducted to get an overview of previously published research in the area, to observe the techniques and problem formulations commonly used. Upon performing a literature search 2, eighteen relevant top results were selected and studied. Different kinds of Neural Networks were subject of evaluation in six of the studies [13], [14], [15], [16], [17], [18], while eleven studies [19], [20], [21], [22], [23], [24], [25], [26], [27]. [28], [29] include various types of Linear Regression. The remaining one [30] focuses on k-Nearest Neighbors and Random Forest, making it the only one out of the eighteen to not be about either Neural Networks or any sort of Linear Regression. There are some recurring approaches in the studies, for example comparing the performance of two models 3, evaluating a model with an approach for which none or very little previous research exists 4 or estimating the general performance of real estate price prediction of a particular model 5. Observations include that Neural Network and Linear Regression models are the most common, that Random Forest models outperformed its

2 Search engines: Google Scholar and OneSearch. Terms: Real estate price prediction, real estate regression, house price estimation, house price regression, and house price prediction. 3 E.g. comparing a Support Vector Regression model and a Neural Network model [14], comparing two Neural Network models [15], comparing a Geographically Weighted Regression model and a Spatial Lag model [22], comparing four Linear Regression models with two Ensemble models [24], comparing a Linear Regression model and a Random Forest model [25], [29], comparing a k-Nearest Neighbors model and a Random Forest model [30]. 4 E.g. using a Neuro-Fuzzy Neural Network model [17], using visual data features in a real estate dataset [18], complementing Least Squares Linear Regression with quantile regression [20], utilizing a bargaining power feature while using a Linear Regression model [28]. 5 E.g. using Neural Network models [13], [16], using a Geographically Weighted Regression model [21], using a Linear Regression model [23], using a kernel-based Geographically and Temporally Weighted Autoregressive (KBGTWAR) model [26].

9 opponent on all occasions [25], [29], [30] and that regression models can be effective for price estimation of real estate.

1.2.2 Collectibles There is not much academic research available in the area of vinyl record price estimation (see Chapter 1.2.3). To examine studies of relatively high similarity, there was a search for studies of price estimation of other collectibles. After doing a search 6 using terms related to art, paintings, and price estimation in general, eight works were studied. Five of them [31], [32], [33], [34], [35] are about art and paintings, while the remaining three are about french autographs [36], rare U.S. coins [37] and manuscripts [38]. The focus of these studies is largely on investigating the price estimation potential on the different datasets. The algorithms and models used are typically described in mathematical terms and the approach generally seems like it is based in mathematics rather than in computer science. Different variations of Linear Regression is being implemented. No comparison between different algorithms is done. Therefore, it is difficult to say whether any specific machine learning models are more effective when it comes to the prices of collectibles. Several findings are made in different studies. Teti et al. maintain that “even though some variables can affect, to a certain extent, the final price at which a work of art is sold, other subjective, difficult-to-measure variables that are intrinsically linked to the specificity of cultural products are the key drivers of prices” [31, pp. 77]. Multiple studies conclude that the identity of the artist is of high importance to the price [32], [35], [36]. Pradier et al. also conclude in their study of french autographs their model “provided an estimation of the hedonic price, but this estimation is conditional to the author's specific effect. Therefore, when [evaluating] an autograph of an author, which is not in our database, it should be calibrated first by analogy with similar authors” [36, pp. 472]. That suggests that an item should be put into a specific context before performing hedonic regression. Nahm points out that “[paintings by] a deceased artist are associated with a price that is twenty times higher” [32, pp. 297]. Other observations include the definition of features and determining

6 Search engines: Google Scholar and OneSearch. Terms: Price estimation art, price prediction art, price prediction paintings, art price regression, and price regression.

10 their hedonic significance for performing regression of the data 7.

1.2.3 Vinyl records After conducting a search 8 using a number of terms related to price estimation of vinyl records, two studies somewhat affiliated to the subject were found. The most relevant one was Pricing the Groove: Hedonic equation estimates for rare vinyl records [39] done at the University of Hagen in 2019. Its primary focus is to study the data of valuable records using a Linear Regression algorithm, i.e. it does not include any comparing of models. Nothing is mentioned about the programming environment being used, and the calculations are explained in purely mathematical terms. The dataset was made up of the 30 most expensive items sold monthly at the Discogs marketplace during a 9 year period, totaling about 3100 observations. In other words, it exclusively includes data about the most valuable items sold on the website. Using this data and the regression model, some conclusions were made. The popularity of an artist correlates positively with the price [39, pp. 11] There is generally a high demand for items that are sold at high prices (demand is calculated based on user data of the website). An item’s age is not necessarily important for the price. Prices are concluded to be lower for audio recordings released in rarely traded countries, such as African countries. The study also mentions that “future development of the present work requires more exploration of the role of specific artist or genre on price movements through additional variable construction” [39, pp. 14]. The Demand for Vinyl L.P.s 1975-1988: Time Series Estimation of a Product Group in the Presence of Product Differentiation Innovation [40] is a study from 1994. Although it does include examples of regression, it focuses on the demand for vinyl LPs in general rather than estimating prices for individual items. Considering that the study was close to three decades old at the time this project was done, both the techniques and data available was arguably outdated. Using a set of independent variables 9 based on data

7 E.g. Nahm concludes that “paintings executed in acrylic, oil, and mixed media command higher prices, with an increased value over all other works of about 170–190%” [32, pp. 293]. Fedderke & Li points out that “consistent with other hedonic pricing models for art markets, the identity of the artist, medium of the artwork, size, a set of dating characteristics, and the genre of the work is of importance to the realized market price.” [35, pp. 100] 8 Search engines: Google Scholar and OneSearch. Terms: Vinyl price, vinyl price prediction, vinyl price estimation, vinyl price regression, and vinyl records price. 9 Income, price of pre-recorded and blank audio software, tune innovation/imitation, selection of tunes on compatible pre-recorded audio software, price of complements - audio

11 obtained mostly from the British Phonographic Industry [40, pp. 42], some observations were made. Examples include that the 10- to 14-year-old population is a significant determinant of the demand for vinyl singles [40, pp. 61] and that “tune innovation, 25- to 29-year-old population, availability of tunes on compatible pre-recorded software and Christmas expenditure [are] positive and significant determinants of the demand for vinylite albums” [40, pp. 62].

1.3 Problem formulation The foundation of this project was built on the following problem formulation: Can machine learning be used to make efficient price prediction of vinyl records? Four research questions, listed in Table 1.1, were studied with the intent of answering this problem.

# Description Q1 What level of price prediction accuracy 10 is it possible to achieve using common algorithms and a dataset of vinyl records? Q2 What are the effects of using additional contextual data while making predictions (e.g. training a model with more artists within the same genre when evaluating an artist’s data)? Q3 Utilizing multiple algorithms, how do they perform in comparison? Q4 Are the features used in the dataset effective in the hedonic regression of vinyl records? Table 1.1: Research questions to be answered in this project.

1.4 Motivation A large amount of research has been done on price estimation in general, verifying it is an area for which a large scientific interest exists. The vinyl record business is a well established worldwide phenomenon. However, no general study about price prediction has previously been done in the area. Therefore, there is currently a gap of knowledge which upon being properly evaluated could be of interest prominently for science. Would the concepts of price prediction of vinyl records be developed into a format adaptable to the average person - e.g. some kind of app or website - it could theoretically and eventually (provided it would be stable enough to produce fairly accurate results) be of interest for the industry (e.g. hardware, demography - youth population, durable good factors and seasonable factors - Christmas presents [40, pp. 44] 10 The approach used to measure this is described further in Chapter 2.2.

12 manufacturers or retailers) and for society (e.g. consumers or collectors). However, the scientific perspective is the sole focus of this project, and any eventual commercial approach has not been further elaborated on.

1.5 Objectives # Description O1 Generate a dataset of vinyl records from Discogs. O2 Implement machine learning models to be used in the project. O3 Experiment with the dataset and models to get usable results. O4 Write the report and present new knowledge acquired on the price estimation of vinyl records. Table 1.2: Objectives of this project.

1.6 Scope/Limitation Would there have been more time and resources available, there are several things that would have been interesting additions to the project. A specific data feature which would have been of high value for this project is the quantity 11 of each item. However, this information is currently unobtainable as there currently are no with this information distinctly available. Gathering that data would either have had to be done manually or by using some kind of software that would have needed to be developed - two options too time-consuming for this project. Since the dataset was generated as a part of this project, a limit for its capacity had to be set at some point, as deciding which artists and genres to be included in the dataset was a time-consuming task. The final set of artists was concluded to be adequate for the purpose in its final state, but given more time, it could have been both larger and more elaborate. Making use of more machine learning algorithms could have been beneficial, as it would shed more light on which algorithms are efficient in this particular situation. But since they are time demanding both to implement and evaluate, a limit had to be set. Four algorithms considered the most suitable for the project were therefore chosen.

1.7 Target group This report is likely to cater to readers in a scientific context foremost. The

11 I.e. total units manufactured. Many vinyl records are pressed in small quantities, increasing the probability of higher resale prices for sought-after items.

13 intention was to be able to expand on the available knowledge in machine learning, thus making the work target those with an interest in regression models and price estimation. Perhaps, it can be particularly interesting for readers inclined to explore hedonic regression of vinyl records or other collectibles, considering their somewhat distinguishing qualities mentioned in Chapter 1.2.2.

1.8 Outline The rest of the report includes five more chapters and some appendices. Method describes the approach used to produce the results to answer the problem formulation. The method has two main focuses; the dataset created specifically for this project, and a controlled experiment using machine learning algorithms. Implementation is divided similarly to the previous chapter but giving a more technically detailed version of how the dataset was created and how the experiment was realized. Results shows the details of the finalized dataset, and the result data generated in the machine learning experiment. In the Analysis chapter, the experiment results are examined to see which relevant observations can be made from them. In Discussion, it is being evaluated whether the observations of the results can be used to answer the problem formulations. The results are also put in relation to those of earlier, related work. Conclusion summarizes the findings of this project, evaluated their relevance, and suggests some aspects of the subject which could be researched further. Appendix A.1-A.2 shows relevant scripts from the implemented software, and Appendix B.1-B.3 shows some results tables generated in the machine learning experiment.

14 2 Method Using a dataset of vinyl records that will be created for this project, and the implementation of multiple machine learning algorithms, a controlled experiment will be conducted. The purpose is to answer the problem formulations in Table 1.1 using results produced in the experiment. This chapter explains the method in further detail.

2.1 Dataset A comprehensive dataset is needed to be able to perform the experiment. As a detailed dataset of vinyl records does not currently exist, one must be created specifically for this project. Two questions need to be answered to build the dataset. What should be sampled in the dataset in terms of genres, artists, and their total production? Which attributes should the dataset include?

2.1.1 Genres and artists When working with the price estimation of collectibles, it has been stated that factors such as the identity of the artist are of high importance (see Chapters 1.2.2 and 1.2.3). For an artist to be legitimately represented in the dataset, it has been decided that all of the artist’s releases will be included, with minor limitations. The rules for the scope are the same for every artist, and that is that all releases in the Albums , Singles & EP’s , and Compilations sections of each artist’s Discogs profile will be enclosed in the dataset. Those sections represent the essence of an artist’s discography. However, there are additional entries associated with an artist, such as appearances on compilations of multiple artists or unofficial releases, none of which are relevant for this dataset 12 . It can be assumed that economic patterns vary in different cultural contexts within the spectrum of vinyl releases. Therefore, the intention is to create a rather uniform dataset, i.e. restricted within a certain cultural scope, but also characterized by diversity so that several subdivisions can eventually be observed in the context of the full span. Based on that foundation, it has been decided to use the relatively broad definition of rock/metal as the total

12 Compilations of multiple artists have no purpose when evaluating discographies of individual artists. Unofficial releases, i.e. bootlegs, pirate releases, and counterfeits, were not the focus of this study. Also, there is no financial data available for them at Discogs.

15 scope and using some well-defined sub-genres as segments within it. Based on common categorization conventions within rock and metal music, a group of subdivisions (hereafter referred to strictly as genres ) was decided for: alternative metal 13, alternative rock 14, black metal , classic rock, death metal , doom metal15 , electronic 16, heavy metal , punk rock 17 , stoner rock 18 and thrash metal. A selection of artists will be chosen to represent the data for each genre. This will be done by looking up some of the most quintessential artists for each genre 19 , and then expanding the roster by adding artists found by researching similar artists on online resources 20 .

2.1.2 Attributes Discogs is the most comprehensive online resource for music releases. It serves both as a source of information about items and as a marketplace where users can buy and sell any of the items in the database. It is structured such that for every specific issue of every title by every artist, there is a dedicated page displaying the items’ known details as well as the median price of previously sold copies at the site. Given its broad information scope and convenient structure, it has been chosen as the source of data for the project. What contributes to the value of a record is generally a combination of many different factors. Many of those factors are described on a website specializing in the value of second-hand records [41]. This resource will be used to create a list of relevant variables. It will then be determined which of those variables can be retrieved from Discogs, whether any additional variables can be obtained from Discogs, or if any hedonic features can be independently constructed using data available at Discogs. The median price for each item, which will be the dependent variable of the dataset, will also be retrieved from Discogs. It will be obtained in USD currency format. During implementation, each variable's effect on the performance will

13 Also including examples of nu-metal, industrial metal, and similar styles. 14 Also including examples of grunge, indie rock, and similar styles. 15 Also including examples of sludge metal, post-metal and similar styles. 16 Although not fundamentally a part of the rock genre, this selection of focuses on styles with a strong association to it, such as 70’s electronic music. 17 Also including examples of garage rock and similar styles. 18 Also including examples of psychedelic rock, rock, and similar styles. 19 Wikipedia pages for the genres list artists which are of high importance for the genre. 20 Both Spotify and Last.fm provide a register of similar artists for each artist based on users listening activity. It was verified that the styles of similar artists did in fact match.

16 be tested, and in case a variable shows to have a generally negative effect on the experiment results, it will be excluded from the final dataset.

2.1.3 Data trimming After all the data has been collected, some tests will be done to determine if the dataset should be trimmed to improve performance. It will be checked if artists with smaller amounts of samples have a significant impact on the results. Using machine learning models trained with the dataset and an iteration through different values n ranging from the smallest amount of samples for an artist and 200, the performance will be tested while excluding artists with less than n samples. The results will be examined to decide if a required minimum sample value for artists should be used. Similarly, a test will be done to examine the effect of having an upper limit on the number of samples by one artist, considering that it is likely to be a big difference in sample amounts between the artists. There will be an iteration of values n (a range between 50 and 3000) where artists samples will be trimmed above n samples per artist to find out where/if to set the limit to get the best results. A test will also be conducted to inspect the effect of outliers in form samples with significantly higher dependent variables than the majority of the samples. The purpose is to find out if relatively few samples have a large negative effect on the overall performance. This will be done similarly to the tests described above. Values of n will be tested iteratively, where n is the maximum limit for samples to be included in the dataset, and the performance of models will be measured to find the ideal value of n .

2.1.4 Variable encoding All variables of the vinyl dataset will need to be numerical. That is a technical requirement for being processed by the machine learning algorithms which will be used in this project. Therefore, any categorical variables will need to be encoded to numerical data. It will be done differently depending on the nature of the variable. If a variable can only be either false or true, it is solved by encoding those values to 0 and 1. Any eventual features with more than two options could not simply be encoded to other numbers unless there exists no natural ordering between the options. In those cases, dummy variables will be created, which means that for every possible value of the variable, there is a new variable created, so that a 0 or 1 value can be applied

17 to the specific values.

2.1.5 Feature scaling Min-max normalization will be applied to the data, rescaling the range of all features to a [0-1] range. This will be done because machine learning algorithms generally perform better when all features are normalized to the same unit range.

2.2 Controlled experiment

2.2.1 Algorithm selection As told in Chapter 1.2.3, no studies have been done where the performance of multiple algorithms is being compared in regards to predicting prices of vinyl records. Therefore, an important aspect of this project will be to study some of the most common machine learning models used in similar situations. It was established in Chapter 1.2 about related work that Linear Regression and Neural Network were the two most commonly used algorithms in the works studied. It was also observed that Random Forest is quite prevalent and also often outperforms other algorithms while being compared. Therefore, these three models have been chosen for this project. In addition to those, it has also been decided to use k-Nearest Neighbors, which is argued to be one of the simplest machine learning algorithms [3, pp. 35]. The four algorithms will hereafter be mentioned as LR , NN , RF , and KN respectively. The functionality of all the different algorithms can be adjusted with hyperparameters and it is a common thing to do to improve the accuracy of predictions. However, since the same implementation will be used for the evaluation of many different portions of data, the models should not be fine-tuned too closely to a specific sample. Instead, the plan is to find some rough settings which can be considered to generally improve the accuracy, even to a relatively small extent. They will be evaluated using the grid search technique, i.e. accuracy will be measured while evaluating the models iteratively, using slightly different settings each time.

2.2.2 Measuring accuracy The aim of research question Q1 shown in Table 1.1 is to investigate the accuracy of price predictions that can be achieved in this project. This will be

18 measured with the R 2 (Coefficient of determination) score, MAPE (mean absolute percentage error) score, and frequency distribution tables of APE (absolute percentage error) scores. These metrics were chosen because they will represent prediction error in a way that will not be relative to the dependent variable. Multiple portions of data will be evaluated in the project, and a metric that is relative to the dependent variable (e.g. mean squared error or mean average error) would not be very useful when comparing the predictions of two different artists for example. The R 2 score is a float value that can have 1.0 as the maximum value. It represents the percentage of the dependent variable that can be calculated using the independent variables. That means that if the R 2 score is 1, the dependent variable of all predictions in the evaluation can be accurately calculated without error. A score as close to 1 as possible is therefore desirable. An APE score will be generated for every testing sample in the evaluation and will explain the percentage difference between the predicted price and the actual price 21 . In order to show an overview of the outcome of an evaluated portion of samples, tables of APE score frequency distribution will be created. They will show the percentage of APE scores in the ranges 0-10%, 10-20%, etc. up to 140-150%, and with one column for APE scores higher than 150%. A MAPE score will be the result of calculating the mean of the APE scores for a set of predictions and is therefore efficient for estimating its’ accuracy.

2.2.3 Experiment configuration After the dataset has been finalized, the data for each artist will be divided into training and testing portions, 80% and 20% respectively. The data will be used to train and evaluate many different models, but the training and testing portions of each artist will remain the same throughout all stages of the experiment. The training portion of the full dataset will be used to train models using all four algorithms. These models will be evaluated using 1) all testing data 2) the testing data of every separate genre and 3) the testing data of every separate artist. Also, for every attribute in the dataset, another set of the four models will be trained and evaluated using the full dataset, but with one

21 I.e. if the actual price is 10 and the predicted price is 9 or 11, the APE score is 10%.

19 attribute excluded in each iteration. For each genre, models will be trained using the training data of all the artists from the respective genre. The models will be evaluated using 1) all the testing data of the genre it was trained with and 2) the data of every artist belonging to that genre as separate evaluations. Training data of each artist will also be used to train individual models using all four algorithms, and they will be evaluated with the artist’s own testing data. The predictions made using the separate artists and genre models will also be used to produce R 2 and MAPE scores for the larger scopes using those models. To achieve this, the predicted scores generated using the models of all the separate artists and genres will be collected and used to produce scores at a genre and full dataset level. That way, it will be possible to evaluate the full dataset using only artist models, using only genre models or using the models trained with the full dataset. Similarly, it will also be possible to evaluate the data of the separate genres using only artist models. The outcome will be that portions of all three levels (artist, genre, full) of the dataset can be evaluated using models trained solely with data of any of those levels. Ultimately, the experiment will be based on evaluations using four models for each artist and genre, as well as for the full dataset times the number of attributes plus one. The purpose of using this selection of models is to be able to evaluate all portions of data independently and in bigger contexts, and using multiple algorithms. The results will be used to answer the problem formulations listed in Table 1.1.

2.3 Reliability and Validity In this section, the reliability, internal validity and external validity of this project is discussed. Reliability refers to the consistency of the results produced in the project, and it is discussed whether they will be the same if the experiment would be repeated. Internal validity is about the legitimacy of the data collected, and it is argued if the data accurately represents what it is meant to. In the external validity section, it is discussed whether the results of the project can be generalized to other configurations outside of this study.

2.3.1 Reliability State-of-the-art software tools for collecting and evaluating data will be used

20 (see Chapter 3 for details). The dataset will be generated from a website where the information is being consistently updated in terms of new releases, information about the releases, and the sale prices. Therefore, a dataset generated today and a dataset generated tomorrow using the same settings would likely include slightly different data. However, if the experiment would be executed on the same dataset multiple times, it would still show a very high level of consistency. The KN and LR algorithms yield identical results at every iteration, while the RF and NN do not as they are random by nature and will produce slightly varying results.

2.3.2 Internal validity The number that will be used as the dependent variable in the dataset is the median price of the ten last sold copies of an item at the Discogs marketplace. Although this is considered the best possible example available of the second-hand price for a vinyl record, it can be debatable how accurately it represents it. These numbers are based on sales where the prices were determined by the sellers and perhaps differ from the selling prices normally occurring in other forums than Discogs, e.g. online auction like eBay or actual record shops.

2.3.3 External validity While the experiment will evaluate the data of specific artists and genres, it will use data that can be retrieved in the same format by any artist. If another dataset would be built using the same number of artists from different genres, it is reasonable to believe that some of the findings done in this project would be valid for that dataset too. For example, if an algorithm proved superior for most of the artists in the rock/metal dataset, it would probably be also for another set of artists. The same goes for the efficiency of using extra contextual data. However, results in terms of achieved accuracy may differ between diversions of data, such as artists and genres, thus results such as MAPE scores are not guaranteed to be similar when performing the experiment with other data. Whether the results would be generalizable to data groups outside of the vinyl record area, such as other collectibles, is difficult to say. Results depend on factors such as the amount and quality of the data available. As datasets of other types of collectibles would likely include different attributes, they need to be evaluated in an experiment to determine how general the

21 results are.

2.4 Ethical Considerations None of the artists which will be included in the dataset will be mentioned by name, as it could not be certain that all artists would want to be explicitly featured in an experiment like this. Also, the purpose of the project is not to investigate the results of specific artists, but rather of artists and genres in general. Ethics concerning artificial intelligence is a relevant topic that is widely addressed. Different forms of AI are getting integrated into society, which causes many topics of debate. Some examples include increased unemployment as a result of automated jobs, the issue of how revenue generated by AI should be distributed, and how to handle eventual mistakes caused by machines [42]. Many of these common, wide questions are more relevant for other types of applications and systems, but there are a couple of questions that can be considered also in the context of this project. Is part of the charm of dealing with collectibles that prices of items should be somewhat unpredictable and spontaneous? Could an eventual AI approach on the prices of collectibles make things too static and somehow alter the mechanics of collecting valuables? These questions arose during the project and could be further elaborated on, should the concepts of price prediction of collectibles become more advanced.

22 3 Implementation Several pieces of software were developed for this project; a class for communicating with databases, a module that could collect, format and save data into a database ( dataset.py, see Appendix A.1), a module that could perform some manipulations on the existing dataset, a module with two classes used for executing the experiment ( experiment.py, see Appendix A.2) and a script for executing the machine learning experiment using said classes. The outline of the software’s structure and functionality, as well as some configuration details which were concluded during implementation, is described in this chapter. The implementation was done in Python (version 3.6.9) and Table 3.1 shows the libraries used. The language was chosen because it is a well-established language to use for data science as well as being an efficient general-purpose language [3, pp. 5], suitable for the various tasks of this project.

Name Version Usage Beautiful Soup A tool that facilitates the process of extracting information from 4.9.0 HTML code. It was used to filter pieces of data from each downloaded page. Pandas Used to store and organize the data used in the machine learning 1.0.3 implementation. Reads data into a table-like DataFrame object which offers useful methods for data manipulation. Requests 2.32.0 Used to download web pages and retrieve the HTML code. Scikit-learn It offers a large selection of tools for machine learning. In this 0.22.2 project, it is used for its machine learning algorithms, metrics, and normalization libraries. Sqlite3 Creates a connection to an SQLite database file, allowing data to 3.22.0 be saved and loaded. Unittest 3.6.9 Unit testing framework for Python. Table 3.1: Python libraries prominently used in the implementation.

3.1 Dataset

3.1.1 Attribute selection A list of relevant variables 22 was made after studying a website specializing in the value of second-hand records [41]. Out of those variables, it was concluded that the following could be extracted from Discogs: age, area of

22 Artist, age, promotional item, small or major label, colored vinyl, picture disc, test pressing, area of release, limited edition, and reissue status.

23 release, artist, colored vinyl , limited edition , picture disc , promotional item , record label, and test pressing. The website was further studied to determine whether any other useful hedonic features could be obtained. The results were the following: box set , numbered, and release type. The median price of each item, which is the dependent variable of the dataset, would also be retrieved from Discogs in USD currency format. It was apparent that additional hedonic features could be independently constructed from data available at Discogs. Area is a refined version of the area of release information obtained from Discogs. The original data was more geographically precise, specifying the countries where the item was released. The unique values were reduced into a set number of categorical values 23 to evaluate if it was beneficial. Testing proved that it was, so the narrowing of values was retained. Chronology is an integer on a 1-10 scale representing the moment of manufacturing between the start of the artists’ career until now, i.e. the early releases of an artist have a low number. Issue type describes whether the item is the original, whether it has been reissued or whether it is a reissue. Version is a number that represents the total amount of versions of that title. Vinyl exclusive tells whether or not there are any non-vinyl releases available of the title. At Discogs, compilation titles are listed on a page of their own and are not included with other albums or singles. This is an understandable approach for maintaining a structure for their website. However, as samples in the dataset for this project, a full-length compilation is essentially the same release type as an ordinary album. Therefore, for the sake of consistency, short compilations were treated as singles/EP’s and full-length compilations were treated as albums in the context of this dataset. Their status as compilations was accounted for by adding a compilation attribute. While the features were being tested after the data had been collected, it was found that the record label data retrieved from Discogs was a bit complicated to find a suitable use for. The feature had more than 3500 unique values and, being categorical, needed as many dummy variables. Not only did it drastically increase the number of features, but it also seemed to worsen the general performance of the models. A possible solution could have been to create a new feature based on the information, e.g. an attribute expressing

23 Africa, Asia, Australia/New Zealand, Eastern Europe, Northern Europe, Central Europe, Southern Europe, North America, South America, and World (used for items with unknown or multiple areas).

24 if it is a small or major label. That would, however, have required a thorough investigation regarding the state of all unique record labels, which was not possible. Therefore, the record label data was excluded.

3.1.2 Data collection A lot of data needed to be retrieved from the Discogs website. For every artist to be included in the dataset, information about the complete album, singles, and compilation discographies was needed. And for every item, several individual attributes were needed. While Discogs does have a free to use REST API that can provide most of the information needed, it lacks the target variable - the median price of previously sold copies. That number is only available on the individual page of each item, meaning every page needed to be downloaded. Since all data available from the API was also accessible from the pages that needed to be downloaded, the website was used as the only source of data. To gather the complete discography data for an artist, the generate_artist_discography function in the dataset.py module is called with parameters representing the artist’s name and the URL to the artist page on the Discogs website 24 . The first thing that will happen is that the main pages for the artists’ album releases 25 . EP/single releases 26 and compilation releases 27 will be downloaded. The releases are split up in several pages using pagination, and all of them need to be retrieved, but with the option of showing 500 releases per page, the amount of pages is normally quite small. However, these pages are merely a list of releases by the artist, and the next step is to retrieve links for the item pages. Using Beautiful Soup, every entry in the list of releases is scanned, and they can be of two types. If there is only one version available of a release, the link goes directly to the item page of that unique release. But if there are multiple versions of the release (as in the case of most releases - many of them, especially those of major artists - have hundreds), the link leads to a master page for that release. The master page holds the links to the

24 For example https://www.discogs.com/artist/125246-Nirvana 25 https://www.discogs.com/artist/125246-Nirvana?filter_anv=0&subtype=Albums &type=Releases 26 https://www.discogs.com/artist/125246-Nirvana?filter_anv=0&subtype=Singles-EPs &type=Releases 27 https://www.discogs.com/artist/125246-Nirvana?filter_anv=0&subtype=Compilations &type=Releases

25 item pages. During this process, items that are not official (i.e pirate or counterfeit releases) or not in the vinyl format are filtered out. After links have been collected for all individual items for the artist, they are being downloaded. With the HTML code for all individual item pages downloaded, it is being processed through some functions to retrieve the needed information about each item. The attributes which are explicitly available in text form on the page 28 , can be retrieved using Beautiful Soup. Some attributes, however, can not be determined with the information from the individual HTML page only29 . They need to be evaluated in relation to the other downloaded pages, which is also done in this process. An example of this is the categorical feature issue type, which can have three different values depending on whether it is an original issue which has not been reissued later, an original issue which has been reissued, or simply a reissue. As the median prices on the pages are displayed in a currency depending on the IP address of the user, and the experiment was not executed in the USA, a VPN service was utilized to acquire the dependent variable in the USD currency format. After data has been collected for all items retrieved, it is being saved to an SQLite database. This process was done for every artist decided to be included, creating the full dataset. Another small piece of software was later used for performing some minor adjustments on the dataset. This script was done to be able to add a few features 30 to the dataset which had not yet been fully established at the time of the implementation of the primary dataset creation module. However, it does not require any additional online data that has not already been retrieved, so a short script was enough to do the changes.

3.1.3 Data trimming calibration To evaluate which values to use for data trimming, tests were done according to the principles described in Chapter 2.1.3. The classes of the experiment.py module were used iteratively to produce R 2 scores of evaluation of the full dataset using different values. The results were then analyzed to see which values generated the most preferable results.

28 Record label, release year (recalculated as age later), country , limited pressing, colored disc, picture disc, box set, numbered , promotional item, test pressing, and usd . 29 Issue type, versions , and vinyl exclusive. 30 Chronology , area , age , and genre .

26 Figure 3.1 shows the results of using different values for the required minimum amount of samples for an artist to be included. A slight increase of R 2 scores started to show when artists with fewer than 50 releases were excluded, becoming more stable again after passing 100. However, since about 40% of all artists in the dataset would be excluded if an item limit of 50 was required, and because the difference of the performance was rather small, no limit was set.

Figure 3.1: A slightly increased R2 score by excluding smaller artists.

Figure 3.2 shows the results of using different values for the upper limit of samples used for one artist. The R 2 scores peak at 300, so the decision was taken to use this limit. Figure 3.3 shows the results of using different values for the maximum target variable allowed for a sample to be included. The graph shows that outliers like these can have a large impact on the overall performance with relatively few samples, and that lower target variable max limit leads to higher R 2 scores in this scenario. However, the purpose was to cut as little as possible while still making a significant improvement in results. After studying the results of the test, the limit was set at 300, meaning all items with a median selling price of over $300 are excluded. Compared to not excluding any samples based on high prices, it produced approximately twice as high R 2 scores while enduring a data loss of less than 1% of the dataset.

27

Figure 3.2: R2 scores when data is trimmed for artists with large discographies.

Figure 3.3: R2 scores when data is trimmed based on target values.

3.2 Controlled experiment To render the experiment of this project, there was a need for an environment that enabled machine learning model evaluation in several different ways. Apart from using the full set of collected data, there was a need to perform calculations also on parts of it, such as genres and artists, and to be able to easily access this data. There was also a need to train and evaluate a number of models with different data and retrieve the results. For these purposes, the

28 two classes Dataset and Regression were developed.

3.2.1 Hyperparameter calibration The hyperparameter settings for the machine learning algorithms were calibrated similarly to how the preferable values for data trimming were measured. The classes of the experiment.py module were used iteratively to produce evaluation results using different values. As mentioned in Chapter 2.2.1, the focus was to not fine-tune the settings too closely to a specific sample, but rather to find some rough settings which can be considered to generally improve the accuracy. After testing different values for a selection of hyperparameters for each algorithm using both the full dataset as well as genres and artists, some settings which proved generally effective were decided for. The hyperparameters used (deviating from the standard settings used while implementing Scikit-learn version 0.22.2) can be seen in Table 3.2.

Algorithm Hyperparameter settings k-Nearest Neighbors n_neighbors: 5, weights: distance Linear Regression fit_intercept: False Neural Network solver: lbfgs, activation: relu Random Forest n_estimators: 100 Table 3.2: Hyperparameter settings used for the algorithms.

3.2.2 Dataset class This class utilizes the Sqlite3 and Pandas libraries during initialization to load all data from the database into a DataFrame that gets stored in the object. It is in this process that the dataset is trimmed using the values established in Chapter 3.1.3. The SQL statement can be adjusted using two variables, artist_min, and max_price. The first one sets the lower limit for how many items an artist needs to have to be included 31 , while the latter sets an upper limit for how high the price is allowed to be for an item to be included 32 . After the data has been imported, it is processed by two methods. One of them, trim_artists , iterates through every artist in the dataset and removes the number of samples exceeding the value of variable artist_max. The next method, train_test, also iterates through each artist, to split their data into training and testing portions. The ratio is determined by the train_size

31 For example, if artist_min is set to 20, all artists with less than 20 items will be excluded. 32 For example, if max_price is set to 200, all items with a price of over 200 will be excluded.

29 variable, which must be a float between 0.1 and 0.9. The data for each artist is divided into a training portion and a testing portion, which are then appended to training and testing sets containing all artists. The reason that the process is done for each artist like this, and not on the whole dataset, is to ensure that the rows of each artist are evenly split into the training and testing sets. The split data for all artists is contained in two sets and can be accessed easily using the get_data method, which takes the two parameters artist and genre. If it is called with both of them set to None, the method will return a tuple with the training and testing sets including all data. Should any of the two parameters be set with a string representing an artist or genre in the dataset, the tuple returned will include training and testing set with only that artist or genre.

3.2.3 Regression class In the initialization method of this class, parameters artist and genre are used to define what training and testing data will be loaded into the object. The aforementioned get_data method of the Dataset object gets called (which means a Dataset object needs to be instantiated before) in the initialization method. The values of the artist and genre parameters are passed on to the get_data method, and the returned tuple will be stored in the Regression object to be used for training and testing data later. After that, several methods are called within the initialization process. Firstly, remove_unique_cols which drops columns in the dataset where all values are unique, which will be the case with the artist and genre columns if not all data was retrieved. Secondly, label encoding for categorical values will be handled in the one_hot_encoding method. Within those methods, the training and testing data is temporarily merged and unmerged using the merge and unmerge methods, ensuring that the operations are performed on both parts of the dataset. Thirdly, the standardize method is called, which uses the MinMaxScaler of the Scikit-learn library to transform all values of the dataset (except for the dependent variable) into the same scale of 0-1. The scaler is fit using the training data, and the transformation is then applied to both the training and testing data. After the data has been preprocessed, the models are created. Using the KNeighborsRegressor, LinearRegression, MLPRegressor, and RandomForestRegressor classes from the Scikit-learn library, and the

30 hyperparameter values in Table 3.2, the models are fit with the training data and stored in the object. Calling the evaluate_models method will return prediction results for the testing data. Two of its parameters, artist and genre, enables the possibility to test the model for a specific artist or genre only (see usage example in Code 3.1). For every model in the object, a list of predicted prices is generated. This list, along with a list of the actual prices, is sent to a function, get_scores, which returns an R 2 score, a MAPE score, an array of APE scores as well as a frequency distribution table of APE scores. The method then returns the results for all models as a Python dictionary. The method has the optional parameters results and perc_list. If results is set to True, the lists of actual and predicted prices will also be returned for each model. This enables the merging and evaluating of results from multiple models mentioned in Chapter 2.2.3. If perc_list is set to True, a list of all APE scores gets added to the results as well.

# Instantiate Dataset object. data = Dataset()

# Instantiate Regression object trained with the full dataset. regr_all = Regression()

# Get prediction scores for the full dataset. all_res = regr_all.evaluate_models()

# Get prediction scores for one artist, trained with the full dataset. artist_scores_trained_all = regr_all.evaluate_models(artist=’artist name’)

# Instantiate Regression object containing data of a specific artist. regr_artist = Regression(artist=’artist name')

# Get prediction results for one artist, trained with artists data only. artist_res = regr_artist.evaluate_models() Code 3.1: Basic usage of the Dataset and Regression classes.

31 3.2.4 Experiment execution To produce all results according to the experiment configuration in Chapter 2.2.3, a script was written that uses the finished dataset and the classes described in this chapter. Regression objects were created for every artist and genre as well as for the full dataset, along with variants of the full dataset where each attribute was excluded. The objects were then used to generate results that were saved to a database. Figure 3.4 shows a visual representation of the experiment.

Figure 3.4: Visual representation of the experiment.

32 4 Results This chapter shows the structure of the finalized dataset, as well as results produced using that dataset in the machine learning experiment.

4.1 Dataset The finalized dataset after trimming includes the sampled discographies of 412 artists in 11 different genres. See Table 4.1 for more details regarding artists and samples in each genre as well as for the full dataset. Table 4.2 shows details of the attributes of the dataset.

GENRE ARTISTS SAMPLE AMOUNT TOTAL TRAINING TESTING (All) 412 37046 29490 7556 Alternative metal 41 2070 1640 430 Alternative rock 46 4788 3812 976 Black metal 43 2526 2004 522 Classic rock 13 3900 3120 780 Death metal 34 2130 1692 438 Doom metal 64 4037 3204 833 Electronic 40 3982 3172 810 Heavy metal 35 3910 3118 792 Punk rock 43 5537 4413 1124 Stoner rock 26 1294 1027 267 Thrash metal 27 2872 2288 584 Table 4.1: Details of the genre subdivisions of the dataset.

NAME TYPE NOTES Age numerical Years since manufacturing. Area categorical The geographical area of the release. Artist categorical Name of artist. Box set categorical True or false . Chronology numerical A number on a 1-10 scale. Colored vinyl categorical True or false . Compilation categorical True or false . Genre categorical Name of genre. Issue type categorical Original (not reissued) , original (reissued) or reissue . Limited edition categorical True or false . Numbered categorical True or false . Picture disc categorical True or false . Price numerical The median price in USD. Dependent variable. Promotional item categorical True or false . Release type categorical Album or single . Test pressing categorical True or false . Versions numerical The number of other versions available of the same title. Vinyl exclusive categorical True or false . Table 4.2: Attributes used in the finalized dataset.

4.2 Controlled experiment This chapter is showing tables giving an overview of the results from evaluating the multiple aspects of the dataset mentioned in Chapter 2.2.3. The

33 largest tables representing results for evaluated genre and artist portions can be found as appendices. Some selected parts of the results are also displayed using graphs. There are two tables, one that shows R 2 and MAPE scores, and one that shows the frequency distribution of APE scores. Some abbreviations are being used throughout the chapter. ED stands for evaluated data and refers to the group of data that has been evaluated. TS describes the number of testing samples for that data group. SM stands for sample mean (i.e. the mean value of all the target variables). TR refers to the training data used, with F, G, and A standing for full dataset , genre, and artist respectively (meaning the evaluations were done using models trained with those data divisions). The remaining abbreviations refer to the algorithms and error metrics used (described in Chapter 2.2.1 and 2.2.2 respectively). The columns with numerical values as headers describe the percentage of predictions belonging to the APE span of that column. For example, the value 15.2 in the first row of Table 4.4 tells that when the full dataset was evaluated with k-Nearest Neighbors using the whole dataset as training data, 15.2% of the predicted prices had a difference of between 10-20% of the actual price. The F column refers to the figure illustrating the data of those rows.

4.2.1 Evaluating full dataset All rows in Table 4.3 are showing the R 2 and MAPE scores of the evaluation of the full dataset of 37046 samples (whereof 7556 were used for testing) while Figure 4.1 is showing a visual representation of the MAPE scores of that table.

ED TS SM TR KN LR NN RF R2 MAPE R2 MAPE R2 MAPE R2 MAPE all 7556 31.57 F 0.247 60.32 0.272 86.45 0.375 71.95 0.404 55.84 all 7556 31.57 G 0.247 60.1 0.302 83.95 0.248 78.0 0.409 57.89 all 7556 31.57 A 0.236 60.73 0.102 87.89 -0.697 87.05 0.34 61.75 Table 4.3: R2 and MAPE for evaluation of the full dataset

34 Figure 4.1: MAPE scores for evaluation of the full dataset.

Table 4.4 shows the frequency distribution of APE scores of the full dataset, while Figure 4.2 shows a visual representation of the first four rows of that table, i.e. the evaluation of the testing samples of the full dataset using models trained with all training data.

AL TR 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 > KN F 18.6 15.2 12.6 10.8 9.0 6.8 5.5 4.0 3.5 1.6 1.5 1.2 1.0 0.9 0.7 7.3 LR F 10.8 10.4 10.2 9.4 8.5 7.9 6.8 5.0 4.3 3.0 2.4 2.0 1.8 1.5 1.3 14.8 NN F 13.7 12.1 11.8 10.4 9.0 7.4 5.9 4.6 3.7 2.6 1.9 2.0 1.6 1.3 1.4 10.6 RF F 19.3 17.0 13.5 11.1 8.5 6.4 5.2 3.6 2.6 1.4 1.2 1.0 0.9 0.6 0.7 6.8 KN G 18.5 15.3 12.4 10.9 9.0 6.8 5.4 4.2 3.3 1.6 1.7 1.1 1.0 0.9 0.7 7.1 LR G 11.9 10.9 10.4 9.4 8.1 7.4 7.1 5.0 4.1 3.1 2.5 1.8 1.9 1.5 1.2 13.7 NN G 13.4 12.9 11.8 10.2 8.2 7.3 6.0 4.8 3.5 2.8 2.2 1.7 1.6 1.3 0.9 11.6 RF G 18.5 17.5 12.9 10.9 8.9 6.6 5.0 3.7 2.6 1.6 1.3 1.0 1.0 0.7 0.7 7.3 KN A 18.1 15.2 12.3 11.1 8.5 7.0 5.5 4.7 3.4 1.6 1.4 1.0 1.2 0.8 0.7 7.4 LR A 12.7 12.5 11.1 9.3 8.2 7.8 6.1 4.9 3.8 2.9 2.4 1.7 1.6 1.1 1.4 12.5 NN A 15.0 13.2 11.7 9.2 8.4 6.5 5.7 4.5 3.5 2.9 1.8 1.5 1.4 1.3 1.1 12.3 RF A 18.2 16.9 13.1 10.7 8.8 6.2 5.1 4.0 2.9 1.9 1.2 1.2 0.9 0.8 0.7 7.5 Table 4.4: APE frequency distribution for evaluation of the full dataset.

35 Figure 4.2: APE frequency distribution for evaluation of the full dataset using the full dataset trained model.

Table 4.5 shows the results of the full dataset being evaluated while excluding one feature at the time, and Figure 4.3 is a visual representation showing MAPE scores for each algorithm based on that data.

ED TS SM TR KN LR NN RF R2 MAPE R2 MAPE R2 MAPE R2 MAPE - age 7556 31.57 F 0.24 61.16 0.272 86.63 0.376 74.25 0.388 55.38 - album 7556 31.57 F 0.197 67.18 0.236 97.02 0.316 85.84 0.369 64.01 - area 7556 31.57 F 0.248 62.08 0.26 89.73 0.38 77.71 0.378 60.12 - artist 7556 31.57 F 0.19 68.88 0.192 102.82 0.289 89.61 0.335 72.69 - box_set 7556 31.57 F 0.218 61.21 0.227 87.66 0.338 75.72 0.346 56.74 - chronology 7556 31.57 F 0.246 60.55 0.27 86.53 0.372 75.45 0.406 56.0 - colored 7556 31.57 F 0.248 60.42 0.27 86.14 0.392 73.32 0.402 56.6 - compilation 7556 31.57 F 0.233 60.21 0.274 86.68 0.373 70.21 0.403 56.28 - genre 7556 31.57 F 0.256 62.65 0.272 86.45 0.373 71.04 0.389 55.63 - issue_type 7556 31.57 F 0.249 59.86 0.27 86.09 0.395 71.57 0.403 55.58 - limited 7556 31.57 F 0.23 60.46 0.266 86.89 0.364 73.82 0.391 55.78 - numbered 7556 31.57 F 0.247 60.85 0.267 86.53 0.381 71.5 0.398 56.38 - picture_disc 7556 31.57 F 0.245 60.48 0.272 86.46 0.399 71.63 0.406 55.87 - promo 7556 31.57 F 0.242 60.21 0.27 85.43 0.356 71.1 0.402 55.77 - test_pressing 7556 31.57 F 0.203 63.65 0.231 88.22 0.316 76.57 0.355 58.02 - versions 7556 31.57 F 0.238 61.18 0.272 86.18 0.378 73.76 0.393 55.88 - vinyl_exclusive 7556 31.57 F 0.244 61.01 0.265 86.51 0.36 71.1 0.403 55.94 Table 4.5: R2 and MAPE for evaluation of the full dataset with features excluded

36 Figure 4.3: MAPE scores of the full dataset being evaluated while excluding features.

4.2.2 Evaluating genre data The table in Appendix B.1 shows the results of individual genres being evaluated. Figure 4.4 is based on data from that table and shows MAPE scores for all genres evaluated using RF models trained with artist data, genre data, and full dataset.

Figure 4.4: MAPE scores of all genres evaluated using RF algorithm and multiple models.

37 The table in Appendix B.2 shows the frequency distribution of APE scores for the different genres. Figure 4.5 and Figure 4.6 are showing APE frequency distribution for the death metal and classic rock genres respectively, which were the two genres that showed the lowest and highest MAPE scores. The graphs show the genre evaluation results generated with full dataset trained models using all four algorithms.

Figure 4.5: APE frequency distribution for the death metal genre evaluated with full model.

Figure 4.6: APE frequency distribution for the classic rock genre evaluated with full model.

38 4.2.3 Evaluating artist data The table in Appendix B.3 shows the MAPE scores of individual artists being evaluated using models trained with the artist, its genre as well as the full dataset. It is sorted firstly by genre, and secondly by the mean of MAPE scores for each artist. Figure 4.7, Figure 4.8, and Figure 4.9 are showing APE frequency distribution for the A283, A19, and A359 artists respectively. They were selected based on their mean MAPE scores of full dataset model evaluations and are examples of artists showing the best, medium, and worst results (and with the requirement that the artist had at least 20 testing samples). The graphs show the artist evaluation results generated with full dataset trained models using all four algorithms.

Figure 4.7: APE frequency distribution for the A283 artist (an example of an artist showing good results in relation to the total) evaluated with the full dataset model.

39 Figure 4.8: APE frequency distribution for the A19 artist (an example of an artist showing medium results in relation to the total) evaluated with the full dataset model.

Figure 4.9: APE frequency distribution for the A359 artist (an example of an artist showing poor results in relation to the total) evaluated with the full dataset model.

40 5 Analysis In this chapter, the results presented in the previous chapter are being analyzed based on the problem formulations in Table 1.1. Results were displayed as R 2, MAPE (Mean absolute percentage error), and APE frequency distribution tables 33 , and parts of the results were also displayed as graphs. They represent different portions of the dataset (full dataset, separate genres, and separate artists) being evaluated with models trained with different portions of data as well. Analyzing these numbers, it is possible to see patterns regarding which algorithms and training data being the most effective, and what level of accuracy can be achieved.

Figure 5.1: Scatter plot of R2 and MAPE values of ~3000 sets of predictions.

Figure 5.1 illustrates the relationship between R 2 scores and MAPE scores using predictions of roughly 3000 portions of data from the dataset used in

33 The R2 score is a float value (maximum 1.0) that represents the percentage of the dependent variable that can be calculated using the independent variables. APE scores were calculated for every testing sample and explain the percentage difference between the predicted price and the actual price. MAPE scores are the result of calculating the mean of the APE scores for a set of predictions. The metrics are further explained in Chapter 2.2.2.

41 this project. It shows that generally, lower MAPE scores tend to equal higher R 2 scores. It also shows plenty of examples where low MAPE scores correlate with low R 2 scores and high MAPE scores correlate with high R 2 scores, meaning that the R 2 can not always be considered a trustworthy determinant regarding the quality of the predictions, even if a (somewhat unpredictable) relationship can be observed. This can also be seen in the results. For example, the first row in Table 4.3 shows a lower R 2 score for the KN model (0.247) than for the LR and NN models (0.272 and 0.375 respectively), despite KN having the lowest MAPE score of the three. It was decided that MAPE scores and APE frequency distribution tables would be the primary focus in this analysis, as they represent the quality of the predictions more explicitly than R 2.

Figure 5.2: Box plot of APE scores generated with the four algorithms. The three lines in the coloured boxes represent the quartiles Q1 (median of the lower half), Q2 (median) and Q3 (median of the upper half) for all sets of APE scores, while the lines above indicate the maximum value (excluding outliers). The black squares above are outliers. The diagram shows that some APE scores are very high (up to around 2000) compared to the majority.

While studying the MAPE scores in this report, it is important to consider that some of the scores are greatly affected by relatively small amounts of very high error scores. For example, looking at the first row of Table 4.3

42 again, the MAPE score for KN is 60.32, meaning the mean absolute percentage error is 60.3%. Looking at the APE frequency distribution of the same evaluation (first row of Table 4.4), it can be observed that 18.6% of the predictions had a less than 10% error, and almost half of all the predictions had less than 30% error. The columns for the remaining spans up to 150% show quite low numbers, which in itself does not justify the high MAPE. However, the last column, representing scores with a larger difference than 150%, contains 7.3% of the predictions, suggesting that there were some quite large errors having a considerable effect on the MAPE. This is visualized as a box plot in Figure 5.2, displaying the distribution of APE scores generated by evaluating the full dataset using all four algorithms. To analyze this further, a test was done where the MAPE scores for that row were re-calculated with the largest APE scores excluded, with one percent at a time from 1% to 9%, followed by 10% and every tenth number up to 50%. The results (visible in Figure 5.3) show that the MAPE gets decreased by 9 percentage points when excluding 1% of the largest errors, over 19 percentage points when excluding 5% and over 25 percentage points when excluding 10%. This further established that the results included some outliers which had a large impact on the scores, and it was kept in consideration during the analysis.

Figure 5.3: The effect on MAPE scores of excluding percentages of the APE scores.

43 5.1 Comparison of algorithms A total of 1272 sets of data were evaluated using the four different algorithms, and an overview of their performance can be seen in Table 5.1. The table displays an overview of the results of different groups of data evaluated (full dataset as well as separate genres and artists), the training data used, and counts of the best performing algorithm in each situation. The performance was measured in terms of the MAPE score, as well as the number of APE scores with a less than 30% difference in each evaluation. The latter was used because it measures the amount of most accurate predictions better than the MAPE, however disregarding the eventual impact of high percentage errors in the evaluation. The data of the table is also displayed visually in Figure 5.4.

ED TR TOTAL LOWEST MAPE MOST APE SCORES < 30% KN LR NN RF KN LR NN RF all F 1 0 0 0 1 0 0 0 1 all G 1 0 0 0 1 0 0 0 1 all A 1 1 0 0 0 0 0 0 1 genres F 11 3 0 0 8 1 0 0 10 genres G 11 4 0 0 7 2 0 0 9 genres A 11 6 0 0 5 1 0 0 10 artists F 412 120 38 48 206 162 46 44 160 artists G 412 122 54 38 198 156 51 51 154 artists A 412 135 53 59 165 159 65 68 120 Table 5.1: Lowest MAPE scores and most APE scores under 30% for evaluated data.

Figure 5.4: Visual representation of the lowest MAPE scores and most APE scores under 30% for evaluated data.

44 The table and graph shows that the algorithms which produced the best results in most cases were RF and KN. When the full dataset was evaluated, RF gave the best results except for the MAPE generated using artist models, where KN gave the lowest score. When genre data was evaluated, MAPE scores between KN and RF were slightly more even, especially using artist trained models, but using the full dataset and genre models, the RF results were the best on most occasions. Looking at the APE scores less than 30% in the same segments, RF also gave the strongest results. Analyzing the evaluations of artist data, the numbers appear more even between KN and RF, and the remaining two algorithms, LR and NN, also show best results in a considerable portion of the predicted artists. Worth mentioning here is that the number of testing samples for each artist differ between 3 and 60 (see Appendix B.3). It can help explain the differences that one can observe when comparing the APE scores below 30% of the artist data and the genre data. Good results for evaluations done on data by individual artists do not necessarily mean the same algorithm is the best choice when evaluating larger sets including more artists or genres. To summarize the comparison, KN seems to be most effective when using models trained by an artist’s own data only, and evaluating smaller portions of data, again like that of the individual artists. The last row shows that when it comes to generating the most APE scores under 30% for individual artists, KN showed its biggest precedence. It also produced slightly lower MAPE scores compared to RF where evaluations were done with artist trained models. In all other situations, the table shows that RF generated superior results. Both LR and NN gave the best results in some of the evaluations of individual artists, but considerably less so than KN and RF.

5.2 Additional contextual data Table 4.3 shows the full dataset being evaluated using different algorithms and models trained with different portions of data. Both KN and LR show a fairly small difference in the effect of using the different training data, while both NN and RF show lower MAPE scores when a broader scope of data has been used for training. The data in Appendices B.1 to B.3 represents evaluations of separate genres and artists. That data was analyzed to show the amounts of genres and artists generating better results when evaluated using models with data from a bigger scope. Table 5.2 shows the number of artists of each genre (as well as

45 all artists at the bottom row), using the different algorithms, that generated lower MAPE scores after being evaluated with models trained with genre data or full datasets. The last row of this table is also represented as a graph in Figure 5.5, showing the amount of artists that got lower MAPE scores when using the genre model or the full dataset model. Table 5.3 shows data from the same evaluations, but are showing numbers based on whether the number of APE scores below 30% was higher when trained with the genre and full dataset models. Table 5.4 shows similar statistics but for evaluations of full genres. The results varied depending on which algorithm was used, and the best effect was achieved while using RF, which was also concluded to be the generally best performing algorithm in Chapter 5.1. For that algorithm, using the genre model generated lower MAPE scores for over 60.7% of the artists, and the full model was slightly better at 62.3%. The number of artists not achieving a lower MAPE score from using the genre or full dataset models is 118 (28,6%). That was calculated from the table in Appendix B.3. Table 5.4 shows that evaluating the separate genres using the full dataset was mostly effective, especially using the RF algorithm. Conclusively, it can be said that for the most part, the effect of using additional contextual training data was positive, especially for RF, although there are exceptions.

Figure 5.5: Visual representation of the lowest MAPE scores and most APE scores under 30% for evaluated data.

46 ED TOTAL KN LR NN RF G F G F G F G F A:AM 41 14 16 27 28 20 26 27 21 A:AR 46 26 26 17 19 29 35 26 24 A:BM 43 23 21 24 22 29 29 28 33 A:CR 13 5 7 6 4 6 9 7 6 A:DE 34 21 22 23 21 23 24 23 29 A:DO 64 33 37 28 29 32 41 35 35 A:EL 40 23 25 20 18 19 20 23 25 A:HM 35 19 19 18 20 14 18 25 22 A:PR 43 20 19 23 25 22 21 21 23 A:SR 26 13 13 15 13 10 11 18 18 A:TR 27 12 13 16 17 16 16 17 21 A:all 412 209 218 217 216 220 250 250 257 Table 5.2: Counts of artist data evaluation where models trained with genre data or full datasets generated lower MAPE scores than when using artist models.

ED TOTAL KN LR NN RF G F G F G F G F A:AM 41 15 14 15 14 14 17 22 21 A:AR 46 23 22 19 16 20 12 19 19 A:BM 43 20 19 16 14 21 18 21 21 A:CR 13 7 7 3 1 5 4 9 9 A:DE 34 14 16 18 16 19 20 19 21 A:DO 64 22 25 27 23 19 24 24 27 A:EL 40 17 16 16 10 16 17 13 20 A:HM 35 15 17 13 8 7 8 13 14 A:PR 43 13 14 11 16 14 18 20 20 A:SR 26 6 7 11 8 11 11 12 12 A:TR 27 9 12 9 10 11 11 12 15 A:all 412 161 169 158 136 157 160 184 199 Table 5.3: Counts of artist data evaluation where models trained with genre data or full dataset generated a higher amount of APE scores below 30% than when using artist models.

ED TOTAL KN LR NN RF MAPE APE<30% MAPE APE<30% MAPE APE<30% MAPE APE<30% G:AM 1 0 1 0 0 1 1 1 0 G:AR 1 1 1 0 1 1 0 1 0 G:BM 1 0 0 1 0 1 0 1 1 G:CR 1 1 1 0 0 1 1 0 1 G:DE 1 1 1 0 0 1 1 1 1 G:DO 1 0 1 0 1 1 1 1 1 G:EL 1 1 0 0 0 1 0 1 1 G:HM 1 0 0 0 0 0 0 1 1 G:PR 1 1 0 1 0 1 1 1 1 G:SR 1 0 1 0 0 1 1 1 0 G:TR 1 1 1 0 1 1 0 1 1 Table 5.4: Counts of genre data evaluation where models trained with full dataset generated lower MAPE scores and higher amount of APE scores below 30% than when using genre models.

5.3 Efficiency of features Table 4.5 shows the full dataset being evaluated in comparison to the same data with all the separate variables excluded. When a variable is excluded, R 2 scores are generally slightly lower while MAPE scores are a bit higher, signifying that they are all relevant features to a varying extent. However, on

47 some occasions, the exclusion of a feature can produce slightly better numbers for one algorithm but at the same time appear to worsen the scores for others. The independent variable showing the largest effect when excluded is clearly artist.

5.4 Accuracy of price estimation Previously in this chapter, it has been established that the concept of using the RF algorithm with training data from a broader scope than that of the testing data (i.e more artists within the same genre, or artists from additional genres) has been the generally most effective approach, although for some parts of the data another way might be more effective. The lowest MAPE score achieved when evaluating the full dataset was 55.8 (Table 4.3), and was the result of the RF algorithm using a single model trained with the full dataset. The APE frequency distribution for the evaluation (fourth row of Table 4.4) shows that 19.3% of the predictions have a difference of less than 10% to the actual prices, while 17% are in the 10-20% span and 13.5% are within the 20-30% span. That means that roughly half (49.8%) of all the predictions done with the 7556 testing samples had a smaller difference than 30% of the actual prices. Narrowing down the scope of data being tested into genres and artists, both higher and lower results can be observed. The table in Appendix B.1 shows R 2 and MAPE scores, while the table in Appendix B.2 is showing the APE frequency distribution of genre data being evaluated. Here it is observable that the price prediction developed in this project is more effective for some genres than others. The death metal genre displays the lowest MAPE score (34.78 using RF and the full dataset model). The black metal genre has a slightly higher MAPE (35.23 using the same approach) but shows the highest amount of APE scores below 30% for any genre in the dataset. It has 23.9% of its predictions in the 0-10% range, 20.3 in the 10-20% range, and 16.1 in the 20-30% range, meaning 60% of the predictions in the set has less than 30% difference from the actual price. For the rest of the genres evaluated, the results vary, with a mean MAPE of 53.3 using RF and full dataset model. MAPE scores were calculated for all artists using each algorithm, and models trained with the artist data, genre data, and full dataset, as can be seen in the table in Appendix B.3. The mean MAPE scores for every approach in each genre, as well as for all artists combined, can be viewed in Table 5.5. It

48 shows that the best mean score is 50.53 using RF and the model trained with the full dataset. Looking at the genres in the same group, the means largely vary between around 35 and 60, with some exceptionally high examples like the alternative metal (64.79) and classic rock (82.37) genres. The mean MAPE scores for artists, generated with the models trained with the full dataset, are displayed visually in Figure 5.6.

ED TR: FULL DATASET TR: GENRE TR: ARTIST KN LR NN RF KN LR NN RF KN LR NN RF A:AM 90.61 109.22 89.29 64.79 80.47 93.72 109.63 68.77 83.44 112.78 132.25 70.09 A:AR 67.59 120.34 74.51 59.41 67.94 114.75 84.15 64.99 76.5 110.21 108.24 69.27 A:BM 47.45 56.96 44.42 35.78 46.79 58.75 47.65 39.53 48.68 68.27 55.4 46.08 A:CR 74.26 126.15 90.69 82.37 74.83 115.68 102.07 79.62 75.22 110.95 110.12 81.85 A:DE 40.24 46.49 46.87 33.75 39.68 44.69 54.81 39.61 42.89 64.06 64.67 49.39 A:DO 56.09 58.26 57.23 43.49 54.32 58.13 62.89 44.57 52.84 82.6 75.81 50.93 A:EL 54.11 87.86 76.02 52.18 54.6 88.04 84.31 56.57 57.46 129.85 97.73 64.85 A:HM 62.65 73.97 68.85 53.69 62.19 70.89 69.55 59.09 57.33 96.37 74.58 67.56 A:PR 70.29 84.49 79.17 58.26 70.11 90.64 86.88 59.77 65.74 91.51 92.37 59.92 A:SR 56.66 75.65 65.23 47.45 57.74 70.68 69.07 52.86 66.31 77.76 55.12 67.0 A:TR 57.3 59.5 56.33 43.78 57.5 57.5 65.4 47.5 54.58 77.02 78.49 56.55 A:all 61.14 79.42 66.76 50.53 59.86 76.9 74.6 54.13 61.12 92.57 85.7 60.31 Table 5.5: Mean MAPE scores for artists divided in genres.

Figure 5.6: Mean MAPE scores for artists divided in genres, using full dataset models.

49 6 Discussion In Table 1.1, four questions (Q1-Q4) composing the foundation of this project were formulated. In the previous chapter, the results were analyzed with the intent to put them in relation to the problem formulations. In this chapter, the answers to the questions are being discussed based on new findings and in comparison with observations done in Chapter 1.2 about related work.

6.1 Accuracy of price estimation (Q1) Q1 reads What level of price prediction accuracy is it possible to achieve using common algorithms and a dataset of vinyl records? Considering the many ways this problem can be approached in terms of data collection, feature construction, algorithm selection, model tuning, and more, this is a complex question to try to put a definitive answer to. One thing that is certain when looking at the results and analysis, is that the accuracy differs greatly for different portions of data evaluated. In this project, the data in focus has been a rather diverse dataset, including hundreds of artists, a dozen different genres, and items released during several decades. The idea behind that was to introduce this concept using a relatively broad approach, with the intent to produce a richer groundwork than if the focus would have been more narrow. That being said, the findings are not to be considered ultimate in the broader sense and are subject to further development. The best MAPE score achieved with the full dataset of 7556 testing samples was 55.84 and was the result of using an RF model trained with the training samples of the full dataset. It was established in Chapter 5 that the MAPE score is greatly affected by a relatively small number of unusually high APE scores, meaning that the MAPE score of 55.84 does not fairly represent the majority of the estimations. Rather, the APE frequency distribution table tells that over 70% of the estimations have a lower APE score than the MAPE score, with one-fifth of the predictions having a less than 10% APE and half of the predictions having less than 30% APE. That can be considered a good starting point, as a majority of the predictions have a fairly low error. The price prediction results observed in related work is presented using different metrics and approaches and are not always directly comparable with the results of this study, but there are some interesting

50 examples, as R 2 is often used in addition to some other metric. Examples of results in the real estate area include a study where six algorithms are compared and the R 2 scores are between 0.665 and 0.918 (Gradient Boosting Regression is the prominent algorithm) [24], a study where a NN model generates a MAPE score of 2.5 [16], a study where a Support Vector Machine model is used to generate an R 2 of 0.968 [26], a study where LR and RF are being compared and generate scores of 0.696 and 0.878 respectively [30] and a study where two NN models are used to generate MAE (mean absolute error) scores between 0.67%-2.34% [15]. Although there are varying results in these studies, result scores such as R 2 and MAPE generally show better values than in this study. This can be interpreted such as that the hedonic features in real estate are stronger and more reliable determinants of the objects’ value. When it comes to collectibles, examples include the study of price prediction of U.S. coins, where R 2 values range between 0.02 and 0.23 [37], and the study of classical music manuscript which shows an R 2 of 0.797 [38]. There is not a strong coherency found in the collectibles studies, possibly because of the difference in approach, data and nature of the objects. Looking at the subgenres, both worse and better results compared to evaluation of the full dataset can be found. Death metal (best MAPE: 34.78) and black metal (best MAPE: 35.23) show the best results, while punk rock (best MAPE: 62.94) and classic rock (best MAPE: 79.62) show the worst. Such large differences in accuracy suggest that different approaches may be convenient for different genres. Combining data from multiple genres which all generate results at a varying level of accuracy might not be the most viable way to proceed. It seems that the characteristics of the different genres may not be homogenous enough that they would all profit from being evaluated as a group. That being said, the black metal genre data does indeed show the best results when being evaluated with the models trained with the full dataset. However, it raises the question of what particular data in the training set has a positive or negative impact on a specific genre. Using another selection of genres, partly or completely, would likely show other examples of either more or less successful predictions. This essentially leads to more reflection around what the factors are that make the evaluation of some of the data reach a higher accuracy than other data. It was all gathered from the same source, has the same format, and is evaluated the same way. Still, the outcome can be quite divergent. What does it say about the genres and their data? Classic rock, for example, was

51 one of the genres that I thought would generate the best results, considering the extensive discographies of those artists. On the contrary, its results were the worst of all genres. What is it that characterizes the black metal genre that gives it a MAPE score of less than half of that of classic rock? Presumably, different approaches for different data is needed when the results are rather diverse.

6.2 Additional contextual data (Q2) Q2 reads What are the effects of using additional contextual data while making predictions (e.g. training a model with more artists within the same genre when evaluating an artist’s data)? While the concepts of evaluating data of multiple genres can also be discussed within the boundaries of Q1 (as shown above), Q2 specifically focuses on the effect that a larger data scope as training data has on the evaluated scope. Table 5.5 shows the mean MAPE scores for all artists’ data evaluated with their own models, their genres’ model, and the full dataset model. Looking at the RF algorithm, in all cases but the classic rock genre, the MAPE means are improved firstly by using the genre models, and further when using the full dataset model. While this suggests that it is generally positive to use extra data in this manner, it is not shown explicitly that the data necessarily needs to be within the same genre, as the results using the full dataset model are generally even better. However, it shall be remembered that the genre variable is included in the full dataset, maintaining the genre association for each sample also while evaluating using the full dataset model. Nonetheless, it can still be asked what has the biggest influence - the level of similarity of other data, or the amount of other data. Either way, one can assume that it would be beneficial to further investigate the effects of combining certain genres, as they may possess a level of compatibility with each other.

6.3 Comparison of algorithms (Q3) Q3 reads Utilizing multiple algorithms, how do they perform in comparison? LR and NN were the two most commonly used algorithms in the related work. RF occurred in several studies as well, while KN was only used in one. There were plenty of examples where the different algorithms were concluded to be effective, but not very many where the ones used in this project were being compared to each other. The studies where algorithms

52 were being compared were strictly focused on real estate and not collectibles. Therefore, there were no very clear expectations of the outcome of the comparison. As described in Chapter 5.1, RF showed a distinct advantage in the experiment, generating the best results for a majority of artists and genres as well as the full dataset. This was not unreasonable to expect, as RF outperformed its opponent in every occurrence where it was being compared to another algorithm in the related work [25], [29], [30]. Considering the high popularity of LR and NN and the seemingly lower popularity of KN according to the research in Chapter 1.2, it was assumed that KN would produce weaker results than the former two. However, throughout the results, KN generated predictions with higher accuracy. In Chapter 2.2.1, it was explained that the model hyperparameters were roughly adjusted in order to find general improvements that would be effective for all data in this project. Some of the algorithms implemented, like LR and KN, did not offer a very large selection of hyperparameters to fine-tune. However, a more advanced algorithm like NN has more implementation options. The class used, MLPRegressor from the Scikit-learn library, can be adjusted with 23 hyperparameters. If there would have been more time for the implementation, it would have been interesting to see if this model could have produced higher results.

6.4 Efficiency of features (Q4) Q4 reads Are the features used in the dataset effective in the hedonic regression of vinyl records? The results show that all features have an impact on the predicted prices, generating slightly lower results when they are excluded from the dataset. The feature that shows the largest effect on the prices is artist, which can be related to conclusions found in several studies about collectibles [32], [35], [36] and the single recent study about the prices of rare vinyl records [39, pp. 11].

53 7 Conclusion This is the first academic study where a dataset representing the complete vinyl discographies by a large number of artists was used to investigate the possibilities of price prediction of vinyl records. The intent was to explore the important principles within the area, e.g. construction of datasets, comparison of algorithms, and different ways to measure and analyze results, to establish a foundation that can be further developed. Four objectives to achieve this, all of which have been met during the course of the project, were listed in Table 1.2. It has been shown that the approach used in the project holds the potential to execute price predictions with high accuracy for a large part of the data examined. But it is also evident that a relatively limited, yet substantial, share of the predictions often is very inaccurate. The general results differ markedly between evaluations of different divisions of data, such as different genres or artists. That suggests that the approach developed in the project is varyingly effective depending on some unidentified characteristics of the data being used. The Random Forest algorithm showed the strongest results, followed by k-Nearest Neighbors, while Linear Regression and Neural Network generated the least positive results. Results and conclusions of this project can be relevant for science, more specifically within the area of price estimation for collectibles, particularly that of vinyl records or perhaps another type of phonographic media. The project demonstrates a set of techniques and approaches along with their results, which can serve as a foundation to expand scientific research. If future development would lead to a more stable approach (in terms of minimizing the share of erroneous predictions), the concept of price prediction of vinyl records could be relevant also outside of science, i.e. for industry (e.g. record dealers) or society (e.g. record collectors). However, configuration of such concepts were not part of this project.

7.1 Future work To improve the approach used in this project, further work can be put into several areas. The construction of more variables could be beneficial. One specific variable that was not possible to use in this project is the quantity of

54 a limited item. At the moment, there is just a limited variable, stating if or if not the item is a limited edition. To construct more variables, there is a need to further investigate what kind of information is relevant to the price prediction of vinyl records. What information can be retrieved that is not included in the dataset of this project, and how complicated is the process of retrieving it? This could be quite complex depending on the size of the dataset, and the nature of the new variables. As Random Forest produced the overall best results in this project, it could be a good idea to continue the research with a larger focus on this algorithm. Can the hyperparameters be further adjusted to generate higher accuracy? Can they be tweaked individually for different portions of data, and can this somehow be done automatically? There could also be value in looking further at Neural Networks, and investigate if a more meticulous implementation could generate better results than in this project. Regarding the varying accuracy of results, it would be relevant to try to study the factors which create this diversity in prediction quality. Is it possible to define these factors and learn something useful from the poor predictions? Since individual genres and artists generated varying results, it could be of importance to evaluate the compatibility between different diversions of data. To have a large dataset, the data of several artists and genres need to be combined. However, since many of them have proven to generate different results, it is reasonable to believe that the nature in which they are combined and related to each other will be crucial for the outcome. The software developed for the project is tailor-made for the vinyl record dataset. That means that several parts of the code in the experiment are customized to be used with a specifically designed dataset. To make the experiment more generalizable so it can be used to evaluate other data groups, a general or easily customizable version of the software could be created.

55 References

[1] RetroSound (2018, Mar 14). The decline of the compact disc [Online]. Available: https://www.retromanufacturing.com/blogs/news/the-decline-of-the-compact-disc

[2] M. Leimkuehler (2019, Jan 7) Vinyl Sales Continued To Grow In 2018, Report Says [Online]. Available: https://www.forbes.com/sites/matthewleimkuehler/2019/01/07/vinyl-sales-grow-2018-buzzan gle-beatles-kendrick-lamar-queen-album-sales

[3] A. Müller and S. Guido, Introduction to Machine Learning with Python - A Guide for Data Scientists, O’Reilly Media, 2016.

[4] G. Seif (2018, Mar 5). Selecting the best Machine Learning algorithm for your regression problem [Blog]. Available: https://towardsdatascience.com/selecting-the-best-machine-learning-algorithm-for-your-regre ssion-problem-20c330bad4ef

[5] P. Priyadarshini (2019, May 15). How to Choose ML Algorithms for Regression Problems? [Blog]. Available: https://geekflare.com/choosing-ml-algorithms/

[6] S. Shukla. Regression and Classification, Supervised Machine Learning [Online]. Available: https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/

[7] N. Shitut (2020, Jan 3). Most Popular Regression Algorithms In Machine Learning [Blog]. Available: https://analyticstraining.com/popular-regression-algorithms-ml/

[8] Maher (2019, Mar 11). Which machine learning model to use? [Blog]. Available: https://towardsdatascience.com/which-machine-learning-model-to-use-db5fdf37f3dd

[9] R. Harlalka (2018, Jun 16). Choosing the Right Machine Learning Algorithm [Blog]. Available: https://hackernoon.com/choosing-the-right-machine-learning-algorithm-68126944ce1f

[10] H. Li (2017, Apr 12). Which machine learning algorithm should I use? [Blog]. Available: https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm- use/

[11] R. Shaw (2019, Jun 26). The 10 Best Machine Learning Algorithms for Data Science Beginners [Blog]. Available:

56 https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/

[12] J. de Haan and E. Diewert, “Hedonic Regression Methods” in Handbook on Residential Property Price Indices, Luxembourg, 2011.

[13] H. Xiaolong, Z. Ming, “Applied research on real estate price prediction by the neural network” in The 2nd Conference on Environmental Science and Information Application Technology, July 2010, Vol. 2, pp. 384-386.

[14] D. Li et al., “A SVR based forecasting approach for real estate price prediction” in 2009 International Conference on Machine Learning and Cybernetics , July 2009, Vol.2, pp. 970-974.

[15] L. Li and K. Chu, “Prediction of real estate price variation based on economic parameters” in 2017 International Conference on Applied System Innovation (ICASI), May 2017, pp. 87-90.

[16] H. Xue, “The Prediction on Residential Real Estate Price Based on BPNN” in 2015 8th International Conference on Intelligent Computation Technology and Automation (ICICTA), June 2015, pp. 1008-1013.

[17] J. Guan, “Analyzing Massive Data Sets: An Adaptive Fuzzy Neural Approach for Prediction, with a Real Estate Illustration” in Journal of Organizational Computing and Electronic Commerce, January 2014, Vol. 24, pp. 94-112.

[18] E. Ahmed and M. Moustafa, “House price estimation from visual and textual features” in arXiv.org , Sep 27, 2016.

[19] S. Zaddach and H. Alkhatib, “Least squares collocation as an enhancement to multiple regression analysis in mass appraisal applications” in Journal of Property Tax Assessment & Administration, January 2014, Vol.11.

[20] L. Choy et al., “Housing attributes and Hong Kong real estate prices: a quantile regression analysis” in Construction Management and Economics, May 2012, Vol. 30 (5), pp. 359-366.

[21] R. Cellmer, “The use of the geographically weighted regression for the real estate market analysis” in Folia oeconomica stetinensia, Jan 2012, Vol.11 (1), pp. 19-32.

[22] P. Bidanset and J. Lombard, “Evaluating Spatial Model Accuracy in Mass Real Estate Appraisal A Comparison of Geographically Weighted Regression and the Spatial Lag Model” in Cityscape , 2014, Vol.16 (3), pp. 169-182.

57 [23] N. Ghosalkar and S. Dhage, “Real Estate Value Prediction Using Linear Regression” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), August 2018, pp. 1-5.

[24] R. Madhuri et al., “House Price Prediction Using Regression Techniques: A Comparative Study” in 2019 International Conference on Smart Structures and Systems (ICSSS), March 2019, pp. 1-5.

[25] C. Wang and H. Wu, “A new machine learning approach to house price estimation” in New Trends in Mathematical Sciences, 2018, No. 4, pp. 165-171.

[26] J. Shim and C. Hwang, “Kernel-based geographically and temporally weighted autoregressive model for house price estimation” in PloS one, 2018, Vol. 13 (10).

[27] J. Liu et al. “A Geographically Temporal Weighted Regression Approach with Travel Distance for House Price Estimation” in Entropy , August 2016, Vol. 18 (8), pp. 303.

[28] M. Iacobini and G. Lisi, “ Estimation of a Hedonic House Price Model with Bargaining: Evidence from the Italian Housing Market” in Aestimum , August 2013.

[29] M. Čeh et al., “Estimating the Performance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments” in Cognitive Aspects of Human-Computer Interaction for GIS, pp. 125-140, 2019.

[30] I. Engström and A. Ihre, “Predicting house prices with machine learning methods”, Bachelor thesis, KTH, School of Electrical Engineering and Computer Science (EECS), Stockholm, Sweden, 2019.

[31] E. Teti et al., “Ephemeral Estimation of the Value of Art” in Empirical Studies of the Arts, January 2014, Vol. 32 (1), pp. 75-92.

[32] J. Nahm, “Price determinants and genre effects in the Korean art market: a partial linear analysis of size effect” in Journal of Cultural Economics, 2010, Vol. 34 (4), pp. 281-297.

[33] D. Hodgson, “Age–price profiles for Canadian painters at auction” in Journal of Cultural Economics, 2011, Vol. 35 (4), pp. 287-308.

[34] D. Witkowska, “An Application of Hedonic Regression to Evaluate Prices of Polish Paintings” in International Advances in Economic Research, 2014, Vol. 20 (3), pp. 281-293.

[35] J. Fedderke and K. Li, “Art in Africa: Hedonic price analysis of the South African fine art auction market, 2009–2014” in Economic Modelling, January 2020, Vol. 84, pp. 88-101.

58 [36] P. Pradier et al., “Autographs and the global art market: the case of hedonic prices for French autographs (1960–2005)” in Journal of Cultural Economics, 2016, Vol. 40 (4), pp. 453-485.

[37] M. Dickie et al. “Price determination for a collectible good: the case of rare U.S. coins” in Southern Economic Journal, July 1994, Vol. 61(1), pp. 40.

[38] P. Georges and A. Seçkin, “Black notes and white noise: a hedonic approach to auction prices of classical music manuscripts” in Journal of Cultural Economics, 2013, Vol. 37 (1), pp. 33-60.

[39] H. Sonnabend and S. Cameron, “Pricing the Groove: Hedonic equation estimates for rare vinyl records”, University of Hagen, Department of Economics, Hagen, Germany, 2019.

[40] A. Burke, “The Demand for Vinyl L.P.s 1975-1988: Time Series Estimation of a Product Group in the Presence of Product Differentiation Innovation” in Journal of Cultural Economics, 1994, Vol. 18, pp. 41-64.

[41] Rare Records (2020). Vinyl Records Value [Online]. Available: https://www.rarerecords.net/vinyl-records-value/

[42] J. Bossmann (2016, Oct 21) Top 9 ethical issues in artificial intelligence [Online]. Available: https://www.weforum.org/agenda/2016/10/top-10-ethical-issues-in-artificial-intelligence/

59 A Appendix 1 """ Dataset generator module. Includes functions for web scraping artists discographies from Discogs.com """ import math import time import re from contextlib import closing from requests import get from bs4 import BeautifulSoup as bs class Release: """ Release class. Scrapes discogs and stores data about release. """ def __init__(self, url): """ Gets data and stores in variables.

Args: url (str): url to release """ if url: # scrape release data from discogs data = scrape_release(url)

self.url = url self.discogs_release_id = get_id(url) self.title = data["title"] self.label = data["label"] self.year = data["year"] self.country = data["country"] self.format_all = data["format_all"] self.issue_type = data["issue_type"] self.limited = data["limited"] self.picture_disc = data["picture_disc"] self.box_set = data["box_set"] self.numbered = data["numbered"] self.test_pressing = data["test_pressing"] self.promo = data["promo"] self.colored = data["colored"] self.price = data["price"] else: # leave attributes blank until later self.url = None self.discogs_release_id = None self.title = None self.label = None self.year = None self.country = None self.format_all = None self.issue_type = None self.limited = None self.picture_disc = None self.box_set = None self.numbered = None self.test_pressing = None self.promo = None self.colored = None

60 self.price = None

# gets set after init self.artist = None self.discogs_master_id = 0 self.format_type = None self.versions = None self.vinyl_exclusive = None self.chronology = None def get_links(artist_url): """ Gets links for all releases by an artist from Discogs.

Args: artist_url (str): url for artists page, the part after 'arists/' in url

Returns: list: list of links """ # create base URLs for discogs pages url = "https://www.discogs.com/artist/" + artist_url + \ "?sort=year%2Casc&limit=500&subtype=_format&" + \ "filter_anv=0&type=Releases&page=" url_base = url[:23]

links = dict() # to store links to all items

# collect links for Albums and Singles for frmt in ["albums", "singles"]: # create correct word for use in url frmt_text = "Albums" if frmt == "albums" else "Singles-EPs"

# get first page of artists discography for this category pages = [] pages.append(get_soup(url.replace("_format", frmt_text) + "1"))

# continue if no results was found for category if not results_found(pages[0]): links[frmt] = None continue

# find out how many results were found soup = pages[0] items = soup.find("strong", {"class": "pagination_total"}) items = int(items.get_text().strip().split()[-1].replace(",", ""))

# check total amount of pages page_count = items/500 page_count = items/500 if not page_count%500 else math.ceil(items/500)

# download rest of pages if more than one count = 1 while count < page_count: pages.append(get_soup(url.replace("_format", frmt_text) + str(count+1))) count += 1 # create variables to store links masters = list() uniques = list()

61 # loop through each page and save in correct list for soup in pages: # find all cards cards = soup.findAll("tr", {"class": "card"}) # loop through cards and collect links for card in cards: # get link href and put in right list link = card.find("a").get("href") slash = link.rfind("/") link_type = link[slash-7:slash] if link_type == "/master": # for masters, make a dict entry storing a link and list # and bool (for storing vinyl exclusivity variable) masters.append([link, list(), 1]) elif link_type == "release": # do not add if it is not a vinyl if is_valid_release(card): uniques.append(link)

# save lists in dicts for right formats links[frmt] = dict() links[frmt]["uniques"] = uniques links[frmt]["masters"] = masters

# loop through masters and get links to releases for frmt in ["albums", "singles"]: # continue only if results were found if not links[frmt]: continue for master in links[frmt]["masters"]: # download master page and create bs object soup = get_soup(url_base + master[0]) # get cards cards = soup.findAll("tr", {"class": "card"}) # loop through cards for card in cards: # set master[2] to False if non-vinyl item found among items if master[2] == 1: if not is_vinyl(card) and is_official(card): master[2] = 0 # save links for valid items if is_valid_release(card): # get title td title_td = card.find("td", {"class": "title"}) # get link that includes '/release/' title_link = title_td.select("a[href*=release]")[-1] master[1].append(title_link.get("href"))

return links def get_releases(artist_name, artist_url, to_csv=False): """ Get Release objects for all releases by artist from Discogs.

Args: artist_name (str): name of artist artist_url (str): url for artists page, the part after 'arists/' in url to_csv (bool): write results to file if True

Returns: list: Release objects

62 """ links = get_links(artist_url) url_base = "https://www.discogs.com"

# create list for all Release objects releases = list()

# scrape release pages for data and create objects for frmt in ["albums", "singles"]: # continue only if results were found if links[frmt]: # unique releases for link in links[frmt]["uniques"]: # init object with basic data from discogs release page release = Release(url_base + link) # add extra data release.artist = artist_name release.format_type = frmt[:-1] release.vinyl_exclusive = 1 release.issue_type = 1 release.versions = 0 releases.append(release) # master releases for mstr in links[frmt]["masters"]: for link in mstr[1]: # init object with basic data from discogs release page release = Release(url_base + link) # add extra data release.artist = artist_name release.format_type = frmt[:-1] release.vinyl_exclusive = mstr[2] release.discogs_master_id = get_id(mstr[0]) releases.append(release)

# save to csv if to_csv is True if to_csv: with open(artist_name + ".txt", "w", encoding="utf-8") as file: for release in releases: file.write(f"{release.get_csv()}\n")

return releases def get_releases_from_csv(artist): """ Reads from file and creates objects of releases in the same stage as after the get_releases function, then returns them. Primarily used in development stage.

Args: artist (string): name of artist

Returns: list: Release objects """ releases = list()

with open("data " + artist + ".txt") as data_file: for line in data_file: release = Release(None) release.add_csv(line) releases.append(release)

63 return releases def scrape_release(url): """ Gets soup of release page, gets necessary data and returns it.

Args: url (str): url to release

Raises: TypeError: if price retrieved is not in USD format

Returns: dict: data scraped from discogs release page """ # create dict and soup object data = dict() soup = get_soup(url)

# find title title = soup.find("h1", {"id": "profile_title"}) data["title"] = title.findAll("span")[-1].get_text().strip()

# find label label = soup.find(text=re.compile("Label:")).find_parent() label = label.find_next_sibling().find("a") data["label"] = label.get_text() if label else None

# find year year = soup.find(text=re.compile("Released:")).find_parent() year = year.find_next_sibling().get_text().strip()[-4:] data["year"] = int(year) if is_valid_int(year) else None

# find country country = soup.find(text=re.compile("Country:")).find_parent() data["country"] = country.find_next_sibling().get_text().strip()

# find format div frmt = soup.find(text=re.compile("Format:")).find_parent() frmt_soup = frmt.find_next_sibling() # save soup for later frmt = frmt_soup.get_text().lower().replace("\n", " ") frmt = re.sub(' +', ' ', frmt.strip()) data["format_all"] = frmt

# check if reissue data["issue_type"] = None words = ["reissue", "reprint", "repress"] if any(x in frmt for x in words): data["issue_type"] = 3

# check if limited data["limited"] = 1 if "limited" in frmt else 0

# check if picture disc data["picture_disc"] = 1 if "picture" in frmt else 0

# check if box set data["box_set"] = 1 if "box" in frmt else 0

# check if numbered data["numbered"] = 1 if "numbered" in frmt else 0

64 # check if test pressing data["test_pressing"] = 1 if "test pressing" in frmt else 0

# check if promotional pressing data["promo"] = 1 if "promo" in frmt else 0

# check if colored vinyl data["colored"] = 0 italics = frmt_soup.findAll("i") # phrases indicating something else than colored vinyl not_color = ["gatefold", "lenticular", "180g", "autographed", "signed", "gatefold, 180g", "180g, gatefold", "180 gram", "numbered", "hand numbered", "black", "black vinyl", "single"] # loop through italic phrases for i in italics: i = i.get_text().lower() # continue to next word if exact match with non-vinyl indicator if i in not_color: continue # set colored to 1 data["colored"] = 1

# check for median selling price try: price = soup.find(text=re.compile("Median:")).find_parent() price = price.find_parent().get_text().replace("\n", " ") price = re.sub(' +', ' ', price.strip()) data["price"] = price[8:] except AttributeError: data["price"] = None

# raise error if price is not in USD format if data["price"] and data["price"][:1] != "$" and data["price"] != "--": raise TypeError(f"Price is not in USD format. ({data['price']})")

return data def post_scrape(releases): """ For use after get_releases. Calculate chronology (and add year if missing), add info about issue type and amount of other versions available. Remove items with no prices available.

Args: releases (list): list of Release objects

Returns: list: Release objects """ # get earliest and latest release years of artist earliest_year, latest_year = get_earliest_and_latest_year(releases)

# get average year of the artists career career_mean = get_mean(earliest_year, latest_year)

# make separate lists of uniques and masters uniques = [x for x in releases if x.discogs_master_id == 0] masters = [x for x in releases if x.discogs_master_id != 0]

# loop through uniques and calculate chronology and set missing years for rel in uniques:

65 calculate_chronology(rel, earliest_year, career_mean)

# make set for done masters done = set()

# loop through masters and add data for rel in masters: # continue to next if master id is already done if rel.discogs_master_id in done: continue

# add id to done done.add(rel.discogs_master_id)

# get releases of master m_releases = get_releases_by_master_id(rel.discogs_master_id, masters)

try: # try to get earliest and latest release year of master m_earliest, m_latest = get_earliest_and_latest_year(m_releases)

# get average year of the master release master_mean = get_mean(m_earliest, m_latest) except AttributeError: # no years found, use general artist year mean instead master_mean = career_mean

m_earliest, m_latest = None, None

reissue_found = False

# loop through every release in master group for mstr in m_releases: # add issue type for reissues if not mstr.year or mstr.year > m_earliest: # set to reissue if release doesn't have year or if released # later than earliest release for master mstr.issue_type = 3

# calculate chronology and set missing years for all releases # with current master id calculate_chronology(mstr, earliest_year, master_mean)

# add other versions number mstr.versions = len(m_releases)-1

# set reissue_found to True if reissue is found if mstr.issue_type == 3: reissue_found = True

# loop again and set issue type for originals for mstr in m_releases: # if release is from earliest year and not set yet if mstr.issue_type is None: # set to 1 (original without reissue) if no reissue found # and to 2 (original with reissue) if reissue found mstr.issue_type = 1 if not reissue_found else 2

# delete releases with price missing releases = [x for x in releases if x.price and x.price != "--"]

return releases

66 def get_soup(url): """ Downloads webpage, makes soup and returns it.

Args: url (str): url for webpage

Returns: obj: soup object representing page """ # print("Downloading URL: " + url)

while True: # download page with closing(get(url, stream=True)) as resp: # decode bytes object to string html = resp.content.decode("utf-8") soup = bs(html, "html.parser")

# get page title title = soup.title.string

# check if too many requests were made if "Error 429" in title: # if yes, wait for 10 seconds and try again print("Too many requests made. Waiting 10 sec and trying again.") time.sleep(10) else: # return soup of page if no error occured return soup.find("div", {"id": "page"}) def results_found(soup): """ Check for 404 error, if it is found, returns False, or else return True.

Args: soup (obj): soup for full page

Returns: boolean """ if not soup.find(text=re.compile("404! Oh no!")): return True return False def get_releases_by_master_id(m_id, releases): """ Return all releases with a certain master id.

Args: m_id (int): discogs master id releases (list): list of Release objects

Returns: list: Release objects """ return [x for x in releases if x.discogs_master_id == m_id] def calculate_chronology(release, earliest, mean): """ Calculate chronology score and add year to object if missing.

67

Args: release (Release): object to calculate earliest: earliest release year of artist mean: mean value to use if year is missing """ # set year for release if not set release.year = mean if not release.year else release.year now = this_year() max_score = 20 span = now - earliest

# set chronology to 1 if released first year release.chronology = 1 if release.year == earliest else release.chronology

# set chronology to max if released this year release.chronology = max_score if release.year == now else release.chronology

# find release year relative to career start year = release.year - earliest #0 year = year if year >= 1 else 1 #1 year = year if year <= span else span

# calculate score score = int(round((year / span) * max_score)) score = score if score >= 1 else 1 score = score if score <= max_score else max_score

# set score in object if not set yet release.chronology = score if not release.chronology else release.chronology def get_earliest_and_latest_year(releases): """ Generate and return the earliest and latest occuring years among a list of Release objects.

Args: releases: list of Release objects

Raises: AttributeError: no year found among releases

Returns: tuple: earliest and latest years """ earliest = this_year() latest = 0 found = False

# loop through releases and find years for release in releases: if is_valid_int(release.year): found = True if release.year < earliest: earliest = release.year if release.year > latest: latest = release.year

# raise exception if no year was found if not found:

68 raise AttributeError("No year found among releases.")

return (earliest, latest) def get_format_from_card(card): """ Get string from html card that represents the items format.

Args: card (obj): bs object representing discogs 'card'

Returns: string """ try: res = card.find("span", {"class": "format"}).get_text() except: res = card.find("td", {"class": "title"}).get_text() res = (res[res.index("("):res.index(")")+1]) return res def is_valid_release(card): """ Check if discogs item is valid, i.e is a vinyl record and is official (not a pirate/bootleg release).

Args: card (obj): bs object representing discogs 'card'

Returns: boolean """ if not is_official(card): return False

return is_vinyl(card) def is_official(card): """ Check if discogs item is official or not.

Args: card (obj): bs object representing discogs 'card'

Returns: boolean """ frmt = get_format_from_card(card)

# return false if unofficial item if "Unofficial" in frmt: return False return True def is_vinyl(card): """ Check if discogs item is a vinyl item.

Args: card (obj): bs object representing discogs 'card'

Returns:

69 boolean """ frmt = get_format_from_card(card)

# strings to look for in format string vinyl = ["LP", "7\"", "10\"", "12\""]

# return true if any of accepted strings appear in format string if any(x in frmt for x in vinyl): return True return False def is_valid_int(int_string): """ Check if string represents a valid integer and return true/false.

Args: int_string (str)

Returns: boolean """ try: int(int_string) return True except (ValueError, TypeError): return False def get_id(url): """ Returns number at end of string following a '/' I.e "../12345" returns "12345"

Args: url (str): url string

Returns: int: id """ return int(url[url.rfind("/")+1:]) def get_mean(earliest, latest): """ Calculates mean of two integers and returns it.

Args: earliest (int) latest (int)

Returns: int: mean of two ints """ return int(round(latest-((latest-earliest)/2))) def this_year(): """ Returns the current year as int.

Returns: int: current year """ return time.localtime().tm_year

70 def generate_artist_discography(db, artist_name, artist_url): """ Generate dataset for artist, delete any existing rows from database of the artist, and add the new dataset.

Args: db (Database): database object artist_name (str): name of artist artist_url (str): url to artists discogs profile """ # scrape and generate data for artist dataset = post_scrape(get_releases(artist_name, artist_url))

# delete already existing entries of artist in database db.delete_dataset(artist_name)

# add dataset to database db.add_dataset(dataset)

return len(dataset)

71 A Appendix 2 """ Experiment module. Includes classes for loading a dataset and creating regression models which produces results which can be saved to a database. """ import sqlite3 from time import time from math import ceil, floor, sqrt from sklearn.utils import shuffle from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.neural_network import MLPRegressor from sklearn.metrics import r2_score, mean_squared_error from sklearn.preprocessing import MinMaxScaler import pandas as pd import fix_functions as fix from database import Database db = Database("db/results.sqlite") seed = 9 artist_min = 10 artist_max = 300 max_price = 300 train_size = 0.8 class Dataset(): """ Loads data from dataset and splits it into training and testing. """ def __init__(self): """ Init method. """ self.load_data() self.trim_artists() self.train_test()

def load_data(self): """ Loads data from sqlite3 database. """ conn = sqlite3.connect("db/vinylprice_db.sqlite") columns = ( "V.usd, V.artist, V.genre, V.issue_type, V.limited, V.chronology, " "V.test_pressing, V.promo, V.colored, V.picture_disc, V.versions, " "V.vinyl_exclusive, V.age, V.area, V.album, V.numbered, " "V.box_set, V.compilation" )

sql = ( f"SELECT {columns} FROM Vinyl V INNER JOIN (SELECT artist, " f"count(artist) as cnt FROM Vinyl GROUP BY artist) C ON V.artist = " f"C.artist WHERE C.cnt > {artist_min} AND usd < {max_price} ORDER " f"BY V.artist"

72 )

self.data = pd.read_sql_query(sql, conn)

# store lists of artists and genres self.artists = self.data.artist.unique().tolist() self.genres = self.data.genre.unique().tolist()

def get_data(self, artist=None, genre=None, sans=None): """ Gets all data or specific data based on parameters and returns it.

Args: artist (str): name of artist to get data for genre (str): name of genre to get data for sans (str): name of column to exclude

Returns: tuple: two DataFrame objects of training and testing data """ train, test = self.train, self.test

if sans is not None: train = train.drop(columns=[sans]) test = test.drop(columns=[sans])

if not (artist or genre): return (train, test)

set_type = "artist" if artist else "genre" name = artist if artist else genre

train = train[train[set_type] == name] test = test[test[set_type] == name] return (train, test)

def trim_artists(self): """ Removes n rows for each artist. """ temp = None for artist in self.artists: a = self.data[self.data['artist'] == artist] a = shuffle(a, random_state=seed)[:artist_max] temp = a if temp is None else temp.append(a) self.data = temp

def train_test(self): """ Splits data inte training and testing for each artist. """ train = None test = None for artist in self.artists: rows = self.data[self.data['artist'] == artist] train_cnt = int(len(rows)*train_size) a_train = rows[:train_cnt] a_test = rows[train_cnt:] train = a_train if train is None else train.append(a_train) test = a_test if test is None else test.append(a_test) self.train = train self.test = test

73 del self.data class Regression(): """ Trains models using data and performs predictions. """ def __init__(self, artist=None, genre=None, sans=None): """ Init method. Gets data from Dataset object and trains the models.

Args: artist (str): name of artist to get data for genre (str): name of genre to get data for sans (str): name of column to exclude """ # get data from Data object self.tt = data.get_data(artist, genre, sans)

# save size of training set self.tr_size = len(self.tt[0])

self.remove_unique_cols() self.one_hot_encoding() self.standardize() self.init_models_standard(model_params) self.train_models()

def standardize(self): """ Standardizes data using MinMaxScaler. """ scaler = MinMaxScaler() X_train_temp, y_train = self.get_xy(self.tt[0], False) X_test_temp, y_test = self.get_xy(self.tt[1], False)

# fit scaler on training data scaler.fit(X_train_temp)

# apply transformation on training data X_train = scaler.transform(X_train_temp) X_train = pd.DataFrame(X_train, index=X_train_temp.index, columns=X_train_temp.columns)

# apply same transformation to test data X_test = scaler.transform(X_test_temp) X_test = pd.DataFrame(X_test, index=X_test_temp.index, columns=X_test_temp.columns)

# join and save train = self.xy_join(X_train, y_train) test = self.xy_join(X_test, y_test) self.tt = (train, test)

def init_models_standard(self): """ Creates a dictionary of the models to be used. Initializes the model objects using some hyperparameters. """ model_classes = dict()

74

# linear regression params_lr = {"fit_intercept": False} model_classes['lr'] = (LinearRegression, params_lr)

# k-nearest neighbors params_kn = {"n_neighbors": 5, "weights": "distance"} model_classes['kn'] = (KNeighborsRegressor, params_kn)

# random forest params_rf = {"n_estimators": 100} model_classes['rf'] = (RandomForestRegressor, params_rf)

# neural network params_nn = {"solver": "lbfgs", "activation": "relu"} model_classes['nn'] = (MLPRegressor, params_nn)

self.model_classes = model_classes

def merge(self): """ Merge training and testing data and return it.

Returns: DataFrame """ return self.tt[0].append(self.tt[1])

def unmerge(self, tt): """ Split merged dataset into training and testing. """ self.tt = (tt[:self.tr_size], tt[self.tr_size:])

def remove_unique_cols(self): """ Remove columns with unique values. """ # get merged rows tt = self.merge() # remove columns with unique value if 'artist' in tt.columns: if len(tt.artist.unique().tolist()) == 1: tt = tt.drop(columns=['artist']) if 'genre' in tt.columns: if len(tt.genre.unique().tolist()) == 1: tt = tt.drop(columns=['genre']) if 'area' in tt.columns: if len(tt.area.unique().tolist()) == 1: tt = tt.drop(columns=['area']) # unmerge and save self.unmerge(tt)

def one_hot_encoding(self): """ Perform one hot encoding for some categorical features. """ # get merged rows tt = self.merge() # one-hot encoding for categorical features tt = pd.get_dummies(tt) # one-hot encoding for issue type

75 if 'issue_type' in tt.columns: issue_dummies = pd.get_dummies(tt['issue_type'], prefix="issue") tt = pd.concat([tt, issue_dummies], axis=1) tt = tt.drop(columns=['issue_type']) # unmerge and save self.unmerge(tt)

def train_models(self): """ Training of the models. """ start = time() # get X and y of training data X_train, y_train = self.get_xy(self.tt[0])

self.models = dict()

for name, model in self.model_classes.items(): self.models[name] = model[0](**model[1]) self.models[name].fit(X_train, y_train)

print("Training took: " + str(round(time()-start, 2)))

def get_xy(self, df, usd_list=True): """ Returns X as DataFrame and y as list.

Returns: tuple: two DataFrame objects """ X = df.drop(columns=['usd']) y = df['usd'].tolist() if usd_list is True else df['usd']

return X, y

def xy_join(self, X, y): """ Joins X and y into DataFrame and returns it.

Returns: DataFrame """ return pd.concat([X, y], axis=1)

def evaluate_models(self, artist=None, genre=None, prefix="", results=False, perc_list=False): """ Make predictions with models and return result.

Args: artist (str): name of artist to get predictions for genre (str): name of genre to get predictions for prefix (str): optional prefix of the dictionary index names results (boolean): whether to return predicted values or not perc_list (boolean): whether to return list of APE scores or not

Returns: dict: prediction results """ # get data to evaluate if not (artist or genre): test = self.tt[1]

76 else: set_type = "artist" if artist else "genre" name = artist if artist else genre test = self.tt[1][self.tt[1][set_type + '_' + name] == 1]

# split to X and y X_test, y_test = self.get_xy(test)

res = dict()

for name, model in self.models.items(): y_pred = model.predict(X_test).tolist() r2, rmse, ape_scores, mape, ape_fd = get_scores(y_test, y_pred) if results: res["actual"] = y_test res[f"{name}_pred"] = y_pred if perc_list: res[f"{prefix}_{name}_perclist"] = ape_scores res[f"{prefix}_{name}_r2"] = r2 res[f"{prefix}_{name}_rmse"] = rmse res[f"{prefix}_{name}_mape"] = mape res[f"{prefix}_{name}_apefd"] = ape_fd

return res def get_scores(y_test, y_pred): """ Calculates different scores and returns as tuple.

Args: y_test (list): actual y values y_pred (list): predicted y values

Returns: tuple: results """ ape_scores = ape_list(y_test, y_pred) return ( # R2 round(r2_score(y_test, y_pred), 3), # RMSE round(sqrt(mean_squared_error(y_test, y_pred)), 3), # list of APEs ape_scores, # MAPE round(mean(ape_scores), 3), # APE frequency distribution ape_fd(ape_scores) ) def ape_list(y_test, y_pred): """ Calculates list of APE scores.

Args: y_test (list): actual y values y_pred (list): predicted y values

Returns: list: APE scores """ ape_scores = list()

77 i = 0 while i < len(actual): diff = abs(actual[i]-pred[i]) perc = (diff/actual[i])*100 ape_scores.append(perc) i += 1 return ape_scores def ape_fd(ape_scores): """ Generates frequency distribution list of APE scores.

Args: ape_scores (list): list of APE scores

Returns: list: APE score frequency distribution list """ max_perc = 150 steps = 10 res_dict = {key:0 for key in [n for n in list(range(0, max_perc)) if n%steps == 0]} res_dict['over'] = 0 i = 0 for n in ape_scores: perc = floor(n) while perc % steps != 0: perc -= 1 if perc >= max_perc: perc = 'over' res_dict[perc] += 1 fd = list() for k, v in res_dict.items(): fd.append(round((v/len(ape_scores))*100, 1)) return fd def mean(num_list): """ Calculates mean of numbers in list.

Args: num_list (list): list of numbers

Returns: float: mean of numbers """ return sum(num_list)/len(num_list)

78 B Appendix 1 ED TS SM TR KN LR NN RF R2 MAPE R2 MAPE R2 MAPE R2 MAPE altmetal 430 31.16 F 0.215 82.65 0.273 106.84 0.343 89.7 0.532 59.86 altmetal 430 31.16 G 0.253 76.07 0.289 93.78 0.139 107.15 0.424 62.52 altmetal 430 31.16 A 0.224 80.98 -0.007 111.19 -5.164 138.59 0.405 67.69 altrock 976 34.09 F 0.291 60.69 0.236 107.37 0.365 73.61 0.443 56.7 altrock 976 34.09 G 0.289 60.81 0.26 105.19 0.332 79.49 0.411 59.83 altrock 976 34.09 A 0.296 63.14 0.14 94.42 -0.963 100.93 0.439 58.28 blackmetal 522 40.1 F 0.217 45.94 0.281 57.95 0.577 45.62 0.503 35.23 blackmetal 522 40.1 G 0.234 45.36 0.349 59.92 0.444 45.73 0.57 38.71 blackmetal 522 40.1 A 0.23 46.71 0.301 63.19 0.279 57.22 0.486 42.12 classicrock 780 21.81 F 0.127 74.26 0.105 126.15 0.299 90.69 0.293 82.37 classicrock 780 21.81 G 0.126 74.83 0.181 115.68 0.086 102.07 0.256 79.62 classicrock 780 21.81 A 0.134 75.22 0.139 110.95 -0.172 110.12 0.206 81.85 deathmetal 438 39.12 F 0.321 39.84 0.342 50.05 0.409 47.17 0.466 34.78 deathmetal 438 39.12 G 0.32 39.9 0.412 46.77 0.261 59.2 0.488 43.19 deathmetal 438 39.12 A 0.308 43.36 0.193 63.34 -0.319 66.53 0.348 51.46 doommetal 833 39.06 F 0.349 48.96 0.342 62.03 0.4 56.74 0.436 42.27 doommetal 833 39.06 G 0.334 48.06 0.352 61.07 0.331 59.11 0.458 42.72 doommetal 833 39.06 A 0.31 48.77 0.172 69.75 -0.124 65.8 0.377 49.0 electronic 810 24.31 F 0.286 59.77 0.247 91.55 0.366 80.58 0.34 60.3 electronic 810 24.31 G 0.275 60.64 0.275 90.57 0.281 87.04 0.342 63.84 electronic 810 24.31 A 0.219 60.72 -0.646 100.53 -1.749 80.18 0.179 67.93 heavymetal 792 32.25 F 0.206 58.26 0.285 85.86 0.296 78.36 0.307 61.97 heavymetal 792 32.25 G 0.202 57.9 0.293 83.69 0.181 73.53 0.381 62.17 heavymetal 792 32.25 A 0.172 56.43 0.172 90.68 -0.77 78.09 0.248 66.92 punkrock 1124 28.43 F 0.113 69.57 0.145 91.13 0.291 81.47 0.267 62.94 punkrock 1124 28.43 G 0.11 70.04 0.172 94.65 0.185 90.72 0.249 65.81 punkrock 1124 28.43 A 0.123 68.38 0.087 94.63 -0.28 98.19 0.234 65.88 stonerrock 267 31.4 F 0.181 51.38 0.177 66.78 0.2 57.93 0.374 42.29 stonerrock 267 31.4 G 0.159 51.1 0.201 64.06 0.03 61.82 0.347 44.69 stonerrock 267 31.4 A 0.101 54.96 -0.263 64.57 -0.321 53.79 0.189 52.66 thrashmetal 584 31.99 F 0.086 58.96 0.352 64.76 0.25 62.38 0.391 47.59 thrashmetal 584 31.99 G 0.109 60.25 0.363 60.52 -0.217 68.22 0.417 48.78 thrashmetal 584 31.99 A 0.113 58.86 0.043 71.76 -0.424 82.92 0.262 60.38

79 B Appendix 2 AL TR 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 > GENRE: ALTERNATIVE METAL KN F 14.9 18.8 11.2 8.4 8.4 7.7 7.2 2.8 2.1 1.4 1.4 0.7 0.9 0.5 1.6 12.1 LR F 10.2 7.9 6.3 11.6 10.0 6.3 7.9 5.6 4.9 2.3 1.9 2.3 1.6 2.6 1.2 17.4 NN F 10.9 12.1 11.9 11.4 7.9 7.2 4.9 4.4 2.8 2.3 2.1 2.1 1.2 0.9 2.1 15.8 RF F 18.8 14.4 15.6 11.6 6.7 4.9 4.9 3.3 3.0 1.9 1.2 1.6 0.7 0.0 0.7 10.7 KN G 15.3 17.7 10.0 9.3 8.8 7.9 6.0 3.7 1.9 1.4 2.6 1.4 0.9 0.2 1.4 11.4 LR G 7.9 10.2 8.6 7.7 10.0 9.1 8.1 3.7 4.9 4.4 3.3 2.1 1.9 0.9 1.2 16.0 NN G 10.2 9.3 10.0 7.2 9.1 8.4 6.5 4.9 4.0 3.5 3.5 3.0 1.4 0.9 1.2 17.0 RF G 17.4 19.8 12.6 10.0 6.0 6.7 5.1 3.0 1.6 1.6 3.0 1.2 1.4 1.6 0.9 7.9 KN A 14.0 17.0 10.5 10.5 6.7 8.4 5.1 3.7 3.3 2.1 1.9 1.2 2.3 1.2 1.4 10.9 LR A 10.9 9.5 8.6 10.5 7.9 7.9 6.0 3.0 3.7 3.0 1.4 2.3 1.9 1.4 2.6 19.3 NN A 10.9 13.3 10.9 10.0 6.3 7.7 5.6 4.0 2.8 4.0 1.4 1.2 1.4 1.4 1.4 17.9 RF A 15.8 16.7 12.1 11.2 7.2 6.7 4.0 4.2 2.8 2.3 0.9 2.1 1.4 1.9 1.4 9.3 GENRE: ALTERNATIVE ROCK KN F 15.4 13.9 14.7 12.0 9.4 6.7 5.9 3.7 3.6 1.6 0.9 1.4 0.9 1.0 0.8 8.0 LR F 7.5 10.3 9.1 9.0 6.9 8.5 8.2 4.8 3.9 2.6 3.0 2.3 1.4 1.6 1.1 19.8 NN F 10.7 10.7 10.7 10.9 9.5 7.3 5.3 4.9 4.6 3.3 2.5 3.1 2.2 1.6 1.7 11.2 RF F 16.2 14.3 14.0 12.2 8.7 6.4 5.4 4.6 3.8 2.4 1.4 0.8 0.9 1.1 0.8 6.9 KN G 15.3 15.0 13.6 11.6 9.3 7.5 5.5 4.2 3.5 1.4 1.0 1.1 1.2 1.3 0.8 7.6 LR G 8.8 8.4 8.3 6.5 8.3 6.9 8.1 5.0 5.5 3.7 2.8 2.7 1.6 2.5 1.4 19.6 NN G 11.5 11.2 12.1 11.1 8.8 6.5 6.6 4.7 3.9 2.9 1.7 2.3 1.4 1.3 1.1 13.0 RF G 15.9 16.1 12.7 10.0 8.8 7.8 5.7 4.3 3.5 1.2 1.9 1.0 0.7 1.2 1.0 8.0 KN A 15.5 13.2 13.3 12.6 8.4 7.6 5.9 4.9 3.5 1.1 1.1 1.4 1.1 0.9 0.9 8.4 LR A 10.2 9.2 10.5 7.8 8.1 7.6 6.2 5.7 5.7 3.1 3.7 1.8 1.7 1.6 1.6 15.3 NN A 11.1 12.0 12.2 11.0 8.1 5.3 4.8 4.4 4.5 2.5 1.8 1.9 1.3 2.0 1.3 15.7 RF A 16.5 15.9 13.4 9.9 9.4 5.4 6.0 4.4 4.0 2.2 1.5 0.9 0.9 1.2 0.8 7.4 GENRE: BLACK METAL KN F 26.8 15.3 11.5 10.0 7.7 4.2 5.2 3.1 3.4 1.7 2.1 0.8 1.7 1.1 0.8 4.6 LR F 11.3 11.3 13.2 8.2 10.9 10.7 6.9 5.7 5.6 2.5 1.5 2.1 2.1 1.1 1.1 5.6 NN F 17.8 15.5 13.2 13.8 10.3 7.9 5.6 2.9 2.5 2.3 1.3 1.1 1.1 0.2 0.4 4.0 RF F 23.9 20.3 16.1 9.8 8.4 6.9 4.8 3.3 1.0 1.1 0.2 0.6 0.4 0.2 0.2 2.9 KN G 26.1 15.5 12.8 9.8 7.5 3.8 5.4 3.4 3.4 1.1 1.9 1.1 1.5 0.8 1.1 4.6 LR G 13.2 12.8 11.5 10.3 6.5 8.0 10.0 5.6 3.6 2.9 3.4 1.3 2.1 1.5 0.4 6.7 NN G 18.8 17.2 14.4 13.8 5.9 6.7 4.8 3.8 1.9 1.7 2.5 2.1 0.6 1.7 0.4 3.6 RF G 23.6 21.6 14.2 9.6 9.6 5.7 2.9 4.2 1.1 1.1 0.8 0.6 0.6 0.6 0.4 3.4 KN A 23.2 14.6 13.2 10.5 8.2 5.0 6.3 4.4 3.6 0.8 1.1 1.0 1.0 1.0 0.2 5.9 LR A 15.1 13.4 10.3 9.8 7.1 9.4 6.7 5.9 2.5 3.1 1.9 3.1 0.6 1.1 1.5 8.4 NN A 20.7 14.4 12.5 9.8 9.0 6.1 5.9 3.3 2.3 2.5 2.1 1.1 1.0 1.0 0.4 8.0 RF A 20.5 20.1 15.9 12.1 5.7 5.7 4.8 2.5 2.7 1.7 0.8 1.1 0.8 0.6 0.8 4.2 GENRE: CLASSIC ROCK KN F 14.5 14.0 9.0 10.0 11.2 6.5 5.8 6.4 4.9 2.2 2.4 0.5 1.2 0.6 0.8 10.1 LR F 9.0 5.6 6.8 7.8 6.4 6.7 5.9 5.9 5.1 3.2 2.9 2.8 1.5 2.6 1.8 25.9 NN F 10.5 9.7 11.9 10.8 8.7 5.1 6.4 6.4 4.0 2.4 1.3 2.2 1.5 1.3 1.4 16.3 RF F 16.8 14.2 12.4 10.4 8.6 6.3 5.9 3.8 2.3 1.8 1.3 1.5 1.3 0.9 0.5 11.9 KN G 14.5 13.2 9.7 9.9 11.8 5.9 6.2 6.2 4.5 2.3 2.3 0.8 1.2 0.8 0.8 10.1 LR G 9.5 8.7 9.1 8.7 7.3 7.2 7.7 6.0 5.8 2.1 1.8 1.8 1.8 1.5 1.5 19.5 NN G 9.9 10.1 10.3 9.9 7.6 6.3 7.2 6.0 3.2 3.6 1.8 1.5 2.1 1.4 1.3 17.9 RF G 16.7 15.0 10.6 10.1 7.3 6.8 6.0 4.6 2.8 2.4 0.8 1.9 1.3 0.6 1.0 11.9 KN A 14.4 15.0 8.5 10.9 9.6 6.9 5.5 6.8 5.3 1.7 1.5 1.0 1.3 0.8 0.6 10.3 LR A 11.9 10.9 9.1 7.8 9.2 7.1 5.4 4.5 4.4 4.0 1.8 2.9 1.7 1.2 1.5 16.7 NN A 11.0 9.7 12.3 9.0 9.1 7.8 5.4 5.6 3.8 2.9 2.2 1.3 1.4 0.9 0.5 16.9 RF A 14.7 15.0 11.0 10.0 9.0 6.9 5.9 4.7 3.6 2.1 1.4 1.3 1.5 0.8 1.2 10.9 GENRE: DEATH METAL KN F 20.3 16.7 16.4 11.9 8.9 7.8 4.3 2.3 2.5 1.6 1.1 1.1 0.7 0.9 0.7 2.7 LR F 13.0 14.8 12.8 13.2 10.3 8.0 6.8 1.4 3.7 4.1 1.6 1.8 1.1 1.1 1.1 5.0 NN F 18.9 14.8 13.7 9.8 9.8 8.0 5.0 4.1 3.0 3.7 0.9 1.8 1.4 1.4 0.7 3.0 RF F 23.1 18.0 16.0 12.3 8.4 5.9 5.3 3.9 1.1 0.7 1.4 0.7 0.7 0.5 0.2 1.8 KN G 21.7 16.4 13.7 12.6 7.5 7.8 5.0 3.4 2.7 1.8 1.4 0.5 0.9 1.6 0.2 2.7 LR G 18.9 13.7 16.2 13.2 7.3 6.4 3.9 4.6 3.2 2.7 2.7 0.7 1.6 1.6 0.2 3.0 NN G 17.4 16.0 13.9 11.0 8.7 7.8 4.6 4.8 3.7 2.3 1.4 1.4 1.4 1.1 0.2 4.6 RF G 19.9 19.4 15.8 12.6 8.7 6.8 5.0 3.9 2.3 0.2 0.7 1.1 0.5 0.7 0.7 1.8 KN A 20.8 14.2 16.0 12.3 7.8 6.6 5.9 2.1 2.5 2.5 2.1 0.5 1.4 0.9 0.7 3.9 LR A 16.4 14.2 12.6 11.2 9.6 7.8 3.7 5.3 4.3 3.2 1.8 1.4 1.4 0.5 0.7 6.2 NN A 15.8 13.7 11.6 11.2 9.1 7.8 6.4 5.3 2.3 1.8 2.1 1.6 1.4 0.9 1.1 8.0 RF A 19.6 16.2 16.0 11.0 8.0 4.8 5.5 5.5 3.2 1.6 1.6 0.9 0.5 0.9 0.7 4.1 GENRE: DOOM METAL KN F 20.8 16.0 12.0 11.2 9.4 6.7 5.2 4.1 3.0 2.0 1.8 1.2 1.0 0.8 0.1 4.8 LR F 13.8 13.8 13.8 11.4 10.6 8.2 5.4 3.5 3.7 2.3 1.4 1.2 1.1 0.8 1.0 8.0 NN F 16.9 13.1 13.1 11.4 10.1 7.0 5.8 4.9 2.8 1.7 1.4 2.8 1.7 0.7 1.0 5.8 RF F 21.5 18.6 15.1 11.2 8.5 7.0 4.8 2.8 2.4 1.6 0.6 0.5 0.5 0.6 0.5 4.0

80 KN G 20.6 15.6 12.2 10.9 10.1 7.2 4.6 3.7 3.5 2.0 2.0 1.0 1.0 1.1 0.1 4.3 LR G 13.7 12.6 14.3 12.4 9.5 7.8 7.4 4.3 2.4 2.2 0.8 1.6 1.3 1.2 1.0 7.6 NN G 15.6 13.3 13.9 10.9 9.4 7.8 5.2 4.1 3.1 3.0 1.9 1.2 1.4 0.8 0.8 7.4 RF G 19.8 17.6 15.6 13.0 8.0 7.0 4.8 2.2 2.6 2.3 1.1 0.2 1.3 0.1 0.4 4.0 KN A 20.4 16.9 11.0 10.8 10.4 7.2 4.6 4.3 3.1 2.2 1.4 0.6 0.6 0.6 0.6 5.2 LR A 17.3 15.5 11.8 10.6 7.6 8.3 6.0 4.0 3.2 2.5 1.8 1.0 1.9 0.5 1.2 7.0 NN A 17.5 15.8 12.4 9.7 9.4 5.6 5.0 4.0 3.6 3.0 1.2 1.2 1.1 1.1 1.1 8.3 RF A 19.6 18.0 14.5 10.8 9.7 6.8 4.4 3.4 3.2 1.6 1.1 1.3 0.5 0.2 0.7 4.1 GENRE: ELECTRONIC KN F 17.8 13.6 10.7 13.3 9.6 7.5 6.0 4.7 2.5 1.4 2.2 1.5 0.9 1.0 0.7 6.5 LR F 10.2 8.9 8.0 7.4 7.8 8.3 6.2 5.7 5.1 4.2 3.6 1.7 2.1 1.6 1.4 17.9 NN F 13.7 9.5 10.5 9.5 8.0 9.3 6.7 4.3 2.6 2.8 3.0 1.6 1.6 2.1 2.0 12.8 RF F 16.8 17.9 12.1 11.0 8.1 6.0 6.8 4.3 3.0 0.7 1.6 0.9 1.7 0.6 0.4 8.0 KN G 17.0 14.7 10.7 13.6 8.6 7.8 5.9 4.2 2.5 1.5 2.3 1.4 0.9 1.0 1.0 6.9 LR G 12.0 11.2 9.1 7.5 5.8 6.3 6.9 5.1 3.8 3.7 2.5 2.2 3.3 2.0 1.5 17.0 NN G 11.2 14.3 9.3 9.3 7.7 6.7 6.4 4.4 5.1 3.0 3.0 2.2 2.1 1.1 1.5 12.8 RF G 15.4 18.4 10.7 10.9 9.0 6.9 5.7 3.7 2.6 1.9 1.1 1.1 1.2 0.7 1.0 9.6 KN A 18.1 12.3 10.7 12.6 8.5 8.5 5.2 5.3 2.7 1.9 3.0 1.0 1.1 1.1 0.9 7.0 LR A 11.9 12.2 11.5 8.0 6.7 7.9 6.8 5.7 3.0 2.3 3.0 1.4 1.9 1.2 1.6 15.1 NN A 15.8 12.7 10.7 7.5 8.4 5.7 6.0 5.2 3.1 4.1 1.7 2.1 1.9 1.6 1.6 11.9 RF A 17.4 17.0 11.0 11.2 8.0 7.0 4.2 5.1 2.8 2.3 1.4 0.7 1.1 1.1 0.5 9.0 GENRE: HEAVY METAL KN F 20.5 17.2 13.5 10.0 8.0 6.7 4.5 3.3 3.9 1.3 1.0 0.9 1.1 0.8 0.6 6.8 LR F 10.7 11.0 9.2 8.5 7.8 7.3 6.6 4.4 4.7 3.8 2.7 2.7 2.8 0.9 1.4 15.7 NN F 12.8 12.2 10.7 10.2 8.8 7.4 6.3 4.0 4.5 2.4 2.3 2.0 1.4 1.5 1.4 11.9 RF F 20.3 17.4 12.8 10.5 9.2 6.3 4.7 4.2 2.1 1.3 0.9 1.1 0.4 0.6 0.5 7.7 KN G 19.9 17.3 14.1 10.2 7.8 6.2 4.5 3.4 3.8 1.3 1.4 1.0 1.1 0.6 0.5 6.7 LR G 11.2 12.4 9.8 8.6 7.3 8.7 6.3 5.3 4.8 3.0 2.7 2.0 1.1 1.1 1.8 13.8 NN G 15.4 12.4 12.1 10.1 7.4 7.3 5.1 4.7 2.8 2.7 2.1 1.0 1.6 1.8 0.6 12.9 RF G 21.2 16.0 11.4 12.2 9.7 6.4 4.4 3.5 2.3 1.6 1.0 0.9 0.4 0.5 0.6 7.7 KN A 21.0 16.5 13.8 9.7 8.1 7.1 4.2 4.7 2.9 2.0 0.6 1.3 1.1 0.4 0.5 6.2 LR A 11.6 14.3 12.1 9.7 10.2 8.2 5.6 4.0 3.2 2.0 1.6 1.6 1.6 1.4 1.1 11.6 NN A 16.2 15.9 12.6 8.6 7.2 6.7 5.7 3.7 2.4 3.2 2.7 1.5 1.6 0.8 1.5 9.8 RF A 20.8 16.0 13.9 10.0 8.5 7.2 4.7 3.7 1.5 1.5 1.0 0.9 0.9 0.6 0.5 8.3 GENRE: PUNK ROCK KN F 17.7 14.1 12.6 10.1 8.5 7.6 5.1 3.6 4.0 1.3 1.5 1.6 1.2 0.8 0.8 9.5 LR F 10.5 9.4 9.3 8.6 8.1 5.3 8.2 6.3 3.7 3.3 2.5 2.2 2.8 1.6 1.6 16.5 NN F 12.5 12.0 11.9 7.8 7.7 8.1 5.7 5.2 4.3 3.1 2.5 1.2 2.1 1.3 1.2 13.3 RF F 17.2 17.1 10.6 11.3 9.2 6.9 4.4 3.4 3.7 1.8 2.0 1.9 1.0 0.3 1.7 7.6 KN G 18.1 14.6 12.3 10.3 8.2 7.2 5.3 3.6 3.6 1.4 1.8 1.3 1.1 0.8 1.0 9.3 LR G 11.7 10.1 8.2 8.1 7.5 7.2 6.7 5.2 3.9 3.8 2.9 1.9 2.0 1.7 1.6 17.5 NN G 11.7 11.7 9.9 8.5 8.7 8.0 6.9 5.6 3.8 3.1 2.8 1.8 2.0 1.3 0.4 13.7 RF G 16.7 14.9 13.2 10.5 11.0 5.4 4.9 3.8 3.3 1.8 2.1 1.1 1.3 0.4 0.9 8.7 KN A 16.2 16.1 12.4 10.5 8.5 6.6 6.0 4.1 4.0 1.5 1.2 1.1 1.5 0.9 0.6 8.9 LR A 10.9 11.9 10.4 8.5 7.2 7.3 7.0 5.6 4.6 2.7 2.6 1.4 1.6 0.8 1.8 15.7 NN A 14.1 12.5 9.9 8.0 8.1 7.6 5.2 5.0 4.2 2.8 1.7 1.3 1.2 1.3 1.4 15.7 RF A 18.0 15.6 11.7 11.7 9.1 6.1 5.5 3.6 2.7 2.7 1.4 1.2 0.7 0.6 0.4 8.8 GENRE: STONER ROCK KN F 19.9 13.1 15.0 9.7 10.1 4.9 6.7 6.0 3.7 2.2 0.7 0.0 1.1 0.4 0.0 6.4 LR F 15.4 9.7 14.6 10.5 9.7 7.9 5.2 4.9 3.7 1.5 3.4 1.5 0.7 1.5 1.1 8.6 NN F 16.5 14.6 13.9 7.5 9.0 7.1 6.4 4.5 3.0 2.6 1.1 1.1 2.6 1.1 1.5 7.5 RF F 22.1 18.7 14.6 11.6 4.9 9.7 4.5 4.1 1.5 0.4 0.7 0.4 0.7 0.7 0.7 4.5 KN G 18.7 13.9 13.9 10.5 11.6 4.9 6.0 7.1 2.2 2.2 1.1 0.4 0.7 0.4 0.0 6.4 LR G 16.5 12.4 11.2 11.2 11.2 6.0 3.7 4.1 2.2 3.0 5.6 2.6 0.7 0.7 0.7 7.9 NN G 15.4 15.4 10.5 9.4 9.0 7.5 6.0 7.5 4.9 1.5 0.4 0.7 1.1 0.7 2.2 7.9 RF G 20.2 21.3 15.7 8.2 8.2 6.0 6.0 3.0 1.5 1.5 0.7 0.4 0.7 0.7 0.0 5.6 KN A 19.1 16.1 14.2 9.4 8.6 5.6 7.1 5.2 3.0 1.1 1.1 0.0 1.1 0.7 0.4 7.1 LR A 13.1 13.5 12.4 12.4 7.1 7.1 4.9 6.0 1.9 3.0 3.0 1.5 1.9 2.2 1.9 8.2 NN A 15.4 10.5 14.6 8.2 11.2 7.5 10.9 5.2 3.7 2.2 1.5 1.1 1.5 0.4 0.7 5.2 RF A 18.7 21.3 12.4 9.7 11.2 5.2 5.2 4.1 2.2 0.0 0.0 1.5 0.0 1.1 0.4 6.7 GENRE: THRASH METAL KN F 20.0 16.4 14.4 10.4 8.0 7.5 5.1 3.6 3.6 1.0 0.2 1.7 0.5 1.5 0.3 5.5 LR F 12.5 12.8 13.9 10.4 8.6 11.6 5.5 5.1 3.6 2.1 1.4 1.0 1.0 1.0 0.5 8.9 NN F 15.2 13.4 11.0 12.7 10.1 6.7 6.7 3.3 4.5 1.9 1.4 1.9 0.9 1.0 1.9 7.7 RF F 23.1 18.7 14.4 11.0 8.9 5.5 5.7 1.7 1.7 0.3 1.0 0.7 1.2 1.2 0.5 4.5 KN G 20.4 15.1 14.4 11.1 8.2 7.5 5.8 4.3 2.4 1.0 0.7 1.5 0.5 0.9 0.5 5.7 LR G 13.7 10.8 13.0 14.0 11.1 7.7 6.3 4.6 2.7 2.1 1.0 0.9 3.1 1.0 0.3 7.5 NN G 14.9 14.7 15.1 11.5 7.2 7.7 5.0 3.4 2.4 1.5 2.4 1.5 1.0 1.0 0.7 9.9 RF G 21.9 19.7 12.3 11.6 8.7 6.0 3.9 4.1 2.1 0.9 0.5 1.0 0.7 0.5 0.3 5.7 KN A 20.0 16.6 14.0 11.0 7.7 6.5 6.2 4.6 2.4 1.0 0.5 1.7 0.7 0.9 0.3 5.8 LR A 14.2 14.6 14.0 9.9 9.4 7.4 6.7 3.3 3.3 3.1 2.9 1.0 1.4 1.2 0.3 7.4 NN A 19.7 14.2 11.0 9.6 8.6 4.6 5.8 3.3 4.5 1.7 1.7 1.0 1.9 1.9 0.7 9.9 RF A 19.5 19.3 14.6 9.2 10.4 4.6 4.8 2.6 2.4 1.0 1.4 1.2 1.2 0.5 0.5 6.7

81 B Appendix 3 ED TS SM TR: FULL DATASET TR: GENRE TR: ARTIST KN LR NN RF KN LR NN RF KN LR NN RF GENRE: ALTERNATIVE METAL A1 9 57.12 38.48 25.97 27.85 35.01 37.93 30.5 48.25 34.2 40.43 39.68 45.14 39.14 A2 4 38.62 32.03 46.23 42.91 33.53 36.9 53.64 48.27 30.74 31.94 59.5 44.85 23.25 A3 7 44.44 57.65 34.88 37.3 52.72 57.65 35.41 35.33 55.99 57.65 40.82 57.65 52.19 A4 5 22.67 39.3 100.44 28.98 28.92 39.97 63.99 50.57 24.46 75.61 50.58 31.53 43.34 A5 14 27.0 38.0 58.31 50.67 35.69 44.5 79.39 44.45 33.78 42.64 70.85 51.62 39.65 A6 11 27.79 32.55 63.01 84.07 45.35 26.26 64.33 110.72 48.14 27.76 28.39 40.99 24.97 A7 5 35.42 69.38 29.48 38.0 28.88 69.32 59.56 26.1 32.02 62.87 46.0 139.76 35.6 A8 10 42.72 61.86 49.03 64.97 28.91 62.43 47.91 33.28 30.17 62.04 56.76 106.36 40.13 A9 13 27.28 50.6 61.18 63.67 41.7 41.79 87.09 64.95 39.51 39.37 73.34 49.91 40.58 A10 18 39.39 45.53 48.53 47.25 34.82 44.48 50.54 58.06 41.13 68.43 67.46 99.92 47.82 A11 3 22.83 30.68 49.69 34.58 75.83 32.25 62.09 53.74 171.48 30.1 75.7 67.53 31.58 A12 7 15.77 82.88 27.99 39.05 70.33 83.84 46.97 101.68 82.57 36.48 60.67 77.53 30.59 A13 20 45.1 53.66 94.81 59.97 40.17 51.75 65.01 64.23 33.88 51.83 102.79 80.74 53.85 A14 12 34.71 64.78 70.02 62.67 59.17 66.15 64.47 57.21 66.31 63.31 51.48 79.3 66.54 A15 11 46.87 67.74 69.32 81.98 44.95 48.91 66.15 92.02 41.11 77.71 70.42 49.57 67.25 A16 3 17.13 65.88 53.55 48.43 55.48 65.87 84.31 120.11 51.4 54.88 83.23 46.84 53.25 A17 25 38.77 64.5 68.62 65.4 38.93 62.4 52.95 94.21 32.41 62.17 56.25 154.68 38.29 A18 15 40.3 62.38 92.54 98.33 36.84 53.0 77.47 67.03 41.37 52.08 140.88 65.17 56.05 A19 38 23.91 40.9 61.85 82.02 58.66 43.62 57.19 133.42 60.71 41.95 97.03 101.79 89.46 A20 7 20.12 34.89 72.55 44.61 24.59 47.26 45.57 60.1 18.9 35.83 160.44 268.72 102.83 A21 6 26.13 92.38 53.51 81.09 60.94 55.46 55.02 86.69 76.13 45.98 138.98 123.58 70.84 A22 18 23.72 30.9 132.88 125.74 64.35 31.85 134.72 115.82 51.83 38.21 85.34 93.07 58.07 A23 3 20.75 102.27 91.81 123.04 42.89 99.6 42.55 73.4 6.91 142.22 172.65 28.78 57.42 A24 10 42.72 112.85 76.6 72.7 42.12 113.43 57.26 141.55 69.13 116.04 59.43 61.86 74.95 A25 10 18.12 71.41 45.13 155.72 56.29 67.05 87.03 144.3 53.43 60.9 99.19 140.65 56.19 A26 3 29.07 161.64 83.8 55.65 39.28 170.3 152.43 51.02 30.24 94.16 74.11 69.92 57.67 A27 7 43.44 35.46 177.14 46.4 57.05 35.75 103.43 102.59 26.34 34.98 355.94 46.09 54.51 A28 25 26.22 89.62 98.06 84.99 65.13 92.77 98.62 110.17 60.35 83.13 121.63 115.87 59.37 A29 8 30.64 102.08 154.1 98.25 83.59 103.53 135.16 88.55 64.6 88.56 103.76 85.68 74.13 A30 4 46.49 64.03 85.25 113.53 89.59 79.89 96.71 183.06 123.76 64.93 125.62 126.51 57.52 A31 15 24.96 67.91 80.54 78.24 46.25 66.17 62.44 74.24 70.06 76.11 93.79 407.39 94.89 A32 9 24.59 78.59 73.49 34.22 67.42 76.45 48.96 130.86 73.26 70.34 65.61 430.72 69.28 A33 17 20.99 84.44 214.41 120.14 55.49 87.78 166.55 82.55 56.74 77.03 141.03 151.88 42.72 A34 4 6.52 189.3 184.31 142.5 80.32 52.91 204.67 154.82 89.58 48.38 50.04 43.91 49.57 A35 7 12.02 72.54 504.55 99.92 94.67 54.72 258.65 123.24 90.75 65.82 52.64 81.6 69.31 A36 5 26.83 144.97 139.95 140.02 111.82 156.46 68.9 175.37 115.96 163.28 210.56 190.22 147.58 A37 9 46.11 68.26 150.02 145.99 124.22 66.24 130.13 162.6 176.69 204.04 250.61 305.32 123.35 A38 3 13.74 156.2 133.18 160.18 144.07 156.13 51.99 246.31 163.82 153.66 148.18 145.8 252.99 A39 7 8.41 147.15 232.53 119.19 149.04 119.99 148.42 267.18 116.69 108.19 311.02 75.8 144.07 A40 8 17.94 484.47 203.29 302.77 215.05 262.73 153.24 349.16 199.5 252.54 100.16 603.5 122.01 A41 15 45.11 324.93 385.64 257.99 96.44 333.6 391.25 267.75 133.67 417.52 431.58 434.27 160.98 GENRE: ALTERNATIVE ROCK A42 25 32.39 29.93 33.08 32.88 23.44 27.55 47.04 32.29 22.19 29.88 56.33 26.59 24.75 A43 11 55.83 31.87 43.03 22.76 24.6 35.49 41.08 35.42 25.46 34.05 43.6 29.74 26.26 A44 7 45.59 34.1 51.78 21.83 33.9 34.34 71.43 53.44 23.83 29.08 28.66 35.0 27.28 A45 4 94.22 28.9 37.5 40.19 47.96 30.64 39.49 30.64 44.49 24.59 63.58 30.31 32.62 A46 22 33.13 34.23 46.15 33.84 31.94 33.82 50.75 38.68 32.45 31.46 42.57 81.71 36.08 A47 44 29.6 43.58 41.77 37.79 33.9 43.11 66.97 43.12 36.5 41.74 36.96 62.42 33.83 A48 10 35.26 43.92 32.03 74.67 30.94 46.73 38.51 47.17 55.35 64.79 55.88 41.97 21.13 A49 33 46.19 40.42 63.54 62.06 34.92 39.93 59.51 52.25 33.58 36.81 57.74 62.24 30.89 A50 38 33.65 41.51 67.92 64.09 42.3 41.0 72.36 47.94 40.93 42.02 58.82 52.9 41.49 A51 12 31.54 51.58 65.25 54.79 34.04 51.92 31.88 81.65 31.29 52.58 106.88 45.37 41.68 A52 12 31.45 49.41 57.86 57.62 39.22 48.92 74.33 64.45 42.32 51.66 81.66 58.12 38.9 A53 58 29.96 41.93 75.44 73.08 57.26 40.31 72.63 67.06 61.69 42.48 67.88 79.11 45.4 A54 19 63.45 57.98 64.65 56.07 44.08 60.17 60.67 56.37 41.41 77.5 74.62 75.19 59.51 A55 13 33.82 69.03 34.8 57.3 36.07 67.41 46.69 44.6 43.53 66.98 57.88 156.75 54.02 A56 20 38.45 77.66 39.07 39.92 63.98 77.65 64.15 48.23 62.91 77.4 44.52 96.71 60.17

82 A57 54 17.51 40.62 63.31 72.05 57.81 40.09 88.6 95.09 52.21 43.47 64.97 88.54 54.44 A58 15 21.68 54.02 138.77 68.25 50.92 55.17 106.37 47.6 56.31 50.55 63.53 82.4 59.15 A59 43 21.56 46.5 120.36 73.38 59.02 46.71 111.71 67.87 64.5 45.87 88.36 70.24 52.85 A60 27 20.43 68.68 92.85 69.07 48.04 69.94 109.54 71.03 48.66 75.06 58.99 100.65 41.42 A61 10 38.68 58.55 82.92 50.56 76.69 60.97 88.11 95.57 66.74 61.76 66.09 116.07 56.05 A62 25 27.18 58.49 90.44 56.66 53.01 59.01 84.1 91.07 66.22 58.16 83.31 174.66 66.71 A63 60 51.85 60.25 135.64 79.93 59.28 61.78 116.99 64.97 58.4 66.12 97.23 82.83 66.52 A64 37 38.2 43.8 144.9 63.19 43.85 44.63 128.83 61.68 43.47 45.47 172.57 108.54 55.4 A65 10 57.61 47.57 100.23 58.2 55.43 47.99 61.49 61.39 55.75 53.08 254.62 109.29 57.88 A66 3 27.39 23.46 67.07 52.38 33.54 23.02 67.28 30.3 29.42 49.84 454.47 71.69 66.83 A67 13 51.72 48.11 129.64 91.99 38.09 47.94 132.02 83.12 39.25 76.82 160.75 105.27 57.5 A68 36 22.49 49.91 135.26 99.21 67.7 49.68 133.69 81.38 63.33 60.64 118.56 82.54 69.9 A69 11 33.99 42.15 234.3 56.83 37.84 46.35 215.64 70.68 46.46 64.01 136.87 39.17 43.11 A70 45 23.47 64.62 122.69 99.19 75.41 61.85 114.0 127.65 69.0 56.6 100.91 112.72 43.98 A71 19 15.29 59.59 92.52 63.85 94.3 57.87 142.18 127.57 101.36 49.19 107.34 118.32 53.33 A72 9 40.11 67.81 125.54 93.42 72.43 61.59 145.49 60.48 91.73 93.08 65.75 139.77 53.19 A73 15 15.66 71.52 163.2 95.88 62.41 73.8 113.61 88.67 75.5 68.79 73.11 115.01 76.1 A74 49 42.3 58.37 186.39 84.5 54.33 61.96 170.94 84.83 57.98 61.91 131.55 73.28 67.53 A75 8 14.73 79.88 93.91 54.14 105.38 68.91 154.74 151.61 130.01 57.85 70.72 106.73 72.43 A76 21 65.57 120.52 69.28 77.48 86.16 119.33 67.2 94.98 102.26 120.21 57.41 130.57 115.23 A77 35 40.99 81.75 162.6 109.67 58.81 79.65 135.7 108.42 70.56 79.29 141.63 111.94 57.12 A78 15 20.47 125.83 114.15 89.64 66.59 125.55 121.95 122.47 60.95 108.99 105.93 117.04 90.75 A79 4 32.91 50.98 186.92 95.69 55.46 47.39 156.34 78.38 72.27 108.61 189.95 101.84 109.92 A80 12 16.25 102.49 115.12 110.67 85.93 105.89 113.14 140.21 112.43 105.94 73.22 172.26 83.69 A81 13 40.85 90.31 117.76 130.4 69.08 88.59 71.48 75.28 82.21 132.85 164.06 242.38 112.5 A82 23 13.3 108.55 135.9 91.61 59.48 107.64 150.73 130.66 51.61 94.24 75.81 385.79 28.97 A83 7 24.28 140.26 400.23 52.62 37.81 140.63 285.43 77.48 41.35 143.21 85.35 126.87 34.42 A84 9 26.88 190.06 133.64 103.11 145.53 209.48 194.78 200.97 175.0 63.29 323.05 93.95 75.83 A85 3 13.52 88.72 417.52 101.21 34.42 87.66 312.35 134.77 35.84 324.67 102.66 89.93 293.36 A86 9 86.26 104.91 194.12 218.01 192.54 106.4 168.45 290.45 236.48 114.13 311.83 279.62 288.65 A87 8 69.6 184.72 414.37 135.69 117.12 188.83 378.04 110.95 134.32 282.48 191.5 294.89 207.65 GENRE: BLACK METAL A88 3 60.66 16.44 26.24 15.02 16.59 16.44 28.72 16.04 16.98 16.44 16.44 16.44 16.59 A89 4 20.14 9.5 30.86 36.43 10.73 9.49 29.73 47.77 9.93 9.47 52.96 11.04 10.8 A90 30 37.48 20.09 47.12 32.54 21.85 20.52 32.85 26.02 18.62 22.42 46.9 29.23 26.16 A91 12 28.53 24.92 25.88 33.14 24.01 21.55 30.28 34.65 19.21 27.67 21.69 57.39 26.66 A92 26 48.59 27.27 51.38 23.71 17.13 26.84 30.84 22.48 21.43 25.92 35.27 42.27 22.77 A93 27 52.66 23.01 36.9 26.52 30.15 20.48 37.35 31.27 33.74 22.93 39.01 48.33 24.34 A94 20 52.0 30.73 43.47 32.04 25.55 30.23 45.29 28.95 32.35 28.91 32.77 36.22 31.5 A95 18 22.41 29.18 34.57 38.59 29.32 27.3 44.78 40.02 26.81 27.21 40.9 53.61 26.73 A96 3 76.67 29.96 20.72 31.2 42.09 30.1 24.29 43.84 23.52 17.6 113.05 17.26 27.18 A97 6 47.1 35.05 32.0 38.07 28.99 35.8 32.79 57.17 33.54 41.34 33.64 29.08 23.75 A98 7 36.22 27.89 21.55 21.61 26.52 25.78 30.28 18.95 25.3 52.41 39.38 108.48 38.01 A99 8 29.9 40.25 35.84 12.2 25.5 36.18 30.43 37.67 42.6 32.51 67.67 50.67 38.56 A100 14 49.52 40.25 39.31 27.09 32.36 37.56 35.97 45.83 26.56 53.74 47.66 38.58 32.49 A101 15 61.8 38.67 57.64 30.91 23.81 39.22 44.05 27.71 27.48 49.91 65.64 33.98 30.09 A102 7 73.78 30.69 43.61 36.03 28.87 29.98 42.03 25.45 22.08 30.17 119.02 52.4 27.55 A103 14 41.46 38.5 63.95 35.42 22.48 40.06 41.41 44.49 25.7 40.93 58.36 29.78 50.9 A104 6 24.9 92.89 50.87 12.01 23.48 57.48 51.69 23.43 24.36 23.58 104.57 12.66 21.31 A105 7 18.55 35.82 47.1 41.73 20.12 36.93 86.98 81.14 20.7 18.97 42.87 53.37 24.99 A106 4 26.46 51.77 57.12 48.84 22.39 50.64 46.86 47.45 28.42 49.45 34.88 38.29 37.92 A107 7 18.64 29.33 30.27 66.89 37.43 33.28 51.57 79.79 47.19 42.72 48.59 17.23 37.64 A108 18 56.09 47.93 56.12 38.68 31.68 47.21 47.37 46.29 28.42 44.37 61.17 52.66 32.95 A109 27 49.04 48.91 56.08 42.42 34.05 48.83 46.04 49.16 34.08 53.15 39.12 55.68 43.34 A110 10 27.54 59.29 35.92 58.64 39.03 51.73 50.35 39.68 36.18 28.31 103.9 62.42 23.89 A111 5 89.6 46.09 73.22 52.87 52.83 48.63 51.81 53.06 49.82 40.32 36.12 41.33 44.33 A112 3 123.11 43.63 60.69 34.22 39.9 43.7 56.55 39.91 69.0 55.6 52.07 48.28 51.13 A113 11 21.8 31.48 100.42 46.01 29.67 31.34 95.89 30.77 34.43 32.38 73.93 66.24 33.05 A114 6 30.99 61.65 78.38 30.02 15.14 61.12 31.04 28.26 9.29 113.62 43.92 86.2 50.5 A115 7 37.11 46.55 21.49 57.58 26.91 50.0 43.9 47.88 84.71 44.29 56.72 67.75 85.98 A116 3 43.81 57.47 64.37 65.99 49.15 75.72 40.01 61.98 26.19 54.95 55.93 49.21 51.94 A117 10 85.73 73.54 52.86 47.59 46.66 73.09 49.05 40.84 54.17 71.44 36.59 50.03 61.92

83 A118 4 20.74 28.69 45.34 66.62 27.98 31.63 103.08 44.13 29.96 34.49 138.57 77.33 34.56 A119 30 34.21 50.27 54.83 61.84 43.67 53.1 49.8 52.81 33.75 53.39 67.88 110.36 35.2 A120 44 28.44 43.39 69.84 69.62 40.38 44.83 112.95 43.49 52.54 39.7 57.85 52.85 40.88 A121 12 18.83 42.15 59.83 39.24 48.46 43.68 74.58 57.92 54.64 53.32 53.29 74.03 68.3 A122 13 23.57 68.7 81.67 61.09 55.66 68.82 100.03 52.99 56.07 60.51 55.8 61.03 40.41 A123 5 52.39 67.36 48.16 48.86 49.53 67.98 38.75 52.86 37.1 82.58 210.58 46.55 62.56 A124 6 22.87 91.99 62.37 47.33 54.82 89.79 71.23 87.55 58.16 96.1 45.14 58.38 97.54 A125 12 38.15 85.74 80.25 63.22 56.58 85.46 79.69 50.95 60.2 94.21 88.69 61.71 72.96 A126 4 47.7 56.69 125.62 55.61 72.76 56.83 113.19 67.14 55.1 74.42 68.27 67.98 104.62 A127 20 43.21 80.93 76.88 85.06 46.91 78.25 77.33 79.0 55.64 70.6 143.27 73.7 55.98 A128 15 26.82 78.97 104.49 60.49 62.99 78.21 102.17 78.39 59.2 99.93 73.7 99.74 69.86 A129 13 26.45 96.84 129.85 63.69 52.03 96.11 120.43 91.39 76.17 89.06 154.57 103.75 74.36 A130 6 30.53 59.92 114.06 73.3 52.4 60.02 172.84 72.26 118.47 72.17 157.35 138.84 169.06 GENRE: CLASSIC ROCK A131 60 14.91 46.65 107.13 50.44 47.23 46.79 78.34 66.51 44.29 46.61 48.71 59.98 45.85 A132 60 36.81 55.16 95.4 71.7 66.76 55.16 91.7 80.95 62.88 73.75 100.26 90.21 55.72 A133 60 18.82 54.81 115.66 86.72 69.81 55.83 100.25 98.59 64.59 46.19 79.51 78.6 57.87 A134 60 15.13 77.58 106.03 77.07 58.52 78.1 90.39 87.42 58.66 73.87 67.57 74.62 64.01 A135 60 15.34 56.03 116.37 80.31 57.97 56.0 155.22 88.3 67.03 58.23 80.38 67.01 58.88 A136 60 23.53 102.24 89.62 91.82 84.1 102.0 70.06 96.71 85.81 99.66 77.29 108.39 65.28 A137 60 32.09 67.09 133.77 102.89 74.04 66.26 106.43 112.28 71.56 68.48 103.58 107.4 67.36 A138 60 31.5 84.14 121.85 79.67 87.11 84.23 101.21 96.94 91.49 84.39 101.93 102.83 95.41 A139 60 23.2 77.44 114.62 80.84 97.8 79.43 127.51 128.69 101.38 79.18 107.01 128.64 105.88 A140 60 19.49 83.58 136.08 93.89 80.07 83.42 117.84 87.76 87.79 77.16 146.2 125.93 120.18 A141 60 21.69 50.36 121.68 85.12 84.1 50.47 104.22 75.61 84.76 48.58 234.59 184.31 140.97 A142 60 10.48 81.27 199.86 157.52 151.11 85.68 179.34 117.7 109.38 85.28 105.64 101.02 76.0 A143 60 20.49 129.08 181.86 120.96 112.19 129.47 181.37 189.41 105.46 136.48 189.62 202.64 110.69 GENRE: DEATH METAL A144 9 39.88 24.63 19.91 25.03 17.92 19.5 21.89 42.06 25.12 27.41 25.07 31.84 23.87 A145 11 40.75 22.37 31.77 46.97 25.21 22.31 23.19 16.53 29.12 21.81 40.96 24.63 26.43 A146 4 30.63 44.85 9.76 21.16 10.48 39.95 12.81 20.14 12.1 24.56 77.89 26.36 38.88 A147 23 36.25 26.17 44.24 30.19 25.72 28.16 36.94 33.03 27.16 27.73 32.08 40.2 25.22 A148 14 40.33 30.13 39.62 37.99 22.15 29.99 46.74 39.37 23.34 35.06 31.07 39.45 33.55 A149 13 35.7 31.66 43.05 37.02 23.12 30.33 34.45 32.38 23.73 29.37 44.1 46.78 38.44 A150 12 40.43 30.44 31.24 44.62 32.69 34.52 36.12 50.66 28.1 41.34 49.94 42.97 29.38 A151 13 31.3 31.99 62.07 42.16 26.21 33.62 62.48 29.68 31.89 34.95 34.49 41.13 26.9 A152 14 45.34 38.5 47.08 40.33 39.37 36.97 46.86 46.61 31.68 30.61 44.23 37.25 25.8 A153 16 33.38 43.5 45.77 44.06 27.1 41.25 29.68 29.01 26.53 45.81 41.61 69.2 37.13 A154 5 22.59 32.89 22.1 71.13 16.75 29.25 51.03 106.1 22.98 35.16 35.17 31.48 31.29 A155 6 38.61 33.05 28.94 51.95 33.48 39.16 35.3 46.88 43.08 38.55 28.67 66.72 40.84 A156 5 23.54 47.41 36.74 20.0 39.21 46.27 40.46 46.13 42.17 46.12 27.8 60.76 34.82 A157 28 32.17 37.51 50.94 48.03 38.99 36.99 50.67 28.41 32.68 36.62 59.55 40.72 46.01 A158 13 28.85 29.97 24.26 52.51 27.75 30.73 34.26 52.81 31.04 35.19 117.04 37.23 35.43 A159 22 55.05 35.75 46.24 38.75 40.56 35.38 37.02 42.16 35.36 45.51 61.48 61.47 51.15 A160 17 35.5 38.9 52.63 48.88 32.88 39.3 36.6 46.33 34.95 40.59 43.34 82.58 40.8 A161 6 33.2 37.44 30.27 29.85 31.48 37.42 48.59 65.39 47.69 38.29 118.23 31.63 44.85 A162 6 39.04 24.43 24.71 36.66 22.54 26.33 38.97 61.7 30.62 54.41 148.43 43.42 51.63 A163 12 32.38 44.51 63.94 67.66 41.32 44.26 49.19 47.54 33.24 42.1 42.93 47.58 46.39 A164 8 31.68 47.69 37.83 48.4 34.43 45.98 37.44 52.3 38.11 46.41 60.78 84.86 68.63 A165 13 44.8 49.65 69.83 44.8 46.03 50.96 38.74 48.13 46.55 54.58 40.1 54.56 61.54 A166 8 67.19 34.63 104.73 83.1 22.56 31.77 92.54 60.24 36.99 41.53 32.01 34.34 35.65 A167 15 25.56 54.2 58.76 36.51 29.32 56.56 26.82 40.48 39.98 61.83 75.32 98.55 44.89 A168 25 67.29 34.68 84.24 39.47 37.66 42.88 64.7 51.35 52.51 43.53 54.67 73.33 58.46 A169 10 26.31 39.78 45.09 63.22 59.84 34.65 40.68 55.57 46.59 31.96 103.69 66.06 72.65 A170 9 24.88 47.53 39.56 89.94 32.33 46.71 33.59 57.91 22.09 38.92 87.28 105.17 59.95 A171 18 55.97 31.92 71.8 59.21 51.15 34.68 72.14 74.54 41.2 35.9 72.99 86.62 31.52 A172 13 24.53 83.64 50.52 29.89 32.11 63.24 37.02 44.88 34.53 58.51 51.31 131.74 48.65 A173 19 56.32 60.21 53.47 44.41 48.1 59.86 53.75 77.52 47.84 65.45 52.07 68.24 52.54 A174 9 32.41 44.41 38.83 64.83 29.48 45.36 28.02 49.12 36.94 64.31 60.84 165.48 62.37 A175 5 44.64 58.2 51.07 12.05 60.57 51.66 49.14 35.87 51.9 47.3 126.62 124.37 113.64 A176 12 19.98 56.93 74.55 79.77 56.87 62.08 99.42 73.92 88.08 82.61 103.51 66.74 74.28 A177 25 33.19 38.74 44.99 62.93 32.05 41.19 72.15 258.85 150.7 54.24 152.78 135.43 165.54

84 GENRE: DOOM METAL A178 6 27.39 14.32 37.38 29.92 27.74 13.24 26.38 33.79 16.49 15.05 10.11 23.83 14.42 A179 10 22.82 23.38 25.18 42.01 27.46 22.0 39.11 36.57 28.51 19.81 26.16 22.26 19.04 A180 8 19.66 34.05 49.86 16.42 24.92 32.14 43.67 29.0 24.32 24.24 24.85 22.36 23.71 A181 15 35.01 26.45 30.16 35.02 32.35 27.37 29.04 38.37 26.5 26.45 31.94 31.25 20.76 A182 5 27.25 31.55 34.47 34.61 16.58 31.39 33.19 33.94 17.58 34.07 27.37 46.29 19.47 A183 5 41.35 24.05 41.12 34.69 26.79 23.04 37.63 22.25 26.89 21.87 36.49 41.72 24.04 A184 14 57.03 28.26 35.88 22.74 23.91 29.04 27.35 38.24 23.11 29.91 40.86 30.67 34.44 A185 10 39.63 30.56 30.06 27.05 32.6 32.57 29.8 31.25 38.02 31.3 29.05 29.41 30.86 A186 10 49.45 23.14 25.49 24.32 26.07 22.64 26.46 47.92 35.45 31.4 31.12 51.19 27.85 A187 6 27.18 16.11 27.83 31.06 30.91 16.79 22.22 36.5 17.1 17.55 49.04 79.19 30.7 A188 6 22.44 38.78 53.04 29.56 14.07 38.72 66.06 31.3 14.35 22.64 26.12 37.53 20.72 A189 9 41.84 24.78 25.4 48.25 21.96 18.46 31.51 34.2 20.45 26.49 26.8 92.76 24.31 A190 4 21.71 39.82 30.72 18.14 17.7 38.55 22.55 27.48 34.97 59.88 35.35 29.09 47.24 A191 20 24.6 33.19 33.31 39.53 34.27 33.84 37.44 50.24 36.96 30.47 28.35 37.36 27.14 A192 7 23.65 40.7 40.12 31.73 23.04 40.69 35.92 45.27 21.44 39.19 33.05 38.85 36.31 A193 5 35.74 34.31 33.74 36.57 35.82 34.75 36.56 37.7 32.46 34.68 38.23 36.98 34.77 A194 6 28.6 26.92 15.01 21.97 43.25 28.71 12.81 50.2 31.82 27.22 110.53 53.68 20.28 A195 15 30.91 40.83 42.33 40.81 36.63 39.17 38.34 48.07 33.3 35.16 30.68 36.45 32.6 A196 22 36.17 39.5 32.32 39.22 32.63 38.6 38.83 45.11 32.63 36.3 45.12 49.92 32.48 A197 9 37.16 54.56 45.75 44.26 27.78 52.32 33.69 55.57 25.95 30.52 56.47 28.48 24.53 A198 15 37.23 47.53 21.53 40.99 37.85 47.62 27.21 23.57 35.7 46.22 45.8 66.78 39.96 A199 13 34.25 42.99 47.35 51.13 24.17 43.14 50.83 43.21 26.4 38.34 46.75 38.0 30.74 A200 17 48.66 28.95 42.19 42.65 43.06 28.84 44.2 57.19 33.76 29.23 55.93 47.34 34.49 A201 9 34.01 36.56 49.87 43.51 34.46 36.41 42.48 38.97 32.07 48.62 44.37 43.69 37.49 A202 17 30.22 52.83 39.98 35.61 36.03 48.47 56.15 42.33 46.88 38.24 36.01 21.8 34.65 A203 9 39.86 40.11 42.38 38.02 24.35 36.77 40.81 48.85 27.03 36.48 60.17 57.21 44.74 A204 17 37.17 38.9 36.3 41.49 27.07 44.01 32.35 57.79 32.32 42.66 67.39 43.92 42.01 A205 18 30.87 39.78 42.57 50.37 31.08 38.97 66.16 45.72 31.44 40.21 41.48 43.57 40.39 A206 10 22.66 34.09 48.77 58.34 34.91 40.0 77.86 49.3 26.59 36.72 39.4 34.23 36.37 A207 4 26.68 48.1 77.16 21.74 69.43 38.35 34.89 84.1 40.43 50.39 13.73 24.94 30.11 A208 14 23.43 36.35 54.3 46.57 36.69 36.5 64.01 36.59 39.02 41.13 34.22 66.03 47.95 A209 31 47.25 41.07 64.21 38.96 29.41 43.27 63.23 42.48 39.73 40.47 57.47 47.17 39.97 A210 10 26.05 47.46 51.55 41.31 40.79 44.95 44.03 57.2 45.32 52.27 43.53 43.55 41.18 A211 25 21.47 39.98 45.49 53.38 42.5 39.87 55.78 85.27 36.35 41.41 44.53 40.53 31.83 A212 4 34.32 40.35 48.23 63.34 49.47 51.2 44.74 34.56 51.69 47.59 44.83 40.55 44.52 A213 10 28.18 28.53 67.59 93.92 33.1 20.2 59.67 52.24 44.62 22.52 56.2 49.53 35.7 A214 11 24.63 30.01 58.66 38.27 35.66 29.39 76.41 66.96 49.3 37.71 46.8 60.73 42.81 A215 10 32.34 79.67 57.37 24.36 21.59 76.24 35.39 34.71 17.45 78.32 32.68 56.43 59.95 A216 16 49.81 43.88 44.35 31.2 35.74 44.89 43.95 82.46 43.2 58.07 38.31 80.91 39.16 A217 43 43.31 32.38 82.06 71.31 40.75 31.99 70.57 47.22 40.84 31.42 63.04 31.13 44.25 A218 11 34.64 71.1 47.74 48.25 42.06 55.93 56.13 51.84 44.18 47.61 57.33 48.79 47.29 A219 6 94.84 66.72 38.71 54.3 32.4 65.43 37.26 75.79 33.8 58.78 90.47 53.52 39.16 A220 26 34.81 67.31 36.13 41.59 50.37 67.07 38.44 77.83 49.34 70.74 49.49 73.33 50.17 A221 3 13.98 41.95 47.09 143.92 49.86 56.16 45.56 104.01 58.88 44.06 27.5 25.3 29.25 A222 22 49.56 63.11 64.57 48.32 38.28 54.39 61.75 61.5 41.14 59.69 65.9 77.0 48.67 A223 27 57.48 48.54 58.26 56.92 73.01 48.65 53.05 49.15 67.35 60.57 32.89 87.17 70.5 A224 18 17.78 54.69 65.27 63.65 57.0 57.02 60.06 73.14 50.2 44.15 46.52 95.4 40.31 A225 17 33.75 68.13 33.28 63.2 57.47 65.47 30.93 71.33 55.46 69.24 50.19 74.82 70.86 A226 10 80.82 49.1 72.23 74.64 56.24 50.63 84.75 43.01 56.55 52.36 70.99 62.56 42.81 A227 6 16.0 128.1 57.67 52.92 44.15 115.67 44.55 80.06 43.24 38.7 25.77 70.01 31.98 A228 27 32.18 66.97 82.62 64.88 38.95 66.76 57.47 79.62 37.69 67.83 57.33 65.86 56.97 A229 11 42.81 79.84 53.2 52.57 32.69 77.99 47.49 69.01 33.64 53.69 116.89 103.24 35.93 A230 4 17.98 77.77 86.55 19.77 37.78 37.79 129.38 67.86 42.48 63.99 100.96 70.29 64.48 A231 8 14.96 54.21 72.59 78.01 63.88 56.01 75.86 88.49 70.6 44.83 55.02 99.48 47.81 A232 7 34.83 32.45 98.62 52.36 29.48 28.11 92.49 39.07 51.59 80.49 138.25 113.93 76.98 A233 60 82.99 40.29 140.97 76.52 46.86 42.74 134.11 57.44 47.83 45.27 98.12 65.49 46.58 A234 10 24.74 54.16 55.87 88.78 54.29 51.11 48.14 107.39 63.47 75.02 99.46 63.4 88.54 A235 10 33.52 65.73 168.13 94.87 78.19 56.23 133.58 70.18 66.82 51.99 57.73 107.68 44.29 A236 3 12.34 71.53 62.82 135.46 84.87 71.52 44.35 58.85 99.17 161.55 70.96 74.9 111.79 A237 29 37.58 60.79 94.85 133.25 56.91 61.9 93.0 80.57 46.47 73.82 138.53 94.19 143.93 A238 9 22.93 56.28 117.34 125.54 64.97 60.43 107.05 108.91 86.72 68.21 135.41 118.26 91.67

85 A239 3 15.29 410.63 82.23 104.99 66.9 399.65 88.12 212.13 69.15 62.29 63.81 55.12 60.62 A240 5 35.67 152.61 267.16 174.57 70.45 153.06 243.02 208.89 140.96 270.46 794.0 777.09 450.2 A241 6 6.58 253.21 113.39 239.34 271.78 241.82 214.7 285.18 217.23 264.37 1192.8 617.8 124.88 GENRE: ELECTRONIC A242 5 19.23 21.7 61.63 46.16 21.48 21.71 19.34 46.64 20.41 19.68 16.94 6.11 17.95 A243 26 34.29 28.47 35.77 40.38 31.98 28.26 32.92 43.57 32.33 37.97 41.88 34.45 30.45 A244 4 12.98 21.6 41.98 70.29 36.64 21.6 52.67 33.05 27.66 29.67 40.62 37.24 29.74 A245 60 21.36 34.32 55.23 37.43 31.27 34.75 49.29 38.35 32.08 36.77 37.81 33.46 32.44 A246 19 17.78 34.51 58.17 49.24 35.27 35.78 47.61 41.92 35.0 35.6 46.64 47.87 31.95 A247 60 26.69 40.44 51.08 48.12 29.62 40.63 50.28 60.44 32.78 42.41 48.18 47.82 39.08 A248 15 25.98 34.46 43.79 52.53 27.71 34.66 53.22 52.52 32.81 43.04 54.31 66.83 37.35 A249 13 17.89 32.19 74.1 36.61 35.87 41.26 51.75 65.27 43.67 39.27 42.87 58.47 29.64 A250 7 31.27 26.74 55.49 74.64 25.4 31.32 48.61 65.79 22.78 46.38 60.7 64.89 29.09 A251 8 22.29 29.59 94.31 43.0 18.79 28.24 55.1 53.11 24.98 28.87 42.25 109.04 27.09 A252 6 15.86 29.17 64.18 24.1 43.09 29.21 144.82 45.19 30.83 29.99 33.79 27.61 54.83 A253 10 21.63 40.21 33.46 58.94 28.74 40.4 37.77 64.32 31.98 58.48 69.66 111.0 33.75 A254 5 39.68 58.94 93.57 36.99 25.75 58.95 68.62 49.25 32.36 57.64 48.67 45.79 35.24 A255 6 26.01 32.04 69.43 87.45 33.85 32.03 54.34 112.39 37.11 32.49 44.23 46.34 34.61 A256 11 43.98 61.68 53.16 67.25 29.66 63.03 54.8 70.28 23.95 51.57 83.7 48.73 23.63 A257 8 24.31 25.93 55.56 33.66 35.1 25.86 65.66 55.82 41.26 33.37 105.86 104.85 52.99 A258 3 33.27 44.35 54.64 37.92 34.78 44.35 46.67 39.92 38.98 52.2 93.54 96.34 79.51 A259 14 44.15 37.06 45.86 47.34 37.37 35.7 51.81 53.34 37.81 27.13 161.24 93.44 50.23 A260 37 30.01 47.47 84.5 64.91 55.38 46.43 70.56 58.38 66.2 44.79 45.33 79.64 37.85 A261 60 14.02 49.51 87.77 86.26 46.06 49.27 75.74 76.8 47.26 50.34 84.34 60.95 45.71 A262 60 26.9 47.65 88.08 68.28 49.74 50.35 101.73 58.58 59.57 52.34 76.34 93.11 55.87 A263 29 52.32 47.7 97.54 101.09 56.58 52.0 100.61 76.94 55.77 47.2 85.58 43.88 53.26 A264 29 23.8 52.26 62.39 61.62 53.79 52.19 62.88 90.03 60.94 50.59 107.88 103.09 64.71 A265 20 20.8 60.65 140.87 67.95 51.32 60.46 143.1 72.87 56.64 57.14 53.51 56.37 59.25 A266 19 35.08 37.48 117.95 62.01 123.19 37.4 103.88 62.86 100.93 60.29 116.28 86.0 55.98 A267 19 14.87 70.71 115.58 100.25 62.8 71.09 103.36 87.38 59.23 71.74 74.55 100.53 54.22 A268 4 13.97 69.66 75.29 172.39 86.32 69.69 99.25 128.44 64.88 40.98 68.1 35.66 67.88 A269 37 9.9 77.56 119.63 105.54 71.04 78.38 109.18 97.08 75.49 72.1 71.23 56.87 55.28 A270 7 21.45 79.79 59.21 79.67 56.31 79.82 85.08 94.27 86.31 91.18 192.09 25.36 66.41 A271 8 23.83 39.76 87.25 43.36 29.93 39.7 69.02 49.86 35.39 44.04 75.68 503.04 88.0 A272 3 37.65 30.87 28.1 48.09 45.65 30.86 26.14 56.31 38.72 54.0 97.39 612.92 54.38 A273 7 8.17 46.38 168.41 62.69 40.01 46.25 144.7 105.89 79.86 80.3 42.26 164.17 195.23 A274 60 36.8 116.69 96.78 76.3 99.1 121.72 100.17 95.2 105.86 122.03 109.25 103.62 127.47 A275 32 14.18 52.14 112.38 142.32 72.46 54.44 110.64 143.87 62.23 52.58 246.94 72.35 205.56 A276 6 58.18 68.07 133.49 50.29 39.88 64.3 158.62 39.25 24.51 36.7 586.49 111.89 28.93 A277 7 14.19 64.42 152.26 56.49 74.58 64.47 158.51 151.61 75.89 79.58 324.97 96.6 105.12 A278 8 22.7 94.11 114.27 209.75 102.64 93.89 138.5 212.26 106.02 91.21 117.14 122.09 148.13 A279 60 13.89 109.67 184.51 167.65 143.13 108.82 190.32 190.44 142.17 103.71 155.41 112.95 116.23 A280 4 24.34 44.4 171.98 70.74 21.31 44.4 183.65 47.27 80.74 147.1 1084.8 58.04 76.94 A281 14 3.77 224.1 174.68 251.09 143.49 220.41 200.72 385.58 199.38 146.04 305.33 129.64 162.05 GENRE: HEAVY METAL A282 3 24.12 30.17 15.46 64.82 9.7 29.81 24.81 92.12 32.67 17.52 10.47 10.12 9.01 A283 34 51.81 24.39 33.45 31.58 26.43 25.18 32.99 22.54 17.56 25.45 45.47 31.16 31.73 A284 17 41.68 26.66 39.51 32.05 16.3 24.84 37.66 41.44 18.49 33.73 55.07 26.27 24.95 A285 17 36.44 20.59 48.14 55.0 31.7 20.68 40.51 39.26 31.93 22.65 35.26 28.39 26.98 A286 3 34.08 24.39 63.33 69.09 9.9 24.06 57.94 44.81 24.11 30.89 17.29 16.43 32.45 A287 10 54.08 38.93 32.18 40.27 44.85 40.24 37.48 41.03 37.75 39.28 33.07 36.62 35.66 A288 7 61.99 40.99 38.05 33.49 44.99 32.91 37.36 35.34 40.22 42.53 70.88 39.31 30.34 A289 28 74.13 32.95 45.9 44.02 37.46 34.17 44.51 52.83 34.76 36.59 61.93 50.66 34.9 A290 27 34.27 30.27 43.31 58.96 41.5 29.93 53.84 46.83 34.91 30.59 56.4 64.5 33.12 A291 7 36.84 35.55 29.68 42.6 41.04 32.68 55.32 64.9 38.84 40.74 53.0 54.49 43.57 A292 22 60.64 40.0 53.82 44.82 40.76 41.08 62.66 56.15 26.93 40.9 55.11 43.76 39.27 A293 10 26.82 46.16 46.81 39.63 55.96 43.25 35.62 50.27 39.82 31.49 43.94 68.81 46.91 A294 11 37.5 41.29 49.37 38.74 52.17 41.11 37.21 60.85 47.65 44.79 59.81 51.54 41.86 A295 16 37.5 44.0 49.29 51.84 38.68 45.33 51.65 47.79 42.13 41.12 60.25 50.54 47.07 A296 12 46.64 40.16 66.96 42.48 21.08 41.0 75.86 53.42 24.29 43.8 49.29 76.87 37.08 A297 60 29.2 38.71 59.29 57.35 39.49 39.37 56.68 58.28 40.14 38.45 50.84 54.65 42.07 A298 12 30.99 33.48 37.9 59.52 30.85 31.85 43.76 64.92 31.63 33.2 68.82 132.97 41.23

86 A299 12 47.08 55.49 34.2 41.84 27.79 54.87 42.24 87.81 47.21 48.36 65.09 54.36 63.46 A300 29 33.39 45.54 78.51 50.82 40.82 45.32 69.86 52.91 37.11 41.06 63.5 45.73 53.83 A301 9 26.78 79.29 53.38 24.33 33.21 78.28 38.04 93.96 47.38 68.08 66.18 31.41 41.04 A302 6 28.85 79.49 53.75 79.41 47.96 79.61 51.23 77.63 46.8 38.76 35.92 41.5 31.58 A303 16 18.9 56.28 51.5 83.47 47.29 56.25 36.64 45.75 46.97 45.88 58.57 94.31 48.93 A304 9 42.28 29.3 63.98 63.83 56.23 31.23 58.91 64.97 51.63 39.54 49.86 183.12 50.16 A305 16 47.11 64.08 108.5 65.95 34.81 63.44 110.37 83.85 40.89 66.36 44.54 39.98 49.36 A306 5 20.71 37.83 87.89 40.97 24.13 39.13 56.59 58.77 45.09 45.08 196.59 94.5 63.25 A307 24 56.74 60.77 63.06 79.46 70.61 61.56 65.88 61.66 59.71 71.69 98.65 47.07 62.62 A308 60 21.0 59.09 104.56 108.58 58.05 57.86 119.88 81.18 56.27 52.13 102.41 62.64 58.88 A309 59 20.44 52.86 135.14 104.96 49.55 52.56 127.28 83.92 52.28 51.88 97.34 68.27 58.81 A310 60 21.86 46.12 107.29 117.13 58.62 47.78 107.35 77.97 71.26 50.3 134.74 124.96 71.79 A311 60 13.03 73.98 131.91 88.16 82.61 72.85 106.06 106.99 81.5 73.22 91.29 90.33 84.94 A312 14 19.94 72.56 78.97 50.9 61.05 72.56 105.6 105.38 69.4 102.34 61.25 338.92 77.47 A313 60 38.9 124.34 108.64 101.13 122.24 125.71 95.8 79.98 111.87 127.52 88.94 142.14 122.38 A314 47 11.46 76.53 142.89 134.59 165.57 69.8 158.57 149.7 154.61 50.46 184.33 78.58 130.39 A315 3 23.27 279.81 115.02 103.13 89.79 279.18 118.61 139.67 86.61 137.56 119.41 114.32 103.55 A316 7 16.53 310.69 317.46 264.72 226.12 311.03 226.29 109.42 397.67 302.7 987.4 121.18 594.05 GENRE: PUNK ROCK A317 18 36.09 43.26 38.83 35.22 27.46 42.49 41.93 45.0 34.9 50.39 45.15 45.6 29.0 A318 11 44.08 30.11 63.26 59.53 27.76 29.56 53.04 44.07 29.38 32.74 64.92 41.26 28.23 A319 38 23.97 39.86 37.7 45.44 25.7 39.46 48.22 43.99 19.15 41.07 82.72 64.7 37.94 A320 5 36.23 24.52 41.39 42.31 26.55 24.06 63.9 94.56 25.64 27.22 115.66 31.76 24.28 A321 17 36.63 53.06 50.31 40.83 35.15 55.96 40.14 37.71 35.58 41.7 55.76 60.03 43.36 A322 9 17.43 42.52 47.23 60.82 33.37 41.28 53.09 49.89 31.55 29.81 54.0 98.37 33.27 A323 60 23.92 40.06 44.24 49.17 44.4 41.58 45.51 50.32 46.3 40.59 46.34 86.39 42.95 A324 38 27.25 47.45 61.24 57.43 44.0 47.98 63.06 57.11 42.5 44.46 56.08 55.81 39.57 A325 35 26.09 36.57 85.67 61.85 38.3 36.92 129.17 50.12 41.08 38.38 45.97 40.68 41.69 A326 17 18.55 74.38 52.93 75.74 31.84 72.98 41.76 58.26 32.46 54.89 66.4 45.79 40.61 A327 8 15.4 66.36 60.49 58.69 70.89 64.06 31.11 67.27 57.08 53.72 52.44 26.63 40.1 A328 16 33.77 40.55 39.11 72.28 57.9 37.66 76.52 64.38 60.35 44.5 64.07 44.03 54.21 A329 40 27.86 50.42 95.73 66.92 46.48 48.95 59.99 63.67 41.38 48.91 70.35 48.93 43.1 A330 18 41.79 38.75 92.18 42.54 45.72 39.86 97.82 47.3 46.01 58.09 70.88 67.69 51.42 A331 26 27.16 60.24 77.93 62.91 41.62 63.1 80.89 90.15 44.78 43.03 52.81 42.96 38.2 A332 25 22.11 46.75 70.89 59.41 52.28 46.92 83.58 58.02 50.09 48.76 73.25 71.78 46.48 A333 60 37.04 54.13 73.53 64.81 43.94 56.67 75.66 59.39 47.8 57.67 61.55 61.21 56.33 A334 38 25.04 51.52 70.86 72.0 45.67 52.71 70.35 67.95 46.15 56.74 63.03 66.54 52.68 A335 5 9.09 45.66 51.72 31.42 40.95 45.64 119.14 77.26 27.47 37.62 95.65 146.88 32.18 A336 7 11.86 98.03 61.21 59.47 66.37 96.39 93.7 54.4 56.18 65.45 44.11 43.1 31.25 A337 28 15.54 45.26 85.81 89.25 58.31 43.87 86.45 87.91 61.28 29.92 89.16 65.33 42.65 A338 15 42.31 72.07 57.2 94.7 38.69 71.87 65.1 91.8 45.88 64.78 57.63 68.48 57.86 A339 14 49.58 40.42 99.58 83.17 51.96 36.73 102.97 106.24 51.76 38.56 84.2 55.68 43.11 A340 8 24.04 63.17 45.35 93.35 30.17 63.88 58.54 107.32 27.09 57.58 70.59 153.75 38.75 A341 13 19.77 77.03 90.25 61.33 68.9 71.17 104.2 77.04 53.07 59.66 46.89 54.87 53.38 A342 60 40.7 70.9 93.24 73.62 65.94 71.24 85.01 63.35 60.46 65.42 86.11 59.2 48.69 A343 13 40.98 49.79 126.56 47.77 22.93 51.47 109.61 56.4 42.12 84.16 62.41 176.36 56.43 A344 13 24.35 86.16 72.51 67.55 59.11 83.27 74.42 42.09 59.74 85.06 73.53 117.86 65.47 A345 32 54.6 71.03 117.25 80.8 57.76 68.62 115.48 65.65 70.21 64.33 86.84 78.08 65.74 A346 12 30.87 75.18 82.54 63.12 54.31 75.31 100.48 45.08 52.92 77.11 164.64 92.64 72.16 A347 43 20.12 74.39 67.59 77.97 61.22 73.02 95.11 98.65 66.87 81.83 67.81 191.1 49.48 A348 60 32.5 68.71 91.57 72.11 62.66 71.66 94.32 87.19 80.9 71.49 102.37 133.53 99.41 A349 6 18.88 184.21 85.88 130.27 43.12 184.06 65.31 145.46 33.65 65.27 100.77 17.83 52.82 A350 25 18.34 80.78 98.08 103.46 77.38 78.31 107.17 144.76 83.51 75.97 68.37 125.76 77.17 A351 18 14.87 111.42 95.97 110.79 102.94 110.61 87.43 106.97 98.29 111.0 89.42 76.92 95.64 A352 32 39.75 67.71 131.14 113.72 87.69 70.69 140.69 108.93 117.42 80.46 113.2 125.94 116.36 A353 32 12.61 102.95 95.06 103.8 88.58 100.72 117.11 138.16 92.53 101.57 144.97 143.58 99.66 A354 41 16.84 67.09 170.7 130.15 69.77 63.07 167.95 145.67 60.98 62.5 240.2 90.23 62.55 A355 34 19.63 80.79 183.73 109.32 90.25 90.58 166.3 146.59 97.5 83.55 132.8 101.4 86.83 A356 4 10.1 88.17 112.78 178.78 95.17 85.45 170.59 122.54 103.33 93.92 242.71 212.82 100.42 A357 60 26.94 93.12 161.84 137.42 104.66 99.43 163.85 192.9 114.42 100.08 175.24 228.81 106.01 A358 50 23.94 144.41 104.86 132.33 149.8 142.88 110.84 153.71 144.55 141.21 161.64 159.92 155.19 A359 20 49.69 223.74 147.25 130.59 117.55 222.56 139.86 220.73 135.97 215.64 192.24 251.89 123.94

87 GENRE: STONER ROCK A360 3 22.04 7.91 62.84 19.57 26.12 7.91 55.13 36.35 16.63 10.16 10.76 34.13 11.04 A361 15 29.44 29.39 17.06 26.18 24.19 30.05 19.15 37.29 23.86 29.14 25.69 37.76 23.76 A362 3 73.94 22.85 36.01 41.42 21.41 32.06 34.56 36.27 19.96 33.46 36.24 22.07 29.09 A363 6 39.47 27.21 28.02 42.9 31.5 31.11 39.03 58.29 20.63 38.48 24.61 33.1 31.48 A364 5 29.01 42.89 26.73 57.01 18.85 42.88 23.19 22.09 22.06 42.88 54.38 41.32 25.37 A365 18 29.19 26.59 50.7 46.62 24.08 28.02 40.31 50.54 27.28 25.75 42.1 34.76 30.3 A366 8 66.34 34.55 47.28 42.35 37.85 39.51 36.78 39.3 28.62 26.99 47.27 25.22 34.08 A367 14 45.29 30.54 38.34 38.43 33.26 30.08 36.55 51.72 37.35 36.25 44.59 46.64 44.56 A368 7 21.64 44.31 47.54 37.02 24.21 41.67 35.82 61.4 22.55 41.59 31.13 40.16 42.88 A369 26 24.47 45.2 48.24 40.92 40.23 41.16 54.61 40.75 25.88 41.13 32.02 43.71 24.49 A370 14 32.62 41.16 36.59 44.04 38.3 42.34 47.83 44.23 36.3 43.59 50.33 47.68 35.48 A371 12 40.97 45.03 60.8 49.2 27.95 44.86 60.31 39.39 33.44 39.58 38.83 40.66 31.58 A372 25 26.88 40.72 64.27 67.93 31.41 36.5 62.79 52.67 32.99 36.83 64.99 39.34 33.38 A373 24 26.78 58.71 40.83 39.91 39.48 54.37 48.27 59.46 44.5 54.04 55.52 50.66 36.23 A374 8 25.27 46.58 61.22 52.6 42.6 52.43 54.61 57.61 42.02 48.07 85.64 55.8 43.34 A375 3 20.38 60.28 35.02 53.05 15.7 60.46 23.9 142.8 25.13 29.86 90.73 46.54 70.04 A376 21 35.52 52.5 56.82 53.04 44.24 51.39 53.34 60.37 36.85 47.97 51.22 104.01 45.02 A377 6 18.5 48.77 46.16 26.02 33.55 51.5 32.19 61.2 41.23 79.56 69.94 123.87 63.76 A378 5 35.97 78.48 64.54 70.24 57.99 79.94 51.8 149.74 78.44 128.53 147.22 43.8 95.2 A379 6 23.69 59.66 186.77 74.86 32.64 57.98 169.56 80.4 46.41 41.27 95.29 84.81 150.0 A380 4 38.57 74.68 79.14 108.47 152.16 74.74 104.78 43.31 150.67 84.64 120.64 55.31 80.79 A381 9 24.1 127.18 144.0 86.11 55.49 126.94 91.64 179.15 61.71 114.16 61.41 39.71 88.88 A382 10 39.93 59.78 177.24 117.96 73.22 60.4 183.41 79.0 92.37 62.64 162.44 74.23 89.58 A383 5 31.14 71.86 148.58 124.86 48.66 77.78 100.12 80.13 72.73 216.1 84.87 65.14 211.49 A384 4 13.76 140.93 38.85 113.89 79.39 140.88 35.98 65.84 101.81 170.06 226.26 68.22 124.49 A385 6 17.94 155.49 323.37 221.4 179.13 164.28 342.01 166.66 232.97 201.21 267.74 134.58 245.54 GENRE: THRASH METAL A386 17 20.65 24.28 56.24 35.01 25.35 23.77 44.67 34.91 24.56 23.6 27.36 29.43 23.82 A387 26 24.0 35.5 37.2 27.67 27.62 36.73 33.89 22.64 31.84 37.15 26.71 35.01 28.85 A388 3 30.25 26.72 31.77 62.96 25.75 25.98 42.99 27.17 31.2 15.37 47.0 23.9 20.24 A389 21 32.5 32.47 52.9 34.62 23.47 32.87 50.81 35.38 24.59 31.85 38.81 32.38 29.87 A390 16 30.74 36.69 40.51 44.95 23.76 32.1 34.05 36.93 21.97 28.98 65.42 37.26 26.02 A391 11 29.0 36.06 39.28 42.22 30.2 35.45 47.52 50.64 31.22 31.72 26.61 33.12 33.67 A392 14 41.17 37.35 36.9 41.95 22.7 36.43 41.33 62.51 30.47 41.95 27.43 35.42 34.81 A393 6 27.85 42.06 33.29 32.23 19.32 42.05 29.22 35.01 25.38 39.67 53.7 50.85 46.72 A394 60 34.46 37.32 44.09 37.92 32.63 36.56 43.4 44.21 32.71 37.8 44.51 35.79 35.62 A395 23 26.72 37.16 45.66 36.03 33.49 42.03 47.67 46.23 39.55 42.97 58.01 62.44 36.14 A396 6 35.38 42.17 50.57 36.55 37.49 40.03 47.86 76.26 24.05 39.41 87.99 43.04 37.19 A397 60 20.52 44.42 60.57 66.34 46.18 43.68 53.84 51.86 40.84 42.72 45.3 51.3 39.85 A398 23 35.36 45.05 37.96 45.48 43.17 44.72 43.06 57.28 42.61 46.91 51.05 85.78 47.37 A399 26 42.2 31.46 40.52 67.37 33.37 32.62 49.53 91.33 39.57 31.32 43.17 101.07 44.33 A400 13 33.03 45.2 30.27 40.98 24.2 43.95 27.69 23.86 28.93 55.92 60.41 192.13 51.14 A401 4 16.15 75.49 26.12 23.86 49.95 73.46 50.92 97.03 65.65 30.6 66.37 45.51 46.68 A402 25 24.33 36.71 75.04 74.69 34.12 74.31 63.58 44.81 41.36 47.68 54.12 74.59 42.46 A403 51 37.82 53.78 67.91 64.04 36.16 52.08 70.36 55.57 39.52 60.27 46.18 71.28 63.14 A404 4 16.44 45.4 61.31 67.5 48.03 46.2 71.94 71.06 85.44 69.51 39.45 37.74 60.58 A405 15 24.18 50.49 60.09 70.53 41.95 51.17 43.51 57.29 40.44 54.57 99.69 87.26 98.99 A406 34 35.02 60.91 43.6 79.48 68.58 58.9 41.58 81.31 43.73 65.67 74.19 86.54 105.99 A407 10 45.6 38.34 112.68 54.82 67.37 38.71 121.85 86.23 48.54 14.89 134.66 194.88 31.74 A408 11 22.33 176.89 102.82 57.65 73.87 152.09 101.44 73.88 92.86 101.6 62.17 130.77 75.88 A409 16 17.91 103.22 124.55 90.07 83.94 90.67 96.1 121.37 96.63 79.15 210.15 78.94 93.23 A410 10 31.81 96.69 40.15 69.92 42.49 93.95 39.52 115.29 57.55 150.74 317.45 133.65 121.94 A411 28 32.75 122.36 120.18 86.96 78.5 137.69 91.23 98.3 92.42 113.98 137.31 103.35 133.35 A412 51 48.38 132.88 134.37 129.1 108.32 134.41 122.93 167.47 108.74 137.78 134.38 225.8 117.26

88