Price Prediction of Vinyl Records Using Machine Learning Algorithms

Bachelor Degree Project Price prediction of vinyl records using machine learning algorithms Author: David Johansson Supervisor: Tobias Ohlsson Semester. Spring 2020 Subject: Computer Science Abstract Machine learning algorithms have been used for price prediction within several application areas. Examples include real estate, the stock market, tourist accommodation, electricity, art, cryptocurrencies, and fine wine. Common approaches in studies are to evaluate the accuracy of predictions and compare different algorithms, such as Linear Regression or Neural Networks. There is a thriving global second-hand market for vinyl records, but the research of price prediction within the area is very limited. The purpose of this project was to expand on existing knowledge within price prediction in general to evaluate some aspects of price prediction of vinyl records. That included investigating the possible level of accuracy and comparing the efficiency of algorithms. A dataset of 37000 samples of vinyl records was created with data from the Discogs website, and multiple machine learning algorithms were utilized in a controlled experiment. Among the conclusions drawn from the results was that the Random Forest algorithm generally generated the strongest results, that results can vary substantially between different artists or genres, and that a large part of the predictions had a good accuracy level, but that a relatively small amount of large errors had a considerable effect on the general results. Keywords: price prediction, price estimation, vinyl records, vinyl prices, regression, machine learning, machine learning algorithms, algorithm comparison, dataset, vinyl dataset, k-nearest neighbors, linear regression, neural network, random forest, discogs ii Contents 1 Introduction 6 1.1 Background 6 1.1.1 Vinyl record industry 6 1.1.2 Price estimation with machine learning 7 1.1.3 Algorithms 8 1.1.4 Hedonic modelling 8 1.2 Related work 8 1.2.1 Real estate 9 1.2.2 Collectibles 10 1.2.3 Vinyl records 11 1.3 Problem formulation 12 1.4 Motivation 12 1.5 Objectives 13 1.6 Scope/Limitation 13 1.7 Target group 13 1.8 Outline 14 2 Method 15 2.1 Dataset 15 2.1.1 Genres and artists 15 2.1.2 Attributes 16 2.1.3 Data trimming 17 2.1.4 Variable encoding 17 2.1.5 Feature scaling 18 2.2 Controlled experiment 18 2.2.1 Algorithm selection 18 2.2.2 Measuring accuracy 18 2.2.3 Experiment configuration 19 2.3 Reliability and Validity 20 2.3.1 Reliability 20 2.3.2 Internal validity 21 2.3.3 External validity 21 2.4 Ethical Considerations 22 iii 3 Implementation 23 3.1 Dataset 23 3.1.1 Attribute selection 23 3.1.2 Data collection 25 3.1.3 Data trimming calibration 26 3.2 Controlled experiment 28 3.2.1 Hyperparameter calibration 29 3.2.2 Dataset class 29 3.2.3 Regression class 30 3.2.4 Experiment execution 32 4 Results 33 4.1 Dataset 33 4.2 Controlled experiment 33 4.2.1 Evaluating full dataset 34 4.2.2 Evaluating genre data 37 4.2.3 Evaluating artist data 39 5 Analysis 41 5.1 Comparison of algorithms 44 5.2 Additional contextual data 45 5.3 Efficiency of features 47 5.4 Accuracy of price estimation 48 6 Discussion 50 6.1 Accuracy of price estimation (Q1) 50 6.2 Additional contextual data (Q2) 52 6.3 Comparison of algorithms (Q3) 52 6.4 Efficiency of features (Q4) 53 7 Conclusion 54 7.1 Future work 54 References 56 A Appendix 1 60 A Appendix 2 72 iv B Appendix 1 79 B Appendix 2 80 B Appendix 3 82 v 1 Introduction Price estimation using machine learning can be approached in multiple ways. Whether the outcome is considered satisfactory is usually depending on several factors, but it is generally of high importance that the data collected and the algorithms implemented are well suited for the problem formulations one is to answer. In this project, state-of-the-art techniques within the field of price estimation were applied in a context where they have not previously been implemented to a large extent. Merging familiar concepts with some new ideas in a new environment, the aim was to broaden the available knowledge and shed some light on an unexplored application area for machine learning. A dataset representing the vinyl discographies of hundreds of artists as well as software utilizing several machine learning models were created for this project. Using those assets, an experiment was conducted with the intent to answer a number of problem formulations related to the subject. 1.1 Background 1.1.1 Vinyl record industry Within the almost hundred-year-old music industry, we can see the usage of physical audio media fading away in favor of heavy consumption of digital audio files and streaming platforms. The popularity of compact discs has been steadily declining since the turn of the century, with major retail chains like Best Buy shutting down their selling of CDs completely [1]. The vinyl record, on the other hand, looks like it has stood the test of time. It still has a thriving community presenting retail sales numbers that have been increasing in recent years, with a total doubling of sales during 2018 [2]. Often, vinyl records are pressed as limited editions, sell out quickly, and might be difficult to get hold of. In other cases, attention can be brought to an item months or even years after it has disappeared from the retail market, making second-hand prices high, or even extremely high compared to the initial price. Other items might not be as sought after, and perhaps will not even generate its original value if one was to sell a copy. Either way, the second-hand market of vinyl records, recent as well as older releases, is highly active and prices can be very diverse depending on several factors. Many people are collecting and dealing with records, either as a hobby or as a 6 business, or somewhere in between. 1.1.2 Price estimation with machine learning Machine learning is used to generate information from data [3, pp. 1]. There are different types of machine learning which can be used for different types of tasks. Price prediction is a type of problem that belongs in the category of supervised machine learning , which is used to predict an output value based on some kind of input [3, pp. 25]. In the case of this project, the output is the price of a vinyl record and the input is an array of attribute values describing the vinyl record. A prediction is made using a model (an implementation of an algorithm) that has been trained (or fit) with input/output pairs of data, referred to as training data [3, pp. 25]. The algorithm is the part of the model that does calculations based on the training data and values of hyperparameters used during implementation. Hyperparameters are settings used to improve the generalization performance of a model [3, pp. 260]. Supervised machine learning problems can be further divided into two major categories - classification and regression. Classification is used to predict a categorical output value from a set list of options 1, while regression is used to predict a continuous number [3, pp. 25]. Thus, regression is the type of machine learning used to predict prices. Regression of prices has been studied in several different areas. Using a generic term like “price regression” in a database of academic publications like Google Scholar, one is likely to find results in diverse areas. Examples include real estate, the stock market, tourist accommodation, electricity, art/paintings, cryptocurrencies, and fine wine. While the fundamental concepts of these studies are often similar (using machine learning algorithms to investigate economical aspects of the subject), the course of action, as well as the aims, are often slightly different in the respective areas of application (which will be further elaborated on in Chapter 1.2). Also, some areas are more prevalent, like real estate or the stock market, and are therefore more comprehensively studied. In more obscure areas, there may only be one or a few studies available, meaning there are possible approaches of study which are currently unexplored. 1 E.g. yes, no or unknown , or any other set of categorical values. 7 1.1.3 Algorithms While the algorithms commonly used in price prediction studies will be looked closer upon in Chapter 1.2, this chapter will give an overview of generally recognized algorithms. Models highlighted in the educational book Introduction to Machine Learning with Python - A Guide for Data Scientists [3] are k-Nearest Neighbors, Linear Regression, Decision Tree, Ensemble of Decision Trees (also known as Random Forest), Support Vector Machine and Neural Network. A Google search using the search phrase which machine learning model to use for regression was done, and a handful of blogs and online resources [4], [5], [6], [7], [8], [9], [10], [11] were examined. Although slightly varied in their suggestions, the sources were fairly consistent. The previously mentioned algorithms constituted the majority of the recommended models. Among the most common ones were Linear Regression, Neural Network, Random Forest, and Decision Tree, while the less frequent ones included k-Nearest Neighbors and Support Vector Machine. 1.1.4 Hedonic modelling Hedonic modelling, or hedonic regression, is a method of distinguishing the characteristics of an item that are components of its value. The purpose is to determine the values of the separate features, to more accurately be able to make calculations of an item’s value. The characteristics are not in themselves market goods, and can not be sold separately.

Load more