Hellinger Distance-Based Similarity Measures for Recommender Systems
Total Page:16
File Type:pdf, Size:1020Kb
Hellinger Distance-based Similarity Measures for Recommender Systems Roma Goussakov One year master thesis Ume˚aUniversity Abstract Recommender systems are used in online sales and e-commerce for recommend- ing potential items/products for customers to buy based on their previous buy- ing preferences and related behaviours. Collaborative filtering is a popular computational technique that has been used worldwide for such personalized recommendations. Among two forms of collaborative filtering, neighbourhood and model-based, the neighbourhood-based collaborative filtering is more pop- ular yet relatively simple. It relies on the concept that a certain item might be of interest to a given customer (active user) if, either he appreciated sim- ilar items in the buying space, or if the item is appreciated by similar users (neighbours). To implement this concept different kinds of similarity measures are used. This thesis is set to compare different user-based similarity measures along with defining meaningful measures based on Hellinger distance that is a metric in the space of probability distributions. Data from a popular database MovieLens will be used to show the effectiveness of different Hellinger distance- based measures compared to other popular measures such as Pearson correlation (PC), cosine similarity, constrained PC and JMSD. The performance of differ- ent similarity measures will then be evaluated with the help of mean absolute error, root mean squared error and F-score. From the results, no evidence were found to claim that Hellinger distance-based measures performed better than more popular similarity measures for the given dataset. Abstrakt Titel: Hellinger distance-baserad similaritetsm˚attf¨orrekomendationsystem Rekomendationsystem ¨aroftast anv¨andainom e-handel f¨orrekomenderingar av potentiella varor/produkter som en kund kommer att vara intresserad av att k¨opabaserat p˚aderas tidigare k¨oppreferenseroch relaterat beteende. Kollab- orativ filtrering (KF) ¨arett popul¨arber¨akningstekniksom har anv¨ants ¨over hela v¨ardenf¨ordessa personliga rekomendationer. Inom tv˚atyper av KF, n¨armanskaps och model-baserat, ¨arn¨armanskaps-baserat KF mer popul¨ar,samt relativt enkel. Den f¨orlitarsig p˚akonceptet att en specifik produkt kan vara av intresse f¨orden givna anv¨andareom, antingen uppskattade anv¨andarenlik- nande produkter eller om produkten var uppskattad av liknande anv¨andare (grannar). F¨oratt implimentera detta koncept ¨anv¨andsolika typer av similar- itetsm˚att.Denna uppsats ¨aravsedd att j¨amf¨oraolika anv¨andar-baseradesimi- laritetsm˚atttillsammans med att definera meningsfulla similaritetsm˚attbaser- ade p˚aHellinger distance, vilket ¨arett m˚attbaserat p˚asannolikhetsf¨ordelningar. Data fr˚anen popul¨arwebsida MovieLens, kommer att anv¨andasf¨oratt visa ef- fektiviteten av de olika Hellinger distance-baserade m˚attj¨amf¨ortmed andra popul¨aram˚atts˚asomPearson correlation (PC), cosine similarity, constrained PC och JMSD. Prestationen av de olika similaritetsm˚attkommer sedan att studeras med hj¨alpav mean absolute error, root mean squared error och F- score. Fr˚anresultaten hittades inga bevis p˚aatt Hellinger distance-baserade m˚attpresterade b¨attre¨ande mer popul¨arasimilaritetsm˚attf¨ordet givna data. i Popular scientific summary There are a lot of different product choices when browsing the internet for online purchasing. It can become overwhelming for the consumers to find and choose the most interesting products for them. To make sure that going through the sea of different products is made easier, the companies have started to use so called recommender systems. What they do is provide the consumer with personalized recommended items based on data from, for example, their previous purchases. The use of recommendations has also shown to increase the revenues for companies as customers tend to purchase more items. Different methods have been created to optimize the recommendation system for different types of data. One of these methods is called collaborative filtering (CF) and is considered to be the most successful approach for personalized product or service recommendations. This method is separated into two main classes, neighborhood- and model-based CF. The model-based CF approach looks at the data from customers and calculates an equation that will predict what kind of items the customer will find worthwhile. The neighborhood based approach works on the idea that an product might be interesting to a customer if similar customers liked it as well or if the customer liked a similar product before. The neighborhood based CF is the focus of this paper. The main aspect of neighborhood based CF is, with the use of statistical software, calculating a measure that quantifies the similarity between different customers. The value we would then get is known as similarity measure. The quantification of the similarity is called statistical distances. The smaller the statistical distance between the items/customers the more similar they are. There are several ways to go about calculating that distance. The purpose of this paper is to see how well popular calculation methods perform against each other as well as introduce a new one that will be compared as well. The data that was used in this paper came from one of the most popular datasets in research field of CF. It is called MovieLens and is available for public use online. In this dataset, there is information on 600 different users, where each of them has given a rating on the scale from 1 to 5, to a specific movie (item). Some of the users have rated many movies and some just a few. This is one of the problems in using CF; not all of the items are rated by all of the users. From the comparison of the performance between the methods for this data, newly introduced method did not perform better than more popular methods. ii Contents 1 Introduction 1 1.1 Data . 1 2 Theory and methods 3 2.1 Popular CF measures . 3 2.2 Bhattacharyya coefficient-based CF measure . 5 2.3 Hellinger distance-based CF measure . 6 3 Evaluation methods 8 4 Results 10 4.1 MAE . 10 4.2 RMSE . 10 4.3 F-score . 12 5 Discussion 13 iii 1 Introduction Whether it is to buy clothes or watching a new show on Netflix, e-commerce is something that most of the population have some kind of experience with. There are often several options to choose from when looking for a product. Amazon offers their customers over 410 000 different e-books to choose from [1]. In the sea of items, books in this example, it can be hard for the user to find that one item that would be of interest. With the use of recommender system techniques it is possible to provide customers with personalized recommendations that would suit their specific interest. So every time a user get recommended a new item when browsing Amazon's website, or being shown supplements to a product a customer is about to purchase, chances are, it was all done with the help of recommender systems. There are two main classes within the recommender systems, content-based filtering and collaborative filtering. Content-based filtering is basically a way to use the information of users profile as well as the purchased items charac- teristics to find and recommend similar items [6]. This paper focuses solely on collaborative filtering (CF), more specifically, the neighborhood based CF. There is a saying: "Show me who your friends are, and I will tell you who you are". The basic concept of neighborhood based CF method is very similar to this saying. What it does is, with the use of data, trying to find the given users "friends", other users that have same preferences. K-number of closest users are found with the use of similarity measure. Information from the k-closest users is then used to make an estimation on the rating the user would give to a certain item. The purpose of this paper is to introduce Hellinger distance-based similarity measure [7] and evaluate it's the performance by comparing it with others, currently more popular similarity measures, for a given dataset. 1.1 Data This thesis is based on the dataset, referred to as MovieLens, that was collected by GroupLens Research from the MovieLens's web site. It was first released in 1998 and is one of the most popular datasets within the reccommender systems field [4]. There are several datasets provided by the GroupLens's web page. The one used in this thesis is a small sample of the data collected over the years. This sample is being changed over time on the website and thus can make it hard for the recreation of this thesis. Dataset in use here is dated as September 2018. MovieLens is built with four variables: ratings (100 836), users (600), movies (9 724) and timestamps that show when the movie was rated. For the statistical experiment of comparing different methods, and not using the results for commercial use, time of the data collection, timestamps, was considered immaterial and is not taken into account when calculating the results. The rating scale of the dataset differs for the ratings made prior to february 2003 where 1 to 5 one-star scale was used, after february 2003 the 0.5 to 5 half-star scale was implemented instead [4]. For simplicity sake, all of the ratings with half star scale were rounded to one-star scale where 0.5 would mean rating 1, 1 1.5 rating 2, and so on. The amount of rated movies varies heavily between the users. The lowest threshold for the amount of rated items for a user to be in the data was 20, which was the case for several users. While the user with the highest amount of ratings had 2698 rated items. 2 2 Theory and methods This thesis concentrates on finding neighbours, for a given active user, to provide information that will be used for rating predictions, user-based CF.