Temporal Preference Analysis In Recommender Systems

Master’s Thesis

0,5

0,4

0,3

0,2

0,1 category_effect movie_effect maca_effect

0,0

-0,1

-0,2

-0,3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Marek Karczewski

Temporal Preference Analysis In Recommender Systems

THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

by

Marek Karczewski born in Warsaw, Poland

Web Information Systems Group Department of Software Technology Faculty EEMCS, Delft University of Technology Delft, the Netherlands www.wis.ewi.tudelft.nl c 2010 Marek Karczewski. Coverpicture: Temporal Effects ￿ Temporal Preference Analysis In Recommender Systems

Author: Marek Karczewski Student id: 9422407 Email: [email protected]

Abstract

This thesis presents the results of research into temporal preference analysis in recommender systems. Temporal preference analysis consists of methods for detecting time recurrent changes in user preferences and for using this informa- tion to improve recommendation precision.

Thesis Committee:

Chair: Prof.Dr.Ir. G.J.P.M. Houben, Faculty EEMCS, TUDelft University supervisor: Dr.Ir. A.J.H. Hidders, Faculty EEMCS, TUDelft Committee Member: Ir. H.J.A.M. Geers, Faculty EEMCS, TUDelft

Preface

This thesis could not have been completed without the love and support of a number of people. First and foremost I would like to express my gratitude to the wonder- ful women in my life. Dear Mother, thank You for supporting and motivating me throughout the years. Dear Alexandra, words come short describing Your patience and dedication. I love You both very much. I am much obliged to my supervisor Dr Jan Hidders. Thank You for providing the guidance and advice to keep the research on track. Your help with formalizing the theoretical basis for performing temporal preference analysis is greatly appreciated as well. My thank also goes to Professor Geert-Jan Houben, head of the faculty of Web Information Systems at the Technical University of Delft. Thank You for admitting me to Your group and allowing me to perform my master’s research as part of it. Special thank goes to Dr Jaap Aalders. The many talks we had on the cyclic nature of natural phenomena turned out to be more valuable than I could foresee. Dear Yehuda Koren, thank You for explaining some details pertaining to Your method for generating recommendations with temporal dynamics. Dear Dr. Mick Flanagan, thank You for Your support with using the Java Scientific Library package for performing linear regression analysis. To all who it may concern; I wish You a pleasant and interesting read.

Marek Karczewski Delft, the Netherlands November 17, 2010

iii

Contents

Preface iii

Contents v

List of Figures vii

1 Introduction 1 1.1 Context ...... 2 1.2 Temporal preference analysis ...... 7 1.3 Research objectives ...... 9 1.4 Outline ...... 11

2 Preparations 13 2.1 Theory of recommender systems and temporal dynamics ...... 13 2.2 Theory of temporal preference analysis ...... 16 2.3 Implementation ...... 20

3 Results 33 3.1 Measuring temporal effects ...... 33 3.2 Applying temporal effects ...... 38 3.3 Recommendation precision ...... 42

4 Conclusions and Future Work 49 4.1 Contributions ...... 49 4.2 Reflection ...... 50 4.3 Future work ...... 55 4.4 Final conclusions ...... 57

Bibliography 59

A Interest period parameters 63

v

List of Figures

1.1 Information Retrieval ...... 5 1.2 Clustering ...... 8

2.1 Notation ...... 19 2.2 Main Classes ...... 22 2.3 User Interface ...... 24

3.1 Rating Series Regions ...... 33 3.2 Gap Length ...... 35 3.3 Last Period Length ...... 36 3.4 Total Length of Rating Series ...... 37 3.5 Maca Effect Count ...... 46

4.1 Distribution of Maca Effects ...... 51 4.2 Positivity and Last Length ...... 52 4.3 Gap Length ...... 52

A.1 Gap Neutrals ...... 63 A.2 Gap Negatives ...... 64 A.3 Gap Average ...... 64 A.4 Last Average ...... 65 A.5 Last Frequency ...... 65 A.6 Last Negatives ...... 66 A.7 Last Neutrals ...... 66 A.8 Interest Periods ...... 67 A.9 Disinterest Periods ...... 67

vii

Chapter 1

Introduction

In the past thirty years we have seen major technological advances that have led to an explosive growth of information processing systems. Some of these systems, for instance those used in hospitals, may have limited connectivity with the outside world and are intended for specialized use only. But the majority of present-day computer systems are interconnected by a common network called the Internet. In developed countries such as the Netherlands, access to the Internet is quite common. According to from 2008, more than eighty percent of all Dutch citizens over sixteen years of age make use of the Internet. We can trace back the start of the Internet as a specialized network for providing robust communication between military computer systems. Nowadays the Internet is a network connecting mostly open systems that benefit a great number of people. The importance of this network might best be exemplified by the influence it has on the natural language. Not only do we write “Internet” with a capital letter in English, but we also thank a number of new expressions to it. Surfing, blog, cyberspace, cookie, browser and downloading are just a few examples of words related to different aspects of the Internet. Many areas of human activity such as commerce, entertainment and education are represented by various services on the Internet. A user of the Internet can visit websites such as Amazon or DealExtreme to do his shopping from the leisure of his home. He can read news on Onet or CNN and he can study unknown facts on the Wikipedia. Video enthusiasts can visit YouTube, a popular online service for viewing and publicizing video content. The Internet is not restricted to websites. There are many dedicated applications that make use of the Internet for communication. Skype, a popular communicator, allows users to talk with each other free of charge over the Internet. Fans of chess or go can play their favourite game on a Internet server by using dedicated client applica- tions. The Internet also allows for easy file exchange, either by using the standard file transfer protocol, or by using a dedicated file sharing network like BT2Net or Gnutella. The progress made in recent years introduces specific engineering challenges. The abundance of available information and usage options in modern computer systems often leads to various usability difficulties. In the context of this research we can formulate the following problem statement:

Problem Statement. The amount of stored and retrievable information in present-day

1 1.1 Context Introduction

computer systems causes difficulties in finding relevant information in a timely manner.

Specialized computer systems called information retrieval systems can assist the user in finding useful information. This thesis focuses on a special kind of information retrieval systems called recommender systems. A recommender system presents items to the user in the form of personalized recommendations. In most cases, the user can provide feedback on the recommended items by giving them a rating. A characteris- tic feature of recommender systems is the ability to store the user’s preferences and filter the presented information accordingly. Recommender systems can thus alleviate the problem of finding relevant information by presenting interesting items to users, while minimizing the effort required to inform the system about individual needs for information. A recommender system is only as useful as it is capable of producing accurate predictions of future ratings. This ability is called recommendation precision. The more precise a recommender system is, the more it is capable to correctly identify the items that should appeal to the user, i.e. receive high ratings. In the quest to produce better recommender systems, an improvement of recom- mendation precision is one of the main goals. Changes in user preferences can hinder the precise generation of recommenations.. The research presented in this thesis is mo- tivated by the desire to increase recommendation precision, while dealing with changes in user preferences. We can state:

Main Objective. The main objective of our research is to investigate the possibility of improving recommendation precision by analysing and acting upon cyclic changes in user preferences that take place over time.

Before formulating more specific research objectives, let’s first present some nec- essary background information. We’ll start by examining the role that information and information processing technology plays in our daily lives.

1.1 Context

1.1.1 The need for information Our understanding of a studied subject, its place in the world and its relation to other subjects, is formed by processing information. The English word “information” is de- rived from the Latin verb "informare" meaning “to give form to the mind”, “to disci- pline”, “instruct” or “teach” [1]. The role of information is to shape our understanding of the surrounding world, our place in it, our relationships with other people and our choices regarding future conduct. Without this knowledge it would be difficult to at- tain to our needs in a proper way. It is important therefore, that the information we are receiving is both correct and relevant to the problems that we are trying to solve. While informational correctness depends on the quality of the source, informa- tional relevance depends on individual needs. Different people will show different needs for information, depending on their age, occupation, interests, living conditions and other factors. A person who is interested in the history of Mexico will not be helped by a course in linear algebra. Furthermore, the level of detail or abstraction of

2 Introduction 1.1 Context information concerning a particular area of interest should reflect the level of under- standing of the subject. People need information both for governing their actions as well as for personal development. Especially in case of young children supplying the right kind of infor- mation is important for correct mental growth. According to psychological studies, environmental factors may have an effect upon childhood IQ, accounting for up to a quarter of the variance [32]. The research presented by Catherine Weathersbee [37] studies the effects of using modern IT technology in the high school classroom. It has been shown, that the integration of information technology in teaching activities, sig- nificantly improves academic performance in key areas such as reading, math and sci- ences for 11th graders. On the other hand, exposure to inappropriate information may have a very negative effect on child development. According to Stephen Kavanagh, “exposure to explicit sexual content, may lead to deviant behaviour to be imprinted on the child’s "hard drive" and become a permanent part of his or her sexual orientation” [20]. If we want to attain to individual needs in an effective way, we need methods for extracting correct and relevant information as well as methods for filtering out information that could be inappropriate or even harmful. This is especially true given the enormous amount of information that is stored in present-day computer systems. As we will see later, recommender systems implement both functions and thus help us cope with a condition known as information overload.

1.1.2 Information overload The invention of the steam engine heralded the start of the industrial revolution in the course of the 18th century. In the course of the 20th century, the invention of the computer heralded another revolution. The information revolution has lead to the so called information society. An information society is a society in which the creation, distribution, diffusion, use, integration and manipulation of information is a significant economic, political, and cultural activity [3]. A related term is the post-industrial so- ciety or the services society. According to the sociologist Daniel Bell, a post-industrial society is characterised by the fact that the majority of those employed is not involved in the production of tangible goods. In a modern-day industrial society, the primary and secondary sectors of economy, which encompass the winning of natural resources, the production of food and goods of industry, provide only about 30% of all job places [6]. The majority of the labour force is employed in the tertiary sector, encompassing trade and services. According to numbers publicized by the Central Bureau of Statis- tics, in the year 2008 almost 3 million people were working in trade and services in the Netherlands. At the same time the number of people employed in industry consti- tuted about 850 thousand people, while in natural resources winning slightly under 8 thousand persons were employed. Computer systems play an important role within the tertiary sector of economy. They provide a platform for supporting business operations such as communication, e-commerce, financial services and management. The importance and size of the pro- vided informational services lead some authors to the introduction of a quaternary sec- tor of economy. This is the sector encompassing information generation, information sharing, consultation, education and research and development [5]. Computer systems

3 1.1 Context Introduction

also provide an invaluable tool for work organization within the public sector, where information processing constitutes a major part of operations. Private use of computers is also quite common in any modern society. People write blogs, upload home-made movies to YouTube and communicate over the Internet on a daily basis. According to the Nielsen company, in April 2008 the average Internet user spent about 68 hours online [10]. All this computer related activity has lead to an explosive growth of the amount of stored and retrievable information. Thanks to the Internet, many information sources on various subjects have become publicly accessible. In some cases the abundance of available information can make it difficult to under- stand a particular issue and in effect leads to decision paralysis. Cognitive studies have demonstrated that there is only so much information humans can process before the quality of decisions begins to deteriorate [11]. Such a condition is known as informa- tion overload. The term "information overload" has been popularized by the futurist Alvin Toffler and first appears in a 1964 book by Bertram Gross, The Managing of Organizations [18].

1.1.3 Quantity and quality of information So how much information is out there? Although no exact figure can be given - as new information is created and stored by the minute - some estimates have been made over the years. According to [25], the amount of ASCII text information on the world wide web around 1997 was approaching 2 terabytes. This is 10 to the power of 40 bytes. According to the same source, in the year 1989 there were in total 4615 movies made worldwide. At an average of 5 megabytes per second, it would take 166 terabytes to store all of them. In 2005, Eric Schmidt, the CEO of Google, estimated the total amount of information on the Internet at 5 million terabytes. YouTube, the popular video content provider, is estimated to stream from 25 to 200 terabytes on a daily basis in 2006. This is roughly equivalent to 40 million movies per day. A thorough overview of the sorts and quantities of information produced each year is presented in a study carried out at Berkley University [25]. Apart from the sheer amount of information, the quality of information also con- stitutes a source of concern. As we know, the industrial revolution brought not only a vast increase in production capacity and domestic product per capita, but also environ- mental pollution. In analogy, the information revolution has brought about a negative effect called infollution. According to [30] infollution is the contamination of the in- formation supply with irrelevant, redundant, unsolicited and low-value information. The spread of useless and undesirable information can have a detrimental effect on human activities [13]. In “the Megabit Bomb”, a book by the science fiction writer Stanisław Lem, the author discusses many downsides associated with the introduction of information processing technologies [24]. Because the Internet as such possesses no natural intelligence, the author advocates the creation of intelligent censorship that would allow filtering out information that is harmful to the psychophysical health of certain types of users. Lem further claims that we are finding ourselves at a crossing from the Gutenberg Era - that is the era of the printed book - to the Information Era, in which the total amount of stored and retrievable information in computer systems, long surpasses the capacity of the human brain to absorb it.

4 Introduction 1.1 Context

1.1.4 Information retrieval In order to address the problems of infollution and information overload, methods have been devised to make information retrieval more effective. Information retrieval “is the science of searching for documents, for information within documents, and for meta- data about documents, as well as that of searching relational databases and the World Wide Web” [2]. We can discern three methods for finding relevant information; by browsing, by retrieval and by filtering:

Figure 1.1: Information Retrieval

In the browsing approach, the user directs displaying information, by manually navigating hypertext. Hypertext is information stored in the form of readable text that can be navigated by using hyperlinks. A hyperlink is a clickable element of the hyper- text that directs the user to a new page of the hypertext. While browsing no explicit query formulation takes place. Users implicitly formulate what they are looking for by clicking on hyperlinks. A well known example of a browsable information source on the Web is the open encyclopedia called Wikipedia. In case of retrieval “the system is presented with a query by the user and expected to produce information sources which the user finds useful” [29]. The query is an ex- plicit formulation of the user’s informational needs, for instance in the form of relevant keywords. Information is “pulled” form the system; the user poses a question and the system returns the searched information. User interests associated with information retrieval are mostly short-term. From search to search, the requested information can be of totally different type. As an example of an information retrieval system we can name the Yellow Pages. On the Yellow Pages the user can find a company providing needed services by inputting relevant keywords, for instance “plumbing”. Systems employing a filter to find relevant information are called information fil- tering systems. These systems “are typically designed to sort through large volumes of dynamically generated information and present the user with sources of information that are likely to satisfy his or her information requirement” [29]. The information source is often dynamically updated; new information is added or existing information is deleted regularly. As an example of such an information source, we can name the Internet movie database called IMDB. If the user likes movies, he will want to be in- formed by the system when new movies appear that are in line with his interests. As information filtering systems focus on informing users of new or updated items, they mainly address the user’s long-term interests; interests that are stable over a longer

5 1.1 Context Introduction

period of time [31]. Information is “pushed” from the system; the user is informed about interesting items, without having to input any specific query. This thesis focuses on recommender systems. Most recommender systems can be classified as information filtering systems. Generally we can say that recommender systems “produce individualized recommendations as output or have the effect of guid- ing the user in a personalized way to interesting or useful objects in a large space of possible options” [12]. Recommender systems can alleviate the problems of infollu- tion and information overload by shortening the time that is necessary to find relevant information. This motivates further research into increasing the precision of recom- mendation algorithms.

1.1.5 Birth of the recommender system Perhaps the best known information retrieval system is the search engine Google. Us- ing a search engine, the user inputs his query in the form of keywords. The system then returns the search results in the form of a list of hyperlinks. Each hyperlink leads to a hypertext page on the Internet, hopefully containing the required information. A common method for allowing information retrieval on textual documents is by calculating the Term Frequency–Inverse Document Frequency weight for terms con- tained in the document. The TF-IDF weight is a measure of relevance of particular terms in relation to the contents of the document. Consider for instance a document detailing the reproductive cycle of polar bears. The associated TF-IDF weights for terms such as “polar” or “bear” should have a high value. Indexing pages on the In- ternet in terms of TF-IDF weights, allows retrieving those pages based on keyword search. The TF-IDF weighting method can also be used for retrieving Usenet news arti- cles. Usenet is one of the oldest computer network communication systems still in widespread use. It was conceived in 1979 and publicly established in 1980 at the University of North Carolina at Chapel Hill and Duke University. The articles that users post to Usenet are organized into topical categories called newsgroups, which are themselves logically organized into hierarchies of subjects. Given the high number of articles published on Usenet, it soon became apparent, that filtering them solely on the basis of topics or keywords is inadequate. Factors such as individual preferences of the user and the quality of the returned articles need to be taken into account in order to implement effective filtering. Computer systems are not capable to asses the quality of articles as well as human beings can. It stands to reason then, to use feedback from other users to perform filtering with. If I would like to receive an assessment of an article that is unknown to me, I could try to find users with similar taste that have read and assessed the article before me. Given the similarity in taste, their opinion should matter to me. The idea to use input from multiple related users or data sources for filtering is called collaborative filtering. The first recommender system to use collaborative filter- ing was the Information Tapestry project at Xerox PARC. This system allowed users to find documents based on previous comments by other users [17]. In 1992, Paul Resnick, John Riedl and associates at the University of Minnesota started the GroupLens project to examine automated collaborative filtering systems. A Usenet news client was created that allowed the readers to rate each other’s messages.

6 Introduction 1.2 Temporal preference analysis

The system used those ratings to predict how much other readers would like an arti- cle before they read it. This recommendation engine was one of the first automated collaborative filtering systems in which algorithms were used to automatically form predictions based on historical patterns of ratings [33]. In 1995, Riedl and Resnick invited Joseph Konstan to join the team. In the Summer of 1996, Riedl was introduced to Steven Snyder, a former Microsoft employee. Snyder realized the commercial potential of collaborative filtering, and encouraged the team to found a company. By June 1996, Gardiner, Snyder, Miller, Riedl, and Konstan had incorporated their company called Net Perceptions. Net Perceptions went on to be one of the leading companies in personalization during the Internet boom of the late 1990s. Recommender systems have since become ubiquitous in the online world, with leading vendors such as E! Online, Amazon and Netflix deploying sophisticated recommender systems. Meanwhile, research continued at the University of Minnesota. When the Each- Movie site closed in 1997, the researchers behind it generously released the anony- mous rating data they had collected, for other researchers to use [27]. The GroupLens research team, led by Brent Dahlen and Jon Herlocker, used this dataset to jump-start a new movie recommendation site called MovieLens. Since 1997, MovieLens has been a very visible research platform. The research presented in this thesis, also makes use of a dataset made available by MovieLens.

1.2 Temporal preference analysis

When we think of a computer system that stores user preferences in a user model to allow personalized recommendations, an obvious question comes to mind. What happens when the user changes his preferences over time? Let us consider an example from the movies domain. Suppose the system “knows” that I like horror movies. One morning, I might decide that I never want to see a horror movie again. But the system does not know this and might continue to recommend horror movies to me. A change in preferences makes part of the information stored in the user profile outdated. In order to account for such effects, we would like to construct a method for performing temporal preference analysis. Choosing movies from a large pool can be considered a probability experiment. We can pose questions like “how many action movies can we expect to be chosen during one week”? If we let a monkey choose the movies, then we can rest assured that each choice is random and nearly independent. From a mathematical standpoint, we can model such a thought experiment as a Poisson process. In a typical realiza- tion of a Poisson process, the distribution of signals - in our case the appearance of action movies - is far from being uniform along the time axis. It reveals a natural ten- dency to create clusters. We call this tendency spontaneous clustering. When movies are chosen at random, we can expect to see periods containing many action movies alternating with periods containing little or no action movies. When the clustering of signals is stronger than spontaneous, we speak of attraction and when it is weaker than spontaneous, we speak of repelling [16]. The following figure shows the concepts of attraction and repelling in a graphical way:

7 1.2 Temporal preference analysis Introduction









 

Figure 1.2: Clustering

People do not choose movies to watch at random. People choose movies accord- ing to taste, interest or habit. Norman Munn states in his book on psychology that “the human is a being of habits, not a being of instinct” [28]. A. Lewicki poses the ques- tion whether “non-biological values” such as patriotism or a sense of beauty can form “comparable conditions for inner balance as biological and material values can” [26]. The answer is positive as Lewicki states, that this property of “non-biological values” is the very reason for the formation of dynamic stereotypes, that is habitual behavior in humans. As the name suggests a dynamic stereotype is not a rigid structure. People adapt their behaviour and change habits over time. Habits can lead both to attraction and repelling. Suppose Jan likes to watch at least 10 action movies over the weekend. Since the frequency is high within a short time frame, the resulting clustering of action movies will show “attraction”. On the other hand, suppose Jane likes to watch a romantic comedy every last Friday of the month. In her case the frequency is low and the resulting clustering of romantic comedies might show “repelling”. Except from habits, another phenomenon that can influence clustering is the so called exploration drive. The drive to explore a new surrounding can be found both in animals and in people and can often overpower other needs, such as the need for food. A small child that has not eaten for 14 hours, once seated behind the dinner table, will often refrain from eating and instead “throw peas into the milk (...) throw spoons on the floor and use the potatoes as paint for finger-drawing” [19]. Most people, when confronted with a movie category that is unknown to them, will want to see at least a few movies representing that category. We can assume, that the first contact with a new category, will lead to a period of heightened interest. Once the new category has been explored, the initial interest might wane forever, or it might return periodically. In any case, the exploration drive most certainly leads to attraction. It is our expectation that thanks to natural phenomena such as habits and the ex- ploration drive, certain types of items will show attraction within the rating series of a user. Furthermore, we expect that by identifying periods showing attraction, we will be able to improve the recommendation precision for items rated in the future. In the context of recommender systems, we are interested in finding periods of heightened interest for movies of a certain category. Such periods are characterised by an increased frequency of high ratings for these movies as compared to the frequency

8 Introduction 1.3 Research objectives that would be the result of an uniform distribution. For the purpose of this research, a period showing a clustering of high ratings for a certain type of movies is called an interest period. This is regardless of the exact causes that led to clustering. A period which is not an interest period, is called a disinterest period. The transition from an interest period to a disinterest period, or vice versa, can be interpreted as a change of preferences. Please see section 2.2.1 for precise definitions of the terms and concepts introduced here. The basic idea for performing temporal preference analysis can be formulated by the following conjectures:

Conjecture 1. We can detect recurrent interest periods by analysing the natural rating series of users by means of an efficient algorithm.

Conjecture 2. Interest period parameters such as duration and intensity, can provide a measure for capturing changes in user preferences.

Conjecture 3. We can use knowledge about interest period parameters to improve the recommendation precision for a particular user.

If successful, we can state: Temporal preference analysis constitutes a set of methods for recognizing periodic preference changes within natural rating series of users and using this information for improving the precision of recommender systems.

1.3 Research objectives

In order to reach the main objective - that is to investigate the possibility of improving recommendation precision by performing temporal preference analysis - this research has been divided into a number of distinct research objectives. The following overview is loosely sorted in chronological order.

Research Objective 1. Develop a theoretical basis for conducting temporal prefer- ence analysis.

Since temporal preference analysis as conducted in this research is an original idea, no theoretical basis exists that would define the problem domain. In order to eliminate the possibility for ambiguity in algorithmic implementation and allow a clear presentation of results, a new notation and a set of definitions have to be created.

Research Objective 2. Implement a system for performing temporal preference anal- ysis and evaluating the results.

This goal encompasses the construction of a software system that will enable us to analyse rating series of users as well as to process, store and display the results of temporal preference analysis. Furthermore, the system must enable us to implement different prediction algorithms and measure their precision. The theory developed as part of the first research objective serves as a basis for the actual system implementa- tion.

9 1.3 Research objectives Introduction

Research Objective 3. Validate the relevance of individual category preferences for predicting user ratings.

It is crucial to show that individual category preferences of users are an important factor in predicting movie ratings. If we can not show this to be true, then we can not expect that recognizing interest periods will allow us to improve recommendation precision. In order to reach this research objective, a prediction algorithm will be constructed that will enable us to compare a prediction based on individual category preferences with a prediction that does not take this information into account.

Research Objective 4. Construct a basic prediction algorithm that accounts for known temporal effects.

A number of temporal effects have been described in relevant literature. We will construct a prediction algorithm that accounts for significant temporal effects and use the resulting predictions as a reference. By doing so, we can be sure that eventual temporal effects measured by our method are different from the temporal effects that have been studied by other researchers. Furthermore, any improvement upon these basic predictions, supports the claim that our method for performing temporal prefer- ence analysis is valid. Further research will therefore concentrate on improving upon the basic prediction algorithm by making use of temporal effects that can be related directly to interest period parameters.

Research Objective 5. Detect and analyse temporal effects that can be related directly to interest period parameters.

If we want to improve the basic predictions, we must first detect temporal effects that can be linked to interest period parameters. Therefore, for any given user rating, we want to record two things. First of all, we want to record the parameters describing the user’s interest periods. Second of all, we want to measure the divergence between the given rating and the prediction produced by our basic prediction algorithm. Since the basic algorithm accounts for known temporal effects, any consistent divergence from it will signal a new temporal effect. A consistent temporal effect will show a correlation with one or more parameters describing interest periods. For instance, short interest periods might be correlated with low future ratings for a given genre. Some parameters of interest periods may be associated with stronger temporal effects than others. Analysing and comparing the relative significance of different pa- rameters for predicting future ratings, will allow us to decide which ones to use to improve the precision of the basic prediction algorithm. Where necessary, we will resort to statistical analysis to investigate the recorded parameters and associated tem- poral effects. If no new temporal effects can be detected, we will conclude that our method is not usable for improving recommendation precision.

Research Objective 6. Improve the precision of the basic prediction algorithm.

If we can find new temporal effects which are related to interest periods parame- ters, we must investigate if knowledge about them can be used to improve the precision of the basic prediction algorithm. To this end, we will attempt to construct an improved prediction algorithm which incorporates the new effects. If successful, we will asses

10 Introduction 1.4 Outline the effectiveness of our method, by comparing the performance of the improved al- gorithm to the performance of temporal algorithms reported by other researchers. In case we do find new temporal effects, but prove unable to use them for improving the precision, we will again invalidate the feasibility of our approach.

1.4 Outline

In chapter 2 we present the preparations which are necessary for conducting temporal preference analysis. A part of preparations is the development of a theoretical basis for conducting temporal preference analysis and a notation that will allow us to present the results. The theory is based on the premises presented in this chapter and on an analysis of related research, which is presented in the next chapter. In chapter 2, we will also discuss the architecture of the system that has been used to perform the research with. In chapter 3 the reader will find the documentation of performing temporal prefer- ence analysis with the implemented software system. The chapter includes an overview of investigated temporal effects and a visualisation of the most important results. Fur- thermore, the chapter documents various attempts at applying temporal effects in order to improve recommendation precision. The thesis ends with chapter 4, reflecting on the obtained results and presenting the project’s contributions within the field of recommender systems. In the final chap- ter, the reader will also find pointers for future work and a number of conclusions addressing the main research questions.

11

Chapter 2

Preparations

In this chapter we present the work that was conducted in preparation for carrying out the actual research into temporal preference analysis. Naturally, a significant portion of preparatory work constitutes a study of known theory and related research. In the following sections the reader will find a brief sketch explaining the main classification of recommender systems and a slightly more elaborate overview of related research concerning temporal effects within recommender systems.

2.1 Theory of recommender systems and temporal dynamics

2.1.1 Main types of recommender systems Recommender systems can be classified according to two general types. The first type consists of memory-based systems and the second type consists of model-based systems. Memory based systems employ statistical techniques to find a set of users, known as neighbors, which have a history of agreeing with the target user. A ratings history or a purchase history can be used to compute user similarity by Pearson corre- lation or by vector cosine [36]. Once a neighborhood of like-minded users is formed, the system can use different algorithms to combine the preferences of neighbors to produce a prediction or top-N recommendation for the target user. These techniques, also known as nearest-neighbor or user-based collaborative filtering are quite popular and widely used in practice [34]. The first recommenders developed and popularized by the company Net Perceptions, are examples of memory-based systems. (see 1.1.5). In memory based systems, there is no need to analyse the contents of the items being recommended. The systems are easy to implement and quite effective. On the down- side, memory based systems will often show a poor performance in case of rating data sparsity. Model-based collaborative filtering algorithms provide item recommendations by first developing a model that is based on past user ratings. Algorithms in this category take a probabilistic approach and compute the expected value of a user prediction, given his ratings on other items. The model building process can be performed by different machine learning algorithms such as Bayesian networks, clustering models, latent semantic models and rule-based models [34]. Model-based systems can alle-

13 2.1 Theory of recommender systems and temporal dynamics Preparations

viate the data sparsity problem, improve scalability with large data sets and improve recommendation precision. On the other hand, these systems can be quite complex and require a high cost of implementation. Many real world recommender systems are of the hybrid kind. Hybrid recom- mender systems use both memory and model based techniques in order to combine the strengths and alleviate the weaknesses of both approaches. The system developed for this research is an example of a model-based recom- mender system. In order to identify interest periods, an analysis is carried out of past user ratings. Different aspects of interest periods are stored as part of the model. Future ratings are predicted on the basis of the collected information.

2.1.2 The Netflix prize - a strong incentive to research One of the companies that made use of recommender technology offered by Net Per- ceptions, was the online movie rental service called Netflix. In 2006 Netflix started a competition for the best movie recommender system. Prizes were based on improve- ment over Netflix’s own algorithm, called Cinematch, or the previous year’s score if a team has made improvement beyond a certain threshold. In order to win the grand prize of one million dollars a participating team had to improve the Cinematch algo- rithm by 10%. Research teams around the world joined the Netflix competition. As a result of this broad involvement of the scientific community, many advancements have been made in techniques for generating recommendations. On 21 September 2009, the grand prize was awarded to the BellKor’s Pragmatic Chaos team. Despite the intensive research spurred by the Netflix price, one aspect of recom- mender systems has been left out of the picture for a considerable time. Up to 2008 not much structured research has been conducted into changing user preferences. What follows is a concise overview of related research up to date.

2.1.3 Concept drift Modelling temporal effects known as concept drifts, is a central issue in predictive analytics and machine learning. A concept drift means that the properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems, because the predictions become less accurate as times passes by. To prevent deterioration in prediction accuracy over time the model has to be refreshed periodically. Two algorithms - FLORA and STAGGER - are well regarded in the field. In the FLORA approach, the model is retrained using only the most recently observed samples [38]. This method is also known as time windowing. In case of STAGGER, the learning algorithm uses training examples which are weighted according to their present utility. A description of the STAGGER algorithm can be found in the works by Schlimmer and Granger [35]. While these algorithms strive to incorporate concept drift into the prediction model at regular intervals, they do not actually try to detect a concept drift actively. In some cases, the model can be augmented with new inputs which may be better at explaining the causes of the concept drift. For instance in a sales prediction appli- cation, concept drift could be accounted for by adding information about the season to the model. In real world applications however, it is often difficult to identify the causes

14 Preparations 2.1 Theory of recommender systems and temporal dynamics leading to concept drift unequivocally. In such cases, we can best detect concept drifts post factum and adapt our learning algorithm accordingly. To this end, some indica- tors of concept drift must be monitored over time. This approach is also used in this research. By analysing parameters relating to interest periods, we infer conclusions about a possible change of interest that a user shows for different movie genres. While related, concept drifts and changes in user interests are different kinds of temporal effects. The former implies a general and gradual change, while the latter implies an individual change, which can be more abrupt in nature. Traditional studies into concept drift have mainly focused on long-term changes that affect the whole population. “In contrast with global drifts, individual changes are characterized by localized causes. A change in family structure for instance, can drastically change shopping patters. Such preference changes and their influence on ratings can not be captured by methods that seek a global concept drift” [22]. Instead, related temporal effects need to be modelled at the level of each user individually. In case of temporal preference analysis, identified interest periods are only meaningful in the context of a particular user.

2.1.4 Interest change In 2001, Lam et al. [23] investigated changes of user interest in information filtering systems. As a result a Bayesian-based technique for tracking users’ interest drifts was developed. In a work from 2003 [39], we can find examples of algorithms for detecting interest change based on time windowing, decay functions or evolutionary algorithms. It wasn’t until 2008 however, when a systematic study on interest drift patterns first appeared [14]. The authors present both a method for detecting interest drift patterns and a method for improving recommendations. The precision of recommendations is improved by removing ratings reflecting past interests from the training data. In case of temporal preference analysis however, we do not expect interests to disappear permanently. Instead, we expect particular interests to appear and disappear in the form of recurrent periods. A number of significant contributions to the research into changing interests has been made by Yehuda Koren et al. in a 2009 paper titled “Collaborative Filtering with Temporal Dynamics” [22]. In this work concerning the Netflix movies dataset, a de- composition of temporal effects has been proposed. According to the author, we can discern biases that are related to the movie as well as biases which are related to the user. For instance, most movies will show a tendency to receive higher ratings as time goes by. On the other hand, some users might show a tendency to give increasingly lower ratings as time goes by. The proposed optimization model allows calculating movie and user specific biases that give the best predictions for a set of training rat- ings. Taking these biases into account when calculating predictions for a set of test ratings, considerably improves the precision. The effectiveness and importance of analysing temporal dynamics in recommender systems is exemplified by the fact that the presented solutions have allowed the BellKor team to win the Netflix prize in late 2009. A number of observations are in order at this point. First of all, periodic effects are not the main interest area of collaborative filtering with temporal dynamics. In the aforementioned paper, periodic effects are modelled by a dedicated parameter that

15 2.2 Theory of temporal preference analysis Preparations

reflects the expected bias that is associated with a regular period. For example, the periods could for instance be limited to weekdays and weekends. In contrast, this research takes the notion of a recurrent interest period of variable length as a central idea for research. Different aspects of interest periods are investigated in order to determine their usability for improving recommendation precision. Second of all, the strongest temporal effect identified during temporal dynamics research, is the day specific user effect. According to the author “in many applications there are sudden drifts emerging as “spikes” associated with a single day or session. For example, in the movie ratings dataset we have found that multiple ratings a user gives in a single day, tend to concentrate around a single value. Such an effect does not span more than a single day. This may reflect the mood of the user that day, the impact of ratings given in a single day on each other, or changes in the actual rater in multi-person accounts”. As we can see, the exact causes for the day specific effect are not given. In our research we will exclusively try to identify effects that can be related to well defined interest period parameters. Next to the day specific effect, the author mentions a number of other temporal ef- fects which can be found in a movie ratings dataset. The most significant ones will be modelled separately in this research, as part of the basic prediction algorithm. Please see research objective 4 for an explanation of the role played by the basic prediction al- gorithm. Please see section 2.3.9 for a description of the actual system implementation of the basic prediction algorithm.

2.2 Theory of temporal preference analysis

2.2.1 Terminology In order to conduct temporal preference analysis in an unambiguous way, we need to introduce new terminology. In this chapter, the reader will find a set of definitions that serve as a basis for implementation. The definitions are written up in a general way and can apply to items of various nature. In our research, the items to be recommended are movies. Items which possess common characteristics can be grouped into categories. In case of movies, the categories are equivalent to movie genres. Each movie is associated with a set of related categories. The analysis of a movie rating is always conducted in the context of one particular category and repeated for each category in the set.

Definition 1. [User Rating] We define a user rating as a set consisting of (1) a set of item related categories (2) a rating value from the set {0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}.

In our research, each movie has a set of movie genres related to it. We will use the information about related genres when determining a prediction for the rating value. The rating value is expressed as the number of “stars” assigned to the rated movie on a scale from 0.5 to 5 stars.

Definition 2. [Rating Series] We define a rating series as a ordered list of unique user ratings of items.

16 Preparations 2.2 Theory of temporal preference analysis

The ratings which make up the rating series are ordered by timestamp. The times- tamp indicates the actual time at which the rating took place. Since any movie can only be rated once by any user, the list includes only unique elements.

Definition 3. [Known Rating Series] We define the known rating series as the sublist of the rating series which includes all ratings up to the analysed position.

When trying to predict the rating for a movie at the analysed position - for in- stance at position 15 - the known rating series consists of the ratings from position 1 to position 14.

Definition 4. [Positive Rating] We define a positive rating as a rating of 3.5 stars or more for an item belonging to the analysed category.

The rating scale in our data set is 5 stars with half star increments. When analysing action movies, a rating of 3.5 stars or more for an action movie is considered a positive rating within the analysed category.

Definition 5. [Neutral Rating] We define a neutral rating as a rating of 3 stars for items belonging to the analysed category or any rating for items not belonging to the analysed category.

When analysing action movies, any rating for a movie which is not an action movie, or a rating of 3 stars for an action movie is considered a neutral rating within the analysed category.

Definition 6. [Negative Rating] We define a negative rating as a rating of 2.5 stars or less for an item belonging to the analysed category.

Definition 7. [Period] We define a period as any sublist of the known rating series of a user.

Per definition, a sublist is a list that makes up part of a larger list. In our research, a period is simply an arbitrary portion of the known rating series of a user.

Definition 8. [Delimited Period] We define a delimited period as any period which starts and ends with a positive rating for items belonging to the analysed category.

Let us again consider the action movies genre. A delimited period for action movies is any period which starts and ends with a positive rating for action movies. Interest periods are characterized by a high frequency of positive ratings. Adding negative ratings to the start or end of a period decreases the chance that this period will be identified as an interest period. All interest periods must therefore be delimited.

Definition 9. [Merge Factor] The merge factor determines how many more positive ratings for category related items must be present within a popularity period than would be the result of a uniform distribution.

It is called the merge factor, because it influences the way that periods consisting of sequences of equal elements are merged together to form interest periods. Please see figure 2.3 for an example of the result of merging interest periods.

17 2.2 Theory of temporal preference analysis Preparations

The total length of an interest period gives important clues about the user’s bias towards the analysed category. We therefore want the interest periods identified by our analysis, to be as large as possible. Alternatively, we can say, that we want to avoid a situation in which we would identify two distinct interest periods which would form a larger interest periods when considered as a whole.

Definition 10. [Popularity Period] We define a popularity period as any period in which the number of positive ratings for category related items, is higher or equal to the expected number of positive ratings times the merge factor.

The expected number of positive ratings within a given category and for a given length of the rating series can easily be calculated. In case of action movies, this is simply the probability of any movie being an action movie, times the probability of any rating being higher than 3 stars, times the length of the series. As an interest period is characterized by a high frequency of positive ratings, we expect the number of positive ratings within it, to be higher than the expected value. The merge factor can be chosen arbitrarily. It determines how much the frequency of positive ratings within a period must diverge from the expected value, in order to identify it as an popularity period. Please note that ratings of items not belonging to the analysed category, can also be part of a popularity period. Intuitively, a person being an action movie aficionado, can watch movies from other genres as well.

Definition 11. [Strong Popularity Period] We define a strong popularity period as any popularity period which can not be divided into two separate periods such that one or both would not be a popularity period.

This definition implies, that any substring of the strong popularity period which starts at the leftmost position or ends at the rightmost position, is itself a popularity period.

Definition 12. [Interest Period] We define an interest period as any delimited strong popularity period which is not part of a larger strong popularity period.

Definition 13. [Disinterest Period] We define a disinterest period as any period which does not overlap with an interest period and is not part of a larger disinterest period.

As we can understand from the definition set, we can divide the rating series of users into interest periods and disinterest periods. We try to choose our interest periods as large as possible, while still maintaining the requirement regarding the minimal frequency of positive ratings within them. In the next section we will see how such a division looks like by introducing a domain specific notation.

2.2.2 Notation In order to present the rating series of a user in a concise way, a domain specific notation is used. Each movie genre is specified by a distinct letter from the alphabet. As there are exactly 26 genres in our dataset, this is a convenient choice. The analysis of a rating series always takes place in the context of one genre. In our system, action

18 Preparations 2.2 Theory of temporal preference analysis movies are associated with the letter A. Positive ratings are depicted by a capital letter, negative ratings by a small letter. Neutral ratings are depicted by either a star - when the item does belong to the analysed category - or by a dot - when the item does not belong to the analysed category. All ratings are ordered by timestamp, so the string depicts what kind of ratings have been given in sequence. The following figure gives an example of the notation used for depicting an anal- ysed rating series of a user in the context of action movies:



     

Figure 2.1: Notation

The vertical dashes show the boundaries of interest periods as they could be de- termined by the algorithm. The exact placement of these boundaries will of course depend on the merge factor which determines the minimum frequency of positive rat- ings that is required within an interest period. In section 1.2 we have explained why it is reasonable to assume that interest peri- ods will be characterized by a high number of positive ratings for the analysed cate- gory. Using the introduced notation, these periods show up having a high number of capital letters. In figure 2.1, “x” stands for the analysed movie rating. For every user, we know the given ratings up to “x” and we would like to predict the rating for the movie at “x”. By using the basic prediction algorithm explained in section 2.3.9, we can calculate a reference prediction at position “x” that accounts for known temporal effects. The goal is then to increase the precision of the basic prediction algorithm by incorporating temporal effects which can be related to interest period parameters. By performing temporal preference analysis, we can determine a number of rele- vant parameters which describe the known rating series of the user. For instance, at position “x”, the length of the last interest period is 8 ratings, while the distance from it is 4 ratings. This kind of data should enable us to adjust the basic prediction at position “x” in order to increase the recommendation precision. Together with the definition set, the introduced notation fulfills the requirements of research objective 1 - that is providing a theoretical basis for performing temporal preference analysis.

2.2.3 Measuring precision It is common to asses the effectiveness of learning prediction algorithms by splitting the available ratings data into two sets; the training set and the quiz or test set. The training set typically contains ratings that precede the ratings contained in the test set. The ratings from the training set are used as input for learning the prediction algorithm. The learned algorithm then produces predictions for the ratings contained in the quiz

19 2.3 Implementation Preparations

set. By comparing the predicted ratings with the actual ratings in the quiz set, we can measure the precision of the learned prediction algorithm. A widely accepted measurement of recommendation precision is the RMSE or Root Mean Squared Error. Expressed in mathematical terms:

n 2 ∑ (rˆi ri) − ￿ i=1 RMSE = ￿ ￿ n ￿ Wherer ˆi is the predicted rating, ri is the given rating and n is the total number of anal- ysed ratings. Another measurement is the MAE or Mean Absolute Error. Expressed in mathematical terms: n ∑ rˆi ri | − | MAE = i=1 n

Wherer ˆi is the predicted rating, ri is the given rating and n is the total number of analysed ratings. In both cases, lower values mean better predictions. An RMSE or MAE of 0 would imply a perfect prediction algorithm; one that always predicts the very rating that a user gives in practice. In a typical dataset containing movie ratings, a trivial algorithm that for each movie in the quiz set predicts its average grade from the training data produces an RMSE of 1.0540 [22]. In most research papers concerning recommendation techniques the measurement of choice is RMSE. Compared to MAE, RMSE allows more efficient optimization calculations in computer systems. In this research we use the RMSE as our standard measure of recommendation precision.

2.3 Implementation

2.3.1 External software External software packages which have been used to augment or support our system can be fall in two categories. The first are software programs which provide some functionality necessary to implement the system, manage the database or interpret the results. The main Integrated Development Environment or IDE in which the system was implemented is Netbeans. Netbeans is an open source programming suite which sup- ports various programming languages. The MySQL Workbench database management system is used for running various SQL scripts on the underlying database. MySQL Administrator is used to carry out basic administration of the MySQL database server. The SOFA statistics package has been used to get insight into the distribution, aver- ages and frequencies of temporal effects. VP-UML has been used to produce system architecture diagrams. Several programs from the NeoOffice suite were used as well. The second category of external software encompasses java software libraries which include routines useful for performing temporal preference analysis. Dr. Michael Thomas Flanagan’s Java Scientific Library [7] has been incorporated into the system as a referenced file. The library provides a large number of mathematical algorithms for performing various scientific calculations. For the purpose of our research, the

20 Preparations 2.3 Implementation regression class is called to perform linear regression analysis. The vecmath library contains functions to perform vector and matrix operations which are used during anal- ysis. Greg Dennis his drej library has been included as a referenced file and allows for calculating various nonparametric regression functions. Please see section 3.3.3 for a description of a kernel based algorithm. Finally, tiles modules are used for standardiz- ing the layout of the user interface.

2.3.2 The system architecture Our system is a web application which has been implemented within the Java Server Faces application framework. JSF is popular application framework and part of a larger family of web based java application frameworks such as AppFuse, Spring, Google Web Toolkit and the well known Struts. The choice for JSF is motivated by the broad availability of coding examples and the general ease of implementation that is a distinctive feature of this framework. According to some reports, systems based on this framework might not scale equally well with increased user numbers as for instance systems based on the Struts framework. This might be a concern when the system is intended for heavy loads in a corporate setting, but should not be a problem when the system is used internally within a company or institution. The application is written in Java and running on a Glassfish Application Server. Other web application servers which implement the J2EE framework, such as Tomcat for instance, can also be used without reserve. While not strictly necessary for con- ducting temporal preference analysis, such a solution does provide an appealing user interface and a good base for constructing a real-life online recommender system. At present, apart from performing temporal preference analysis, the system is capable of registering online users and storing their favourite movies. The figure shown on the following page shows the relationships between the main classes participating in temporal preference analysis. The central place is taken by the RatingController object, which accepts input from the user and manipulates other objects in the system in order to conduct temporal preference analysis functions. Due to space constrains, a number of classes have been left out of the diagram, for instance classes such as Period or Analysis, which are used for handling the results of an anal- ysis run. For the same reason, the attributes of the RatingController have not been shown. The implemented system is a typical example of a web application following the Model View Controller architecture. In a MVC architecture, system objects are segre- gated into three categories, each taking care of another aspect of operation. The model is comprised of objects which represent the business logic of the im- plemented domain. In our case, these are objects such as “user”, “movie”, “rating”, “actor” and so on. The central place is occupied by the “rating” object which repre- sents a user rating of a movie. The view is comprised of objects which enable user interaction with the underlying business functionality. In our system, the view is made up of a collection of Java Server Pages. JSP pages present screen elements such as buttons, search boxes and lists, which provide an interface to underlying system functions. Finally, the controller is made up of objects which can process the input received from the user. Control objects can instruct business objects to perform certain func-

21 2.3 Implementation Preparations

                                                                                                        

     

 

                                                                                                      

Figure 2.2: Main Classes

22 Preparations 2.3 Implementation tions. They can also instruct view objects to present the results of system operations to the user. Controller objects are typically associated with certain business objects or with a set of operations. In our system, all functions associated with temporal prefer- ence analysis are managed by the RatingController object. This objects counts close to 3000 lines of code. On the whole, the implemented system functions - not counting any external modules - consist of approximately 10000 lines of code.

2.3.3 The data set For this research two public data sources have been used; the Internet Movie Database or IMDB [9] and the 10 million ratings MovieLens dataset [8]. Both sources have been read into a MySQL database and the provided information has been linked together on the level of individual movies. The information from IMDB - next to providing detailed information on movies - also provides an extended classification of movies. The MovieLens dataset uses only 18 genres to classify movies, while IMDB uses 26 genres. More genres result in a greater precision when analysing individual category preferences. Secondly, the information from IMDB is valuable in case the system is ever to be used as a real-life recommendation engine. The dataset we are concerned with, provides about 10 million ratings of about 10 thousand movies. The rating scale consists of 4.5 stars, from 0.5 star for a very bad movie to 5 stars for an excellent movie. The ratings have been given by approximately 70 thousand anonymous users and span a period from January 1995 to January 2009. Every user in the dataset has rated at least 20 movies. According to the GroupLens research group, 20 ratings is a minimum number needed for reflecting the interests of a user [34]. In order to test the universality of our approach, we have divided the user ratings roughly into two equal part; a training set and a test set. The division is made solely on user number, which results in a semi-random split up. Both sets contain around five million ratings given by about thirty five thousand users. The users in both sets have rated movies in the same time frame. The average rating for the training set and the test set is respectively 3,514 and 3,512 stars. The is respectively 1.059 and 1.060 star. Grosso modo, both sets are quite similar in contents. The training set is used to measure and store effects associated with interest period parameters. The test set is used to verify the generality of obtained results - i.e. to confirm that effects measured for the training set can be applied to improve recom- mendation precision for the ratings in the test set.

2.3.4 The user interface The front-end of the application is defined by means of JSP pages. The layout has been standardised over all pages by augmenting the JSF framework with tiles. Tiles allows us to set up page templates which lead to a consistent look and feel across the whole user interface. The interface is divided in a number of pages. The first four pages provide access to the part of the system not directly related to temporal preference analysis. The page called “Home” provides a user login screen. At “Search a Movie” allows the user to search the underlying database and retrieve detailed information on any movie. The

23 2.3 Implementation Preparations

user can add found movies to his list of favourite movies. At “Favorite Movies” the user can view and edit his list of favourite movies. At “Recommended Movies”, the user can see the list of movies recommended to him by the system. Early on in the research, we wanted to test the methods for temporal preference analysis on a group of online users. The users would be presented with movie recom- mendations based on aforementioned techniques. The benefit of this approach would be the ability to store and analyse user feedback and determine such things as user satisfaction for instance. The idea for an online recommender necessitated the creation of a system capable of storing information on users and their favourite movies. In the course of further re- search it was realised, that arriving at a sufficient number of distinct online users would be difficult. It is easier and more in line with other d systems research, to use a read- ily available dataset of user ratings. The idea to create a web based recommendation system for testing temporal preference techniques was abandoned as a result. The rest of the pages allow access to that part of the system, which is concerned with running temporal preference analysis functions on the MovieLens dataset. A page called “Genre Series” shows the results of performing temporal preference analysis of the rating series of a selected user. The reader will find an example of related output in figure 2.3. “Director Series” is an experimental page showing the results of temporal prefer- ence analysis where movie similarity is based on the director and not on movie genre. The idea to group movies according to director instead of genre was abandoned early on in the research. The reason is the same as in case of actors for instance; the number of categories is too large to be able to discern interest periods in user rating series. “Analysis” is a page allowing us to run various functions specific to temporal pref- erence analysis. This is the part of the system which has taken up most of the develop- ment time. Please see section 2.3.5 for a description of implemented functions. What follows is a figure showing a screen-shot of the “Genre Series” page. On the left we see the links leading to different pages of the system. On the right we see the results after analysing the rating series of a user.

Figure 2.3: User Interface

24 Preparations 2.3 Implementation

The top string on screen shows the rating series of a user before determining inter- est periods. The series has simply been divided into parts holding sequences of equal elements. After running the algorithm, some parts have been merged together to form interest periods. This result can be seen in the middle string. The lower string is a leg- end showing which letters are used for which category. In the presented example, the analysis has been carried out in the context of action movies. When a movie belongs to several categories - and most movies do - then the analysis is carried out for each associated category separately.

2.3.5 Temporal preference analysis functions The “Analysis” page allows the user to run a number of functions needed for perform- ing temporal preference analysis. The output of these functions is written to the system console which is part of the Netbeans IDE. In some cases, the results are written to the database in order to be retrievable later. The following table gives an overview of implemented functions:

Function Name Implementation Details Set category For each category, calculate and store the probability that probabilities any movie belongs to this category. This information is needed later on when determining interest periods. Please see definition 10. Set movie averages For each movie, calculate and store the global average rating for this movie. The global movie average is used when determining the importance of personal category ratings as documented in section 2.3.6. In the basic prediction algorithm described in section 2.3.9, we use a binned movie average. Determining the binned averages is carried out by executing SQL scripts directly in the MySQL Workbench system. Set category letters Automatically assign category letter codes to a chosen category. In case of movie genres, it suffices to use 26 letters of the alphabet. Other categories such as directors need codes consisting of several letters. Analyse users This is the main routine to carry out an analysis of the ratings series of users. The routine has two distinct modes of operation. These are: 1) Analyse temporal preferences of users in the training set and store the associated temporal effects in the database. 2) Analyse temporal preferences of the users in the test set and apply temporal effects to rating predictions for these users. The desired mode of operation - as well as other aspects of system operation - can be set in code.

25 2.3 Implementation Preparations

Function Name Implementation Details Fill frequencies For a combination of effect type and length of the last identified interest period, calculate and store the number of occurrences for a number of effect value intervals. Please see section 3.3.6 for an explanation on how this information is used. Show SlopeOne Slope One is a method for calculating rating predictions. estimations As an early research idea, the method was to be adapted as our basic prediction algorithm. This idea was abandoned because slope one predictions do not take known temporal effects into account. Calculate effects Calculate and store average effects by an alternative method that takes the frequency of occurrence into account. Please see section 3.3.6 for an explanation. Aggregate average This function calculates average effects by aggregating the effects directly effects directly from the table holding all effects from periodeffects determined for the training set. The aggregation can be one of three types as explained in section 3.2.2. Table 2.1: Implemented Functions

2.3.6 Personal category preferences The 3rd research objective - acknowledging the relevance of category preferences - is a conditio sine qua non for further research. If we cannot prove that capturing and using personal category preferences plays a significant role in predicting future ratings, then we can not expect that recognizing interest periods will be of any use. In order to meet the research goal, we would like to compare a prediction based solely on user preferences with a prediction based on other factors. To this end, we will use a category average rating and a movie average rating. The “movie average” is simply the system wide average rating for a movie. The “category average” is the user’s average rating for movies which share one or more categories with the analysed movie and have been rated before it. Any given rating of a user can be expressed as a linear combination of the movie average and the category average: rˆ = w ma + w ca wherer ˆ is the predicted rating, w and w are weights, ma 1 ∗ 2 ∗ 1 2 is the global movie average and ca is the category average. Also, w1 + w2 = 1, so w =(1 w ) andr ˆ = w ma +(1 w )ca = w (ma ca)+ca. 2 − 1 1 ∗ − 1 1 − By performing a least squares regression, we can determine the weights which lead to predicted ratings which are closest to the actual ratings in the test set. If the weight for the category average is not very small in comparison to the weight of the movie average, then we can conclude that personal category preferences play a significant role in predicting ratings. The sum of squared errors over all ratings in the test set can be expressed mathematically as: n 2 SE = ∑ (ri rˆi) wherer ˆi is the predicted rating for index i, ri is the given rating i=1 − for index i and n is the total number of ratings in the test set. In order to find the weights leading to the minimum of the squared errors function, we need to calculate

26 Preparations 2.3 Implementation

the first order derivative for the variable w1 and set it to zero:

n n 2 2 SE￿ =(∑(ri rˆi) )￿ =(∑(ri w1 (mai cai)+ca) )￿ i=1 − i=1 − ∗ −

n = ∑ 2(ri (w1(mai cai)+cai))(cai mai) i=1 − − −

Therefore;

n 0 = 2 ∑(ri (w1(mai cai)+cai))(cai mai) i=1 − − −

n 0 = ∑(ri(cai mai) w1(mai cai)(cai mai) cai(cai mai) ⇒ i=1 − − − − − −

n n n 2 0 = ∑(ri(cai mai)+w1 ∑(mai cai) ∑ cai(cai mai) ⇒ i=1 − i=1 − − i=1 −

n n ∑ cai(cai mai) ∑ (ri(cai mai) i=1 − − i=1 − w1 = n ⇒ 2 ∑ (mai cai) i=1 −

Having calculated the optimal weights w1 and w2 for the regression function for all the ratings in the test set we arrive at the following formula:

rˆ = 0.55 ma + 0.45 ca ∗ ∗ The weights have been rounded to two decimal places. As can be seen, the optimal weight associated with the category average is not much smaller than the weight of the movie average. This means, that the predictive power of the category average is not much smaller than that of the movie average. This is a favourable result. It proves, that personal category preferences do play a significant role when trying to predict a movie rating. In the calculation shown above, the optimal weights have been determined over all ratings of all users in the test set. We could use the formula shown above to predict ratings in a very general way. If we calculate the optimal weights per user, this leads to a better precision. Some users are inclined to follow general consensus and thus rate movies close to the average rating. Other users will show a bigger individuality and rate movies closer to their category average. Finding the optimal weights per user is performed by regression analysis as explained in section 2.3.8.

27 2.3 Implementation Preparations

2.3.7 Prediction algorithms For convenience, we will refer to a prediction algorithm with the term “predictor”. All predictors are calculated in the context of the analysed user. To support the preparatory phase of research we have implemented the following basic predictors in our system:

Predictor Name Code Implementation Details Movie Average ma The ratings in our dataset are divided into time intervals called bins. A bin in our system spans 10 consecutive weeks. The bin in which the analysed movie rating falls is determined. The movie average equals the average rating of the analysed movie within the relevant bin. Category Average ca The category average is the average of all user ratings up to the analysed position for movies which share one or more categories with the analysed movie. Day Average da The day in which the analysed movie rating falls is determined. The day average equals the average of all user ratings on that day, up to the analysed position. Movie Average maca A weighted combination of the ma and ca Category Average predictors. The optimal weights are determined per user by linear regression optimization. Movie Average macada A weighted combination of the ma, ca and da Category Average predictors. The optimal weights are determined per Day Average user by means of linear regression optimization. Table 2.2: Implemented Predictors

In section 2.3.9 a clarification is given on the chosen implementation specifics of the presented predictors. The first three predictors shown above, are called singular predictors. They can be considered the basic building blocks for constructing a com- bined predictor. The last two predictors shown above are combined predictors. A combined predictor determines a weighted average of the singular predictors which are part of it. The optimal weights of the combined predictor are determined per user by means of linear regression analysis.

2.3.8 Regression analysis In order to determine optimal weights for combined predictors, regression analysis is performed. The goal of regression analysis is to model the relationship between a de- pendent variable and one or more independent variables. In our case the dependent variable is the rating for a movie and the independent variables are the basic predic- tors which together make up the combined predictor. We would like to express the predicted movie rating as a combination of the basic predictors. In case of the macada predictor this would be:

28 Preparations 2.3 Implementation

rˆ = w1ma + w2ca + w3da+C wherer ˆ is the predicted rating, w1,w2 and w3 are the optimal weights for the particular user, ma is the movie average, ca is the category average, da is the day average and C is a constant. The function which expresses the dependent variable in terms of the independent variables is called the regression function. In the example above the relationship is linear, but non linear regression functions are possible as well. The most precise rec- ommender engines, like the one which proved best in the Netflix competition, use a linear combination of a large number of different predictors to determine the estimated movie rating. As we can remember from section 2.3.6, the calculation of the regression function depends on finding the minimum of the function of squared errors. Essentially, we want to find a function for which the difference between the given ratings and the predicted ratings is as small as possible. Such a function is an optimal representation of the linear relationship between the basic predictors and the given ratings. In our research we determine the optimal weights for our basic predictors by call- ing the regression class of Dr. Michael Thomas Flanagan’s Java Scientific Library. The regression class accepts a number of training data points to perform the analysis on. These are examples of given ratings and the corresponding ma, ca and da predic- tion values which can be determined for the rating series of the analysed user. In our system, we start using combined predictors after having analysed at least 10 ratings from the rating series. As every user in the system has inputted at least 20 ratings, this is possible for all users. After having predicted a particular rating, we use the actual rating at the analysed position to determine a new regression function before predicting the next rating. In other words, we recalculate the regression function every time that we move forward in the rating series of a user and new information about the given ratings becomes available. Increasing the number of example data points allows for a more precise calculation of the regression function.

2.3.9 The basic prediction algorithm This research is mainly concerned with temporal effects associated with interest peri- ods. As we can learn from [22], a number of temporal effects can be identified in a dataset containing movie ratings. We would like to isolate significant temporal effects that can be ascribed to other causes than interest periods. This will increase our cer- tainty that any remaining effect that can be measured is indeed associated with interest periods. To this end we need to construct a basic prediction algorithm as stipulated in research objective 4. As the research conducted by Yehuda Koren is considered exemplary, we have used the presented findings as a reference point. In the temporal dynamics research paper [22] the following significant temporal effects are mentioned: Time changing movie bias. The average movie rating changes over time. It has been found, that a bin of 10 consecutive weeks works well for representing the time variant movie average. In our research we therefore determine the movie average as the average rating for the considered movie in a period of 10 consecutive weeks. Please note, that this is different from the global movie average as used in section 2.3.6. The global movie average is the movie’s average rating over all ratings in the dataset.

29 2.3 Implementation Preparations

Time changing user bias. Users change their base-line ratings over time. In the reference research, this gradual change in user ratings is modelled by a linear function or by a spline. In our research we have chosen a different approach. For each user, we determine a category average rating. This is the average rating of movies that belong to one or more of the categories that the analysed movie belongs to and which are present in the rating series of the user up to the analysed position. As an alternative, one could first determine the set of average ratings given by the user for each category of the analysed movie and then calculate the average of these averages. We have tested both implementations and chosen the first one as it produces a more precise result. In our view, the category average is a reflection of the user-movie interaction and to some extent also of the time changing user bias. Accounting for the time changing user bias with a linear or spline model improves recommendation precision only by a small amount [22]. Single day effect. From the reference research we can learn that “multiple ratings a user gives in a single day, tend to concentrate around a single value”. In our research we account for the single day effect by taking the average of all ratings given on the same day and up to the rating at the analysed position. Of course, if the rating for the analysed movie is the first rating on a day, the day average can not be determined. According to [22] the single day effect is the strongest temporal effect found in the NetFlix dataset. Accounting for this effect allows the best improvement of recommen- dation precision. In order to make a suitable choice for our basic prediction algorithm, we have compared the precision of the two combined predictors. The first one - code named macada - is a weighted combination of all 3 predictors described above. The second one - code named maca - uses a combination of the movie average and the category average. Both algorithms have been tested for precision using all ratings from the test set. In both cases the optimal weights are determined by linear regression. The follow- ing table shows the precision of the singular predictors and the combined predictors, rounded to three decimal places:

Predictor ma ca da maca macada RMSE 0,921 0,979 0,981 0,841 0,848 Table 2.3: Comparison of predictor RMSE

As can be seen from the measurement data, the da predictor is the least precise of the basic predictors. Furthermore, the maca predictor is more precise than the macada predictor over the whole test set. Other combinations that one can think of, like “mada” and “cada” have also been shown to be less precise than the maca predictor. This result is surprising - in our dataset we do not see any improvement of precision by using the day average. Equally noteworthy; the maca predictor, which is the result of a relatively uncomplicated calculation, shows good precision. Please refer to section 4.1.2 for further discussion of this result. Besides having a lesser precision, the macada predictor is not able to produce a meaningful prediction in as many cases as the maca predictor can. In about 10 percent of cases, the macada predictor will not produce a prediction, while the maca predictor will. The fact that the macada predictor is based on three independent variables, leads

30 Preparations 2.3 Implementation to an increased complexity of the necessary linear regression calculations. This could be solved by using a different optimization technique, such as gradient descent [21], but this lies outside the scope of our research. On the basis of obtained results, we have decided to use the maca predictor as our basic prediction algorithm. The goal of further research is to improve the maca predictor’s precision, by incorporating temporal effects which are measured by means of temporal preference analysis.

31

Chapter 3

Results

The objectives stipulated in section 1.3 can coarsely be divided in a preparatory part and a practical part. The realisation of the first set of objectives - developing a the- oretical basis for temporal preference analysis and implementing a software system - has been documented in the previous chapter. In the current chapter, we will present the results of performing temporal preference analysis with the goal of improving the precision of the basic maca prediction algorithm.

3.1 Measuring temporal effects

The first question to be answered, is whether we can detect a temporal effect which can be related to interest periods. In order to do that, we must analyse the rating series of a user and express it as a set of parameter values. We also must measure the associated temporal effect - that is, if any effect can be measured at all.

3.1.1 Analysing the rating series The goal of analysing the rating series of a user is to express identified interest and disinterest periods by a set of parameter values. The rating series of a user - typically consisting of many interest and disinterest periods - can be divided into three separate regions. The regions discerned are the “gap”, the “last period” and the “rest of series”. The last period is simply the last identified interest period in the rating series. The gap is the last identified disinterest period. If the analysed movie at position “x” is adjacent to the last interest period, then the gap length is zero. All ratings before the last interest period are called the rest of the rating series. See the following figure for a graphical depiction:

     



Figure 3.1: Rating Series Regions

The presented division is motivated by the idea that at the analysed position, the most significant interest period will be the last one, as it reflects the most recent pref-

33 3.1 Measuring temporal effects Results

erences of the user. For the same reason, the last disinterest period, or the gap, is also studied as a separate entity. Every region in the rating series is represented by its own set of parameters. The parameter values - which are determined by measurement - characterize the ratings series of an user up to the analysed movie rating at position “x”. The following table lists researched parameters pertaining to the three regions in the rating series:

Parameter Name Meaning gap length Number of ratings following the last interest period. gap neutrals Number of neutral ratings within the gap. gap negatives Number of negative ratings within the gap. gap average Average rating of category related movies within the gap. last length Number of ratings within the last interest period. last average Average rating of movies within the last interest period. last frequency Number of positive ratings divided by total number of ratings within the last interest period. last negatives Number of negative ratings within the last interest period. last neutrals Number of neutral ratings within the last interest period. total length Total number of ratings up to the analysed position. interest periods Number of interest periods found before the last interest period. disinterest periods Number of disinterest periods found before the last interest period. positivity Average length of an interest period divided by the average length of a disinterest period. The periods are determined before the last period. Table 3.1: Researched Parameters

The presented parameters have been chosen to reflects different aspects of interest periods that could show a relation with the given movie ratings at the analysed position. In accordance to research objective 5, we would like to find out which of these param- eters show a strong relation with eventual temporal effects. In order to do that, we perform an analysis of the rating series present in the training set. For each analysed movie rating, we store the interest period parameter values together with measured temporal effects. If a persistent temporal effect is present, we should be able to see it as a pattern in the data.

3.1.2 Measuring effects As we can remember from the previous chapter, the basic maca predictor accounts for known temporal effects at the analysed position. We would like to determine by measurement whether any extra temporal effect can be detected that can be related to interest period parameters. To this end, we simply record the difference between the given rating and the maca prediction for all ratings in the training set. This difference is called the maca effect. We also store the difference between the given rating and the movie average, called the movie effect and between the given rating and the category

34 Results 3.1 Measuring temporal effects average, called the category effect. The following table gives an overview of stored temporal effects:

Effect Name Meaning movie effect Difference between the given rating and the movie average predictor. category effect Difference between the given rating and the category average predictor. maca effect Difference between the given rating and the maca predictor. Table 3.2: Measured Temporal Effects

For each analysed movie rating, the effects are stored together with measured val- ues of the interest parameters described previously. As any movie can belong to one or more categories, we need to carry out the analysis for each associated genre. For the entire training set - which constitutes of about five million ratings in total - this means storing about twelve million measurements in the database.

3.1.3 Depicting effects We would like to see the average effects for different values of each interest period pa- rameter. If there is no persistent effect present in the data, then the difference between the given ratings and the predictors should be zero on average. On the other hand, if any temporal effect is present, it should show up as a constant bias on our chart. When making the selection, we do not discern between movie categories, nor do we set any restrictions regarding the other parameters. The following chart depicts average effects for different lengths of the gap:

0,3

0,2

0,1

0,0 category_effect movie_effect maca_effect

-0,1

-0,2

-0,3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure 3.2: Gap Length

35 3.1 Measuring temporal effects Results

It is evident from the presented chart, that on average, increasing gap lengths lead to increasing negative effects. Looking at the graph for a gap length of 10 ratings, we can see that all three effects are negative. This means that on average, the given ratings are below the prediction values for all three analysed predictors. On the other hand, when the gap length is small, the effects are overly positive. In such cases, we are analysing movies close to the last identified interest period and the given ratings are higher than their predictions. The effect is least pronounced for the maca predictor. This is to be expected, as the maca predictor is the most precise of our basic predictors. Nevertheless, the pattern shown in the data seems to suggest that accounting for the displayed temporal effect, can increase the overall precision of the maca predictor. The effects shown above are expressed in stars. When compared to the size of the rating scale - which is 4.5 stars - an average effect of 0.1 star is about 2.2 % of the total rating scale. The measured effects seem significant enough to allow for an improvement of precision by accounting for them in the prediction algorithm. The following example shows average effects for different lengths of the last inter- est period:

0,5

0,4

0,3

0,2

0,1 category_effect movie_effect maca_effect

0,0

-0,1

-0,2

-0,3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Figure 3.3: Last Period Length

This effect is obviously in opposition to the previous one. Longer interest periods seem to be an indication of a user’s affinity for movies from the analysed category. It is interesting to note that the maca and movie effects are overly positive, while the category effect is overly negative with increasing lengths. This is not difficult to explain. Long interest periods are the effect of a user giving many high ratings to movies of a certain category. This implies that his related category average is high. The chance that any subsequent rating will be below this category average is high. This leads to negative effects. On the other hand, the chance that a subsequent rating will be above the movie average is high as well, as the movie average does not reflect

36 Results 3.1 Measuring temporal effects the personal preferences of the analysed user. This leads to positive effects. The maca effect is less regular than in case of the gap length effects, especially with long period lengths. It should be noted, that long period lengths have a smaller chance of occurrence than short ones. To give an idea; a last period length of 5 ratings occurs about 288 thousand times in the training set, while a last period length of 30 ratings occurs about 6 thousand times. Needless to say, the average effects calculated for long last period lengths are based on a smaller amount of samples. Not all parameters show the same kind of strong relationship with measured ef- fects. The following chart shows average effects in relation to the total length of the analysed rating series:

0,15

0,10

0,05

0,00 category_effect movie_effect maca_effect

-0,05

-0,10

-0,15 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Figure 3.4: Total Length of Rating Series

As stated in section 2.3.8, we calculate the maca prediction starting from rating 11 in the rating series, so the minimal total length of the analysed rating series is 10 ratings. From the chart shown above, we make out that for total lengths above 30 ratings, there seems to be little relation with the calculated average effects. There seems to be some relation for lengths from 10 to 30 ratings, but the calculated effects seem quite small and irregular. In our training dataset, the average total length of all analysed rating series is about 225 ratings. This rather high average is partly due to some users which have inputted an extraordinary large amount of ratings. The conclusion we draw is that accounting for effects on the basis of the total length of the analysed ratings series is not going to be very helpful for increasing recommendation precision. Similar charts have been created for all researched parameters. Please see appendix A. After an analysis of the results, we can summarize the findings per parameter:

37 3.2 Applying temporal effects Results

Parameter Results gap length Strong correlation with negative effects for all analysed predictors. gap neutrals Similar to gap length. Slightly less correlation. gap negatives Strong correlation with negative effects for all analysed predictors. Low count numbers lead to irregular average effect values. gap average Almost a linear relationship where lower average ratings relate to stronger negative effects. Calculable only when there are category related ratings within the gap. last length Strong correlation with positive effects for the maca and ma predictors. Relates to negative effects for the ca predictor. last average Some relation to positive effects for the ma predictor. Weak relation to effects with either the ca or maca predictors. last frequency Weak relation to any sort of effect. last negatives Resembles last length. The maca effect is very irregular. The relationship is weak. last neutrals Very similar to last length. total length Weak relation to calculated effects. The effects are irregular. interest periods Minimal effects for the maca and ma predictors. Small effect for the ca predictor. disinterest periods Minimal effects for the maca predictor. Some irregular effects for the ma and ca predictors. positivity Strong correlation with all effect types. Table 3.3: Results Per Parameter

The parameters which show the best correlation with calculated average effects have been shown in italic type. In accordance with intuition, the strongest positive effects are related to long lengths of the last interest period. Negative effects are related with long gap lengths. Other parameters which are well correlated with calculated average effects are the gap average and positivity. The next step in our research will allow us to verify the main thesis by answering the question whether we can use knowledge about interest periods and associated tem- poral effects to increase recommendation precision. We would like to find a method to increase the precision of the basic maca predictor by applying temporal effects to it.

3.2 Applying temporal effects

Up to this point, it has been established that temporal effects can be detected in our training set as a constant average divergence from the predictions made with basic predictors. It has also been determined, that these effect are related to interest period parameters.

38 Results 3.2 Applying temporal effects

Different aspects of interest periods show a different degree of correlation with calculated average effects. The best correlated effects are related to the “gap length”, the “last length” and the “positivity” parameters. It should be possible to construct an algorithm that uses effects associated with these parameters to improve overall preci- sion.

3.2.1 The algorithm Lets start by explaining the general idea behind improving the precision of a basic predictor. The whole process can be described as a number of steps in pseudo code:

1 Perform an analysis of the rating series present in the training set. Record found interest period parameter values and the associated temporal effects. 2 Create a table with aggregated temporal effects. This table holds the average temporal effect for a certain movie category and combination of interest period parameter values. 3 Perform an analysis of the rating series present in the test set. For each analysed movie, determine interest period parameters values for each associated category. 4 Read average effects from the aggregated table which correspond to the parameter values determined in the previous step. 5 Apply the average effects to the basic predictor. The result is a temporal predictor which should show a better precision than the basic predictor. Table 3.4: The algorithm for improving maca predictions

The presented algorithm is based on the idea that the average effects which have been calculated for a high number of samples in the training set, can be applied as a ’corrective factor’ to the result of basic predictions for the test set. Before the correc- tion can be applied, we need to read the effects from the database as stated in step 4. The effects are read on the basis of determined interest period parameters. There are three parameters which show a strong relation to calculated average effects. It is not clear however, what combination of parameters will give the best characterization of the ratings series when aggregating effects.

3.2.2 Aggregating effects The table holding the effects measured in step 1 of the algorithm presented earlier, counts twelve million records. As we are interested in average effects, we aggregate the data, grouping by a specific combination of parameter values. When deciding which combination of parameters to group by, the following possibilities have been considered: Group per parameter. The effects are aggregated separately per parameter. The problem with this approach is, that a grouping by a single parameter disregards the values of the other parameters. It is not clear how to combine the effects relating to different parameters, when trying to apply the calculated effects to the basic predictor. Group by gap length and last length. This grouping seems like a reasonable choice. Both parameters are in opposition to each other, so their combination provides a good

39 3.2 Applying temporal effects Results

characterization of the ratings series. The associated temporal effects can be applied in a straightforward manner. Group by gap length, last length and positivity. This combination would seem even better than the previous one. The addition of one extra parameter makes the characterization of the rating series even more precise. The downside is, that grouping by 3 parameters often leads to parameter combinations which are not well represented in the data set. This in turn leads to calculated average values which might not be reliable. Application of such average values might not increase the recommendation precision. In order to determine the relative merits of the presented options, we have imple- mented them as a set of temporal predictors.

3.2.3 Temporal predictors

After running the analysis routines on our training set, we have aggregated temporal effects and associated interest period parameter values stored in the database. The next step is to use these effects while generating predictions for the test set. A straightfor- ward idea is to read the average temporal effect from the database which corresponds to current interest period parameters. This average effect can then be applied to our basic predictor in an attempt to improve its precision. The result of applying aver- age effects to the basic predictor is a temporal predictor which accounts for temporal effects related to interest periods. n 1 For the maca temporal predictor we write: ˆ = maca + n ∑ E f f ect(PV )i wherer ˆ i=1 is the predicted rating, maca is the prediction of the maca predictor, n is the number of categories related to the analysed movie and E f f ect(PV )i is the average temporal ef- fect corresponding to a particular combination of parameter values for movie category i. As any movie can belong to a number of categories, the corresponding average ef- fects are summed and divided by the total number of associated categories. The result is an average of average effects. In order to asses the effectiveness of applying average effects to our basic predictor, a number of temporal predictors have been implemented which use interest parameters according to the options presented in the previous section. The following table gives an overview:

Predictor Name Code Implementation Details maca temporal maca_pe_1 The values of the gap length, last length and with an average positivity parameters are determined for the effect based on 1 analysed movie. Average maca effects parameter associated with the determined parameter groupings values are read from the database. An average is calculated from the three read effects. This is repeated for all associated movie categories and the average of the results is applied as a correction to the basic maca predictor.

40 Results 3.2 Applying temporal effects

Predictor Name Code Implementation Details maca temporal maca_pe_2 The values of the gap length and last length with an average parameters are determined for the analysed effect based on 2 movie. The average maca effect associated parameters with the determined combination of gap groupings length and last period length is read from the aggregated table. This is repeated for all associated movie categories and the average of the results is applied to the basic maca predictor. maca temporal maca_pe_3 The values of the gap length, last length and with an average positivity parameters are determined for the effect based on 3 analysed movie. The average maca effect parameters associated with the determined combination groupings of gap length, last period length and positivity is read from the aggregated table. This is repeated for all associated movie categories and the average of the results is applied to the basic maca predictor. Table 3.5: Temporal Predictors

The maca_pe_2 and maca_pe_3 predictors determine one effect per category for a combination of interest parameter values. The maca_pe_1 predictor determines three effects per category - one for each determined parameter value.

3.2.4 Optimal merge factor Before testing the precision of the implemented temporal predictors over the whole test set, we would like to chose a reasonable value for the merge factor. The merge factor determines how substrings are merged into interest periods. The higher the merge factor, the more positive ratings must be present in the resulting interest periods. Different merge factors will lead to different interest periods and thus to a different analysis result. We do not know up front which merge factor leads to the best precision. Therefore, we have measured the resulting precision when using several different merge factors for the maca_pe_2 predictor. Only positive effects have been applied in this case, as explained in section 3.3.5. This leads to very low RMSE values. Because the calculation of RMSE for the whole test set takes over 24 hours to complete, we have calculated it for a batch of 300 users. This is about 1% of the total and sufficient to make a comparison. The following table lists the obtained RMSE values, rounded to three decimal places:

Factor 3 4 5 6 7 RMSE 0,807 0,805 0,802 0,804 0,804 Table 3.6: maca_pe RMSE for different merge factors

41 3.3 Recommendation precision Results

The RMSE values have been calculated under special circumstances. They are not representative of overall precision and solely used as a basis for comparison. On the basis of shown results, we have chosen to use a merge factor of 5 as a standard for all further calculations. A merge factor equal to 5 means that the number of positive, category related ratings within a popularity period, needs to be at least 5 times higher than would be the result of a uniform distribution. Please see section 2.2.1 for relevant definitions.

3.3 Recommendation precision

3.3.1 Improving singular predictors

The movie and category effects which are associated with singular predictors have been recorded simultaneously with maca effects associated with the combined maca predictor. There is no fundamental difference in applying any effect type to the asso- ciated predictor, regardless of its type. We have chosen to use the movie effects and category effects calculated for a grouping of the two parameters gap length and last length and test their effectiveness when applied to the associated singular predictors. As we could observe earlier, the singular predictors are less precise than the maca predictor and the effects associated with them are much larger. It is reasonable to expect that applying relevant effects to the singular predictors will yield the biggest improvement of precision. If our method is valid, we should see it clearly in the results. The temporal versions of the singular predictors are called ma_pe and ca_pe. In order to test the supposition, the RMSE values for both the basic predictors and their corrected versions have been calculated for all ratings in the test set. The following table presents the results:

Predictor ma ma_pe ca ca_pe Results 0.921 0.890 0.979 0.955 Table 3.7: RMSE for temporal singular predictors

Both the ma_pe and ca_pe predictors show a significant improvement of precision in comparison to their uncorrected versions. The implementation details of these pre- dictors are left out of the discussion. Suffices to say, that they closely resemble the maca_pe_2 predictor. The obtained results provide the first evidence in support of the main thesis.

3.3.2 Improving the maca predictor

The goal of this research is to improve the precision of the maca predictor. Such an improvement provides strong evidence in support of the thesis. A calculation run for the entire test set was carried out and the RMSE values have been determined for all three variants of the temporal maca predictor. The following table presents the results:

42 Results 3.3 Recommendation precision

Predictor maca maca_pe_1 maca_pe_2 maca_pe_3 Results 0.841 0.840 0.839 0.839 Table 3.8: RMSE for temporal maca predictors

The improvement of the maca predictor is not as pronounced as the improvement of the singular predictors shown before. It is significant enough though, to verify the main thesis of this research. Temporal preference analysis can be used to improve the precision of recommendations. Please see section 4.4.2 for further discussion of this result.

3.3.3 A kernel based alternative algorithm Early on in the research a completely different type of algorithm was envisioned for determining the value of temporal effects. The mapping from a set of interest period parameter values to an effect value can be seen as a multidimensional function. The ex- act function is unknown to us, but we can try to approximate it by means of regression analysis. The regression function approximates the unknown mapping from interest param- eter values to effect values for the entire function domain. The theoretic benefit of calculating the effect by means of a regression functions is that we can calculate it for combinations of interest parameter values which have not been observed in the training set. On the other hand, if we can not base the regression analysis on a sufficient amount of data points, then the resulting regression function might not be very precise. In or- der to find out how such an approach would perform in practice, we have decided to perform a nonparametric regression analysis and use the resulting regression function to predict effect values with. Nonparametric regression is a subclass within the family of regression methods. In case of nonparametric regression, the predictor does not take a predetermined form but is constructed according to information derived from the data. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates [4]. For testing the accuracy of predictions based on a nonparametric regression func- tion, we have used the drej library by Greg Dennis. The library allows to perform a kernel based regression. Kernel regression estimates the continuous dependent vari- able from a limited set of data points by convolving the data points’ locations with a kernel function. Approximately speaking, the kernel function specifies how to "blur" the influence of the data points so that their values can be used to predict the value for nearby locations [4]. A Gaussian curve has been chosen for our kernel function. The analysis was per- formed on 10000 data points consisting of measured interest period parameter values and related effect values. The resulting regression function was then used to predict effect values during subsequent analysis runs. The predicted effect values have been applied to the maca predictor as a corrective factor, resulting in a maca_pe predictor. The results of predicting effect values with the nonparametric regression function were not satisfactory. The precision of the maca_pe predictor was much worse than of the uncorrected maca predictor. On the basis of this result, we have abandoned the

43 3.3 Recommendation precision Results

idea to determine effects by means of a nonparametric regression function. Although a parametric model such as linear regression could lead to a better precision, we have decided to try a more direct approach and calculate the effect as an average over many samples. Thanks to the fact that the training set consists of almost five million ratings, we can find a high number of samples for almost every combination of parameter values. In exotic cases - such as with users which have interest periods consisting of several hundred ratings - the number of samples might be too low to calculate an accurate average. This is not a problem though, as such situations are only a marginal fraction of all cases in which we will be predicting effect values. It was established that calculating effects as a simple average of observed values, leads to a satisfactory result and this method was chosen as standard for all further calculations.

3.3.4 The ma_pe_ca_pe alternative algorithm The main algorithm for improving precision calculates average maca effects and ap- plies these effects to the maca predictions for test set ratings. The maca predictor is a combined predictor. This means that the effects are measured and applied in relation to a predictor which is the result of linearly combining two singular predictors. As documented in sections 3.1.3 and 3.3.1, we have also determined effects for the ma an ca predictors separately. We can therefore construct an alternative algorithm, in which we first try to improve the ma and ca predictions separately and then combine them linearly in the same way as the maca prediction is constructed. The code for this predictor is ma_pe_ca_pe to reflect the fact that period effects are first applied to the singular predictions and then the results are combined. The optimal weights for com- bining the corrected singular predictors in the ma_pe_ca_pe predictor are determined by linear regression as explained in section 2.3.8. The ma_pe_ca_pe predictor shows a precision which is slightly worse than the maca_pe predictor when predicting for the same test set under the same system set- tings. Furthermore, there exists a small number of cases in which the ma_pe_ca_pe predictor can not produce a prediction, while the maca_pe predictor can. We con- clude that in our test setting, the ma_pe_ca_pe is an inferior predictor to the maca_pe predictor and we will concentrate our efforts on improving the maca_pe predictor. Although improving singular predictors before combining them linearly proves to be an inferior technique in our system, this is not to say that the technique can be dismissed in all cases. In general, we are free to choose whether we want to measure and apply temporal effects to combined predictors or to singular predictors. In the latter case, we simply improve each singular predictor separately and then combine them linearly by using weights determined by some optimization technique. It is not unimaginable, that in some cases this approach might lead to a better precision.

3.3.5 Asymmetry of results The effects retrieved from the database for a particular category and combination of interest period parameters, are combined algorithmically as documented in section 3.2.3. The resulting average effects can either be positive or negative. Positive effects increase the value of the rating prediction, negative effects decrease it. Positive effects are associated with long interest periods expressed by the parameters last length and

44 Results 3.3 Recommendation precision positivity parameters. Negative effects are associated with long disinterest periods, expressed by the gap and positivity parameters. It is surprising to see, that positive and negative effects do not contribute to an increase in precision in the same manner. In fact, applying negative effects is detrimental to the attained precision. The following table shows calculated RMSE when using only positive effects, only negative effects or both. The RMSE has been calculated using all ratings in the test set.

Predictor maca maca_pe_1 maca_pe_2 maca_pe_3 Positive 0.841 0.829 0.818 0.831 Both 0.841 0.840 0.839 0.839 Negative 0.841 0.848 0.857 0.847 Table 3.9: RMSE for positive versus negative effects

As we can see, applying negative effects to the maca predictor leads to a decreased precision for all temporal predictors. The strongest deterioration is displayed for the maca_pe_2 predictor. This predictor uses the gap length and last length parameters. On the other hand, the best improvement of precision is also displayed for the maca_pe_2 predictor. When using only positive effects, the improvement is substan- tial. The problem with this approach is that the ratio of positive versus negative effects is about fifty-fifty. This means, that applying solely positive effects is only possible for about half of all cases.

3.3.6 Frequency versus average As we know, the average value of a random variable gives limited information without looking at the probability distribution of that variable. In our case, we are interested in the probability distribution of temporal effects. Rating a movie is a natural activity which depends on a great number of factors related to the user. The difference between the given rating and the maca prediction can be considered to follow from natural causes. The size of natural phenomena, such as the size of tree, is distributed normally. Such a distribution is expressed by the Bell curve. We would expect to see a similar distribution of maca effects. To confirm this claim, we have used the SOFA statistics package to investigate the distribution of maca effects. When looking at measured maca effects in general - that is without discriminating between movie categories - SOFA clearly shows a distribution which approaches the normal distribution. Because SOFA does not have a good exporting facility for charts, we have simulated the charts seen in SOFA by making a suitable selection from the database and charting the data in NeoOffice’s excel application. Please see figure 4.2.1 for the general case. At this point, lets take a look at the distribution of effect values for a common com- bination of interest period parameter values. To this end, we have restricted ourselves to the category of action movies, a last period length of 10 ratings and a gap length of 0 ratings. Relatively speaking, these values are well represented in the data set. There are 4425 measurements stored in the database which meet the criteria. We have

45 3.3 Recommendation precision Results

counted the number of effects inside intervals of 0.01 star in magnitude. The following chart shows the result:

40

35

30

25

20 count

15

10

5

0 -0.96 -0.9 -0.84-0.78-0.72-0.66 -0.6 -0.54-0.48-0.42-0.36 -0.3 -0.24-0.18-0.12-0.06 0 0.05 0.11 0.17 0.23 0.29 0.35 0.41 0.47 0.53 0.59 0.65 0.71 0.77 0.83 0.89 0.95 -0.99-0.93-0.87-0.81-0.75-0.69-0.63-0.57-0.51-0.45-0.39-0.33-0.27-0.21-0.15-0.09-0.03 0.02 0.08 0.14 0.2 0.26 0.32 0.38 0.44 0.5 0.56 0.62 0.68 0.74 0.8 0.86 0.92 0.98

Figure 3.5: Maca Effect Count

A few things are evident from the chart. Even for this well represented combination of parameters, the distribution is not very smooth. The count values seem to jump up and down for nearby effect value intervals. This leads to the question whether we can calculate the most frequent values with sufficient precision. Furthermore, the distribution is not symmetrical around the most frequent interval value. The ’left side’ is clearly ’thicker’ than the ’right side’ of the distribution. This could mean that the average effect value might not coincide with the most frequent effect value. Looking at the chart, we can see, that the most frequently counted effect values are found in the interval around 0.13 star. The average maca effect for the same set of data points is 0.076. For the investigated effects, the average effect value seems to be lower than the most frequent effect value. This could influence the resulting precision, as we would like to apply effects which have been most frequently measured in practice. Obviously, such effects have the biggest probability of being exactly right for the measured inter- est period parameter values and thus have the greatest potential to improve the maca prediction. In order to address this issue, we would like to determine the most frequent effect values instead of average effect values. We can describe the algorithm in pseudo code:

46 Results 3.3 Recommendation precision

1 For each combination of category, last length and gap length, count measured maca effect values within the an interval of 0.01 star. This is done by rounding the effect values to a specified number of digits after the comma. 2 Order the resulting list by descending count. The effect values which have the highest number of occurrences will be at the top of this list. 3 Take the top N number of effect values from the list and calculate their average value, accounting for the determined number of occurrences. 4 Store the calculated value as the effect to be applied in case of the specific category, last length and gap length combination. Table 3.10: The algorithm for determining the most frequent effects

It should be noted that we look at the top N most frequent effect values and not just the most frequent one. This is because for any specific combination of category and parameter values, there are not enough samples available and the resulting effect count values are irregular. On the other hand, we do not want to make the top N set too large as this would lead to calculating a value close to the average. The algorithm described above has been implemented and tested in practice. Un- fortunately, the effects determined by this algorithm did not improve the precision of the maca predictor in a satisfactory way. Other interval sizes of 0.1 and 0.001 star as well as other top N set sizes have also been tried. In all cases, the result of applying the calculated effects was much worse than in case of simply applying average effects. Please see section 4.3.1 for further discussion.

3.3.7 Predicting ratings in the training set All effects which have been measured and stored in the database, have been determined by analysing ratings in the training set. Using those effects on the test set allows us to establish the generality of our approach. On the other hand, by using the effects on the training set, we arrive at an interesting result as well. Since the effects are derived from the training set, it is reasonable to expect, that they will allow a bigger improvement of precision, when applied to this set. The question is how much of a difference it will make. In order to find out, a full calculation run on the training set was conducted, using both positive and negative effect types. The results can be directly compared with those in section 3.3.2. For convenience, the results for both datasets are shown side by side in the following table:

Predictor maca maca_pe_1 maca_pe_2 maca_pe_3 Test set 0.841 0.840 0.839 0.839 Training set 0.837 0.836 0.834 0.833 Table 3.11: RMSE for training versus test set

Two surprises show up. First of all, the maca predictor is significantly more precise when predicting ratings in the training set. Second of all, the maca_pe_3 is more precise than the maca_pe_2 predictor in case of training set ratings.

47 3.3 Recommendation precision Results

Furthermore, the improvement of RMSE for the training set is bigger, than for the test set. This implies, that it is best to calculate and use effects for the same users. In a real world scenario, this is always possible. We can calculate average effects for all users in our dataset and use these effects when predicting ratings for the same users. This concludes the documentation of the results obtained by performing temporal preference analysis. In the next and final chapter, the reader will find a retrospective analysis.

48 Chapter 4

Conclusions and Future Work

In light of this thesis, the main question is “whether the precision of recommender sys- tems can be improved by means of temporal preference analysis?” In order to find an answer, a set of research objectives has been formulated, which can be found in chapter 1.3. Chapter 2 documents the preparatory work that has lead to a theoretical basis and a system implementation for performing temporal preference analysis research. The results of research have been documented in chapter 3. In this concluding chapter we will address the main research question on the basis of obtained results. In the following sections the reader will find an overview of the project’s main contributions, pointers for future work and some final conclusions.

4.1 Contributions

The project’s execution led to a number of contributions in the field of recommender systems. The most significant results include methods for temporal preference anal- ysis, a precise basic prediction algorithm, detection of temporal effects in the dataset and an improvement of recommendation precision by use of temporal predictors.

4.1.1 Methods for temporal preference analysis A set of methods has been presented for performing temporal preference analysis. The methods are based on theory which includes a definition of terminology and a way of notation. The terms defined allow for unambiguous algorithmic implementation. The notation permits a convenient way of presenting the results conducting temporal preference analysis. Please see section 2.2 for the documentation of the theoretical basis. Thanks to the developed methods we can conduct a number of analysis operations. First of all, we are able to analyse the rating series of users in order to identify inter- est and disinterest periods. Furthermore, we can measure temporal effects associated with the identified interest periods. Average temporal effects can then be calculated based on the measurement data. Finally, we are able to apply average effects to basic prediction algorithms. Applying temporal effects to a basic predictor leads to an increase of the precision of rating predictions. As far as we know, this is a novel way to increase the precision of prediction algorithms. The increase in precision is significant. This result is the

49 4.2 Reflection Conclusions and Future Work

most important contribution to the field of recommender systems in the context of our research.

4.1.2 A precise basic prediction algorithm A trivial algorithm that predicts for each movie in the quiz set its average grade from the whole set of training data produces an RMSE of 1.0540. Obviously, this is a very coarse prediction as it takes no account of individual preferences at all. Cinematch, the proprietary algorithm of NetFlix, scores an RMSE of 0.9514 on the quiz data. In order to win the grand prize of one million dollars, a team participating in the NetFlix competition, had to improve this by another 10%. Such an improvement on the NetFlix quiz set corresponds to an RMSE of 0.8563. The basic prediction algorithm from section 2.3.9 is quite uncomplicated. It uses a weighted combination of two components, the movie average over a period of 10 consecutive weeks and the category average rating of the analysed user. The optimal weights for both components are determined by linear regression analysis, which can be performed in real time online. Despite the simplicity of this approach, the attained precision is very good. The algorithm shows a low RMSE of 0,841 on our test set. It has to be noted, that the algorithm is not used for the first 10 ratings of a user, only from rating 11 and onwards. This is to make sure, that the regression analysis is performed on a sufficient amount of data points. For comparison; if the algorithm is used from rating 6 and onwards, the shown RMSE is 0,844. This is still lower than the lowest RMSE values reported in the acclaimed research paper on collaborative filtering with temporal dynamics [22]. Please see section 4.2.2 for further perspective on the attained results.

4.1.3 Temporal predictors The result of applying temporal effects to a predictor from section 2.3.7, is a temporal predictor with increased precision. The effects need to be calculated in relation to the predictor that will have the effects applied to it. It is possible to take any kind of prediction algorithm and calculate temporal effects for it. Unless the calculated effects are very small, it is then possible to construct a temporal version of that predictor by applying the calculated effects. The generality of the presented method allows it to be applied to any recommen- dation algorithm. One could for instance construct a hybrid recommendation system incorporating collaborative filtering and temporal preference analysis. In such a sys- tem, the basic prediction would be the result of collaborative filtering. This basic prediction would then be corrected on the basis of temporal preference analysis. The result would be a temporal predictor based on collaborative filtering.

4.2 Reflection

The obtained results should be put into perspective by discussing the particular nature of the conducted calculations. In this chapter, we will explain how our results relate to those obtained by other researchers and discuss some specific aspects of the used

50 Conclusions and Future Work 4.2 Reflection dataset and applied algorithms. We will also take a look at the nature of positive versus negative effects.

4.2.1 Negative effects

We know from section 3.3.5, that positive and negative effects do not improve precision in the same way. Positive effects increase precision, negative effects decrease it. In order to get a feeling for the differences that exist between these effect types, lets first take a look at the distribution of maca effects. To this end, we have counted maca effects within intervals of 0.1 star in size and plotted the results. We have not discerned between movie categories or interest period parameter values. Keeping in mind a standard deviation of about 1.06 star, we also have disregarded effects over 2 stars in magnitude. The plotted count values should be considered a global representation of relative frequencies of occurrence. As we can see from the graph below, in general the distribution approaches the normal distribution. Effects around 0.1 star are the most frequent. The average effect however, is closer to 0. This is due to the fact that the negative side of the distribution is “thicker”. Evidently, people will tend towards overly negative ratings more frequently than towards overly positive ones. As the average rating is around 3.7 stars, this is understandable.

800000

700000

600000

500000

400000 count

300000

200000

100000

0 -2,0-1,9-1,8-1,7-1,6-1,5-1,4-1,3-1,2-1,1-1,0-0,9-0,8-0,7-0,6-0,5-0,4-0,3-0,2-0,1 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2,0

Figure 4.1: Distribution of Maca Effects

Next, lets take a look at average parameter values that go together with a certain effect value. Again, we have determined effects within 0.1 star intervals and calculated the average values of associated interest period parameters. The first graph shows the average length of the last interest period and the positivity value. The positivity values have been scaled up to make comparison easier.

51 4.2 Reflection Conclusions and Future Work

5,0

4,5

4,0

3,5

3,0

2,5 last length positivity

2,0

1,5

1,0

0,5

0,0 -2 -1,9 -1,8 -1,7 -1,6 -1,5 -1,4 -1,3 -1,2 -1,1 -1 -0,9 -0,8 -0,7 -0,6 -0,5 -0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2

Figure 4.2: Positivity and Last Length

As we can see from the graph, the highest average positivity values and the longest average last interest periods, are associated with effects around 0.3 star in magnitude, which is relatively big. Lets finally consider the same kind of chart of the gap length parameter:

14,0

12,0

10,0

8,0

gap length

6,0

4,0

2,0

0,0 -2 -1,9 -1,8 -1,7 -1,6 -1,5 -1,4 -1,3 -1,2 -1,1 -1 -0,9 -0,8 -0,7 -0,6 -0,5 -0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2

Figure 4.3: Gap Length

As we can see, for small positive effects, the average gap lengths are minimal over the considered interval. This implies, that the combination of long last lengths

52 Conclusions and Future Work 4.2 Reflection and short gap lengths is a strong indication, that a positive effect should be applied. These small positive effects are also the most frequent. We can determine them with a reasonable degree of confidence. On the other hand, negative effects are associated with longer average gap lengths. If we look at the interval from -1 star to -0.1 star, we can see, that the differences in the average gap length are quite small. This implies, that within the interval of frequent negative effects, it is difficult to determine the size of the negative effect that should apply, solely on the basis of gap length. Furthermore, in a number of cases, long gap lengths lead to very strong positive effects, although this is true only for a small number of cases. Determining that the gap length is 11 ratings, does not provide much information about the effect value that should apply. By looking at the combination of gap length and last length, certainly more information is provided. However, as we can see from the RMSE results in section 3.3.5, applying negative effects is detrimental to the pre- cision of the temporal maca predictor. We have not been able to find a solution to this problem.

4.2.2 MovieLens versus Netflix An important question is “how do the results attained in our research compare to the results of the reference research [22]?” The best precision on the NetFlix dataset as reported is an RMSE value of 0,855. Our maca predictor, with a calculated RMSE of 0,841 on the test set, compares favourably. The maca_pe predictor - incorporating temporal effects derived by our method - does even better and attains a RMSE of 0.839 on the test set. This precision is attained when using both positive and negative effects. If only positive effects are used, the RMSE is as low as 0.818 for the test set. Both the NetFlix dataset and the MovieLens dataset contain ratings of movies on a 5 star scale. However, the NetFlix rating system only allows for ratings of whole star increments. A rating such as 4.5 star is not possible. In our dataset, half star increments are allowed and present. Furthermore, the ratings in the NetFlix dataset have been made by users who may have rented the related movie from NetFlix. The ratings in the MovieLens dataset are made by users who are simply movie enthusiasts. In the latter case, no commercial aspect is present. Its not hard to imagine, that paying for a movie which was not to our liking, can result in a particularly low rating. In short, while similar, the datasets differ slightly in nature. Results obtained on one dataset, can not be related to the other without reserve. In order to make a performance comparison, one would have to run our algorithms on the NetFlix dataset. This has not been attempted in our research. It seems to be the general consensus however, that it is more difficult to generate precise recommenda- tions for the NetFlix dataset than for the MovieLens dataset. In [40] for instance, the authors present an improved algorithm for collaborative filtering and state an improve- ment of 22% on the NetFlix dataset and 23% on the MovieLens dataset.

4.2.3 Nature of MovieLens ratings MovieLens is an online movie recommendation system. The system generates rec- ommendations for movies by means of collaborative filtering and on the basis of user

53 4.2 Reflection Conclusions and Future Work

ratings. The user can either rate a movie which has been recommended to him - and thus provide feedback to the system - or he can search for a movie by keywords and give it a rating [15]. We have no way of knowing which route was taken by the user when he rated a movie. This information is not provided in the ratings dataset from MovieLens. Does this have implications for our approach? Perhaps to a small extent. If an interest period is the result of the system making recommendations for movies within a certain category - instead of the user making consequent choices for movies within that category - the user is still required to give high ratings to these movies. With other words; even if the system would be biased towards a particular category, this would not lead to interest periods if the user would not give high ratings to the recommended movies. We do acknowledge, that interest periods which are the result of system bias could be longer in length, than those which are the result of manual search and rating actions.

4.2.4 Real time factors In our research, the analysis of a user’s rating series takes place without regard for the actual time at which the ratings take place. The ratings are ordered by timestamp, but the actual time between two consecutive ratings is not a factor in our analysis. Taking the actual time between ratings into account when determining interest periods should lead to different parameter values describing the rating series in a number of cases. It is doubtful whether accounting for real time factors would have a large impact on the prediction precision. This is because most people input their ratings into the system within a relatively short time span. In our dataset the number of users who have inputted all their ratings within the period of one day approaches 64% of all users. The number of users who have inputted all their ratings within one month time, approaches 76% of all users. Real time factors - if relevant - would only seem to apply to a limited portion of users.

4.2.5 Optimization The main goal of our research was to investigate the possibility of improving precision through temporal preference analysis. We have not tried to attain the best possible pre- cision, although optimization of the prediction routines has taken place. For instance, an optimal merge factor has been determined which gives the best results in case of ap- plying positive effects to the maca predictor. Although present, the effect of changing the merge factor is not very pronounced. Other optimizations which have been carried out concern memory and CPU us- age. Many aspects of the system have been optimized in order to improve run time performance. One can name the pre-calculation of probability values or a mechanism which reads the whole aggregated effects table into memory. Not having to access the database when determining effects for a particular set of interest period parame- ters, significantly shortens the total execution time. It should be noted, that with all the run time optimizations in place, the analysis of the whole test set for a particular combination of system settings still takes a full day to complete.

54 Conclusions and Future Work 4.3 Future work

While there might not be much room left for optimizing run time performance, we do feel that improving the effectiveness of temporal preference methods is possible. In the next section, the user will find a number of research directions that could lead to an increased precision of the overall system.

4.3 Future work

Keeping the ultimate goal in mind - that is increasing rating prediction precision - a number of research directions seem promising in the context of temporal preference analysis. In the following sections we will discuss these possibilities in order of de- creasing relevance.

4.3.1 Negative and frequent effects As we could see in section 3.3.5, applying negative effects leads to decreased precision. We have not been able to find a dependable way to establish when negative effects should apply on the basis of temporal preference analysis. A solution to this problem would lead to a much improved precision of the system. Another issue has been addressed in section 3.3.6. It seem probable, that the av- erage effect values that we have calculated for combinations of category, last length and gap length values do not coincide perfectly with the most frequent effect values for the same parameter values. We can not determine this with certainty, as we have not been able to construct an algorithm that would calculate the most frequent effect values dependably. The claim is based solely on a visual inspection of the distribution of effect values - see figure 3.3.6. Nevertheless, it seems probable, that by construct- ing an algorithm that determines the most frequent effect values dependably, we can further improve the precision of the maca_pe predictor.

4.3.2 Clustering similar users At present we do not distinguish users based on the analysis results. Every user is treated the same and the same effects are applied in each case when calculating a rat- ing prediction. The results suggest however, that the correlation between interest pe- riod parameters and the measured effects is different for different users. From section 3.3.7 we know, that applying the same effects to different datasets produces different results. By clustering users which exhibit similar behaviour as far as interest periods and associated effects are concerned, we should be able to apply the effects much more precisely.

4.3.3 Using different categories Some movie categories are better represented in the dataset than others. Within the cat- egory of adult movies for instance, one can not find many ratings. As we are interested in finding average effects - which are often quite small in comparison to the whole rating scale - we need a sufficient number of examples to calculate them reliably. If a certain category is ill-represented in the dataset, then it is difficult to find enough examples to base our calculations on. In such cases, we can not place much trust

55 4.3 Future work Conclusions and Future Work

in the calculated average effect. Applying such effects is detrimental to the attained precision. This leads to the conclusion, that not all effects are equally suitable for improv- ing precision. At present, all effects are applied with equal weight, regardless of the associated category. It is quite probable, that using weights which would reflect the predictive strength of different categories, would lead to increased precision. It re- mains an open question, how the weights should be determined for different effects. One idea is to simply take the total number of observed cases into account when apply- ing the effect. In such scenario, effects which have been calculated for a larger number of samples, would have a bigger weight associated with them. Another idea is to use a completely different item clustering into categories. In our research we have chosen to use movie genre as a basis for grouping movies into common categories. It is a convenient choice as this data is readily available and is considered a domain specific clustering of items possessing common characteristics. Nevertheless, other sorts of clustering of movies are also possible, by using a different similarity measure. A similarity measure determines the likeness of two comparable items. When the characteristics of a movie - such as participating actors, synopsis or the director - are expressed in the form of vectors, then the similarity between movies can be calculated as a vector cosine [36]. Another method for relating movies is by counting the number of times that two movies appear close to each other in the rating series of users. Using a different grouping of movies into categories will lead to a different result when determining interest periods. This could have a significant impact on the attained RMSE.

4.3.4 Calculating average effects differently At present, the rating series of a user are characterized by a set of basic parameters relating to different aspects of interest periods. In section 3.2.2, we have presented three distinct methods for relating interest period parameters with average effects. The first option - where all parameters are determined in separation - could be implemented differently. Firstly, we could increase the number of used parameters. Secondly, the different effects which relate to separate parameters could be combined differently. At present a simple average is computed. It might be better to compute a weighted average in which the weights are determined by means of an optimization algorithm. We have constructed such an algorithm based on linear regression, but its application has not lead to increased precision. Nevertheless, we feel that a different implementation could produce better results.

4.3.5 Basic prediction algorithm For the reasons given in section 2.3.9, we have chosen the maca predictor to be our basic prediction algorithm. The generality of our method allows its application as an extension to any basic algorithm. One could for instance take a recommendation based on collaborative filtering and augment it with temporal preference analysis. In such cases we would measure different temporal effects related to interest period parame- ters. It is to be expected, that using a different basic algorithm will lead to a different

56 Conclusions and Future Work 4.4 Final conclusions

RMSE of the constructed temporal predictor. It is difficult to predict how changing the basic algorithm would influence the attained precision.

4.4 Final conclusions

The theory for temporal preference analysis is based on three basic premises. The first one states that user preferences for a certain category of items are manifested in the form of recurrent interest and disinterest periods. The second one states, that interest periods are characterized by a high frequency of positive ratings for items of the related category. The last one states, that we can improve the precision of recommendation algorithms, by accounting for temporal effects which are associated with interest periods. In order to validate or disprove these claims, a theory has been developed for per- forming temporal preference analysis. A system has been implemented to test the theory and provide answers to a number of key questions: - Can we detect temporal effects in our data set by performing temporal preference analysis? - Are the detected effects different from those reported by other researchers? - Can we improve prediction precision by applying temporal effects determined by analysis as a corrective factor? In the following sections we will address these questions.

4.4.1 Temporal effects Before actual measurement, it was not clear whether interest periods can be associ- ated with any kind of temporal effect - defined as a constant difference between given ratings and the value of a prediction. The results presented in section 3.1.3 verify the supposition, that temporal effects can be detected in our dataset. These effects show a dependency with interest period parameters, most notably with the length of the last identified interest period and the length of the last identified disinterest period. Other parameters such as positivity - that is the ratio between the average interest pe- riod length and the average disinterest period length - also show a strong relation to calculated average temporal effects. Temporal effects are calculated as the average difference between predictor val- ues and the given ratings. The basic prediction algorithm presented in section 2.3.9, accounts for temporal effects known from relevant literature. Therefore, we have a high degree of certainty, that the calculated temporal effects are mainly due to interest periods. The size of calculated average temporal effects is significant. In the training set, the average rating is about 3.512 stars with a standard variation of 1.060 star. This implies that about 70 % of all ratings fall in the interval from 2.5 to 4.6 star. For a rating of 2.5 stars, an effect of 0.1 star constitutes about 4 %. For a rating of 4.6 stars, the same effect constitutes 2.1 % respectively. As can be seen in section 3.1.3, average effects can approach 0.4 star in extreme cases. The effects are consistent and significant in comparison to the rating scale. The improvement of precision which has been attained by applying temporal ef- fects to basic predictors is very good. We are certain, that further improvement of

57 4.4 Final conclusions Conclusions and Future Work

precision can be attained by perfecting the methods for temporal preference analysis.

4.4.2 Improving precision by temporal preference analysis Temporal preference analysis provides information about interest periods which reflect the user’s affinity with movies from a certain genre. The main goal of this research is to investigate the possibility of using information about interest periods and associated temporal effects to improve rating prediction precision. Specifically, we want to im- prove the precision of the basic maca prediction algorithm, which accounts for known temporal effects. On the basis of the results presented in section 3.3, we conclude that such an improvement is possible. The most important lesson learned from the research, is that positive effects and negative effects do not contribute to an increase of precision to the same degree. We speak of a positive effect, when the corrective factor to be applied is positive in value. In the opposite case, we speak of a negative effect. Applying negative effects to the basic prediction algorithm, is detrimental to the precision. Applying positive effects to the same predictor leads to a significant improvement of precision. Unfortunately, the number of cases in which we determine positive effects is almost equal to the number of cases in which negative effects are determined. In other words, we can not improve the precision of the basic prediction algorithm in about half of all cases. In the sections about future work, many aspects of temporal preference analysis have been presented, which have not been explored yet. Given the encouraging results obtained when applying positive effects, we feel that further research is warranted. Es- pecially solving the problem with negative effects, bears the promise of a very effective method for improving the overall precision of recommender systems.

58 Bibliography

[1] http://en.wikipedia.org/wiki/information.

[2] http://en.wikipedia.org/wiki/information_retrieval.

[3] http://en.wikipedia.org/wiki/information_society.

[4] http://en.wikipedia.org/wiki/nonparametric_regression.

[5] http://en.wikipedia.org/wiki/quaternary_sector_of_the_economy.

[6] http://en.wikipedia.org/wiki/three-sector_hypothesis.

[7] http://www.ee.ucl.ac.uk/ mflanaga/java/regression.html.

[8] http://www.grouplens.org/node/73.

[9] http://www.imdb.com/.

[10] http://www.itfacts.biz/average-us-internet-usage-in-april-2008/10598.

[11] David A. Bray. Information pollution, knowledge overload, limited attention spans, and our responsibilities as IS professionals. SSRN eLibrary, 2007.

[12] R. Burke. Hybrid recommender systems: Survey and experiments. User Model- ing and User-Adapted Interaction, 12(4):331–370, 2002.

[13] K.Y. Cai and C.Y. Zhang. Towards a research on information pollution. In IEEE International Conference on Systems, Man, and Cybernetics, 1996., volume 4, 1996.

[14] H. Cao, E. Chen, J. Yang, and H. Xiong. Enhancing recommender systems un- der volatile user interest drifts. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 1257–1266. ACM, 2009.

[15] Y. Chen, F.M. Harper, J. Konstan, and S.X. Li. Social comparisons and con- tributions to online communities: A field experiment on movielens. American Economic Review, forthcoming, 2009.

59 BIBLIOGRAPHY BIBLIOGRAPHY

[16] T. Downarowicz and Y. Lacroix. The law of series. Technical report, Wroclaw University of Technology, 2006.

[17] D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):70, 1992.

[18] Bertram Myron Gross. The managing of organizations. Free Press of Glencoe, 1964.

[19] H.F. Harlow. Motivational forces underlying learning. In Learning theory, per- sonality theory, and clinical research. The Kentucky Symposium. New York: Wi- ley, 1954.

[20] Stephen J Kavanagh. Protecting children in cyberspace. Behavioral Psychother- apy Center, 1 edition, 1998.

[21] J. Kivinen and M.K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.

[22] Y. Koren. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447–456. ACM, 2009.

[23] W. Lam and J. Mostafa. Modeling user interest shift using a bayesian ap- proach. Journal of the American society for information science and technology, 52(5):416–429, 2001.

[24] Stanisław Lem. Bomba megabitowa. Wydawn. Literackie (Kraków), 1999.

[25] Michael Lesk. How much information is there in the world? Technical report, 1999.

[26] A. Lewicki. Procesy poznawcze i orientacja w otoczeniu. PWN Warszawa, 1960.

[27] M. Lim and J. Kim. An adaptive recommendation system with a coordinator agent. Web Intelligence: Research and Development, pages 438–442, 2001.

[28] Norman L. Munn. Psychology. Boston. Houghton Mifflin, 1951.

[29] D.W. Oard. The state of the art in text filtering. User Modeling and User-Adapted Interaction, 7(3):141–178, 1997.

[30] L. Orman. Fighting information pollution with decision support systems. J. MANAGE. INFO. SYST., 1(2):64–71, 1984.

[31] M.J. Pazzani, J. Muramatsu, D. Billsus, et al. Syskill & webert: Identifying interesting web sites. In Proceedings of the national conference on artificial intelligence, pages 54–61, 1996.

[32] Robert Plomin. Behavioral genetics. Worth Pubs., New York, 4th ed. edition, 2001.

60 BIBLIOGRAPHY BIBLIOGRAPHY

[33] P. Resnick, N. Lacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work, pages 175– 186. ACM, 1994.

[34] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, page 295. ACM, 2001.

[35] J.C. Schlimmer and R. Granger. Beyond incremental processing: Tracking con- cept drift. In Proceedings of the Fifth National Conference on Artificial Intelli- gence, volume 1, pages 502–507, 1986.

[36] E. Vozalis and K.G. Margaritis. Analysis of recommender systems algorithms. In Proceedings of the 6th Hellenic European Conference on Computer Mathematics and its Applications (HERCMA-2003), Athens, Greece. Citeseer, 2003.

[37] Julia Catherine Weathersbee. Impact of technology integration on academic per- formance of texas school children. Technical report, Texas State University-San Marcos, Dept. of Political Science, 2008.

[38] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69–101, 1996.

[39] D.H. Widyantoro, T.R. Ioerger, and J. Yen. Tracking changes in user interests with a few relevance judgments. In Proceedings of the twelfth international con- ference on Information and knowledge management, page 551. ACM, 2003.

[40] T. Zhou, R.Q. Su, R.R. Liu, L.L. Jiang, B.H. Wang, and Y.C. Zhang. Accu- rate and diverse recommendations via eliminating redundant correlations. New Journal of Physics, 11:123008, 2009.

61

Appendix A

Interest period parameters

In this appendix we present charts of investigated interest period parameters, which have not been used as input for the temporal predictors introduced in section 3.2.3. Each chart show the average effects as measured for the ma, ca and maca predictors over the entire training set. The name of each figure reflects the name of the investi- gated parameter. Please see section 3.1.3 for an explaination of the parameters and the conditions under which the calculations are made.

0,20

0,15

0,10

0,05

0,00

-0,05 category_effect movie_effect maca_effect

-0,10

-0,15

-0,20

-0,25

-0,30 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Figure A.1: Gap Neutrals

63 Interest period parameters

0,20

0,00

-0,20

-0,40

category_effect movie_effect maca_effect

-0,60

-0,80

-1,00

-1,20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure A.2: Gap Negatives

0,00

-0,20

-0,40

-0,60

-0,80

category_effect movie_effect maca_effect -1,00

-1,20

-1,40

-1,60

-1,80 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3

Figure A.3: Gap Average

64 Interest period parameters

1,20

1,00

0,80

0,60

0,40 category_effect movie_effect maca_effect

0,20

0,00

-0,20

-0,40 2,9 3 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 4 4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4,9 5

Figure A.4: Last Average

0,40

0,30

0,20

0,10

category_effect movie_effect maca_effect

0,00

-0,10

-0,20

-0,30 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Figure A.5: Last Frequency

65 Interest period parameters

2,00

1,50

1,00

0,50

category_effect movie_effect maca_effect

0,00

-0,50

-1,00

-1,50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Figure A.6: Last Negatives

0,50

0,40

0,30

0,20

category_effect movie_effect maca_effect

0,10

0,00

-0,10

-0,20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure A.7: Last Neutrals

66 Interest period parameters

0,04

0,02

0,00

-0,02

-0,04

category_effect movie_effect maca_effect -0,06

-0,08

-0,10

-0,12

-0,14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure A.8: Interest Periods

0,20

0,15

0,10

0,05

0,00 category_effect movie_effect maca_effect

-0,05

-0,10

-0,15

-0,20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure A.9: Disinterest Periods

67